Human versus Machine: Can the struggle for better decision-making apparatuses prove to be a forum for partnership?

Human versus Machine
Can the struggle for better decision-making apparatuses prove to be a forum for partnership?
By Brian Neesby | March 10, 2019

“Man versus Machine” has been a refrain whose origins are lost to history—perhaps it dates back to the Industrial Revolution, perhaps to John Henry and the Steam Mill. Searching the reams of books on Google’s archives, the first mention of the idiom appears to hail from an 1833 article in the New Anti-Jacobin Review. Authorship is credited to Percy Bysshe Shelley, posthumously, but the editor was his cousin Thomas Medwin. Both poets are famous in their own right, but Shelly’s first wife, Mary Shelly, is probably more renown. Personally, I choose to believe that the author of Frankenstein herself dubbed the phrase.

Not only must the phrase be updated for modern sensibilities—take note of the blog’s gender-agnostic title—but the debate itself must be reimagined. Our first concerns were over who was the best at certain strategic, memory, or mathematical tasks. The public watched as world chess champion Garry Kasparov beat IBM’s Deep Blue in 1996, only to be conquered by the computer just on year later, when the machine could evaluate 200 million chess moves per second. I think in modern times, we can safely say that machines have won. In 2011, Watson, an artificial intelligence named after IBM’s founder, soundly beat Jeopardy champions, Ken Jennings and Brad Rutter, in the classic trivia challenge; it wasn’t close. But do computers make better decisions; they certainly make faster decisions, but are they substantively better? The modern debate with these first “thinking computers” centers on the use of automated decision making, especially those decisions that affect substantive rights.

Automated Decision Making

One does not have to go too far to find automated decision-making gone awry. Some decisions are not about rights, per se, but they can still have far-flung consequences.

  • Beauty.AI, a deep-learning system supported by Microsoft, was programmed to use objective factors, such as facial symmetry and lack of wrinkles, to identify the most attractive contestants in beauty pageants. It was used in 2016 to judge an international beauty contest of over 6000 participants. Unfortunately, the system proved racist; its algorithms equated beauty with fair skin, despite the numerous minority applicants. Alex Zhavoronkov, Beauty.AI’s Chief Science Officer, blamed the system’s training data, which “did not include enough minorities”.
  • Under the guise of objectivity, a computer program called the Correctional Offender Management Profiling for Alternative Sanctions (Compas) was created to rate a defendant on the likeliness of recidivism, particularly of the violent variety. The verdict—the algorithm was given high marks for predicting recidivism in general, but with one fundamental flaw; it was not color blind. Black defendants who did not commit crimes over the next two years were nearly twice as likely to be misclassified as higher risks vis-à-vis their white counterparts. The inverse was also true. White defendants who reoffended within the two-year period had been mislabeled low risk approximately twice as often as black offenders.
  • 206 teachers were terminated in 2009 when Washington DC introduced an algorithm to assess teacher performance. Retrospective analysis eventually proved that the program had disproportionately weighed a small number of student survey results; other teachers had gamed the system by encouraging their students to cheat. At the time, the school could not explain why excellent teachers had been fired.
  • A Massachusetts resident had his driving license privileges suspended when a facial recognition system mistook him for another driver, one that had been flagged in an antiterrorist database.
    Algorithms in airports inadvertently classify over a thousand customers a week as terrorists. A pilot for American Airlines was detained eighty times within a single year because his name was similar to a leader of the Irish Republican Army (IRA).
  • An Asian DJ was denied a New Zealand passport because his photograph was automatically processed; the algorithm decided that he had his eyes closed. The victim was gracious: “It was a robot, no hard feelings,” he told Reuters.

Human Decision-Making is all too “Human”

Of course, one could argue that the problem with biased algorithms is the humans themselves. Algorithms just entrench existing stereotypes and biases. Put differently, do algorithms amplify existing prejudice, or can they be a corrective? Unfortunately, decision-making by human actors does not fare much better than our robotic counterparts. Note the following use cases and statistics:

  • When researchers studied parole decisions, the results were surprising. The prisoner’s chance of being granted parole was heavily influenced by the timing of the hearing – specifically it’s proximity to the judge’s lunch hour. 65% of cases were granted parole in the morning hours. This fell precipitously over the next couple hours, occasionally to 0%. The rate returned to 65% once the ravenous referee had been satiated. Once again, late afternoon hours brought a resurgence of what Daniel Kahneman calls decision fatigue.
  • College-educated Blacks are twice as likely to face unemployment compared to all other students.
  • One study reported that applicants with white-sounding names received a call back 50% more often than applicants with black-sounding names, even when identical resumes were submitted to prospective employers.
  • A 2004 study found that when police officers were handed s series of pictures and asked to identify faces that “looked criminal”, they chose Black faces more often than White ones.
  • Black students are suspended three times more often than White students, even when controlling for the type of infraction.
  • Black children are 18 times more likely than White children to be sentenced as adults.
  • The Michigan State Law Review presented the results of a simulated capital trial. Participants were shown one of four simulated trial videotapes. The videos were identical except for the race of the defendant and/or the victim. The research participant – turned juror – was more likely to sentence a black defendant to death, particularly when the victim was white. The researchers’ conclusion speaks for itself: “We surmised that the racial disparities that we found in sentencing outcomes were likely the result of the jurors’ inability or unwillingness to empathize with a defendant of a different race—that is, White jurors who simply could not or would not cross the ’empathic divide’ to fully appreciate the life struggles of a Black capital defendant and take those struggles into account in deciding on his sentence.”

At this point, dear reader, your despair is palpable. Put succinctly, society has elements that are bigoted, racist, masochist – add your ‘ism’ of choice – and humans, and algorithms created by humans, reflect that underlying reality. Nevertheless, there is reason for hope. I shared the litany of bad decisions that are attributable to humans, without the aid of artificial intelligence, to underscore the reality that humans are just as prone to making unforgivable decisions as their robotic counterparts. Nevertheless, I contend that automated decision-making can be an important corrective for human frailty. As a data scientist, I might be biased in this regard – according to Kauffman, this would be an example of my brain’s self-serving bias. I think that the following policies can marry the benefits of human and automated decision-making, for a truly cybernetic solution – if you’ll permit me to misuse that metaphor. Here are some correctives that can be applied to automatic decision-making to provide a remedial effective for prejudiced or biased arbitration.

  • Algorithms should be reviewed by government and nonprofit watchdogs. I am advocating turning over both the high-level logic, as well as the source code, to the proper agency. I think there should be no doubt that government-engineered algorithms require scrutiny, since they involve articulable rights. The citizen’s sixth amendment right to face their accuser would alone necessitate this, even if the accuser in this case is an inscrutable series of 1s and 0s. Nevertheless, I think that corporations could also benefit from such transparency, even if it is not legally coerced. If a trusted third-party watch dog or government agency has vetted a company’s algorithm, the good publicity – or, more likely, the avoidance of negative publicity – could be advantageous. The liability of possessing a company’s proprietary algorithm would need to be addressed. If a nonprofit agency’s security was compromised, damages would likely be insufficient to remedy a company’s potential loss. Escrow companies routinely take on such liability, but usually not for clients as big as Google, Facebook, or Amazon. The government might provide some assistance here, by guaranteeing damages in the case of a security breach.
  • There also need to be publicly-accessible descriptions of company algorithms. The level of transparency for the public cannot be expected to be quite as formulaic as above; such transparency should not expose proprietary information, nor permit the system to be gamed in a meaningful way.
  • Human review should be interspersed into the process. I think a good rule of thumb is that automation should preserve rights or other endowments, but rights, contractual agreements, or privileges, should only be revoked after human review. Human review, by definition, necessitates a diminution in privacy. This should be weighed appropriately.
  • Statistical review is a must. The search for a discriminatory effect can be used to continually adjust and correct algorithms, so that bias does not inadvertently creep in.

One final problem presents itself. Algorithms, especially those based on deep learning techniques, can be so opaque that it becomes difficult to explain their decisions. Alan Winfield, professor of robot ethics at the University of the West of England, is leading a project to solve this seemingly intractable problem. “My challenge to the likes of Google’s DeepMind is to invent a deep learning system that can explain itself,” Winfield said. “It could be hard, but for heaven’s sake, there are some pretty bright people working on these systems.” I couldn’t have said it better. We want the best and the brightest humans working not only to develop algorithms to get us to spend our money on merchandise, but also to develop algorithms to protect us from the algorithms themselves.

Sources:
https://www.theguardian.com/technology/2017/jan/27/ai-artificial-intelligence-watchdog-needed-to-prevent-discriminatory-automated-decisions
https://www.marketplace.org/2015/02/03/tech/how-algorithm-taught-be-prejudiced
https://humanhow.com/en/list-of-cognitive-biases-with-examples/
https://www.forbes.com/sites/markmurphy/2017/01/24/the-dunning-kruger-effect-shows-why-some-people-think-theyre-great-even-when-their-work-is-terrible/#541115915d7c
https://www.pnas.org/content/108/17/6889
https://deathpenaltyinfo.org/studies-racial-bias-among-jurors-death-penalty-cases

Youtube and the Momo Challenge

Youtube and the Momo Challenge
By Matt Vay | March 10, 2019

Youtube has been in hot waters recently over a series of high profile incidents that have gained massive media coverage and put into question the algorithms that drive its business and the role it should be playing in censoring the content it puts out. The first incident consisted of predatory comments made on videos showing children with the second major incident, and the focus of this blogpost, dealing with a new dangerous challenge called the “Momo Challenge”.

What is the Momo Challenge?
Momo began as an urban legend created in a public forum online but evolved over time. The Momo Challenge has become a series of images that appear in children’s videos, telling kids to harm themselves. Many believe this story has been perpetuated by mainstream media and unnecessarily frightened parents across the world due to the lack of evidence of these videos existing on Youtube. However, this has brought to attention once again, what role does Youtube play in censoring the content that it puts out on its website?

What are the legal and ethical issues?
Youtube’s recommender algorithm has been the subject of great debate over the past few years. It has a tendency to place individuals into “filter bubbles” where they are shown videos similar to those they have watched in the past. But what kind of dangers could that lead to when the videos it records our children watching are dangerous pranks? Could it lead to seeing a child watching the Momo Challenge and then recommend them to watch a Tide Pod Challenge video? Companies with this much power have a responsibility to protect the rights of our young children from seeing disturbing content. If a child watches one of these videos and then harms them self, how much to blame is Youtube for its part in recommending these videos?

What has Youtube done?
The Momo Challenge is not the first time our nation has been captivated by a dangerous challenge that has been targeted at our youth. From the tide pod challenge to the bird box challenge, Youtube has experience these dangerous pranks before and recently updated their Community Guidelines. In them, Youtube policies now ban challenge and prank videos that could lead to serious physical injury. They even went one step further with the Momo Challenge and demonetized all videos even referencing Momo. Many of those videos also have warning screens that classify the video as having potentially offensive content.

Where do we go from here?
Unfortunately, these types of videos do not seem to be going away. Youtube has taken the right steps toward censoring its content for children, but how much further do they need to go? I think that answer is very unclear. Nobody will ever be fully happy with all of the content found on Youtube and that is the nature of the beast. It is an open source video sharing platform where users can upload a video file with anything they want in it. But with children gaining access to these sites with ease and at such a young age, we always need to be challenging Youtube to be better with its policies, its censorship and its algorithms, as it likely will never be enough.

Sources:

Alexander, Julia. “YouTube Is Demonetizing All Videos about Momo.” The Verge, The Verge, 1 Mar. 2019, www.theverge.com/2019/3/1/18244890/momo-youtube-news-hoax-demonetization-comments-kids.

Hale, James Loke. “YouTube Bans Stunts Like Particularly Risky ‘Bird Box,’ Tide Pod Challenges In Updated Guidelines.” Tubefilter, Tubefilter, 16 Jan. 2019, www.tubefilter.com/2019/01/16/youtube-bans-bird-box-tide-pod-community-guidelines-strikes/.

A Day in My Life According to Google: The Case for Data Advocacy

A Day in My Life According to Google: The Case for Data Advocacy
By Stephanie Seward | March 10, 2019

Recently I was sitting in an online class for the University of California-Berkeley’s data science program discussing privacy considerations. If someone from outside the program were to listen in, they would interpret our dialogue as some sort of self-help group for data scientists who fear an Orwellian future that we have worked to create. It’s an odd dichotomy potentially akin to Oppenheimer’s proclamation that he had become death, destroyer of worlds after he worked diligently to create the atomic bomb (https://www.youtube.com/watch?v=dus_M4sn0_I).

One of my fellow students mentioned as part of our in depth, perhaps somewhat paranoid, dialogue that users can download the information Google has collected on them. He said he hadn’t downloaded the data, and the rest of the group insisted that they wouldn’t want to know. It would be too terrifying.

I, however, a battle-hardened philosopher that graduated from a military school in my undergraduate days thought, I’m not scared, why not have a look? I was surprisingly naïve just four weeks ago.

What follows is my story. This is a story of curiosity, confusion, fear, and a stark understanding that data transparency and privacy concerns are prevalent, prescient, and more pervasive than I could have possibly known. This is the (slightly dramatized) story of a day in my life according to Google.
This is how you can download your data.

https://support.google.com/accounts/answer/3024190?hl=en

A normal workday according to Google
0500: Wake up, search “News”/click on a series of links/read articles about international relations
0530: Movement assessed as “driving” start point: home location end point: work location
0630: Activity assessed as “running” grid coordinate: (series of coordinates)
0900: Shopping, buys swimsuit, researches work fashion


1317: Uses integral calculator
1433: Researches military acquisition issues for equipment
1434: Researches information warfare
1450: Logs into maps, views area around City (name excluded for privacy), views area around post
1525: Calls husband using Google assistant
1537: Watches Game of Thrones Trailer (YouTube)
1600: Movement assessed as “driving” from work location to home location
1757: Watches Inspirational Video (YouTube)
1914-2044: Researches topics in Statistics
2147: Watches various YouTube videos including Alice in Wonderland-Chesire Cat Clip (HQ)
Lists all 568 cards in my Google Feed and annotates which I viewed
Details which Google Feed Notifications I received and which I dismissed

I’m not a data scientist yet, but it is very clear to me that the sheer amount of information Google has on me (about 10 GB in total) is dangerous. Google knows my interests and activities almost every minute of every day. What does Google do with all that information?

We already know that it is used in targeted advertising, to generate news stories of interests, and sometimes even in hiring practices. Is that, however, where the story ends? I don’t know, but I doubt it. I also doubt that we are advancing toward some Orwellian future in which everything about us is known by some big brother figure. We will probably fall somewhere in between.

I also know that, I am not the only one Google has about 10GB if not more information on. If you would like to view your own data, visit: https://support.google.com/accounts/answer/3024190?hl=en or to view your data online visit https://myactivity.google.com/.

Privacy considerations cannot remain in the spheres of data science and politics, we each have a role in the debate. This post is a humble attempt to drum up more interest from everyday users. Consider researching privacy concerns. Consider advocating for transparency. Consider the data, and consider the consequences.


Looking for more?
Here is a good place to start: https://www.wired.com/story/google-privacy-data/. This article, “The Privacy Battle to Save Google from Itself” by Lily Hay Newman is in the security section of wired.com. It details Google’s recent battles, as of late 2018, with privacy concerns. Newman discusses emphasis on transparency efforts contrasted with increased data collection on users. She talks of Google’s struggle with remaining transparent to the public and its own employees when it comes to data collection and application use. In her final remarks, Newman reiterates, “In thinking about Google’s extensive efforts to safeguard user privacy and the struggles it has faced in trying to do so, this question articulates a radical alternate paradigm ̶ one that Google seems unlikely to convene a summit over. What if the data didn’t exist at all?”

GDPR: The tipping point for a US Privacy Act?

GDPR: The tipping point for a US Privacy Act?
By Harith Elrufaie | March 6, 2019

GDPR, which is a short for General Data Protection Regulation, was probably in the top ten buzz words of 2018! For many reasons, this new regulation fundamentally reshapes the way data is handled across every sector. According to the new law, any company that is based in the EU, or has a business with EU customers must comply with the new regulations. Failing to comply will result in fines that could reach 4% of annual global turnover or €20 Million (whichever is greater). Here in the US, Companies revamped their privacy policies, revised architectures, data storage and encryption policies. It is estimated that US companies spent over $40 billions to be GDPR compliant.

To be a GDPR compliant, the company must:

1. Obtaining consent: consents must be simple. This means complex legal terms and conditions are not accepted.
2. Timely breach notification: if a security data breach occurs, the company must not only inform the users, obut must also be within 72 hours.
3. Right to data access: the user has the right to request all their stored data and for free.
4. Right to be forgotten: the user has the right to request the deletion of their data any time and for free.
5. Data portability: the user has the right to obtain their data and reuse the same data in a different system.
6. Privacy by design: calls for the inclusion of data protection from the onset of the designing of systems, rather than an addition.
7. Potential data protection officers: to appoint Data Protection Officer (DPO) to oversee for some cases.

Is this the tipping point?

The last few years were a revolving door of data privacy scandals; the shutdown of websites, data mishandling, public apologies, and CEO’s testifying before US Congress. A question that pops in the mind of many is will a GDPR similar act appear in the United States sometime soon?

The answer is maybe.

In January 2019, two U.S. senators, Amy Klobuchar and John Kennedy, introduced the Social Media Privacy and Consumer Rights Act, a bipartisan legislation that will protect the privacy of consumers’ online data. Having senator Kennedy is no surprise to many. He has been an advocate of data privacy and been vocal about Facebook’s user agreement. In Mark Zuckerberg’s testimony before the Congress, senator John Kennedy said: “Your user agreement sucks. The purpose of that user agreement is to cover Facebook’s rear end. It’s not to inform your users of their rights.” The act is very similar to GDPR in many forms. After reading the bill, I could not identify anything unique or different from GDPR. While this is a big step towards consumers data privacy, many believe such measures will never become a law, because of the power of the tech lobby and the lack of public demand for data privacy overhaul.

The second good move happened here in California with the new California Consumer Privacy Act of 2018. The act grants consumers the right to know what data businesses and edge providers are collecting from them and offers them specific controls over how that data is handled, kept, and shared. This new act will take effect on January 1st of 2020 and will only apply to the residents of California.

To comply with the California Consumer Privacy Act, companies must:

1. Disclose to consumers the personal information being collected, how it is used, and to whom it is being disclosed or sold.
2. Allow consumers to opt out of the sale of their data.
3. Allow consumers to request the deletion of their personal information.
4. To offer an opt-in services for consumers under the age 16.

While the United States has a rich history of data protection acts, such as HIPPA, COPPA, etc., there is no single act to address online consumers privacy. Corporates have benefited for many years by invading our privacy and selling out data without our knowledge. It is time to make an end to this and voice our concerns and demands to our representatives. There is no better time than now for an online consumers privacy act.

Sources:

Privacy Reckoning Comes For Healthcare

Privacy Reckoning Comes For Healthcare
By Anonymous | March 3, 2019

The health insurance industry (“payors”), compared to other industries, is relatively late to the game in utilizing data science and advanced analytics in its core business. While actuarial science has long been at the heart of pricing and risk management in insurance, not only are actuarial methods years behind the latest advances in applied statistics and data science, but the scope of use of these advanced analytical tools has been limited largely to solely underwriting and risk management.

But times are a-changing. Many leading payors are investing in data science capabilities in applications ranging from the traditional stats-heavy domain of underwriting to a range of other enterprise functions including marketing, care management, member engagement, and beyond. With this larger foray into data science has come requisite concerns with data privacy. ProPublica and NPR teamed up last year to publish the results of an investigation into privacy concerns related to the booming industry of using aggregated personal data in healthcare applications (link); while sometimes speculative and short on details, the report brings up skin-crawling possibilities of how this can go horribly wrong. Given the sensitivity of healthcare generally and the alarming scope of data collection in process, it’s high time for the healthcare industry to take a stand on how they intend to use this data and confront privacy issues top of mind for consumers. Let’s explore a few issues in particular.

Data usage: “Can they do that?”

One issue raised in the article — which would be an issue for any person with a health insurance plan — is how personal will actually be used. There are a number of protections in place that prevent some of the more egregious imagined uses of personal data, the most important being that insurance companies cannot price-discriminate for individual plans (though insurers can charge different prices for different plan tiers in different geographies). Beyond this, however, one could imagine other uses that might raise concerns on the expectations of privacy with data, including: using personal data in group plan pricing (insurance plans fully underwritten by the payor and offered to employers with <500 employees), outreach to individuals that may alert others to personal medical information (consider the infamous Target incident where a father learned of his daughter’s then-unannounced pregnancy through pregnancy-related mailers sent by Target), and individualized pricing that takes into account data collected from social media in a world where laws governing health care pricing are in flux in our current political environment. Data usage is something that payors need to be transparent about with its consumers if payors hope to engender and maintain the already-mercurial trust of its members…and ultimately voters.

Data provenance: “Do I really subscribe to ‘Guns & Ammo’?”

It is demonstrable that payors are making significant investments in personal data, sourced from a cottage industry of providers that aggregate data using a variety of proprietary methods. Given the potential uses laid out above, consider the following: what if major decisions about the healthcare offered to consumers is based on data that is factually incorrect? Data aggregation firms sometimes resort to imputing data for people with missing data points — so that, if all my neighbors subscribe to Guns & Ammo magazine, for instance, it may assume I am also a subscriber. Notwithstanding what my specific hypothetical Guns & Ammo subscription might mean, what is the impact of erroneous data on decisions around important healthcare decisions? How do we protect consumers from being the victim of erroneous decisions based on erroneous data that is out of their control? A standard is required here in order to ensure decisions are not made based on inaccurate data.

Conclusion: Miles to go before we sleep on this issue

ProPublica and NPR merely scratched the surface of potential data privacy issues that can arise from questionable data usage, data inaccuracy, and other issues not addressed in the article. As the healthcare industry continues to invest further in burgeoning its data science capabilities — which, by the way, has the potential to also help millions of people — it will be critical for payors to take a clear stand in articulating a clear data privacy policy with, at the very least, well-understood standards of data usage and data accuracy.

—————————

IMAGE SOURCES: both are examples of what a ‘personal dossier’ of an individual’s health risk might look like, including personal data. Both come from the main ProPublica article mentioned above (“Health Insurers Are Vacuuming Up Details About You – And It Could Raise Your Rates”, by Marshall Allen, July 17, 2018), found here: https://www.propublica.org/article/health-insurers-are-vacuuming-up-details-about-you-and-it-could-raise-your-rates

Both images are credited to Justin Volz, special to ProPublica

Contextual Violations of Privacy

Contextual Violations of Privacy
By Anonymous | March 3, 2019

Facebook’s data processing practices are once again in headlines (shocker, right?). One recent outrage surrounds the way in which data from non-related mobile applications is shared with the social media platform in order to improve their respective efficacy of targeting users on Facebook. This particular question has raised serious questions about end user privacy harm. This has in fact prompted New York Department of Financial Services to request documents from Facebook. In this post we will discuss some of the evidence concerning the data sharing practices of third-party applications with Facebook, and then discuss a useful lens for evaluating the perceived privacy harm. Perhaps we will also provide some insights in alternative norms in which we might construct the web to be a less commercial, surveillance-oriented tool for technology platforms.

The Wall Street Journal recently investigated 70 of the top Apple iOS 11 apps and found that 11 of them (16%) shared sensitive, user-submitted data with Facebook in order to enhance the ad targeting effectiveness of Facebook’s platform. The sensitive health and fitness data provided by the culprit apps includes very intimate data such as ovulation tracking, sexual activity defined as ìexerciseî, alcohol consumption, heart rates and other sensitive data. These popular apps use a Facebook feature called “App Events” that is then used to feed Facebook ad-targeting tools. In essence, this feature enables companies to effectively track users across platforms to improve their ad effectiveness targeting.

A separate, unrelated and earlier study conducted by Privacy International running Android 8.1 (Oreo) provides more technical discussion and details of data sharing. In tests of 34 common apps it found that 23 (61%) automatically transferred data to Facebook at the time a user opens an application. This occurred regardless of a user having a Facebook account. This data includes the specific application accessed by a user, events such as the open and closure of the application, device specific information, the userís suspected location based on language and time zone settings and a unique Google advertising ID (AAID) provided by the Google Play Store. For example, specific applications such as the travel app Kayakî sent detailed search behavior of end users to Facebook.

In response to the Wall Street Journal reports, a Facebook spokesperson commented that it’s common for developers to share information with a wide range of platforms for advertising and analytics. To be clear, the report was focused on how other apps use peopleís information to create Facebook ads. If it is common practice to share information across platforms, which on the surface appears to be true (although the way in which targeted marketing and data exchanges work is not entirely clear), then why are people so upset? Moreover, why did the report published by the Wall Street journal spark regulatory action while the reports from Privacy International were not as polarizing?

Importance of Context

Helen Nissenbaum NYU researcher, criticizes the current approach to online privacy which is dominated by discussion of transparency and choice. One central challenge to the whole paradigm is what Nissenbaum calls the “transparency paradox”. That is, providing simple, digestible and easy to comprehend privacy policies are, with few exceptions, directly opposed to detailed understanding as to how data are really controlled in practice. Instead, she argues for an approach that leverages contextual integrity in order to define the ways in which data and information ought to be handled. For example, if you operate as an online bank, then the ways in which information is used and handled in a banking context ought to apply whether it is online or in-person.

Now applying Nissenbaum’s approach to the specific topic of health applications sharing data, e.g. when one annotates her menstrual cycle on her personal device, would she reasonably expect that information to be accessed and used for forums in social media (e.g., on Facebook)? Moreover, would she reasonably expect that her travel plans to Costa Rica would then be algorithmically aggregated with her menstrual cycle information in order to detect whether she would be more or less inclined to purchase trip insurance? What if that information was then used to charge her more for the trip insurance? The number of combinations and permutations of this scenario is only constrained by one’s imagination.

Arguably many of us would be uncomfortable with this contextual violation. Debatably, sharing flight information with Facebook does not result in the same level of outrage as does health data. That is due to the fact that the norms that govern health data tend to privilege autonomy and privacy much more than those of other commercial activities like airline travel. While greater transparency would have been a meaningful step towards minimizing the outrage experienced by the general public with the health specific example, it is still not sufficient to remove the privacy harm that could be, was or is experienced.

As Nissenbaum has proposed, perhaps it is time that we rethink the norms of how data are governed and whether informed consent with todayís internet is really a sufficient approach towards protecting individual privacy. We can’t agree on a lot in America today, but it feels like keeping our medical histories safe from advertisers is maybe one area where we could find a majority of support?

A Case Study on the Evolution and Effectiveness of Data Privacy Advocacy and Litigation with Google

A Case Study on the Evolution and Effectiveness of Data Privacy Advocacy and Litigation with Google
By Jack Workman | March 3, 2019

2018 was an interesting year for data privacy. Multiple data breaches, the Facebook Cambridge Analytica scandal, and the release of the European Union’s General Data Protection Regulation  (GDPR) mark just a few of the many headlines. Of course, data privacy is not a new concept, and it is gaining prominence as more online services collect and share our personal information. Unfortunately, as 2018 showed, this gathered personal information is not always safe, which is why governments are introducing and exploring new policies and regulations like GDPR to protect our online data. Some consumers might be surprised that this is not the first time governments have attempted to tackle the issue of data privacy. GDPR actually replaced an earlier  data privacy initiative by the EU called the Data Protection Derivative of 1995. In the US, California’s Online Privacy Protection Act  (CalOPPA) of 2003 governs many actions involving privacy and is planned to be replaced by the California Consumer Privacy Act  (CCPA) in 2019. Knowing this, you might be wondering, what’s changed? Why do these earlier policies need replacing? And are these policies actually effective in setting limits on data privacy practices? To answer these questions, we turn to the history of one of the internet’s most well-known superstars: Google.

Google: Two Decades of Data Privacy History

Google’s presence and contributions in today’s ultra-connected world cannot be understated. It owns the most used  search engine, the most used internet browser, and the most popular smartphone operating system. Perhaps more than any other company, Google has experienced and been at the forefront of the evolution of the internet’s data privacy debates.

As such, it is a perfect subject for a case study to answer our questions. Even better, Google publishes an archive  of all of its previous privacy policy revisions with highlights of what’s changed. Why are privacy policies important? Because privacy policies are documents legally required to be shared by a company to explain how it collects and shares personal information. If a company changes its approach to personal information use, then this change should be reflected in a privacy policy update. By reviewing the changes between Google’s privacy policies, we can assess how Google responded to and the impact on Google of major data privacy events in the last two decades of data privacy advocacy and policy.

2004: The Arrival of CalOPPA

Google’s first privacy policy , published in June of 1999, is a simple affair: only 3 sections and 617 words. The policy remained mostly the same until July 1, 2004, the same date that CalOPPA’s policy went into effect, where Google added a full section on “Data Collection” and much further detail on how it shared your information. Both additions were required under the new regulations set forth by CalOPPA and can be considered positive steps towards more transparent data practices.

2010: Concerns Over Government Data Requests

A new update in 2010 brings first mention of the Google Dashboard. The Dashboard, published after rising media attention focusing on reports that Google shared its data with governments upon request, is a utility for users to view the data Google’s collected. This massive increase in transparency can be considered a big win for data privacy advocates.

2012: A New Privacy Policy and Renewed Scrutiny

March 2012 brings Google’s biggest policy change yet. In a sweeping move, Google overhauled its policy to give it the freedom to share user data across all of its services. At least, all except for ads: “We will not combine DoubleClick cookie information with personally identifiable information unless we have your opt-in consent”. This move received negative attention and fines from both international media and governments.

2016: The Ad Wall Falls

With a simple, one-line change in its privacy policy, Google drops the barrier preventing it from using data from all of its services to better target its advertisements. This move shows that, despite previous negative attention, Google is not afraid of expanding its use of our personal information.

2018: The Arrival of GDPR

It is still far too soon to assess the impact of GDPR, but, if the impact on Google’s privacy policy  is any indicator, then it represents a massive change. With the addition of videos, additional resources, and clearer language, it seems as if Google is taking these new regulations very seriously.

Conclusion

Comparing Google’s first privacy policy to its most recent depicts a company that’s become more aware of and more interested in communicating its data practices. As demonstrated, this growth was caused by media scrutiny and governmental legislation along the way. However, while the increased transparency is appreciated, the same media scrutiny and governmental legislation has not prevented Google from expanding its use and sharing of our personal information. This raises a new question that will only be answered with time: will GDPR and the pending US regulations actually place real limits on the use of and protections for our personal information, or will they just continue to increase transparency?

Operation Neptune Spear

Operation Neptune Spear
By Chris Sanchez | March 3, 2019

Almost eight years ago on May 2nd, 2011, at 11:35pm Eastern Time, former President Barak Obama unfolded Operation NEPTUNE SPEAR to the world:

*“…the United States has conducted an operation that killed Osama bin Laden, the leader of al-Qaeda, and a terrorist who’s responsible for the murder of thousands of innocent men, women, and children.”*

Neptune Spear Command Center

At the time, the American public was aware that the US was engaged in combat operations in Afghanistan, but the whereabouts of Osama bin Laden—including whether he was alive or dead—were unknown. The announcement by President Obama (which, by the way, interrupted my viewing of America’s Funniest Home Videos, confirmed to the American public that Osama bin Laden:

  • Survived the US invasion of Afghanistan in 2001.
  • Had been hiding in Pakistan for several years.
  • Was killed in the raid by a highly trained (but undisclosed) US military unit.

President Obama’s announcement and subsequent reporting provided additional details about the raid and the decisions leading up to it, but the primary substance of the event can be neatly summarized in the above three bullet points. Yet much to my shock and dismay, over the coming days, I watched as news channels reported leaked details of the event to include classified information such as the identity of the military unit responsible for the raid including their call signs, identifying features, and deployment rotation cycle. None of this disclosed information materially altered the narrative of what had happened or provided any particularly useful insight into this classified military operation.

Secrecy and representative democracy have long had a tumultuous relationship, which is not likely to significantly improve in our Age of Information (On-Demand), as there will always be a trade-off between government transparency and the desire to keep certain pieces of information hidden from the public in the name of national security to include economic, diplomatic, and physical security. And though it often takes major headline events—Pentagon Papers (1971), Wikileaks (2006), Edward Snowden (2013) —to jar the public consciousness, the resultant public discussion surrounding these events, often finds that the balance between transparency and secrecy is either not well monitored, or well understood, by those who are elected/appointed to safeguard both the public trust and their overall security.

Take for instance the Terrorist Screening Database  (TSDB, commonly known as the “terrorist watchlist”). The TSDB is managed by the Terrorist Screening Center, a multi-agency organization created in 2003 by presidential directive, in response to the lack of intelligence sharing across governmental agencies prior to the September 11 terrorist attacks. People—both US citizens and foreigners—who are known or suspected of having terrorist organization affiliations are placed into the TSDB, along with unique personal identifiers, including in some cases, biometric information. This central repository of information is then exported across federal agencies (Department of States, Department of Homeland Security, Department of Defense, etc.) to aid in terrorist identification across passive and active channels.
TSDB Nomination Regimen

In the aftermath of the 9/11 attacks and subsequent domestic terror incidents, one would be hard pressed to argue that the TSDB is not a useful and necessary information-sharing tool for US Law Enforcement and other agencies responsible for domestic security. But like other instances of the government claiming the necessity of secrecy in the name of national security, there are indications that the secrecy/transparency balance is tilted in favor of unnecessary secrecy. A report in 2014 from the Intercept —an award-winning news organization—claimed evidence that 280,000 people in the TSDB (almost half the total number at the time), had no known terrorist group affiliation. How or why were these unaffiliated people placed into this federal database? The consequences of being placed in the TSDB are not trivial. Depending on the circumstances, TSDB members can find themselves on the “no-fly list”, have visas denied, be subjected to enhanced screenings at various checkpoints, and find their personal information (including biometric information) exposed across multiple organizations.

With an average of over 1,600 daily nominations to the TSDB, I am hard-pressed to believe that due diligence is conducted on all of those names, despite what is claimed on the Federal Bureau of Investigation’s FAQ section  of their Terrorist Screening Center website, regarding the thoroughness of the TSDB nomination process. Furthermore, once nominated, it’s very cumbersome for individuals to correct or remove records about them in the TSDB, in spite of a formal appeals procedures as mandated by the Intelligence Reform and Terrorism Prevention Act of 2014. The Office of the Inspector General under the Department of Justice has criticized the maintainers of the TSDB for “…frequent errors and being slow to respond to complaints”. A 2007 Inspector General report found a 38% error rate in a 105 name sample from the TSDB.

As long as we live in a representative democracy that values individual privacy, free and open discussion of policy, and the applicability of Constitutional principles to all US citizens, there will always be “friction” at the nexus of government responsibility, public trust in governmental institutions, and secrecy. Trust in US governmental institutions has slowly eroded over time, due in large part to the access of information previously hidden from the public, which was found to be contrary/misleading to what they had been told or had been led to believe. Experience has shown that publicly elected representatives are often not enough of a check on the power of government agencies to strike an appropriate balance between secrecy and transparency. Fortunately, though not perfect in their efforts to right perceived wrongs, much progress has been made at this nexus point by public advocacy organizations, academic institutions, investigative journalism, constitutional lawyers, and concerned citizens.

In my experience, which includes being on the front lines of the War on Terror from 2007-2013, the men and women who comprise the totality of “government institutions”, while imperfect, generally do have the best interests of the nation (as a whole), in mind when prosecuting their responsibilities. Given the limitations of human decision making in times of both crisis and tranquility, there is a tendency to err on the side of secrecy in the name of security. However, taken to extremes this mentality can result in significant abuses of power ranging from moderate invasions of privacy to severe abuses of personal freedoms. To compound the situation, the public erosion of trust in government creates a certain level of suspicion behind every governmental action that is not completely “above board”, even when there are very good reasons for non-public disclosure of information (such as the operational details as described in the Operation Neptune example cited at the beginning of this article). At the end of the day, the government will take those measures it deems as necessary to secure the safety of its citizenry, even if such actions come at the expense of the rights of minority groups or those who do not find themselves in political power. I think it’s our job as vigilant citizens to ensure that the balance of power is restored once the real or perceived crisis has passed.

How transparent does a government need to be? In a representative democracy it needs to be as transparent as possible without compromising public safety and security. How the US government and its citizens decide to strike that balance over the coming generations will be an interesting discussion indeed.

Primary Sources
1. https://en.wikisource.org/wiki/Remarks_by_the_President_on_Osama_bin_Laden
2. https://fas.org/sgp/crs/terror/R44678.pdf
3. https://theintercept.com/2014/08/05/watch-commander/
4. https://www.fbi.gov/file-repository/terrorist-screening-center-frequently-asked-questions.pdf/view

Data Privacy and the Chinese Social Credit System

Data Privacy and the Chinese Social Credit System
“Keeping trust is glorious and breaking trust is disgraceful”
By Victoria Eastman | February 24, 2019

Recently, the Chinese Social Credit System has been featured on podcasts, blogs, and news articles in the United States, often highlighting the Orwellian feel of the imminent system China plans to use to encourage good behavior amongst its citizens. The broad scope of this program raises questions about data privacy, consent, algorithmic bias, and error correction.

What is the Chinese Social Credit System?

In 2014, the Chinese government released a document entitled, “Planning Outline for the Construction of a Social Credit System” The system uses a broad range of public and private data to rank each citizen on a scale from 0-800. Higher ratings offer citizens benefits like discounts on energy bills, more matches on dating websites, and lower interest rates. Low ratings incur such punishments as the inability to purchase plane or train tickets, banishment for you and your children from universities, and even pet confiscation in some provinces. The system has been undergoing testing in various provinces around the country with different implementations and properties, but the government plans to take the rating system nationwide in 2020.

The exact workings of the system have not been explicitly detailed by the Chinese government, however details have spilled out since the policy was announced. Data is collected from a number of private and public sources: chat and email data; online shopping history; loan and debt information; smart devices, including smart phones, smart home devices, and fitness trackers; criminal records; travel patterns and location data; and the nationwide collection of millions of cameras that watch all Chinese citizens. Even your family members and other people you associate with can affect your score. The government has signed up more than 44 financial institutions and has issued at least 8 licenses to private companies such as Alibaba, Tencent, and Baidu to submit data to the system. Algorithms are run over the entire dataset and generate a single credit score for each citizen.

This score will be publicly available on any number of platforms including the newspapers, online media, and even some people phones so when you call a person with a low score, you will hear a message telling you the person you are calling has low social credit.

What does it mean for privacy and consent?

On May 1st, 2018, China announced the Personal Information Security Specification, a set of non-binding guidelines to govern the collection and use of personal data of Chinese citizens. The guidelines appear similar to the European GDPR with some notable differences, namely a focus on national security. Under these rules, individuals have full rights to their data, including erasure and must provide consent for any use of personal data by the collecting company.

How do these guidelines jive with the social credit system? The connection between the two policies has not been explicitly outlined by the Chinese government, but at first blush it appears there are some key conflicts between the two policies. Do citizens have erasure power over their poor credit history or other details that negatively affect their score? Are companies required to ask for consent to send private information to the government if it’s to be used in the social credit score? If the social credit score is public, how much control to individuals really have over the privacy of their data?

Other concerns about the algorithms themselves have also been raised. How are individual actions weighted by the algorithm? Are some ‘crimes’ worse than others? Does recency matter? How can incorrect data be fixed? Is the government removing demographic information like age, gender, or ethnicity or could those criteria unknowingly create bias?

Many citizens with high scores are happy with the system that gives them discounts and preferential treatment, but others fear the system will be used by the government to shape behavior and punish actions deemed inappropriate by the government. Dissidents and minority groups fear the system will be biased against them.

There are still many details that are unclear about how the system will work on a nationwide scale, however, there are clear discrepancies between the published data privacy policy China announced last year and the scope of the social credit system. How the government addresses the problems will likely lead to even more podcasts, news articles, and blogs.

Sources

Sacks, Sam. “New China Data Privacy Standard Looks More Far-Reaching than GDPR”. Center for Strategic and International Studies. Jan 29, 2018. https://www.csis.org/analysis/new-china-data-privacy-standard-looks-more-far-reaching-gdpr

Denyer, Simon. “China’s plan to organize its society relies on ‘big data’ to rate everyone“. The Washington Post. Oct 22, 2016. https://www.washingtonpost.com/world/asia_pacific/chinas-plan-to-organize-its-whole-society-around-big-data-a-rating-for-everyone/2016/10/20/1cd0dd9c-9516-11e6-ae9d-0030ac1899cd_story.html?utm_term=.1e90e880676f

Doxing: An Increased (and Increasing) Privacy Risk

Doxing: An Increased (and Increasing) Privacy Risk
By Mary Boardman | February 24, 2019

Doxing (or doxxing) is a form of online abuse where one party releases sensitive and/or personally identifiable information. While it isn’t the only risk associated with a privacy concern, it is one that can be put people physically in harm’s way. For instance, this data can include information such as name, address, telephone number. Such information exposes doxing victims to threats, harassment, and/or even violence.

People dox others for many reasons, all with the intention of harm. Because more data is more available to more people than ever, we can and should assume the risk of being doxed is also increasing. For those of us working with this data, we need to remember that there are actual humans behind the data we use. As data stewards, it is our obligation to understand the risks to these people and do what we can to protect them and their privacy interests. We need to be deserving of their trust.

Types of Data Used
To address a problem, we must first understand it. Doxing happens when direct identifiers are released, but these aren’t the only data that can lead to doxing. Some data are such as indirect identifiers, can also be used to dox people. Below are various levels of identifiability and examples of each:

  • Direct Identifier: Name, Address, SSN
  • Indirect Identifier: Date of Birth, Zip Code, License Plate, Medical Record
  • Number, IP Address, Geolocation
  • Data Linking to Multiple Individuals: Movie Preferences, Retail Preferences
  • Data Not Linking to Any Individual: Aggregated Census Data, Survey Results
  • Data Unrelated to Individuals: Weather

Anonymization and De-anonymization of Data
Anonymization is a common response to privacy concerns and can be seen as an attempt to protect people’s privacy. The way this is done is by removing identifiers from a dataset. However, because this data can be de-anonymized, anonymization is not a guarantee of privacy. In fact, we should never assume that anonymization can provide more than a level of inconvenience for a doxer. (And, as data professionals, we should not assume anonymization is enough protection.)

Generally speaking, there are four types of anonymization:
1. Remove identifiers entirely.
2. Replace identifiers with codes or pseudonyms.
3. Add statistical noise.
4. Aggregate the data.

De-anonymization (or re-identification) is where data that had been anonymized are accurately matched with the original owner or subject. This is often done by combining two or more datasets containing different information about the same or overlapping groups of people. For instance, anonymized data from social media accounts could be combined to identify individuals. Often this risk is highest when anonymized data is sold to third parties who then re-identify people.


Image Source:
http://technodocbox.com/Internet_Technology/75952421-De-anonymizing-social-networks-and-inferring-private-attributes-using-knowledge-graphs.html

One example of this is Sweeney’s 2002 paper where she was able to correctly identify 87% of the US population with just zip code, birthdate, and sex. Another example is work by Acqusiti and Gross from 2009, where they were able to predict social security numbers with birthdate and geographic location. Other examples include a 2018 study by Kondor, et al., where they were able to identify people based on mobility and spatial data. While their study only had a 16.8% success rate after a week, this jumped to 55% after four weeks.


Image Source:
https://portswigger.net/daily-swig/block-function-exploited-to-deanonymize-social-media-accounts

Actions Moving Forward
There are many options data professionals can take. These range from being negligent stewards, doing as little as possible, to the more sophisticated differential privacy option. El Emam presented a protocol back in 2016 that does a very elegant job of balancing feasibility with effectiveness to anonymize data. He proposed the following steps:

1. Classify variables according to direct, indirect, and non-identifiers
2. Remove or replace direct identifiers with a pseudonym
3. Use a k-anonymity method to de-identify the indirect identifiers
4. Conduct a motivated intruder test
5. Update the anonymization with findings from the test
6. Repeat as necessary

We are unlikely to ever truly know the risk of doxing (and with it, de-anonymization of PII). However, we need to assume de-anonymization is always possible. Because our users trust us with their data and their assumed privacy, we need to make sure their trust is well-placed and be vigilant stewards of their data and privacy interests. What we do, and the steps we take as data professionals can and do have an impact on the lives of the people behind the data.

Works Cited:
Acquisti, A., & Gross, R. (2009). Predicting Social Security numbers from public data. Proceedings of the National Academy of Sciences, 106(27), 10975–10980. https://doi.org/10.1073/pnas.0904891106
Center, E. P. I. (2019). EPIC – Re-identification. Retrieved February 3, 2019, from https://epic.org/privacy/reidentification/
El Emam, Khaled. (2016). A de-identification protocol for open data. In Privacy Tech. International Association of Privacy Professionals. Retrieved from https://iapp.org/news/a/a-de-identification-protocol-for-open-data/
Federal Bureau of Investigation. (2011, December 18). (U//FOUO) FBI Threat to Law Enforcement From “Doxing” | Public Intelligence [FBI Bulletin]. Retrieved February 3, 2019, from https://publicintelligence.net/ufouo-fbi-threat-to-law-enforcement-from-doxing/
Lubarsky, Boris. (2017). Re-Identification of “Anonymized” Data. Georgetown Law Technology Review. Retrieved from https://georgetownlawtechreview.org/re-identification-of-anonymized-data/GLTR-04-2017/
Narayanan, A., Huey, J., & Felten, E. W. (2016). A Precautionary Approach to Big Data Privacy. In S. Gutwirth, R. Leenes, & P. De Hert (Eds.), Data Protection on the Move (Vol. 24, pp. 357–385). Dordrecht: Springer Netherlands. https://doi.org/10.1007/978-94-017-7376-8_13
Narayanan, A., & Shmatikov, V. (2010). Myths and fallacies of “personally identifiable information.” Communications of the ACM, 53(6), 24. https://doi.org/10.1145/1743546.1743558
Snyder, P., Doerfler, P., Kanich, C., & McCoy, D. (2017). Fifteen minutes of unwanted fame: detecting and characterizing doxing. In Proceedings of the 2017 Internet Measurement Conference on – IMC ’17 (pp. 432–444). London, United Kingdom: ACM Press. https://doi.org/10.1145/3131365.3131385
Sweeney, L. (2002). k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557–570. https://doi.org/10.1142/S0218488502001648