A Day in My Life According to Google: The Case for Data Advocacy

A Day in My Life According to Google: The Case for Data Advocacy
By Stephanie Seward | March 10, 2019

Recently I was sitting in an online class for the University of California-Berkeley’s data science program discussing privacy considerations. If someone from outside the program were to listen in, they would interpret our dialogue as some sort of self-help group for data scientists who fear an Orwellian future that we have worked to create. It’s an odd dichotomy potentially akin to Oppenheimer’s proclamation that he had become death, destroyer of worlds after he worked diligently to create the atomic bomb (https://www.youtube.com/watch?v=dus_M4sn0_I).

One of my fellow students mentioned as part of our in depth, perhaps somewhat paranoid, dialogue that users can download the information Google has collected on them. He said he hadn’t downloaded the data, and the rest of the group insisted that they wouldn’t want to know. It would be too terrifying.

I, however, a battle-hardened philosopher that graduated from a military school in my undergraduate days thought, I’m not scared, why not have a look? I was surprisingly naïve just four weeks ago.

What follows is my story. This is a story of curiosity, confusion, fear, and a stark understanding that data transparency and privacy concerns are prevalent, prescient, and more pervasive than I could have possibly known. This is the (slightly dramatized) story of a day in my life according to Google.
This is how you can download your data.


A normal workday according to Google
0500: Wake up, search “News”/click on a series of links/read articles about international relations
0530: Movement assessed as “driving” start point: home location end point: work location
0630: Activity assessed as “running” grid coordinate: (series of coordinates)
0900: Shopping, buys swimsuit, researches work fashion

1317: Uses integral calculator
1433: Researches military acquisition issues for equipment
1434: Researches information warfare
1450: Logs into maps, views area around City (name excluded for privacy), views area around post
1525: Calls husband using Google assistant
1537: Watches Game of Thrones Trailer (YouTube)
1600: Movement assessed as “driving” from work location to home location
1757: Watches Inspirational Video (YouTube)
1914-2044: Researches topics in Statistics
2147: Watches various YouTube videos including Alice in Wonderland-Chesire Cat Clip (HQ)
Lists all 568 cards in my Google Feed and annotates which I viewed
Details which Google Feed Notifications I received and which I dismissed

I’m not a data scientist yet, but it is very clear to me that the sheer amount of information Google has on me (about 10 GB in total) is dangerous. Google knows my interests and activities almost every minute of every day. What does Google do with all that information?

We already know that it is used in targeted advertising, to generate news stories of interests, and sometimes even in hiring practices. Is that, however, where the story ends? I don’t know, but I doubt it. I also doubt that we are advancing toward some Orwellian future in which everything about us is known by some big brother figure. We will probably fall somewhere in between.

I also know that, I am not the only one Google has about 10GB if not more information on. If you would like to view your own data, visit: https://support.google.com/accounts/answer/3024190?hl=en or to view your data online visit https://myactivity.google.com/.

Privacy considerations cannot remain in the spheres of data science and politics, we each have a role in the debate. This post is a humble attempt to drum up more interest from everyday users. Consider researching privacy concerns. Consider advocating for transparency. Consider the data, and consider the consequences.

Looking for more?
Here is a good place to start: https://www.wired.com/story/google-privacy-data/. This article, “The Privacy Battle to Save Google from Itself” by Lily Hay Newman is in the security section of wired.com. It details Google’s recent battles, as of late 2018, with privacy concerns. Newman discusses emphasis on transparency efforts contrasted with increased data collection on users. She talks of Google’s struggle with remaining transparent to the public and its own employees when it comes to data collection and application use. In her final remarks, Newman reiterates, “In thinking about Google’s extensive efforts to safeguard user privacy and the struggles it has faced in trying to do so, this question articulates a radical alternate paradigm ̶ one that Google seems unlikely to convene a summit over. What if the data didn’t exist at all?”

GDPR: The tipping point for a US Privacy Act?

GDPR: The tipping point for a US Privacy Act?
By Harith Elrufaie | March 6, 2019

GDPR, which is a short for General Data Protection Regulation, was probably in the top ten buzz words of 2018! For many reasons, this new regulation fundamentally reshapes the way data is handled across every sector. According to the new law, any company that is based in the EU, or has a business with EU customers must comply with the new regulations. Failing to comply will result in fines that could reach 4% of annual global turnover or €20 Million (whichever is greater). Here in the US, Companies revamped their privacy policies, revised architectures, data storage and encryption policies. It is estimated that US companies spent over $40 billions to be GDPR compliant.

To be a GDPR compliant, the company must:

1. Obtaining consent: consents must be simple. This means complex legal terms and conditions are not accepted.
2. Timely breach notification: if a security data breach occurs, the company must not only inform the users, obut must also be within 72 hours.
3. Right to data access: the user has the right to request all their stored data and for free.
4. Right to be forgotten: the user has the right to request the deletion of their data any time and for free.
5. Data portability: the user has the right to obtain their data and reuse the same data in a different system.
6. Privacy by design: calls for the inclusion of data protection from the onset of the designing of systems, rather than an addition.
7. Potential data protection officers: to appoint Data Protection Officer (DPO) to oversee for some cases.

Is this the tipping point?

The last few years were a revolving door of data privacy scandals; the shutdown of websites, data mishandling, public apologies, and CEO’s testifying before US Congress. A question that pops in the mind of many is will a GDPR similar act appear in the United States sometime soon?

The answer is maybe.

In January 2019, two U.S. senators, Amy Klobuchar and John Kennedy, introduced the Social Media Privacy and Consumer Rights Act, a bipartisan legislation that will protect the privacy of consumers’ online data. Having senator Kennedy is no surprise to many. He has been an advocate of data privacy and been vocal about Facebook’s user agreement. In Mark Zuckerberg’s testimony before the Congress, senator John Kennedy said: “Your user agreement sucks. The purpose of that user agreement is to cover Facebook’s rear end. It’s not to inform your users of their rights.” The act is very similar to GDPR in many forms. After reading the bill, I could not identify anything unique or different from GDPR. While this is a big step towards consumers data privacy, many believe such measures will never become a law, because of the power of the tech lobby and the lack of public demand for data privacy overhaul.

The second good move happened here in California with the new California Consumer Privacy Act of 2018. The act grants consumers the right to know what data businesses and edge providers are collecting from them and offers them specific controls over how that data is handled, kept, and shared. This new act will take effect on January 1st of 2020 and will only apply to the residents of California.

To comply with the California Consumer Privacy Act, companies must:

1. Disclose to consumers the personal information being collected, how it is used, and to whom it is being disclosed or sold.
2. Allow consumers to opt out of the sale of their data.
3. Allow consumers to request the deletion of their personal information.
4. To offer an opt-in services for consumers under the age 16.

While the United States has a rich history of data protection acts, such as HIPPA, COPPA, etc., there is no single act to address online consumers privacy. Corporates have benefited for many years by invading our privacy and selling out data without our knowledge. It is time to make an end to this and voice our concerns and demands to our representatives. There is no better time than now for an online consumers privacy act.


Privacy Reckoning Comes For Healthcare

Privacy Reckoning Comes For Healthcare
By Anonymous | March 3, 2019

The health insurance industry (“payors”), compared to other industries, is relatively late to the game in utilizing data science and advanced analytics in its core business. While actuarial science has long been at the heart of pricing and risk management in insurance, not only are actuarial methods years behind the latest advances in applied statistics and data science, but the scope of use of these advanced analytical tools has been limited largely to solely underwriting and risk management.

But times are a-changing. Many leading payors are investing in data science capabilities in applications ranging from the traditional stats-heavy domain of underwriting to a range of other enterprise functions including marketing, care management, member engagement, and beyond. With this larger foray into data science has come requisite concerns with data privacy. ProPublica and NPR teamed up last year to publish the results of an investigation into privacy concerns related to the booming industry of using aggregated personal data in healthcare applications (link); while sometimes speculative and short on details, the report brings up skin-crawling possibilities of how this can go horribly wrong. Given the sensitivity of healthcare generally and the alarming scope of data collection in process, it’s high time for the healthcare industry to take a stand on how they intend to use this data and confront privacy issues top of mind for consumers. Let’s explore a few issues in particular.

Data usage: “Can they do that?”

One issue raised in the article — which would be an issue for any person with a health insurance plan — is how personal will actually be used. There are a number of protections in place that prevent some of the more egregious imagined uses of personal data, the most important being that insurance companies cannot price-discriminate for individual plans (though insurers can charge different prices for different plan tiers in different geographies). Beyond this, however, one could imagine other uses that might raise concerns on the expectations of privacy with data, including: using personal data in group plan pricing (insurance plans fully underwritten by the payor and offered to employers with <500 employees), outreach to individuals that may alert others to personal medical information (consider the infamous Target incident where a father learned of his daughter’s then-unannounced pregnancy through pregnancy-related mailers sent by Target), and individualized pricing that takes into account data collected from social media in a world where laws governing health care pricing are in flux in our current political environment. Data usage is something that payors need to be transparent about with its consumers if payors hope to engender and maintain the already-mercurial trust of its members…and ultimately voters.

Data provenance: “Do I really subscribe to ‘Guns & Ammo’?”

It is demonstrable that payors are making significant investments in personal data, sourced from a cottage industry of providers that aggregate data using a variety of proprietary methods. Given the potential uses laid out above, consider the following: what if major decisions about the healthcare offered to consumers is based on data that is factually incorrect? Data aggregation firms sometimes resort to imputing data for people with missing data points — so that, if all my neighbors subscribe to Guns & Ammo magazine, for instance, it may assume I am also a subscriber. Notwithstanding what my specific hypothetical Guns & Ammo subscription might mean, what is the impact of erroneous data on decisions around important healthcare decisions? How do we protect consumers from being the victim of erroneous decisions based on erroneous data that is out of their control? A standard is required here in order to ensure decisions are not made based on inaccurate data.

Conclusion: Miles to go before we sleep on this issue

ProPublica and NPR merely scratched the surface of potential data privacy issues that can arise from questionable data usage, data inaccuracy, and other issues not addressed in the article. As the healthcare industry continues to invest further in burgeoning its data science capabilities — which, by the way, has the potential to also help millions of people — it will be critical for payors to take a clear stand in articulating a clear data privacy policy with, at the very least, well-understood standards of data usage and data accuracy.


IMAGE SOURCES: both are examples of what a ‘personal dossier’ of an individual’s health risk might look like, including personal data. Both come from the main ProPublica article mentioned above (“Health Insurers Are Vacuuming Up Details About You – And It Could Raise Your Rates”, by Marshall Allen, July 17, 2018), found here: https://www.propublica.org/article/health-insurers-are-vacuuming-up-details-about-you-and-it-could-raise-your-rates

Both images are credited to Justin Volz, special to ProPublica

Contextual Violations of Privacy

Contextual Violations of Privacy
By Anonymous | March 3, 2019

Facebook’s data processing practices are once again in headlines (shocker, right?). One recent outrage surrounds the way in which data from non-related mobile applications is shared with the social media platform in order to improve their respective efficacy of targeting users on Facebook. This particular question has raised serious questions about end user privacy harm. This has in fact prompted New York Department of Financial Services to request documents from Facebook. In this post we will discuss some of the evidence concerning the data sharing practices of third-party applications with Facebook, and then discuss a useful lens for evaluating the perceived privacy harm. Perhaps we will also provide some insights in alternative norms in which we might construct the web to be a less commercial, surveillance-oriented tool for technology platforms.

The Wall Street Journal recently investigated 70 of the top Apple iOS 11 apps and found that 11 of them (16%) shared sensitive, user-submitted data with Facebook in order to enhance the ad targeting effectiveness of Facebook’s platform. The sensitive health and fitness data provided by the culprit apps includes very intimate data such as ovulation tracking, sexual activity defined as ìexerciseî, alcohol consumption, heart rates and other sensitive data. These popular apps use a Facebook feature called “App Events” that is then used to feed Facebook ad-targeting tools. In essence, this feature enables companies to effectively track users across platforms to improve their ad effectiveness targeting.

A separate, unrelated and earlier study conducted by Privacy International running Android 8.1 (Oreo) provides more technical discussion and details of data sharing. In tests of 34 common apps it found that 23 (61%) automatically transferred data to Facebook at the time a user opens an application. This occurred regardless of a user having a Facebook account. This data includes the specific application accessed by a user, events such as the open and closure of the application, device specific information, the userís suspected location based on language and time zone settings and a unique Google advertising ID (AAID) provided by the Google Play Store. For example, specific applications such as the travel app Kayakî sent detailed search behavior of end users to Facebook.

In response to the Wall Street Journal reports, a Facebook spokesperson commented that it’s common for developers to share information with a wide range of platforms for advertising and analytics. To be clear, the report was focused on how other apps use peopleís information to create Facebook ads. If it is common practice to share information across platforms, which on the surface appears to be true (although the way in which targeted marketing and data exchanges work is not entirely clear), then why are people so upset? Moreover, why did the report published by the Wall Street journal spark regulatory action while the reports from Privacy International were not as polarizing?

Importance of Context

Helen Nissenbaum NYU researcher, criticizes the current approach to online privacy which is dominated by discussion of transparency and choice. One central challenge to the whole paradigm is what Nissenbaum calls the “transparency paradox”. That is, providing simple, digestible and easy to comprehend privacy policies are, with few exceptions, directly opposed to detailed understanding as to how data are really controlled in practice. Instead, she argues for an approach that leverages contextual integrity in order to define the ways in which data and information ought to be handled. For example, if you operate as an online bank, then the ways in which information is used and handled in a banking context ought to apply whether it is online or in-person.

Now applying Nissenbaum’s approach to the specific topic of health applications sharing data, e.g. when one annotates her menstrual cycle on her personal device, would she reasonably expect that information to be accessed and used for forums in social media (e.g., on Facebook)? Moreover, would she reasonably expect that her travel plans to Costa Rica would then be algorithmically aggregated with her menstrual cycle information in order to detect whether she would be more or less inclined to purchase trip insurance? What if that information was then used to charge her more for the trip insurance? The number of combinations and permutations of this scenario is only constrained by one’s imagination.

Arguably many of us would be uncomfortable with this contextual violation. Debatably, sharing flight information with Facebook does not result in the same level of outrage as does health data. That is due to the fact that the norms that govern health data tend to privilege autonomy and privacy much more than those of other commercial activities like airline travel. While greater transparency would have been a meaningful step towards minimizing the outrage experienced by the general public with the health specific example, it is still not sufficient to remove the privacy harm that could be, was or is experienced.

As Nissenbaum has proposed, perhaps it is time that we rethink the norms of how data are governed and whether informed consent with todayís internet is really a sufficient approach towards protecting individual privacy. We can’t agree on a lot in America today, but it feels like keeping our medical histories safe from advertisers is maybe one area where we could find a majority of support?

A Case Study on the Evolution and Effectiveness of Data Privacy Advocacy and Litigation with Google

A Case Study on the Evolution and Effectiveness of Data Privacy Advocacy and Litigation with Google
By Jack Workman | March 3, 2019

2018 was an interesting year for data privacy. Multiple data breaches, the Facebook Cambridge Analytica scandal, and the release of the European Union’s General Data Protection Regulation  (GDPR) mark just a few of the many headlines. Of course, data privacy is not a new concept, and it is gaining prominence as more online services collect and share our personal information. Unfortunately, as 2018 showed, this gathered personal information is not always safe, which is why governments are introducing and exploring new policies and regulations like GDPR to protect our online data. Some consumers might be surprised that this is not the first time governments have attempted to tackle the issue of data privacy. GDPR actually replaced an earlier  data privacy initiative by the EU called the Data Protection Derivative of 1995. In the US, California’s Online Privacy Protection Act  (CalOPPA) of 2003 governs many actions involving privacy and is planned to be replaced by the California Consumer Privacy Act  (CCPA) in 2019. Knowing this, you might be wondering, what’s changed? Why do these earlier policies need replacing? And are these policies actually effective in setting limits on data privacy practices? To answer these questions, we turn to the history of one of the internet’s most well-known superstars: Google.

Google: Two Decades of Data Privacy History

Google’s presence and contributions in today’s ultra-connected world cannot be understated. It owns the most used  search engine, the most used internet browser, and the most popular smartphone operating system. Perhaps more than any other company, Google has experienced and been at the forefront of the evolution of the internet’s data privacy debates.

As such, it is a perfect subject for a case study to answer our questions. Even better, Google publishes an archive  of all of its previous privacy policy revisions with highlights of what’s changed. Why are privacy policies important? Because privacy policies are documents legally required to be shared by a company to explain how it collects and shares personal information. If a company changes its approach to personal information use, then this change should be reflected in a privacy policy update. By reviewing the changes between Google’s privacy policies, we can assess how Google responded to and the impact on Google of major data privacy events in the last two decades of data privacy advocacy and policy.

2004: The Arrival of CalOPPA

Google’s first privacy policy , published in June of 1999, is a simple affair: only 3 sections and 617 words. The policy remained mostly the same until July 1, 2004, the same date that CalOPPA’s policy went into effect, where Google added a full section on “Data Collection” and much further detail on how it shared your information. Both additions were required under the new regulations set forth by CalOPPA and can be considered positive steps towards more transparent data practices.

2010: Concerns Over Government Data Requests

A new update in 2010 brings first mention of the Google Dashboard. The Dashboard, published after rising media attention focusing on reports that Google shared its data with governments upon request, is a utility for users to view the data Google’s collected. This massive increase in transparency can be considered a big win for data privacy advocates.

2012: A New Privacy Policy and Renewed Scrutiny

March 2012 brings Google’s biggest policy change yet. In a sweeping move, Google overhauled its policy to give it the freedom to share user data across all of its services. At least, all except for ads: “We will not combine DoubleClick cookie information with personally identifiable information unless we have your opt-in consent”. This move received negative attention and fines from both international media and governments.

2016: The Ad Wall Falls

With a simple, one-line change in its privacy policy, Google drops the barrier preventing it from using data from all of its services to better target its advertisements. This move shows that, despite previous negative attention, Google is not afraid of expanding its use of our personal information.

2018: The Arrival of GDPR

It is still far too soon to assess the impact of GDPR, but, if the impact on Google’s privacy policy  is any indicator, then it represents a massive change. With the addition of videos, additional resources, and clearer language, it seems as if Google is taking these new regulations very seriously.


Comparing Google’s first privacy policy to its most recent depicts a company that’s become more aware of and more interested in communicating its data practices. As demonstrated, this growth was caused by media scrutiny and governmental legislation along the way. However, while the increased transparency is appreciated, the same media scrutiny and governmental legislation has not prevented Google from expanding its use and sharing of our personal information. This raises a new question that will only be answered with time: will GDPR and the pending US regulations actually place real limits on the use of and protections for our personal information, or will they just continue to increase transparency?

Operation Neptune Spear

Operation Neptune Spear
By Chris Sanchez | March 3, 2019

Almost eight years ago on May 2nd, 2011, at 11:35pm Eastern Time, former President Barak Obama unfolded Operation NEPTUNE SPEAR to the world:

*“…the United States has conducted an operation that killed Osama bin Laden, the leader of al-Qaeda, and a terrorist who’s responsible for the murder of thousands of innocent men, women, and children.”*

Neptune Spear Command Center

At the time, the American public was aware that the US was engaged in combat operations in Afghanistan, but the whereabouts of Osama bin Laden—including whether he was alive or dead—were unknown. The announcement by President Obama (which, by the way, interrupted my viewing of America’s Funniest Home Videos, confirmed to the American public that Osama bin Laden:

  • Survived the US invasion of Afghanistan in 2001.
  • Had been hiding in Pakistan for several years.
  • Was killed in the raid by a highly trained (but undisclosed) US military unit.

President Obama’s announcement and subsequent reporting provided additional details about the raid and the decisions leading up to it, but the primary substance of the event can be neatly summarized in the above three bullet points. Yet much to my shock and dismay, over the coming days, I watched as news channels reported leaked details of the event to include classified information such as the identity of the military unit responsible for the raid including their call signs, identifying features, and deployment rotation cycle. None of this disclosed information materially altered the narrative of what had happened or provided any particularly useful insight into this classified military operation.

Secrecy and representative democracy have long had a tumultuous relationship, which is not likely to significantly improve in our Age of Information (On-Demand), as there will always be a trade-off between government transparency and the desire to keep certain pieces of information hidden from the public in the name of national security to include economic, diplomatic, and physical security. And though it often takes major headline events—Pentagon Papers (1971), Wikileaks (2006), Edward Snowden (2013) —to jar the public consciousness, the resultant public discussion surrounding these events, often finds that the balance between transparency and secrecy is either not well monitored, or well understood, by those who are elected/appointed to safeguard both the public trust and their overall security.

Take for instance the Terrorist Screening Database  (TSDB, commonly known as the “terrorist watchlist”). The TSDB is managed by the Terrorist Screening Center, a multi-agency organization created in 2003 by presidential directive, in response to the lack of intelligence sharing across governmental agencies prior to the September 11 terrorist attacks. People—both US citizens and foreigners—who are known or suspected of having terrorist organization affiliations are placed into the TSDB, along with unique personal identifiers, including in some cases, biometric information. This central repository of information is then exported across federal agencies (Department of States, Department of Homeland Security, Department of Defense, etc.) to aid in terrorist identification across passive and active channels.
TSDB Nomination Regimen

In the aftermath of the 9/11 attacks and subsequent domestic terror incidents, one would be hard pressed to argue that the TSDB is not a useful and necessary information-sharing tool for US Law Enforcement and other agencies responsible for domestic security. But like other instances of the government claiming the necessity of secrecy in the name of national security, there are indications that the secrecy/transparency balance is tilted in favor of unnecessary secrecy. A report in 2014 from the Intercept —an award-winning news organization—claimed evidence that 280,000 people in the TSDB (almost half the total number at the time), had no known terrorist group affiliation. How or why were these unaffiliated people placed into this federal database? The consequences of being placed in the TSDB are not trivial. Depending on the circumstances, TSDB members can find themselves on the “no-fly list”, have visas denied, be subjected to enhanced screenings at various checkpoints, and find their personal information (including biometric information) exposed across multiple organizations.

With an average of over 1,600 daily nominations to the TSDB, I am hard-pressed to believe that due diligence is conducted on all of those names, despite what is claimed on the Federal Bureau of Investigation’s FAQ section  of their Terrorist Screening Center website, regarding the thoroughness of the TSDB nomination process. Furthermore, once nominated, it’s very cumbersome for individuals to correct or remove records about them in the TSDB, in spite of a formal appeals procedures as mandated by the Intelligence Reform and Terrorism Prevention Act of 2014. The Office of the Inspector General under the Department of Justice has criticized the maintainers of the TSDB for “…frequent errors and being slow to respond to complaints”. A 2007 Inspector General report found a 38% error rate in a 105 name sample from the TSDB.

As long as we live in a representative democracy that values individual privacy, free and open discussion of policy, and the applicability of Constitutional principles to all US citizens, there will always be “friction” at the nexus of government responsibility, public trust in governmental institutions, and secrecy. Trust in US governmental institutions has slowly eroded over time, due in large part to the access of information previously hidden from the public, which was found to be contrary/misleading to what they had been told or had been led to believe. Experience has shown that publicly elected representatives are often not enough of a check on the power of government agencies to strike an appropriate balance between secrecy and transparency. Fortunately, though not perfect in their efforts to right perceived wrongs, much progress has been made at this nexus point by public advocacy organizations, academic institutions, investigative journalism, constitutional lawyers, and concerned citizens.

In my experience, which includes being on the front lines of the War on Terror from 2007-2013, the men and women who comprise the totality of “government institutions”, while imperfect, generally do have the best interests of the nation (as a whole), in mind when prosecuting their responsibilities. Given the limitations of human decision making in times of both crisis and tranquility, there is a tendency to err on the side of secrecy in the name of security. However, taken to extremes this mentality can result in significant abuses of power ranging from moderate invasions of privacy to severe abuses of personal freedoms. To compound the situation, the public erosion of trust in government creates a certain level of suspicion behind every governmental action that is not completely “above board”, even when there are very good reasons for non-public disclosure of information (such as the operational details as described in the Operation Neptune example cited at the beginning of this article). At the end of the day, the government will take those measures it deems as necessary to secure the safety of its citizenry, even if such actions come at the expense of the rights of minority groups or those who do not find themselves in political power. I think it’s our job as vigilant citizens to ensure that the balance of power is restored once the real or perceived crisis has passed.

How transparent does a government need to be? In a representative democracy it needs to be as transparent as possible without compromising public safety and security. How the US government and its citizens decide to strike that balance over the coming generations will be an interesting discussion indeed.

Primary Sources
1. https://en.wikisource.org/wiki/Remarks_by_the_President_on_Osama_bin_Laden
2. https://fas.org/sgp/crs/terror/R44678.pdf
3. https://theintercept.com/2014/08/05/watch-commander/
4. https://www.fbi.gov/file-repository/terrorist-screening-center-frequently-asked-questions.pdf/view

Data Privacy and the Chinese Social Credit System

Data Privacy and the Chinese Social Credit System
“Keeping trust is glorious and breaking trust is disgraceful”
By Victoria Eastman | February 24, 2019

Recently, the Chinese Social Credit System has been featured on podcasts, blogs, and news articles in the United States, often highlighting the Orwellian feel of the imminent system China plans to use to encourage good behavior amongst its citizens. The broad scope of this program raises questions about data privacy, consent, algorithmic bias, and error correction.

What is the Chinese Social Credit System?

In 2014, the Chinese government released a document entitled, “Planning Outline for the Construction of a Social Credit System” The system uses a broad range of public and private data to rank each citizen on a scale from 0-800. Higher ratings offer citizens benefits like discounts on energy bills, more matches on dating websites, and lower interest rates. Low ratings incur such punishments as the inability to purchase plane or train tickets, banishment for you and your children from universities, and even pet confiscation in some provinces. The system has been undergoing testing in various provinces around the country with different implementations and properties, but the government plans to take the rating system nationwide in 2020.

The exact workings of the system have not been explicitly detailed by the Chinese government, however details have spilled out since the policy was announced. Data is collected from a number of private and public sources: chat and email data; online shopping history; loan and debt information; smart devices, including smart phones, smart home devices, and fitness trackers; criminal records; travel patterns and location data; and the nationwide collection of millions of cameras that watch all Chinese citizens. Even your family members and other people you associate with can affect your score. The government has signed up more than 44 financial institutions and has issued at least 8 licenses to private companies such as Alibaba, Tencent, and Baidu to submit data to the system. Algorithms are run over the entire dataset and generate a single credit score for each citizen.

This score will be publicly available on any number of platforms including the newspapers, online media, and even some people phones so when you call a person with a low score, you will hear a message telling you the person you are calling has low social credit.

What does it mean for privacy and consent?

On May 1st, 2018, China announced the Personal Information Security Specification, a set of non-binding guidelines to govern the collection and use of personal data of Chinese citizens. The guidelines appear similar to the European GDPR with some notable differences, namely a focus on national security. Under these rules, individuals have full rights to their data, including erasure and must provide consent for any use of personal data by the collecting company.

How do these guidelines jive with the social credit system? The connection between the two policies has not been explicitly outlined by the Chinese government, but at first blush it appears there are some key conflicts between the two policies. Do citizens have erasure power over their poor credit history or other details that negatively affect their score? Are companies required to ask for consent to send private information to the government if it’s to be used in the social credit score? If the social credit score is public, how much control to individuals really have over the privacy of their data?

Other concerns about the algorithms themselves have also been raised. How are individual actions weighted by the algorithm? Are some ‘crimes’ worse than others? Does recency matter? How can incorrect data be fixed? Is the government removing demographic information like age, gender, or ethnicity or could those criteria unknowingly create bias?

Many citizens with high scores are happy with the system that gives them discounts and preferential treatment, but others fear the system will be used by the government to shape behavior and punish actions deemed inappropriate by the government. Dissidents and minority groups fear the system will be biased against them.

There are still many details that are unclear about how the system will work on a nationwide scale, however, there are clear discrepancies between the published data privacy policy China announced last year and the scope of the social credit system. How the government addresses the problems will likely lead to even more podcasts, news articles, and blogs.


Sacks, Sam. “New China Data Privacy Standard Looks More Far-Reaching than GDPR”. Center for Strategic and International Studies. Jan 29, 2018. https://www.csis.org/analysis/new-china-data-privacy-standard-looks-more-far-reaching-gdpr

Denyer, Simon. “China’s plan to organize its society relies on ‘big data’ to rate everyone“. The Washington Post. Oct 22, 2016. https://www.washingtonpost.com/world/asia_pacific/chinas-plan-to-organize-its-whole-society-around-big-data-a-rating-for-everyone/2016/10/20/1cd0dd9c-9516-11e6-ae9d-0030ac1899cd_story.html?utm_term=.1e90e880676f

Doxing: An Increased (and Increasing) Privacy Risk

Doxing: An Increased (and Increasing) Privacy Risk
By Mary Boardman | February 24, 2019

Doxing (or doxxing) is a form of online abuse where one party releases sensitive and/or personally identifiable information. While it isn’t the only risk associated with a privacy concern, it is one that can be put people physically in harm’s way. For instance, this data can include information such as name, address, telephone number. Such information exposes doxing victims to threats, harassment, and/or even violence.

People dox others for many reasons, all with the intention of harm. Because more data is more available to more people than ever, we can and should assume the risk of being doxed is also increasing. For those of us working with this data, we need to remember that there are actual humans behind the data we use. As data stewards, it is our obligation to understand the risks to these people and do what we can to protect them and their privacy interests. We need to be deserving of their trust.

Types of Data Used
To address a problem, we must first understand it. Doxing happens when direct identifiers are released, but these aren’t the only data that can lead to doxing. Some data are such as indirect identifiers, can also be used to dox people. Below are various levels of identifiability and examples of each:

  • Direct Identifier: Name, Address, SSN
  • Indirect Identifier: Date of Birth, Zip Code, License Plate, Medical Record
  • Number, IP Address, Geolocation
  • Data Linking to Multiple Individuals: Movie Preferences, Retail Preferences
  • Data Not Linking to Any Individual: Aggregated Census Data, Survey Results
  • Data Unrelated to Individuals: Weather

Anonymization and De-anonymization of Data
Anonymization is a common response to privacy concerns and can be seen as an attempt to protect people’s privacy. The way this is done is by removing identifiers from a dataset. However, because this data can be de-anonymized, anonymization is not a guarantee of privacy. In fact, we should never assume that anonymization can provide more than a level of inconvenience for a doxer. (And, as data professionals, we should not assume anonymization is enough protection.)

Generally speaking, there are four types of anonymization:
1. Remove identifiers entirely.
2. Replace identifiers with codes or pseudonyms.
3. Add statistical noise.
4. Aggregate the data.

De-anonymization (or re-identification) is where data that had been anonymized are accurately matched with the original owner or subject. This is often done by combining two or more datasets containing different information about the same or overlapping groups of people. For instance, anonymized data from social media accounts could be combined to identify individuals. Often this risk is highest when anonymized data is sold to third parties who then re-identify people.

Image Source:

One example of this is Sweeney’s 2002 paper where she was able to correctly identify 87% of the US population with just zip code, birthdate, and sex. Another example is work by Acqusiti and Gross from 2009, where they were able to predict social security numbers with birthdate and geographic location. Other examples include a 2018 study by Kondor, et al., where they were able to identify people based on mobility and spatial data. While their study only had a 16.8% success rate after a week, this jumped to 55% after four weeks.

Image Source:

Actions Moving Forward
There are many options data professionals can take. These range from being negligent stewards, doing as little as possible, to the more sophisticated differential privacy option. El Emam presented a protocol back in 2016 that does a very elegant job of balancing feasibility with effectiveness to anonymize data. He proposed the following steps:

1. Classify variables according to direct, indirect, and non-identifiers
2. Remove or replace direct identifiers with a pseudonym
3. Use a k-anonymity method to de-identify the indirect identifiers
4. Conduct a motivated intruder test
5. Update the anonymization with findings from the test
6. Repeat as necessary

We are unlikely to ever truly know the risk of doxing (and with it, de-anonymization of PII). However, we need to assume de-anonymization is always possible. Because our users trust us with their data and their assumed privacy, we need to make sure their trust is well-placed and be vigilant stewards of their data and privacy interests. What we do, and the steps we take as data professionals can and do have an impact on the lives of the people behind the data.

Works Cited:
Acquisti, A., & Gross, R. (2009). Predicting Social Security numbers from public data. Proceedings of the National Academy of Sciences, 106(27), 10975–10980. https://doi.org/10.1073/pnas.0904891106
Center, E. P. I. (2019). EPIC – Re-identification. Retrieved February 3, 2019, from https://epic.org/privacy/reidentification/
El Emam, Khaled. (2016). A de-identification protocol for open data. In Privacy Tech. International Association of Privacy Professionals. Retrieved from https://iapp.org/news/a/a-de-identification-protocol-for-open-data/
Federal Bureau of Investigation. (2011, December 18). (U//FOUO) FBI Threat to Law Enforcement From “Doxing” | Public Intelligence [FBI Bulletin]. Retrieved February 3, 2019, from https://publicintelligence.net/ufouo-fbi-threat-to-law-enforcement-from-doxing/
Lubarsky, Boris. (2017). Re-Identification of “Anonymized” Data. Georgetown Law Technology Review. Retrieved from https://georgetownlawtechreview.org/re-identification-of-anonymized-data/GLTR-04-2017/
Narayanan, A., Huey, J., & Felten, E. W. (2016). A Precautionary Approach to Big Data Privacy. In S. Gutwirth, R. Leenes, & P. De Hert (Eds.), Data Protection on the Move (Vol. 24, pp. 357–385). Dordrecht: Springer Netherlands. https://doi.org/10.1007/978-94-017-7376-8_13
Narayanan, A., & Shmatikov, V. (2010). Myths and fallacies of “personally identifiable information.” Communications of the ACM, 53(6), 24. https://doi.org/10.1145/1743546.1743558
Snyder, P., Doerfler, P., Kanich, C., & McCoy, D. (2017). Fifteen minutes of unwanted fame: detecting and characterizing doxing. In Proceedings of the 2017 Internet Measurement Conference on – IMC ’17 (pp. 432–444). London, United Kingdom: ACM Press. https://doi.org/10.1145/3131365.3131385
Sweeney, L. (2002). k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557–570. https://doi.org/10.1142/S0218488502001648

Android Apps in the Hot Seat for Violating Privacy Rules

Over 17k Android Apps in the Hot Seat for Violating Privacy Rules
A new ICSI study shows that Google’s user-resettable advertising IDs aren’t working
by Kathryn Hamilton (https://www.linkedin.com/in/hamiltonkathryn/)
February 24, 2019

What’s going on?
On February 14th 2019, researchers from the International Computer Science Institute (ICSI) published an article claiming that thousands of Android apps are breaking Google’s privacy rules. ICSI claims that while Google provides users with advertising privacy controls, these controls aren’t working. ICSI is concerned for users’ privacy and is looking for Google to address the problem.

But what exactly are the apps doing wrong? Since 2013, Google has required that apps record only the user’s “Ad ID” as an individual identifier. This is a unique code associated to each device that advertisers use to profiles users over time. To ensure control remains in the hands of each user, Google allows users to reset their Ad ID any time. This effectively resets everything that advertisers know about a person so that their ads are once again anonymous.

Unfortunately, ICSI found that some apps are recording other identifiers too, many of which the user cannot reset. These extra identifiers are typically hardware related like IMEI, MAC Address, SIM card ID, or device serial number.

Android’s Ad ID Settings

How does this violate privacy?

Let’s say you’ve downloaded one of the apps that ICSI has identified as being in violation. This list includes everything from Audible and Angry Birds to Flipboard News and antivirus softwares.

The app sends data about your interests to its advertisers. Included is your resettable advertising ID and your device’s IMEI, a non-resettable code that should not be there. Over time, the ad company begins to build an advertising profile about you, and the ads you see become increasingly personalized.

Eventually, you decide to reset your Ad ID to anonymize yourself. The next time you use the app, it will again send data to its advertisers about your interests, plus your new advertising ID and the same old IMEI.

To a compliant advertiser, you would appear to be a new person—this is how the Ad ID system is supposed to work. For the noncompliant app, however, advertisers simply match your IMEI to the old record they had about you and associate your two Ad IDs together.

Just like that, all your ads go back to being fully personalized, with all the same data that existed before you reset your Ad ID.

But they’re just ads. Can this really harm me?

I’m sure you have experienced the annoyance of being followed by ads after visiting a product’s page once and maybe even by accident. Or maybe you’ve tried to purchase something secretly for a loved one and had your surprise ruined by some side banner ad. The tangible harm to a given consumer might not be life-altering, but it does exist.

Regardless, the larger controversy here is not the direct harm to a consumer but rather the blatant lack of care or conscience exhibited by the advertisers. This is an example of the ever-present trend of companies being overly aggressive in the name of profit, and not respecting the mental and physical autonomy that should be fundamentally human.

This problem is only increasing as personal data is becoming numerous and easily accessible. If we’re having this amount of difficulty anonymizing ads, what kind of trouble will we face when it comes to bigger issues or more sensitive information?

What is going to happen about it?

At this point, you might be thinking that your phone’s app list is due for some attention. Take a look through your apps and delete those you don’t need or use—it’s good practice to clear the clutter regardless of whether an app is leaking data. If you have questions about specific apps, search ICSI’s Android app analytics database, which has privacy reports for over 75,000 Android apps.

In the bigger picture, it’s not immediately clear that Google, app developers, or advertisers have violated any privacy law or warrant government investigation. More likely, it seems that Google is in the public hot seat to provide a fix for the Ad ID system and to crack down on app developers.

Sadly, ICSI reported their finding to Google over five months ago, but have yet to hear back. Their study has spurred many media articles over the past few days, which means Google should feel increasing pressure and negative publicity over this in the coming weeks.

Interestingly, this case is very similar to a 2017 data scandal about Uber’s iOS app, which used hardware based IDs to tag iPhones even after the Uber app had been deleted. This was in direct violation of Apple’s privacy guidelines, caused large amounts of public outrage, and resulted in threats from Apple CEO Tim Cook to delete Uber from the iOS App Store. Uber quickly updated their app.

It will be interesting to see how public reaction and Google’s response measure up to the loud public outcry and swift action taken by Apple in the case of Uber.

Fall 2017 Test

Hi there, everyone! This is a “test” post to ensure that the process is working as intended and that everyone should have access to create posts of your own!