February 2019 – Data Science W231 | Behind the Data: Humans and Values

February 26, 2019

Data Privacy and the Chinese Social Credit System

Data Privacy and the Chinese Social Credit System
“Keeping trust is glorious and breaking trust is disgraceful”
By Victoria Eastman | February 24, 2019

Recently, the Chinese Social Credit System has been featured on podcasts, blogs, and news articles in the United States, often highlighting the Orwellian feel of the imminent system China plans to use to encourage good behavior amongst its citizens. The broad scope of this program raises questions about data privacy, consent, algorithmic bias, and error correction.

What is the Chinese Social Credit System?

In 2014, the Chinese government released a document entitled, “Planning Outline for the Construction of a Social Credit System” The system uses a broad range of public and private data to rank each citizen on a scale from 0-800. Higher ratings offer citizens benefits like discounts on energy bills, more matches on dating websites, and lower interest rates. Low ratings incur such punishments as the inability to purchase plane or train tickets, banishment for you and your children from universities, and even pet confiscation in some provinces. The system has been undergoing testing in various provinces around the country with different implementations and properties, but the government plans to take the rating system nationwide in 2020.

The exact workings of the system have not been explicitly detailed by the Chinese government, however details have spilled out since the policy was announced. Data is collected from a number of private and public sources: chat and email data; online shopping history; loan and debt information; smart devices, including smart phones, smart home devices, and fitness trackers; criminal records; travel patterns and location data; and the nationwide collection of millions of cameras that watch all Chinese citizens. Even your family members and other people you associate with can affect your score. The government has signed up more than 44 financial institutions and has issued at least 8 licenses to private companies such as Alibaba, Tencent, and Baidu to submit data to the system. Algorithms are run over the entire dataset and generate a single credit score for each citizen.

This score will be publicly available on any number of platforms including the newspapers, online media, and even some people phones so when you call a person with a low score, you will hear a message telling you the person you are calling has low social credit.

What does it mean for privacy and consent?

On May 1st, 2018, China announced the Personal Information Security Specification, a set of non-binding guidelines to govern the collection and use of personal data of Chinese citizens. The guidelines appear similar to the European GDPR with some notable differences, namely a focus on national security. Under these rules, individuals have full rights to their data, including erasure and must provide consent for any use of personal data by the collecting company.

How do these guidelines jive with the social credit system? The connection between the two policies has not been explicitly outlined by the Chinese government, but at first blush it appears there are some key conflicts between the two policies. Do citizens have erasure power over their poor credit history or other details that negatively affect their score? Are companies required to ask for consent to send private information to the government if it’s to be used in the social credit score? If the social credit score is public, how much control to individuals really have over the privacy of their data?

Other concerns about the algorithms themselves have also been raised. How are individual actions weighted by the algorithm? Are some ‘crimes’ worse than others? Does recency matter? How can incorrect data be fixed? Is the government removing demographic information like age, gender, or ethnicity or could those criteria unknowingly create bias?

Many citizens with high scores are happy with the system that gives them discounts and preferential treatment, but others fear the system will be used by the government to shape behavior and punish actions deemed inappropriate by the government. Dissidents and minority groups fear the system will be biased against them.

There are still many details that are unclear about how the system will work on a nationwide scale, however, there are clear discrepancies between the published data privacy policy China announced last year and the scope of the social credit system. How the government addresses the problems will likely lead to even more podcasts, news articles, and blogs.

Sources

Sacks, Sam. “New China Data Privacy Standard Looks More Far-Reaching than GDPR”. Center for Strategic and International Studies. Jan 29, 2018. https://www.csis.org/analysis/new-china-data-privacy-standard-looks-more-far-reaching-gdpr

Denyer, Simon. “China’s plan to organize its society relies on ‘big data’ to rate everyone“. The Washington Post. Oct 22, 2016. https://www.washingtonpost.com/world/asia_pacific/chinas-plan-to-organize-its-whole-society-around-big-data-a-rating-for-everyone/2016/10/20/1cd0dd9c-9516-11e6-ae9d-0030ac1899cd_story.html?utm_term=.1e90e880676f

February 26, 2019

Doxing: An Increased (and Increasing) Privacy Risk

Doxing: An Increased (and Increasing) Privacy Risk
By Mary Boardman | February 24, 2019

Doxing (or doxxing) is a form of online abuse where one party releases sensitive and/or personally identifiable information. While it isn’t the only risk associated with a privacy concern, it is one that can be put people physically in harm’s way. For instance, this data can include information such as name, address, telephone number. Such information exposes doxing victims to threats, harassment, and/or even violence.

People dox others for many reasons, all with the intention of harm. Because more data is more available to more people than ever, we can and should assume the risk of being doxed is also increasing. For those of us working with this data, we need to remember that there are actual humans behind the data we use. As data stewards, it is our obligation to understand the risks to these people and do what we can to protect them and their privacy interests. We need to be deserving of their trust.

Types of Data Used
To address a problem, we must first understand it. Doxing happens when direct identifiers are released, but these aren’t the only data that can lead to doxing. Some data are such as indirect identifiers, can also be used to dox people. Below are various levels of identifiability and examples of each:

Direct Identifier: Name, Address, SSN
Indirect Identifier: Date of Birth, Zip Code, License Plate, Medical Record
Number, IP Address, Geolocation
Data Linking to Multiple Individuals: Movie Preferences, Retail Preferences
Data Not Linking to Any Individual: Aggregated Census Data, Survey Results
Data Unrelated to Individuals: Weather

Anonymization and De-anonymization of Data
Anonymization is a common response to privacy concerns and can be seen as an attempt to protect people’s privacy. The way this is done is by removing identifiers from a dataset. However, because this data can be de-anonymized, anonymization is not a guarantee of privacy. In fact, we should never assume that anonymization can provide more than a level of inconvenience for a doxer. (And, as data professionals, we should not assume anonymization is enough protection.)

Generally speaking, there are four types of anonymization:
1. Remove identifiers entirely.
2. Replace identifiers with codes or pseudonyms.
3. Add statistical noise.
4. Aggregate the data.

De-anonymization (or re-identification) is where data that had been anonymized are accurately matched with the original owner or subject. This is often done by combining two or more datasets containing different information about the same or overlapping groups of people. For instance, anonymized data from social media accounts could be combined to identify individuals. Often this risk is highest when anonymized data is sold to third parties who then re-identify people.

Image Source:
http://technodocbox.com/Internet_Technology/75952421-De-anonymizing-social-networks-and-inferring-private-attributes-using-knowledge-graphs.html

One example of this is Sweeney’s 2002 paper where she was able to correctly identify 87% of the US population with just zip code, birthdate, and sex. Another example is work by Acqusiti and Gross from 2009, where they were able to predict social security numbers with birthdate and geographic location. Other examples include a 2018 study by Kondor, et al., where they were able to identify people based on mobility and spatial data. While their study only had a 16.8% success rate after a week, this jumped to 55% after four weeks.

Image Source:
https://portswigger.net/daily-swig/block-function-exploited-to-deanonymize-social-media-accounts

Actions Moving Forward
There are many options data professionals can take. These range from being negligent stewards, doing as little as possible, to the more sophisticated differential privacy option. El Emam presented a protocol back in 2016 that does a very elegant job of balancing feasibility with effectiveness to anonymize data. He proposed the following steps:

1. Classify variables according to direct, indirect, and non-identifiers
2. Remove or replace direct identifiers with a pseudonym
3. Use a k-anonymity method to de-identify the indirect identifiers
4. Conduct a motivated intruder test
5. Update the anonymization with findings from the test
6. Repeat as necessary

We are unlikely to ever truly know the risk of doxing (and with it, de-anonymization of PII). However, we need to assume de-anonymization is always possible. Because our users trust us with their data and their assumed privacy, we need to make sure their trust is well-placed and be vigilant stewards of their data and privacy interests. What we do, and the steps we take as data professionals can and do have an impact on the lives of the people behind the data.

Works Cited:
Acquisti, A., & Gross, R. (2009). Predicting Social Security numbers from public data. Proceedings of the National Academy of Sciences, 106(27), 10975–10980. https://doi.org/10.1073/pnas.0904891106
Center, E. P. I. (2019). EPIC – Re-identification. Retrieved February 3, 2019, from https://epic.org/privacy/reidentification/
El Emam, Khaled. (2016). A de-identification protocol for open data. In Privacy Tech. International Association of Privacy Professionals. Retrieved from https://iapp.org/news/a/a-de-identification-protocol-for-open-data/
Federal Bureau of Investigation. (2011, December 18). (U//FOUO) FBI Threat to Law Enforcement From “Doxing” | Public Intelligence [FBI Bulletin]. Retrieved February 3, 2019, from https://publicintelligence.net/ufouo-fbi-threat-to-law-enforcement-from-doxing/
Lubarsky, Boris. (2017). Re-Identification of “Anonymized” Data. Georgetown Law Technology Review. Retrieved from https://georgetownlawtechreview.org/re-identification-of-anonymized-data/GLTR-04-2017/
Narayanan, A., Huey, J., & Felten, E. W. (2016). A Precautionary Approach to Big Data Privacy. In S. Gutwirth, R. Leenes, & P. De Hert (Eds.), Data Protection on the Move (Vol. 24, pp. 357–385). Dordrecht: Springer Netherlands. https://doi.org/10.1007/978-94-017-7376-8_13
Narayanan, A., & Shmatikov, V. (2010). Myths and fallacies of “personally identifiable information.” Communications of the ACM, 53(6), 24. https://doi.org/10.1145/1743546.1743558
Snyder, P., Doerfler, P., Kanich, C., & McCoy, D. (2017). Fifteen minutes of unwanted fame: detecting and characterizing doxing. In Proceedings of the 2017 Internet Measurement Conference on – IMC ’17 (pp. 432–444). London, United Kingdom: ACM Press. https://doi.org/10.1145/3131365.3131385
Sweeney, L. (2002). k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557–570. https://doi.org/10.1142/S0218488502001648

February 25, 2019

Android Apps in the Hot Seat for Violating Privacy Rules

Over 17k Android Apps in the Hot Seat for Violating Privacy Rules
A new ICSI study shows that Google’s user-resettable advertising IDs aren’t working
by Kathryn Hamilton (https://www.linkedin.com/in/hamiltonkathryn/)
February 24, 2019

What’s going on?
On February 14th 2019, researchers from the International Computer Science Institute (ICSI) published an article claiming that thousands of Android apps are breaking Google’s privacy rules. ICSI claims that while Google provides users with advertising privacy controls, these controls aren’t working. ICSI is concerned for users’ privacy and is looking for Google to address the problem.

But what exactly are the apps doing wrong? Since 2013, Google has required that apps record only the user’s “Ad ID” as an individual identifier. This is a unique code associated to each device that advertisers use to profiles users over time. To ensure control remains in the hands of each user, Google allows users to reset their Ad ID any time. This effectively resets everything that advertisers know about a person so that their ads are once again anonymous.

Unfortunately, ICSI found that some apps are recording other identifiers too, many of which the user cannot reset. These extra identifiers are typically hardware related like IMEI, MAC Address, SIM card ID, or device serial number.

Android’s Ad ID Settings

How does this violate privacy?

Let’s say you’ve downloaded one of the apps that ICSI has identified as being in violation. This list includes everything from Audible and Angry Birds to Flipboard News and antivirus softwares.

The app sends data about your interests to its advertisers. Included is your resettable advertising ID and your device’s IMEI, a non-resettable code that should not be there. Over time, the ad company begins to build an advertising profile about you, and the ads you see become increasingly personalized.

Eventually, you decide to reset your Ad ID to anonymize yourself. The next time you use the app, it will again send data to its advertisers about your interests, plus your new advertising ID and the same old IMEI.

To a compliant advertiser, you would appear to be a new person—this is how the Ad ID system is supposed to work. For the noncompliant app, however, advertisers simply match your IMEI to the old record they had about you and associate your two Ad IDs together.

Just like that, all your ads go back to being fully personalized, with all the same data that existed before you reset your Ad ID.

But they’re just ads. Can this really harm me?

I’m sure you have experienced the annoyance of being followed by ads after visiting a product’s page once and maybe even by accident. Or maybe you’ve tried to purchase something secretly for a loved one and had your surprise ruined by some side banner ad. The tangible harm to a given consumer might not be life-altering, but it does exist.

Regardless, the larger controversy here is not the direct harm to a consumer but rather the blatant lack of care or conscience exhibited by the advertisers. This is an example of the ever-present trend of companies being overly aggressive in the name of profit, and not respecting the mental and physical autonomy that should be fundamentally human.

This problem is only increasing as personal data is becoming numerous and easily accessible. If we’re having this amount of difficulty anonymizing ads, what kind of trouble will we face when it comes to bigger issues or more sensitive information?

What is going to happen about it?

At this point, you might be thinking that your phone’s app list is due for some attention. Take a look through your apps and delete those you don’t need or use—it’s good practice to clear the clutter regardless of whether an app is leaking data. If you have questions about specific apps, search ICSI’s Android app analytics database, which has privacy reports for over 75,000 Android apps.

In the bigger picture, it’s not immediately clear that Google, app developers, or advertisers have violated any privacy law or warrant government investigation. More likely, it seems that Google is in the public hot seat to provide a fix for the Ad ID system and to crack down on app developers.

Sadly, ICSI reported their finding to Google over five months ago, but have yet to hear back. Their study has spurred many media articles over the past few days, which means Google should feel increasing pressure and negative publicity over this in the coming weeks.

Interestingly, this case is very similar to a 2017 data scandal about Uber’s iOS app, which used hardware based IDs to tag iPhones even after the Uber app had been deleted. This was in direct violation of Apple’s privacy guidelines, caused large amounts of public outrage, and resulted in threats from Apple CEO Tim Cook to delete Uber from the iOS App Store. Uber quickly updated their app.

It will be interesting to see how public reaction and Google’s response measure up to the loud public outcry and swift action taken by Apple in the case of Uber.