Doxing: An Increased (and Increasing) Privacy Risk
By Mary Boardman | February 24, 2019
Doxing (or doxxing) is a form of online abuse where one party releases sensitive and/or personally identifiable information. While it isn’t the only risk associated with a privacy concern, it is one that can be put people physically in harm’s way. For instance, this data can include information such as name, address, telephone number. Such information exposes doxing victims to threats, harassment, and/or even violence.
People dox others for many reasons, all with the intention of harm. Because more data is more available to more people than ever, we can and should assume the risk of being doxed is also increasing. For those of us working with this data, we need to remember that there are actual humans behind the data we use. As data stewards, it is our obligation to understand the risks to these people and do what we can to protect them and their privacy interests. We need to be deserving of their trust.
Types of Data Used
To address a problem, we must first understand it. Doxing happens when direct identifiers are released, but these aren’t the only data that can lead to doxing. Some data are such as indirect identifiers, can also be used to dox people. Below are various levels of identifiability and examples of each:
- Direct Identifier: Name, Address, SSN
- Indirect Identifier: Date of Birth, Zip Code, License Plate, Medical Record
- Number, IP Address, Geolocation
- Data Linking to Multiple Individuals: Movie Preferences, Retail Preferences
- Data Not Linking to Any Individual: Aggregated Census Data, Survey Results
- Data Unrelated to Individuals: Weather
Anonymization and De-anonymization of Data
Anonymization is a common response to privacy concerns and can be seen as an attempt to protect people’s privacy. The way this is done is by removing identifiers from a dataset. However, because this data can be de-anonymized, anonymization is not a guarantee of privacy. In fact, we should never assume that anonymization can provide more than a level of inconvenience for a doxer. (And, as data professionals, we should not assume anonymization is enough protection.)
Generally speaking, there are four types of anonymization:
1. Remove identifiers entirely.
2. Replace identifiers with codes or pseudonyms.
3. Add statistical noise.
4. Aggregate the data.
De-anonymization (or re-identification) is where data that had been anonymized are accurately matched with the original owner or subject. This is often done by combining two or more datasets containing different information about the same or overlapping groups of people. For instance, anonymized data from social media accounts could be combined to identify individuals. Often this risk is highest when anonymized data is sold to third parties who then re-identify people.
One example of this is Sweeney’s 2002 paper where she was able to correctly identify 87% of the US population with just zip code, birthdate, and sex. Another example is work by Acqusiti and Gross from 2009, where they were able to predict social security numbers with birthdate and geographic location. Other examples include a 2018 study by Kondor, et al., where they were able to identify people based on mobility and spatial data. While their study only had a 16.8% success rate after a week, this jumped to 55% after four weeks.
Actions Moving Forward
There are many options data professionals can take. These range from being negligent stewards, doing as little as possible, to the more sophisticated differential privacy option. El Emam presented a protocol back in 2016 that does a very elegant job of balancing feasibility with effectiveness to anonymize data. He proposed the following steps:
1. Classify variables according to direct, indirect, and non-identifiers
2. Remove or replace direct identifiers with a pseudonym
3. Use a k-anonymity method to de-identify the indirect identifiers
4. Conduct a motivated intruder test
5. Update the anonymization with findings from the test
6. Repeat as necessary
We are unlikely to ever truly know the risk of doxing (and with it, de-anonymization of PII). However, we need to assume de-anonymization is always possible. Because our users trust us with their data and their assumed privacy, we need to make sure their trust is well-placed and be vigilant stewards of their data and privacy interests. What we do, and the steps we take as data professionals can and do have an impact on the lives of the people behind the data.
Acquisti, A., & Gross, R. (2009). Predicting Social Security numbers from public data. Proceedings of the National Academy of Sciences, 106(27), 10975–10980. doi.org/10.1073/pnas.0904891106
Center, E. P. I. (2019). EPIC – Re-identification. Retrieved February 3, 2019, from epic.org/privacy/reidentification/
El Emam, Khaled. (2016). A de-identification protocol for open data. In Privacy Tech. International Association of Privacy Professionals. Retrieved from iapp.org/news/a/a-de-identification-protocol-for-open-data/
Federal Bureau of Investigation. (2011, December 18). (U//FOUO) FBI Threat to Law Enforcement From “Doxing” | Public Intelligence [FBI Bulletin]. Retrieved February 3, 2019, from publicintelligence.net/ufouo-fbi-threat-to-law-enforcement-from-doxing/
Lubarsky, Boris. (2017). Re-Identification of “Anonymized” Data. Georgetown Law Technology Review. Retrieved from georgetownlawtechreview.org/re-identification-of-anonymized-data/GLTR-04-2017/
Narayanan, A., Huey, J., & Felten, E. W. (2016). A Precautionary Approach to Big Data Privacy. In S. Gutwirth, R. Leenes, & P. De Hert (Eds.), Data Protection on the Move (Vol. 24, pp. 357–385). Dordrecht: Springer Netherlands. doi.org/10.1007/978-94-017-7376-8_13
Narayanan, A., & Shmatikov, V. (2010). Myths and fallacies of “personally identifiable information.” Communications of the ACM, 53(6), 24. doi.org/10.1145/1743546.1743558
Snyder, P., Doerfler, P., Kanich, C., & McCoy, D. (2017). Fifteen minutes of unwanted fame: detecting and characterizing doxing. In Proceedings of the 2017 Internet Measurement Conference on – IMC ’17 (pp. 432–444). London, United Kingdom: ACM Press. doi.org/10.1145/3131365.3131385
Sweeney, L. (2002). k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557–570. doi.org/10.1142/S0218488502001648