Can fake data be good?

Can fake data be good?
By Anonymous | June 20, 2022

With the chaos caused by deep fake videos in recent years, can fake generated data also do good? Apparently yes, synthetic data has been playing an important role in machine learning in recent years. Many in the AI industry even think that using synthetic data will become more commonplace than using real data as techniques to generate synthetic data improve.

Image source: NVIDIA Blog – Graph of projected synthetic data usage in 10 years

And there’s an ever growing list of new companies that focus on technology to generate all kinds of synthetic data (Devaux 2022). An interesting, albeit a bit creepy example, is the https://thispersondoesnotexist.com/ website that generates a realistic, photo-like image of a person that does not actually exist. Below image contains example images from that site.

Image source: thispersondoesnotexist.com – Examples of generated human face images

This article provides a brief overview of synthetic data and its uses. For more details, refer to the Towards Data Science podcast on synthetic data that much of this information originates from.

What is synthetic data?

Synthetic data is data generated by an algorithm instead of collected from the real world (Devaux 2021). Depending on the use case the data is generated for, it can have different properties from its source data. Even though it’s generated, the statistics behind it are based on real world data so that its predictive value remains intact.

You can also generate synthetic data from simulations of the real world. For example, self-driving vehicles require a lot of data to run safely. And sometimes it’s difficult to come by, not to mention unethical to create, situations that it would need to be aware of like an accident (Devaux 2022). By simulating such incidents, you can generate data about them without endangering anyone.

Why synthetic data?

Synthetic data has many use cases from nefarious actors generating deep fake videos to more positive use cases like augmenting data collected from the real world if for example the dataset is too small or limited in some way (Andrews 2022). By far one of the more touted reasons for using synthetic data is privacy protection and speed of development.

Privacy is a huge factor in using synthetic data instead of real world data. Data from industries like finance, medicine and other sensitive areas aren’t readily available and have a lot of hurdles to go through to get access. But synthetic data generated from the mathematical properties of those data do not need protection because they don’t reveal anything about the individuals in the original dataset.

This brings me to the next reason for using synthetic data which is the efficiency with which researchers can get access to data. Often real world data is either buried in privacy and security restrictions or expensive and time consuming to collect and transform properly. Synthetic data provides an alternative to that without losing the predictive power of the original data.

Another useful way to use synthetic data is to test AI systems. As regulators and companies alike use AI more in their products and businesses, they need a way to test those systems without violating privacy. Synthetic data provides a good alternative.

Challenges of synthetic data

Overfitting is one of the challenges of generating synthetic data. This can happen if you generate a large synthetic dataset from a small real world dataset. Because the pool of source data is limited, the model you create from the synthetic data will usually overfit to the characteristics in that smaller dataset. In extreme cases this can lead to a model memorizing some specific individual data which violates privacy completely. There are techniques like detecting and removing data that’s too similar between the generated and source datasets or removing data points that are outliers that help prevent overfitting.

Another big challenge is bias. If you’re unaware of a bias in the original dataset, generating another dataset from that original will just duplicate that bias. In some cases, it can even exacerbate the bias if the generated dataset is much larger than the original. There are a lot of tools and currently a lot of work going on in the field to detect and prevent bias in data.

Conclusion

Synthetic data is becoming a mainstay of machine learning. It provides a way to continue to innovate despite the difficulty of collecting real data while still protecting the privacy of the individuals in the original data. Even though there are still big challenges in using these techniques, it seems using synthetic data will continue to be a growing part of AI development.

References

  1. Andrews, G. (2022, May 19). What Is Synthetic Data? | NVIDIA Blogs. NVIDIA Blog. Retrieved June 17, 2022, from https://blogs.nvidia.com/blog/2021/06/08/what-is-synthetic-data/
  2. Devaux, E. (2021, December 15). Introduction to privacy-preserving synthetic data – Statice. Medium. Retrieved June 17, 2022, from https://medium.com/statice/introduction-to-privacy-preserving-synthetic-data-f5bccbdb8e0c
  3. Devaux, E. (2022, January 7). List of synthetic data startups and companies — 2021. Medium. Retrieved June 17, 2022, from https://elise-deux.medium.com/the-list-of-synthetic-data-companies-2021-5aa246265b42
  4. Hao, K. (2021, June 14). These creepy fake humans herald a new age in AI. MIT Technology Review. Retrieved June 17, 2022, from https://www.technologyreview.com/2021/06/11/1026135/ai-synthetic-data/
  5. Harris, J. (2022, May 21). Synthetic data could change everything – Towards Data Science. Medium. Retrieved June 17, 2022, from https://towardsdatascience.com/synthetic-data-could-change-everything-fde91c470a5b
  6. Somers, M. (2020, July 21). Deepfakes, explained. MIT Sloan. Retrieved June 17, 2022, from https://mitsloan.mit.edu/ideas-made-to-matter/deepfakes-explained
  7. Watson, A. (2022, March 24). How to Generate Synthetic Data: Tools and Techniques to Create Interchangeable Datasets. Gretel.Ai. Retrieved June 17, 2022, from https://gretel.ai/blog/how-to-generate-synthetic-data-tools-and-techniques-to-create-interchangeable-datasets#:%7E:text=Synthetic%20data%20is%20artificially%20annotated,learning%20dataset%20with%20additional%20examples.

Image References

  1. Wang, P. (n.d.). [Human faces generated by AI]. Thispersondoesnotexist.com. https://imgix.bustle.com/inverse/4b/17/8f/0e/cf91/4506/99c7/e6a491c5d4ac/these-people-are-not-real–they-were-produced-by-our-generator-that-allows-control-over-different-a.png?w=710&h=426&fit=max&auto=format%2Ccompress&q=50&dpr=2
  2. Andrews, G. (2022, May 19). What Is Synthetic Data? | NVIDIA Blogs. NVIDIA Blog. Retrieved June 17, 2022, from https://blogs.nvidia.com/blog/2021/06/08/what-is-synthetic-data/

Tesla: Should You Pay For My Car Insurance?

Tesla: Should You Pay For My Car Insurance?
By Melinda Leung | June 17, 2022

Tweeter Lede: NHTSA published a report identifying that Tesla is involved in 75% of incidents involving autonomous vehicles. But before we blame Tesla, we need to better understand the data and context behind those numbers.

The National Highway Traffic Safety Association (NHTSA) recently published their first ever report on vehicles using driver-assist technologies. They found that there have been 367 crashes in the last nine months involving vehicles that were using these types of advanced driver assistance systems. Almost 75% of the incidents involved a Tesla system functioning on Tesla’s iconic Autopilot, three of which led to injuries and five resulting in death.

Before we jump the gun, blame Tesla for these issues, and forever veto all self-driving vehicles, we need to take a step back and understand the context behind the numbers.

Levels of Autonomy

First, how do autonomous vehicles even work? There are actually five levels of autonomy.

  • Level 0: We’re still driving. Hence, this is not even considered a level in the autonomy scale.
  • Level 1: The vehicle can assist with basic steering, braking and accelerating.
  • Level 2: The vehicle can control both steering, braking and acceleration (adaptive cruise control, lane keep). However, the human driver still needs to monitor the environment at all times. This is the level Tesla currently is officially at.
  • Level 3: The vehicle can perform most driving tasks. However, the human driver needs to be ready to take control if needed and essentially acts as the figure behind the wheel.
  • Level 4: The vehicle can perform all driving tasks and monitor the driving environment in most conditions. The human doesn’t have to pay attention during those times.
  • Level 5: Forget steering wheels. At this point, the vehicle completely drives for you and the human occupants are just passengers. They are not involved in the driving whatsoever.

There are 5 levels of autonomous vehicles. (Source: Lemberg Law)

Now that we understand autonomy, we now know that there is still a human component to Tesla’s Autopilot feature. Full self-driving isn’t fully here yet, so accidents that occur with driver-assisted technologies still very much involve human interaction.

So Can We Blame Tesla Yet?

Not quite yet. Because Tesla is the brand name associated with autonomous vehicles, it is no surprise that they happen to also sell the largest number of vehicles with the most advanced driver assistance technologies. Therefore, by being purely the largest and most well known amongst the autonomous vehicle industry, it is not surprising that they are responsible for the largest count of crashes that occurs. What is more useful is to understand the percentage of accidents that occur by the number of miles driven. A classic base rate fallacy.

Unlike most automakers, Tesla also knows exactly which vehicles were using Autopilot at the time of a crash. Its vehicles are equipped with cellular connectivity that automatically reports this information back to Tesla when a crash occurs. Not all vehicles do so. Therefore, Tesla’s systems may also be better at relaying crash information than others.

Next, what if the crash was going to happen irregardless if the vehicle was in Autopilot or not. For example, if the car behind you was driving too quickly and rear-ended you, it didn’t matter who was driving: you would have been hit no matter what. Because we don’t really have any context in the type of accident, it makes it difficult to understand who is at fault.

Lastly, according to NHTSA, these companies need to document crashes if any automated technologies were used within 30 seconds of impact. According to Waymo, which is Google’s autonomous driving division, a third of its reported crashes took place when the vehicle was in manual mode but still fit within this 30 second range. They are one of the oldest players in this industry and we can extrapolate and expect similar stats for Tesla. If that’s the case, it’s really difficult to 100% blame Tesla.

If two cars on Autopilot crash and this is a common occurrence, then yes, let’s make Elon Musk pay for our increased car insurance policies. But until we have a lot more data about the conditions of these crashes, it’s hard for us to determine who is really at fault and make sweeping assumptions about the safeness of these vehicles.

References

  1. National Highway Traffic Safety Association. (2022, June). Summary Report:Standing General Order on Crash Reporting for Level 2 Advanced Driver Assistance Systems. https://www.nhtsa.gov/sites/nhtsa.gov/files/2022-06/ADAS-L2-SGO-Report-June-2022.pdf
  2. Lemberg Law. What You Need to Know about Driverless Cars. https://lemberglaw.com/are-driverless-cars-safe/ 
  3. McFarland, Matt. (2022, March 16). CNN. Tesla owners say they are wowed — and alarmed — by ‘full self-driving’. https://www.cnn.com/2021/11/03/cars/tesla-full-self-driving-fsd/index.html 
  4. Hawkins, Andrew. (2022, June 15). The Verge. US releases new driver-assist crash data & surprise, it’s mostly Tesla. https://www.theverge.com/2022/6/15/23168088/nhtsa-adas-self-driving-crash-data-tesla

This Pride Month, the Fight for LGBTQ Equality Continues

This Pride Month, the Fight for LGBTQ Equality Continues
By Dustin Cox | March 16, 2022

Is it time for another Stonewall? Progress, as LGBTQ folks have learned, is often hard fought. Recent events have reminded us that equality is still elusive in places across the United States, and progress is far from inevitable. Limitations on how we talk about, categorize, and analyze data are a little-understood front in GOP culture wars targeting LGBTQ people at various levels of government. These limitations fundamentally undermine the ability for researchers, public policy makers, and data scientists to understand the LGBTQ community, craft effective measures to promote general welfare, and design effective AI and machine-learning-powered capabilities for good.

Republican Governor Ron DeSantis recently rammed through Florida’s “Don’t Say Gay” law, which will “prohibit instruction about sexual orientation and gender identity” and goes into effect just two weeks from now on July 1st (CBS, 2022). This chilling limitation on speech is a stark reminder that many on the political right desire to take us back to a time where LGBTQ people are relegated to the closet, marginalized, and even criminalized in society. It threatens LGBTQ youths’ safety by ensuring they have less knowledge and feel more isolated; jeopardizes LGBTQ teachers’ jobs and livelihoods if they let slip that they have spouses who aren’t straight; and demeans LGBTQ families in Florida. And that’s the point. No discussing data, no studying it, and certainly no progress if it has anything to do with LGBTQ topics.

It’s not only at the state level that we see such policies being enacted. The federal government, under Former President Donald J. Trump, advanced various changes that were aimed at erasing LGBTQ people from critical data sources used for a wide array of government initiatives. The Health and Human Services Department removed sexual orientation from their data collection activities, most notably the National Survey of Older Americans Act Participants and the Annual Program Performance Report for the Centers for Independent Living, which will limit the ability to serve LGBTQ seniors (Kozuch, 2017). The US Census Bureau even altered a report to congress by literally erasing their plans to measure sexual orientation and gender identity in the America Community Survey (HRC, 2017).

As many data scientists will tell you, “more data beats better algorithms.” That is to say that – embarrassingly often – simply having higher quality or more training data will yield more accurate predictions than new algorithms do. The right-wing push to suppress, eliminate, and criminalize data and speech about LGBTQ people strikes at the very core of our ability to utilize AI and machine learning techniques to categorize, understand, model, and predict in ways that would benefit LGBTQ communities. When LGBTQ people are categorized incorrectly, lumped together with broader groups, or thrown away as “residual,” it makes for worse health outcomes, fewer government resources, inequitable public policy, and more. Again… that’s the point.

While these policies are more abstract in nature, they discriminate, harass, and abuse the LGBTQ community like police officers did decades ago in New York City. Drag queens, trans women of color, lesbians, and gays rose up against their antagonists in the Stonewall Uprising in 1969 and demanded equality (History.com, 2022)… it just may be time to rise up against this new wave of oppressors who would see us erased.

References
[1] CBS Miami Team (2022). CBS Broadcasting, Inc. https://www.cbsnews.com/miami/news/gov-ron-desantis-addresses-woke-gender-ideology-dont-say-gay-law/
[2] Kozuch, Elliott (2017). Human Rights Campaign. https://www.hrc.org/news/hrc-calls-on-trump-admin-to-reinstate-sexual-orientation-question
[3] HRC Staff (2017). Human Rights Campaign. https://www.hrc.org/news/trump-administration-eliminates-lgbtq-data-collection-from-census
[4] Schnoebelen, Tyler (2016). More Data Beats Better Algorithms. Data Science Central. https://www.datasciencecentral.com/more-data-beats-better-algorithms-by-tyler-schnoebelen/
[5] History.com Editors (2022). History.com. https://www.history.com/topics/gay-rights/the-stonewall-riots

Images
Image 1: https://news.harvard.edu/gazette/story/2019/06/harvard-scholars-reflect-on-the-history-and-legacy-of-the-stonewall-riots/
Image 2: https://cbs4indy.com/news/bill-passes-in-senate-would-allow-businesses-to-deny-service-to-gay-couples/
Image 3: https://www.documentarytube.com/articles/stonewall-riots-the-protest-that-inspired-modern-pride-parades

Surveillance in Bulk

Surveillance in Bulk
By Chandler Haukap | February 18, 2022

When is the government watching, and do you have a right to know?
In 2014, the ACLU documented [47 civilians injured in no-knock warrants](https://www.aclu.org/report/war-comes-home-excessive-militarization-american-police
). Since then, the deaths of Breonna Taylor and Amir Locke have resulted in mass protests opposed to the issuance of no-knock warrants.

While the country debates the right of citizens to security in their own homes, the federal government is also extending its reach into our virtual spaces. In 2019 the ACLU learned that the NSA was reading and storing text messages and phone calls from United States citizens with no legal justification to do so. While the stakes of these virtual no-knock raids are much lower in terms of human life, the surveillance of our online communities could manifest a new form of oppression if left unchecked.

Privacy rights online

Online privacy is governed by the Stored Communications Act which is part of the larger Electronic Communications Privacy Act. The policy was signed into law in 1986. Three years before the world wide web. 19 years before Facebook. The policy is archaic and does not scale well to a world where 3.6 billion people are on a social media platform.

The Stored Communications Act distinguishes between data stored for 180 days or more and data stored for less than 180 days. For data less than 180 days old, the government must obtain a warrant to view it. Data stored for more than 180 days can be surveilled using a warrant, subpoena, or court order. However, there is one loophole. All data stored “solely for the purpose of providing storage or computer processing services” fall under the same protections as data stored for more than 180 days. This means that once you’ve opened an email it could be considered “in storage”; fair game for the government to read using a court order.

Furthermore, the court can also issue a gag order to the data provider that prevents them from informing you that the government is watching.

Three mechanisms of the current law raise concern:
1) With a search warrant, the government does not have to inform you that they are spying on you.
2) The government can gag private companies from informing you.
3) The government can request multiple accounts per warrant.

These three mechanisms are a recipe for systematic oppression. The Dakota Access Pipeline protest camp contained a few hundred participants at any point. The only thing stopping the government from surveilling all of their online interactions is a warrant issued by a court that we cannot observe or contest.

Bulk Requests

How could we ever know if the government is reading protesters’ texts? With private companies gagged and warrants issued in bulk, we cannot know the extent of the surveillance until years later. Luckily we can see some aggregated statistics from Facebook and Twitter. Both companies issue reports on government requests aggregated every 6 months. The number of requests for user information by the government increases every year at both companies, but the truly disturbing statistic is the number of accounts per request.

The government is requesting more accounts per request from Facebook
Source: https://github.com/chaukap/chaukap.github.io/raw/main/img/Facebook_Account_Requests.png

 

By dividing the number of accounts requested by the number of requests made, we get the average number of accounts per request. Facebook’s data shows a steady increase in the number of accounts per request which suggests that the government is emboldened to issue more bulk requests.

The government is requesting more accounts per request from Twitter
Source: https://github.com/chaukap/chaukap.github.io/raw/main/img/Twitter_Account_Requests.png

 

Twitter’s data doesn’t trend as strongly upward, but the last half of 2021 is truly disturbing. Over 9 accounts per request! When will the government simply request an entire protest movement, or have they already?

Every warrant is a chip at our Fourth Amendment right to privacy. This new mechanism is an atomic bomb. It’s a recipe for guilt by association and the violation of entire online communities.

Privacy is a human right, and we deserve laws that respect it.

If you’re doing nothing wrong, you have nothing to hide from the giant surveillance apparatus the government’s been hiding.Steven Colbert

Lived Experts Are Essential Collaborators for Impactful Data Work

Lived Experts Are Essential Collaborators for Impactful Data Work
By Alissa Stover | February 18, 2022

Imagine you are a researcher working in a state agency that administers assistance programs (like Temporary Assistance for Needy Families, a.k.a. TANF, a.k.a. cash welfare). Let’s assume you are in an agency that values using data to guide program decision-making, so you have support of leadership and access to the data you need. How might you go about your research?

If we consider a greatly simplified process, you would probably start with talking to agency decision-makers about their highest-priority questions. You’d figure out which ones might benefit from the help of a data-savvy researcher like you. You might then go see what data is available to you, query it, and dive into your analyses. You’d discuss your findings with other analysts, agency leadership, and maybe other staff along the way. Maybe at the end you’d summarize your findings in a report or just dive into another project. Along the way, you want to do good, you want to help the people the agency serves. And despite the fact that the process here captures little of the complexity of the real, iterative steps researchers go through, it is accurate in the fact that at no point in this process did the researcher engage actual program participants in identifying and prioritizing problems and going about solving them. Although some researchers might be able to incorporate the perspectives of program participants into their analyses by collecting additional qualitative data from them, the vast majority of researchers do not work with program participants as collaborators in research.

A growing number of researchers recognize the value of participatory research, or directly engaging with those most affected by the issue at hand (and thus have lived expertise) [11]. Rather than only including the end-user as a research “subject”, lived experts are seen as having as co-owners of the research with real decision-making power. In our example, the researcher might work with program participants and front-line staff in figuring out what problem to solve. What better way to do “good” by someone than to ask what they actually need, and then do it? How easily can we do harm if we are ignoring those most affected by a problem when trying to solve it? [2]

Doing research using data does not mean you are an unbiased actor; every researcher brings their own perspective into the work and without including other points of view will blindly replicate structural injustices in their work [6]. Working with people who have direct experience with the issue at hand brings a contextual understanding that can improve the quality of research and is essential for understanding potential privacy harms [7]. For example, actually living the experience of being excluded can help uncover where there might be biases in data from under- or over-representation in a dataset [1]. Lived experts bring a lens that can improve practices around data collection and processing and ultimately result in higher quality information [8].

But does collaboration equal consultation? Not really. What really makes participatory research an act of co-creation is power sharing. As Sascha Costanza-Chock puts it, “Don’t start by building a new table; start by coming to the table.” Rather than going to the community to ask them to participate in pre-determined research activities, our researcher could go to the community and ask what they should focus on and how to go about doing the work [3].

Image from: Chicago Beyond, “Why Am I Always Being Researched?”. Accessed at: https://chicagobeyond.org/wp-content/uploads/2019/05/ChicagoBeyond_2019Guidebook_19_small.pdf

Doing community-led research is hard. It takes resources and a lot of self-reflection and inner work on the part of the researcher. Many individuals who might be potential change-makers in this research face barriers to engagement that stem from traumatic experiences with researchers in the past and feelings of vulnerability [5]. Revealingly, many researchers who publish about their participatory research findings don’t even credit the contributions of nonacademic collaborators [9].

Despite the challenges, community-led research could be a pathway to true change in social justice efforts. In our TANF agency example, the status quo is a system that serves a diminishing proportion of families in poverty and does not provide enough assistance to truly help families escape poverty despite decades of research focused on this program asking “What works?” for program participants [10]. Many efforts to improve the programs with data have upheld the status quo or have even made the situation even worse for families [4].

Image from: Center for Budget and Policy Priorities, “TANF Cash Assistance Helps Families, But Program Is Not the Success Some Claim”. Accessed at: https://www.cbpp.org/research/family-income-support/tanf-cash-assistance-helps-families-but-program-is-not-the-success

Participatory research is not a panacea. Deep changes in our social fabric requires cultural and policy change on a large scale. However, a commitment to holding oneself accountable to the end-user of a system and choosing to co-create knowledge with them could be a small way individual researchers create change in their own immediate context.

References

[1] Buolamwini, J. and Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of Machine Learning Research, 81:1-15. Accessed at: https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf

[2] Chicago Beyond. (2019). Why Am I Always Beyond Researched?: A guidebook for community organizations, researchers, and funders to help us get from insufficient understanding to more authentic truth. Chicago Beyond. Accessed at: https://chicagobeyond.org/researchequity/

[3] Costanza-Chock, S. (2020). Design Justice: Community-Led Practices to Build the Worlds We Need. MIT Press. Accessed at: https://design-justice.pubpub.org/

[4] Eubanks, V. (2018). Automating inequality: how high-tech tools profile, police, and punish the poor. First edition. New York, NY: St. Martin’s Press.

[5] Grayson, J., Doerr, M. and Yu, J. (2020). Developing pathways for community-led research with big data: a content analysis of stakeholder interviews. Health Research Policy and Systems, 18(76). https://doi.org/10.1186/s12961-020-00589-7

[6] Jurgenson, N. (2014). View From Nowhere. The New Inquiry. Accessed at: https://thenewinquiry.com/view-from-nowhere/

[7] Nissenbaum, H.F. (2011). A Contextual Approach to Privacy Online. Daedalus 140:4 (Fall 2011), 32-48. Accessed at: https://ssrn.com/abstract=2567042

[8] Ruberg, B. and Ruelos, S. (2020). Data for Queer Lives: How LGBTQ Gender and Sexuality Identities Challenge Norms of Demographics. Big Data & Society. https://doi.org/10.1177/2053951720933286

[9] Sarna-Wojcicki, D., Perret, M., Eitzel, M.V., and Fortmann, L. (2017). Where Are the Missing Coauthors? Authorship Practices in Participatory Research. Rural Sociology, 82(4):713-746. https://doi.org/10.1186/s12961-020-00589-7

[10] Pavetti, L. and Safawi, A. (2021). TANF Cash Assistance Helps Families, But Program Is Not the Success Some Claim. Center on Budget and Policy Priorities. Accessed at: https://www.cbpp.org/research/family-income-support/tanf-cash-assistance-helps-families-but-program-is-not-the-success

[11] Vaughn, L. M., and Jacquez, F. (2020). Participatory Research Methods – Choice Points in the Research Process. Journal of Participatory Research Methods, 1(1). https://doi.org/10.35844/001c.13244

 

Embedding Ethics in the Code we Write

Embedding Ethics in the Code we Write
By Allison Fox | February 18, 2022

In the last few years, several researchers and activists have pulled back the curtain on algorithmic bias, sharing glaring examples of how artificial intelligence (AI) models have the potential to discriminate based on age, sex, race, and other identities. A 2013 study conducted by Latanya Sweeney revealed that if you have a name that is more often given to black babies than white babies, you are 80% more likely to have an ad suggestive of an arrest display when a Google search for your name is performed (Sweeney 2013). 

Joy Boulamwini, Coded Bias

Similar discrimination was presented in the Netflix documentary Coded Bias – MIT researcher Joy Buolamwini discovered that facial recognition technologies do not accurately classify women or detect darker-skinned faces (Coded Bias 2020). Another case of algorithmic bias surfaced recently when news articles revealed that an algorithmic tool used by the Justice Department to assess the risk of prisoners returning to crime generated inconsistent results based on race (Johnson 2022). As the use of AI decision-making continues to increase, and more decisions are made by algorithms instead of humans, these algorithmic biases are only going to be amplified. Data science practitioners can take steps to mitigate these biases and their impacts by embedding ethics in the code they write – both figuratively and literally.

To better integrate conversations about ethics into the actual process of doing data science, the company DrivenData developed Deon, a command line tool that provides developers with reminders about ethics throughout the entire lifecycle of their project (DrivenData 2020). 

Deon Checklist: Command Line Tool

The checklist is organized into five sections, designed to mirror the various stages of a data science project – data collection, data storage, analysis, modeling, and deployment. Each section includes several questions that aim to provoke discussion and ensure that important steps are not overlooked. DrivenData also put together a table of real-world ethical issues with AI that maybe could have been avoided had the corresponding checklist questions been discussed during the data science project. For example, during analysis, it is important to examine the dataset for possible sources of bias, and then take steps to address those biases. If this step is not taken during analysis, unintended consequences can ensue – garbage in often results in garbage out, meaning that if you provide a model with biased data, the model is likely going to produce outputs that reflect that bias. For example, female jobseekers are more likely to be shown Google ads for lower-paying jobs than male jobseekers (Gibbs 2015). This discriminatory behavior by Google’s model could be a result of biased data, and had steps been taken to address biased data, this discriminatory treatment potentially could have been avoided. By using Deon to embed ethics in the code we write, data scientists will be reminded of these ethical risks while coding, and can take steps to address biased data before a model is released into the wild, in turn avoiding and mitigating potential unintended biases. 

Ethics are also relevant during the modeling stage of a data science project, where it is important to test model results for fairness across groups. The Deon checklist includes a checklist item on this step, and several open-source, code-based toolkits like AI Fairness 360 and Fairlearn have been developed recently to help data scientists assess and improve fairness in AI models. If this step is ignored, models may treat people differently based on certain identities, such as when Apple’s credit card first launched, and offered smaller lines of credit to men than women (Knight 2019). 

As the use of AI to make decisions that were previously made by humans becomes even more widespread, classification decisions will be made faster and at a larger scale, reaching more people than ever before. While this will have its benefits, in that the advent of new technologies such as the ones discussed in this blog can improve quality of life and access to opportunity, it will also have its consequences. Minorities populations who already face discrimination have been shown to be the most susceptible to these consequences. Open-source tools that embed ethical considerations in the data science process, like Deon, AI Fairness360, and Fairlearn, can all help to combat these consequences by encouraging data scientists to place ethics at the forefront during each stage of a data science project.

References:

1. Coded Bias. (2020). About the Film. Coded Bias. https://www.codedbias.com/

2. DrivenData. (2020). About – Deon. Deon. https://deon.drivendata.org/ 

3. Gibbs, S. (2015, July 8). Women less likely to be shown ads for high-paid jobs on Google, study shows. The Guardian. https://www.theguardian.com/technology/2015/jul/08/women-less-likely-ads-high-paid-jobs-google-study 

4. Johnson, C. (January 6, 2022). Flaws plague a tool meant to help low-risk federal prisoners win early release. NPR. https://www.npr.org/2022/01/26/1075509175/justice-department-algorithm-first-step-act 

5. Knight, W. (2019, November 19). The Apple Card Didn’t “See” Gender—and That’s the Problem. Wired. https://www.wired.com/story/the-apple-card-didnt-see-genderand-thats-the-problem/ 

6. Sweeney, Latanya, Discrimination in Online Ad Delivery (January 28, 2013). Available at SSRN: http://dx.doi.org/10.2139/ssrn.2208240 

The Appeal and the Dangers of Digital ID for Refugees Surveillance

The Appeal and the Dangers of Digital ID for Refugees Surveillance
By Joshua Noble | October 29, 2021

Digitization of national identity is growing in popularity as governments across the world seek to modernize access to services and streamline their own data stores. Refugees, especially those coming from war-torn areas where they have had to flee at short notice with few belongings or those who have made dangerous or arduous journeys, often lack any form of ID. Governments are often unwilling to provide ID to the stateless since they often have not determined whether they will allow a displaced person to stay and in some cases the stateless person may not want to stay in that country. Many agencies are beginning to explore non-state Digital ID as a way of providing some identity to stateless persons, among them the UNHCR, Red Cross, International Rescue Committee, and the UN Migration Agency. For instance, a UNHCR press release states: “UNHCR is currently rolling out its Population Registration and Identity Management EcoSystem (PRIMES), which includes state of the art biometrics.”

The need for a way for a stateless person to identify themselves is made all the more urgent by approaches that governments have begun to take to identifying refugees. Governments are increasingly using migrants’ electronic devices as verification tools. This practice is made easier with the use of mobile extraction tools, which allow an individual to download key data from a smartphone, including contacts, call data, text messages, stored files, location information, and more. In 2018, the Austrian government approved a law forcing asylum seekers to hand over their phones so authorities could check their origin, with the aim of determining if their asylum request should be invalidated if they were found to have previously entered another EU country.

NGO provided ID initiatives may convince governments to abandon or curtail these highly privacy invasive strategies. But while the intention of these initiatives is often charitable and seeking to help provide assistance to refugees, they have the challenge of many attempts to uniquely identify users or persons: access to services is often tied to the creation of the ID itself. For a person who is stateless, homeless, and in need of aid, arriving in a new country and shuttled to a camp, this can feel like coercion. There is an absence of informed consent on the part of refugees. The agencies creating these data subjects often fail to adequately educate them on what data is being collected and how it will be stored. Once data is collected, refugees face extensive bureaucratic challenges if they want to change or update that data. Agencies creating the data offer little in the way of transparency around how data is stored, used, and offered and most importantly, with whom it might be shared both inside and outside of the organizations collecting the data.

Recently as NGOs and aid agencies fled Afghanistan as the US military abandoned the country, thousands of Afghans who had worked with those organization agencies began to worry that biometric databases and their own digital history might be used by the Taliban to track and target them. In another example of the risks of using biometric data, the UNHCR shared information on Rohingya refugees with the government of Bangladesh. The Bangladeshi government then sent that same data to Myanmar to verify people for possible repatriation. Both of these cases identify the real and present risk that creating and storing biometric data and ID can pose.

While the need for ID and the benefits that it can provide are both valid concerns, the challenge of ad hoc and temporary institutions providing those IDs and collecting and storing data associated with them presents not only privacy risks to refugees but often real and present physical danger as well.

UNHCR. 2018. “UNHCR Strategy on Digital Identity and Inclusion” [https://www.unhcr.org/blogs/wp-content/uploads/sites/48/2018/03/2018-02-Digital-Identity_02.pdf](https://www.unhcr.org/blogs/wp-content/uploads/sites/48/2018/03/2018-02-Digital-Identity_02.pdf)

IOM & APSCA. 2018. 5th border management and identity conference (BMIC) on technical cooperation and capacity building. Bangkok: BMIC. [http://cb4ibm.iom.int/bmic5/assets/documents/5BMIC-Information-Brochure.pdf](http://cb4ibm.iom.int/bmic5/assets/documents/5BMIC-Information-Brochure.pdf).

Red Cross 510. 2018 An Initiative of the Netherlands Red Cross Is Exploring the Use of Self Managed Identity in Humanitarian Aid with Tykn.Tech. [https://www.510.global/510-x-tykn-press-release/](https://www.510.global/510-x-tykn-press-release/)

UNHCR. 2018. Bridging the identity divide – is portable user-centric identity management the answer? [https://www.unhcr.org/blogs/bridging-identity-divide-portable-user-centric-identity-management-answer/](https://www.unhcr.org/blogs/bridging-identity-divide-portable-user-centric-identity-management-answer/)

Data&Society 2020, “Digital Identity in the Migration & Refugee Context” [https://datasociety.net/wp-content/uploads/2019/04/DataSociety_DigitalIdentity.pdf](https://datasociety.net/wp-content/uploads/2019/04/DataSociety_DigitalIdentity.pdf)

India’s National Health ID – Losing Privacy with Consent

India’s National Health ID – Losing Privacy with Consent
By Anonymous | October 29, 2021

Source: Ayushman Bharat Digital Mission (ABDM)

“Every Indian will be given a Health ID,” Prime Minister Narendra Modi promised on India’s Independence Day this year, adding, “This Health ID will work like a health account for every Indian. Your every test, every disease – which doctor, which medicine you took, what diagnosis was there, when they were taken, what was their report – all this information will be included in your Health ID.”[1] The 14 digit Health ID will be linked to a health data consent manager – used to seek patient’s consent for connecting and sharing of health information across healthcare facilities (hospitals, laboratories, insurance companies, online pharmacies, telemedicine firms).

Source: Ayushman Bharat Digital Mission (ABDM)

Technology Is The Answer, But What Was The Question?
India’s leadership of the landmark resolution on digital health by the World Health Organization (WHO) has been recognized globally. With a growing population widening the gap between number of health‑care professionals and patients (0.7 doctors per 1000 patients[3]) and with increasing cost of health care, investing in technology to enable health‑care delivery seems to be the approach to leapfrog public health in India. And, the National Digital Health Mission (NDHM) is India’s first big step in improving India’s health care system and a move towards universal health coverage.

PM Modi says “This mission will play a big role in overcoming problems faced by the poor and middle class in accessing treatment”[4]. It aims to digitize medical treatment facilities by connecting millions of hospitals. The Health ID will be free of cost and completely voluntary. Citizens will be able to manage their records in a private, secure, and confidential environment. The analysis of population health data will lead to better planning, budgeting and implementation for states and health programmes, helping save costs and improve treatment. But, with all its well intentions, this hasty rush to do something may actually be disconnected with ground reality and challenges abound.

Source: Ayushman Bharat Digital Mission (ABDM)

Consent May Not Be The Right Way To Handle Data Privacy Issues
Let’s start with ‘voluntary’ consent. The government might be playing a digital sleight of hand here. Earlier this month, the Supreme Court of India issued notices to the Government seeking removal of the requirement for a National ID (Aadhar) from the government’s CoWin app. The CoWin app is used to schedule COVID vaccine appointments. For registration, Aadhar is voluntary (you can use a Driver’s License), but the app makes Aadhar required to generate a certificate[5]. You must be thinking what National ID has to do with National Digital Health ID? During its launch of National Digital Health ID, the government automatically created health ids for individuals that used the National ID for scheduling a vaccine appointment. 122 million (approx. 98%) of 124 million IDs generated have been for people registered on CoWin. Most recipients of the vaccine were not aware that their unique Health ID had been generated[6].

Then there is the issue of ‘forced’ consent. Each year, 63 million Indians are pushed into poverty due to healthcare costs[7] i.e. two citizens every second, and 50% of the population lives in poverty (3.1 USD per day). One of the stated benefits of Health ID is that it will be used to determine distribution of benefits under Government’s health welfare schemes. So if you are dependent on Government schemes or looking to participate, you have to create a Health ID and link it with the National ID. As Amulya Nidhi of the non-profit People’s Health Movement puts it “People’s vulnerability while seeking health services may be misused to get consent. Informed consent is a real issue when people are poor, illiterate or desperate[8].”

Source: Ayushman Bharat Digital Mission (ABDM)

Good Digital Data Privacy Is Hard To Get Right
Finally, there is the matter of ‘privacy regulation’, the NDHM depends on a Personal Data Protection Bill (PDP) which overhauls the outdated Information Technology Act 2000. After two years of deliberation the PDP is yet to be passed, and 124 million Health IDs have already been generated. Moreover, principles such as qualified consent and specific user rights have no legal precedence in India[9]. In its haste, the Government has moved forward without a robust legal framework to protect health data. And without a data protection law or an independent data protection authority, there are few safeguards and no recourse when rights are violated.

The lack of PDP could lead to misuse of data by private firms and bad actors. It may happen that an insurance agency chooses to grant coverage only to customers willing to link their Health IDs and share digitised records. Similarly, they may offer incentives to those who share medical history and financial details for customised insurance premium plans[10]. Or, they may even reject insurance applications and push up premium rates for those with pre-existing medical conditions. If insurance firms, hospitals etc. demand health IDs, it will become mandatory, even if not required by law.

The New Normal: It’s All Smoke and Mirrors
In closing, medical data will lead to better planning, cost optimization, and implementation for health programs. But without a robust legal framework, the regulatory gap poses implementation challenges for a National Digital Health ID. Moreover, the government has to rein in intimidatory data collection practices else people will have no choice but to consent to access essential resources which they are entitled to. Lastly as the GDPR explains, consent is freely given, specific, informed and an unambiguous indication of the data subject’s wishes. The Government of India needs to decouple initiatives and remove any smoke and mirrors, so people are clearly informed about what they are agreeing to in each case. In the absence of such efforts, there will be one added ‘new normal’ for India – losing privacy with consent.

References:
1. Mehrotra Karishma (2020). PM Announces Health ID for Every Indian. The Indian Express. Accessed on October 25, 2001 from: https://indianexpress.com/article/india/narendra-modi-health-id-coronavirus-independence-day-address-6556559/
2. Bertalan Mesko et al (2017). Digital Health is a Cultural Transformation of Traditional Healthcare. Mhealth. Accessed on October 25, 2001 from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5682364/
3. Anup Karan et al (2021). Size, composition and distribution of health workforce in India. Human Resources for Health. Accessed on October 25, 2001 from: https://human-resources-health.biomedcentral.com/articles/10.1186/s12960-021-00575-2
4. Kaunain Sheriff (2021). PM Modi launches Ayushman Bharat Digital Mission. The Indian Express. Accessed on October 25, 2001 from: https://indianexpress.com/article/india/narendra-modi-pradhan-mantri-ayushman-bharat-digital-health-mission-7536669/
5. Ashlin Mathew (2021). Modi government issuing national health ID stealthily without informed consent. National Herald. Accessed on October 25, 2001 from: https://www.nationalheraldindia.com/india/modi-government-issuing-national-health-id-stealthily-without-informed-consent
6. Regina Mihindukulasuriya (2021). By 2025, rural India will likely have more internet users than urban India. ThePrint. Accessed on October 25, 2001 from: https://theprint.in/tech/by-2025-rural-india-will-likely-have-more-internet-users-than-urban-india/671024/
7. Vidhi Doshi (2018). India is rolling out a health-care plan for half a billion people. But are there enough doctors? Washington Post. Accessed on October 25, 2001 from: https://www.washingtonpost.com/world/2018/08/14/india-is-rolling-out-healthcare-plan-half-billion-people-are-there-enough-doctors/
8. Rina Chandran (2020). Privacy concerns as India pushes digital health plan, ID. Reuters. Accessed on October 25, 2001 from: https://www.reuters.com/article/us-india-health-tech/privacy-concerns-as-india-pushes-digital-health-plan-id-idUSKCN26D00B
9. Shahana Chatterji et al (2021). Balancing privacy concerns under India’s Integrated Unique Health ID. The Hindu. Accessed on October 25, 2001 from: https://www.thehindubusinessline.com/opinion/balancing-privacy-concerns-under-indias-integrated-unique-health-id/article36760885.ece
10. Mithun MK (2021). How the Health ID may impact insurance for patients with pre-existing conditions. The News Minute. Accessed on October 25, 2001 from: https://www.thenewsminute.com/article/how-health-id-may-impact-insurance-patients-pre-existing-conditions-156306

Social Media Analytics for Security : Freedom of Speech vs Government Surveillance

Social Media Analytics for Security : Freedom of Speech vs Government Surveillance
By Nitin Pillai | October 29, 2021

Introduction

The U.S. Department of Homeland Security (DHS) U.S. Customs and Border Protection (CBP) takes steps to ensure the safety of its facilities and personnel from natural disasters, threats of violence, and other harmful events and activities. For aiding these efforts, CBP personnel monitor publicly available social media to provide situational awareness and to monitor potential threats or dangers to CBP personnel and facility operators. CBP may collect publicly available information posted on social media sites to create reports and disseminate information related to personnel and facility safety. CBP conducted a Privacy Impact Assessment (PIA) because, as part of this initiative, CBP may incidentally collect, maintain, and disseminate personally identifiable information (PII) over the course of these activities.

Social Media Surveillance’s impact on Privacy

Social Media Surveillance

The Privacy Impact Assessment (PIA) states that CBP searches public social media posts to bolster the agency’s “situational awareness”—which includes identifying “natural disasters, threats of violence, and other harmful events and activities” that may threaten the safety of CBP personnel or facilities, including ports of entry. The PIA aims to inform the public of privacy and related free speech risks associated with CBP’s collection of personally identifiable information (PII) when monitoring social media. CBP claims it only collects PII associated with social media—including a person’s name, social media username, address or approximate location, and publicly available phone number, email address, or other contact information—when “there is an imminent threat of loss of life, serious bodily harm, or credible threats to facilities or systems.”

Chilling Effect on Free Speech
CBP’s social media surveillance poses a risk to the free expression rights of social media users. The PIA claims that CBP is only monitoring public social media posts, and thus individuals retain the right and ability to refrain from making information public or to remove previously posted information from their respective social media accounts. While social media users retain control of their privacy settings, CBP’s policy chills free speech by causing people to self-censor including not expressing their public opinions on the Internet for fear that CBP could collect their PII for discussing a topic of interest to CBP. Additionally, people running anonymous social media accounts might be afraid that PII collected could lead to their true identities being unmasked. This chilling effect is made worse by the fact that CBP does not notify users when their PII is collected. CBP also may share information with other law enforcement agencies, which could result in immigration consequences or being added to a government watchlist.

CBP’s Practices Don’t Mitigate Risks to Free Speech
The PIA claims that any negative impacts on free speech of social media surveillance are mitigated by both CBP policy and the Privacy Act’s prohibition on maintaining records of First Amendment activity. Yet, these supposed safeguards ultimately provide little protection.

Social Network Analysis

Collecting information in emergency situations and to ensure public safety undoubtedly are important, but CBP collects vast amounts of irrelevant information – far beyond what would be required for emergency awareness – by amassing all social media posts that include matches to designated keywords. Additionally, CBP agents may use “situational awareness” information for “link analysis,” that is, identifying possible associations among data points, people, groups, events, and investigations. While that kind of analysis could be useful for uncovering criminal networks, in the hands of an agency that categorizes protests and immigration advocacy as dangerous, it may be used to track activist groups and political protesters.

Conclusion

Some argue that society must “balance” freedom and safety, and that in order to better protect ourselves from those who would do us harm, we have to give up some of our liberties. This might be a false choice in many areas. Especially in the world of data analysis, liberty does not have to be sacrificed to enhance security.

Freedom of speech is a critical stitch in the fabric of democracy. The public needs to know more about how agencies are gathering our data, what they’re doing with it, any policies that govern this surveillance, and the tools agencies use, including algorithmic surveillance and machine learning techniques. A single Facebook post or tweet may be all it takes to place someone on a watchlist, with effects that can range from repeated, invasive screening at airports to detention and questioning in the United States or abroad.

Our government should be fostering, not undermining our ability to maintain obscurity in our online personas for multiple reasons, including individual privacy, security, and consumer protection.

References :

1. Privacy Impact Assessment for Publicly Available Social Media Monitoring and Situational Awareness Initiative – DHS/CBP/PIA-058
https://www.dhs.gov/sites/default/files/publications/privacy-pia-cbp58-socialmedia-march2019.pdf
2. CBP’s new social media Surveillance : A Threat to Free Speech and Privacy

CBP’s New Social Media Surveillance: A Threat to Free Speech and Privacy


3. We’re demanding the government come clean on surveillance of social media
https://www.aclu.org/blog/privacy-technology/internet-privacy/were-demanding-government-come-clean-surveillance-social

 

Time flies when ISPs are having fun

Time flies when ISPs are having fun
By Anonymous | October 29, 2021

More than four years have passed since US Congress repealed FCC rules bringing essential privacy protections to ISP consumers. This is a matter affecting millions of Americans, and measures need to be taken so consumers are not left at their own peril and big corporations’ mercy while accessing the Internet.

**What Happened?**

In March 2017, as the country transitioned from Obama’s 2nd term to newly
elected President Trump, without much alarm the, US Congress repealed regulation providing citizens with privacy protections when using ISP and broadband services. The main area concerning the regulation was to inhibit ISP appetite to freely collect, aggregate and sell consumer data, including web browsing history.

The repeal was a massive victory for ISPs such as Verizon, Comcast and AT&T and a blow to consumers’ privacy rights. Not only was the “wild west” privacy status quo maintained, but it also impeded the FCC from trying to submit any similar regulations to congress (!) in the future.

The main argument for repealing this regulation was the FTC traditionally being the agency regulating corporate/business privacy affairs. Also by regulating ISPs, it was argued the FCC would put them at disadvantage when compared to FTC regulated web services such as Google, Apple, Yahoo and such. Never mind the ISP business model is based on charging for access and bandwidth, not monetization via data brokerage or advertising services. And never mind FCC newly appointed chair – Ajit Pai – who recommended for voting against its own regulatory agency, was a former lawyer for Verizon.[1]

So four years have passed and the FTC has not issued, nor it is expected to issue any robust privacy regulatory frameworks on ISP privacy. Consumers are left into privacy limbo and states scrambling to pass related laws [2]. How bad is it, and can can be done?

**What can ISPs see **

The Internet – a network of networks – is an open architecture of technologies
and services, where information flows thru its participant nodes in little
virtual envelopes called “packets”\*. Every information-containing packet
passing thru any of the network’s edges (known as routers), can be inspected and have its source address, destination address and information content (known as payload) known.

Since the ISP is your first node entering the Internet (also known as default
gateway), this node presents a great opportunity to collect data about everything sent or received by households. This complete visibility risk is only mitigated by the usage of encryption, which prevents any nodes (except the sender and receiver) from seeing packets’ contents. As long as encryption is being used (think of HTTPS, for example), payload is not visible to ISPs.

The good news is that encryption is becoming more pervasive across all internet domains. As of early 2021, 90% of internet traffic is encrypted, and the trend is still upward.

But even with encryption present ISPs can collect a lot of information. ISPs
have to route your packages after all, so they know exactly with whom you are
communicating to and from, along with how many packages are being exchanged and their timestamps. ISPs can easily deduct when one is for, example, watching Netflix movies, despite your communication to Netflix being encrypted.

In addition to the transport of information packets per se, there is another
venue ISPs use to collect data: Domain Name Services (DNS). Every time one needs to go to a domain (say by visiting URL [www.nyt.com](http://www.nyt.com)), the translation of that domain to routable IP addresses is visible to the ISP, either by it providing the DNS service (which usually is a default setting), or examining DNS traffic (TCP port 53). ISPs can easily collect important web browsing usage in this fashion.

Beyond what is known to be used by ISPs to collect usage data, some technologies could also be used. ISPs could use technics such as sophisticated traffic fingerprinting [3] and in extreme cases even deep packet inspection, or other some nefarious techniques such as Verizon’s infamous X-UIDH’s [4]. Fingerprinting is how for example, ISPs were supposed to detect movies being shared illegally via torrent streams, a failed imposition by the Record Industry Association of America (RIAA) [5]. While it is speculative that ISPs could be resorting to such technologies, it is important to notice that abuses by ISPs occurred in the past, so without specific regulations, the potential danger remains.

**So what can you do?**

Since our legislators failed to protect us, ‘some do-it-yourself work is
needed’. And some of these actions requite a good level of caution.

Opt-in was one of the most important FCC provisions repealed in 2017, so an
opt-out action from the consumer is needed:

Another measure is to configure your home router (or each individual device) so it no longer uses the ISP as the DNS server, and make DNS traffic encrypted. Here one needs to be careful selecting a DNS provider, otherwise you are at the mercy of the same privacy risks. Make sure you select a DNS service with good privacy. For example CloudFlare DNS (server “1.1.1.1”) privacy can be found here: https://developers.cloudflare.com/1.1.1.1/privacy/public-dns-resolver

Setting up private DNS on Android device. Credits: cloudflare

For a complete “cloak” of your traffic, making it virtually invisible to the ISP
one can use a VPN services. These services will make internet traffic extremely difficult for your ISP to analyze. Except for volumetrics, the ISP will not have much information about your traffic. The drawback is that a VPN service provider in turn can see all your traffic, just like the ISP. So one has to be EXTREMELY diligent selecting this type of services. Some of these providers are incorporated abroad in countries with lax regulations, with varying degrees of privacy assurance. For example, vendor NordVPN is incorporated and regulated in Panama, while “ExpressVPN” has its privacy independently audited by renowned company PwC.

Last but most importantly, it is important to contact your representative and
voice your concern about the current state ISP privacy. At the current state of
affairs the FCC has its arms tied by congress, and the FTC has done very little
to protect consumers privacy. As mid-terms elections approach, this is a good
time to make your voice be heard. Your representative along ways of contact can be found here: https://www.house.gov/representatives/find-your-representative

References:

[1] <https://www.reuters.com/article/us-usa-internet-trump-idUSKBN1752PR>

[2] <https://www.ncsl.org/research/telecommunications-and-information-technology/2019-privacy-legislation-related-to-internet-service-providers.aspx>

[3] <https://www.ndss-symposium.org/wp-content/uploads/2017/09/website-fingerprinting-internet-scale.pdf>

[4] https://www.eff.org/deeplinks/2014/11/verizon-x-uidh

[5] https://www.pcworld.com/article/516230/article-4652.html