Data Breaches

According to the United States Government, “A data breach is a security violation in which sensitive, protected or confidential data is copied, transmitted, viewed, stolen or used by an individual unauthorized to do so.”  The news has been filled with massive company data breaches involving customer and employee information.

Notification Laws: Every state in the U.S., with the exception of Alabama and South Dakota, has a data breach notification law in place.  The National Conference of State Legislators has a link to all the different state laws so you can see what your state requires.  Keeping track of all these laws could be very confusing, not including all the international laws for multinational corporations.  Currently, there is no federal law that covers general personal information data breaches. Both the Data Security and Breach Notification Act of 2015 and Personal Data Notification and Protection Act of 2017 have been introduced into the House of Representatives but that is as far as they got.  For health information specifically, there are two rules at the federal level that cover notification to those effected which are the Health Breach Notification Rule and the HIPAA Breach Notification Rule.

Data Ownership: Discussion stemming from these breaches has brought up the topic of data ownership. The personal information that companies have residing in their databases has long been thought of as their property.  This concept has been changing and evolving as our personal data has been proliferated into many databases with increasingly more personal information being collected and generated.  Users of these websites and companies understand that organizations need their information to provide services, whether that’s a personalized shopping experience or hailing a ride.  This point of ownership cannot be highlighted enough.  The acquiring of personal information gained in a data breach is not just an attack on the company but is an attack on all this users whose personal information was stolen and could be sold or used for illegal activities.

Timing: Customers of these companies want to know if their information has been compromised, so they can evaluate if accounts or other identity fraud situations have occurred. There are several milestones in the data breach timeline.  One is when the data breach actually occurred.  This may not be known if the company does not have a digital trail and infrastructure to discover when this happened.  This may be well before the next milestone of the company discovering a breach and assessing the extent of the breach.  The next milestone would be the corrective action taken by the effected company or agency to ensure the data is now being protected.  Currently, only eight states have a firm deadline for notification which is usually 30 to 90 days after discovery of the breach.

Encryption: California led the data breach notification law effort by passing, in 2002, a law requiring businesses and government agencies to notify California residents of data security breaches.  In the California law, there is an exception to notifying those effected if the personal information is encrypted. The law defines the term “encrypted” to mean “rendered unusable, unreadable, or indecipherable to an unauthorized person through a security technology or methodology generally accepted in the field of information security.”  These broad terms for encryption do not include a particular levels of encryption but tries to leave open the increasing level of encryption by whatever the industry standard is at that time.  Maybe if a breach occurs, a government or third party could evaluate the company’s encryption levels to determine if reporting is required.

The issue of data breaches is not going away. If Government agencies and companies do not respond in a fashion that customers find acceptable, users will start to become wary of sharing this valuable personal information and the insights that come with it will be lost.

The Blurry Line of Wearables

Imagine a world where your heart rate, sleep pattern, and other biometrics are being measured and monitored by your employer on a daily basis. They can decide when and how much you can work based on these metrics. They can decide how much to pay you or even fire you based on this information. This is the world that a lot of athletes now live in. Wearable biometric trackers are becoming the norm in sports across the world. The NBA has begun adopting it to aid recovery and monitor training regimens. MLB is using it to monitor pitchers’ arms. There are European sports teams that display this information on a jumbo-tron live in the arena.

Currently in the NBA, the collective bargaining agreement between the player’s union and the NBA state that “data collected from a Wearable worn at the request of a team may be used for player health and performance purposes and Team on-court tactical and strategic purposes only. The data may not be considered, used, discussed or referenced for any other purpose such as in negotiations regarding a future Player Contract or other Player Contract transaction.” The line between using biometric data for strategic or health purposes and using it for contract talks or roster decisions can become blurred quickly. What begins as a simple monitoring of heart rate in practice can end with a team cutting a player for being out of shape or not trying hard enough. A pitcher getting rested by a team because they saw that his arm is tiring out can lead to less money next time that pitcher negotiates for a contract since that pitcher played in less games. If a team sees that a player isn’t sleeping as much as they should, they can offer less money and say that it’s because the player is going out partying too much and they question his character. These are all possible ethical and legal violations of a player’s rights. Being able to monitor a player during a game is one thing. But being able to monitor their entire lives and controlling what they can or cannot do and tying that to their paycheck is another.

On top of these ethical and legal issues that wearables raise, there are also a lot of privacy risks. These players are in the public eye every day and leaks happen within organizations. The more sensitive medical information a team collects, the more risks there are that there are of HIPAA violations. Imagine if a player gets traded or cut for what seems like an unexplainable reason to the public. People might start questioning if there are medical issues with the player. Fans love to dissect every part of a player’s life and teams love to post as much information as possible online for their fans to look at. We’ve seen the risks of open public datasets and how even anonymized information can be de-anonymized and tied back to an individual. With the vast amount of information in the world about these players, this is definitely a risk.

All of these issues come as wearable technology become commonplace, but there are other risks on the horizon. We’re beginning to see embeddables, such as a pill to take or a device implemented under the skin, being developed and it will only be a matter a time before professional sports start taking advantage of it. These issues are not only faced by professional athletes. Think about your own job. Do you wear a badge at work? Can your employer track where you are at all times and analyze your performance based on that? What if in the future DNA testing is required? As more technology is developed and data is collected, we need to be aware of the possible issues that come with it and ask ourselves these questions.

Between Free Speech and the Right to Privacy

In a W231 group project that I recently worked on, we attempted to combine Transparent California, an open database of California’s employees’ salaries and pension data, with social media information to assemble detailed profiles of California’s employees. While we were able to do so on a individual employees, one by one, limitations posed by large social networks on scraping by third party tools prevented us from doing this at scale. This may soon change.

In a recent ruling, the Federal US District Court for the Northern District of California concluded that the giant social network cannot prevent HiQ Labs from accessing and using its data. The startup helps HR professionals fight attrition by scraping LinkedIn data and deploying machine learning algorithms to predict employees’ flight risk.


This was a fascinating case of the sometimes-inevitable clash between the public’s right to access to information, and the individuals’ right to privacy. On one hand – most would agree that liberating our data from tech giants such as LinkedIn, now owned by Microsoft, is a positive outcome; on the other – allowing universal access to it doesn’t come without risks.


Free speech advocates praise this ruling as potentially signaling a new direction in US courts’ approach to social networks’ control over data their users shared publically. This new approach was exemplified only a month earlier by another ruling by the US supreme court, where usage of social media was described as “speaking and listening in the modern public square”. If social media indeed is a modern public square, there should be little debate on whether information posted there can be used by anyone, for any reason.


There are, however, disadvantages to this increased access to publically posted data. The first of which, as my group project discussed, is that by combining multiple publically available data sets one can violate users’ privacy. And, practically adding the vast sea of information held by social networks to these open data sets, enables an ever-increased violation of privacy. It may be claimed that if users themselves post information online, companies who use it do nothing wrong. However, it is unclear whether users who post information to their public LinkedIn profiles intend for it to be scraped, analyzed and sold by other services. It is much more likely that most users expect this information to be viewed by other individual users.


Finally, as argued by LinkedIn, the right to privacy covers not only the data itself but also changes to it. When changing their profiles, social media users mostly do not wish to broadcast the change publically, but to display it to friends or connections who visit their profiles. LinkedIn even allows users to choose whether to make changes private, and post them only to the user’s profile, or public and let them appear on others’ news feed. Allowing third parties to scrape information revokes this right from users – algorithms such as HiQ’s scrape profiles, pick up changes, and sell them.


LinkedIn already appealed the court’s decision, and it will likely be a while before information on social media will be treated literally as posted on the public square. Courts will be required to choose, again, between the right to privacy and the right to access to information in this case. But regardless of what the decision will be, this is yet another warning sign reminding us, again, to be thoughtful of what information we share online – it will likely reach more eyes, and be used in other ways than we originally intended.

Algorithm is not a magic word

People may throw the word “algorithm” around to justify that their decisions are good and unbiased, and it sounds technical enough that you might trust them. Today I want to demystify this concept for you. Fair warning: it may be a bit disillusioning.

What is an algorithm? You’ll be happy to know that it’s not even that technical. An algorithm is just a set of predetermined steps that are followed to reach some final conclusion. Often we talk about algorithms in the context of computer programs, but you probably have several algorithms that you use in your daily life without even realizing it. You may have an algorithm that you follow for clearing out your email inbox, for unloading your dishwasher, or for making a cup of tea.

Potential Inbox Clearing Algorithm

  • Until inbox is empty:
    • Look at the first message:
      • If it’s spam, delete it.
      • If it’s a task you need to complete:
        • Add it to your to-do list.
        • File the email into a folder.
      • If it’s an event you will attend:
        • Add it to your calendar.
        • File the email into a folder.
      • If it’s personal correspondence you need to reply to:
        • Reply.
        • File the email into a folder.
      • For all other messages:
        • File the email into a folder.
    • Repeat.

Computer programs use algorithms too, and they work in a very similar way. Since the steps are predetermined and the computer is making all the decisions and spitting out the conclusion in the end, some people may think that algorithms are completely unbiased. But if you look more closely, you’ll notice that interesting algorithms can have pre-made decisions built into the steps. In this case, the computer isn’t making the decision at all, it’s just executing the decision that a person made. Consider this (overly simplified) algorithm that could decide when to approve credit for an individual requesting a loan:

Algorithm for Responding to Loan Request

  • Check requestor’s credit score.
  • If credit score is above 750, approve credit.
  • Otherwise, deny credit.

This may seem completely unbiased because you are trusting the computer to act on the data provided (credit score) and you are not allowing any other external factors such as age, race, gender, or sexual orientation to influence the decision. In reality though, a human being had to decide on the appropriate threshold for the credit score. Why did they choose 750? Why not 700, 650, or 642? They also had to choose to base their decision solely on credit score, but could there be other factors that the credit score subtly reflects, such as age or duration of time spent in America? (hint: yes.) With more complicated algorithms, there are many more decisions about what thresholds to use and what information is worth considering, which brings more potential for bias to creep in, even if it’s unintentional.

Algorithms are useful because they can help humans use additional data and resources to make more informed decisions in a shorter amount of time, but they’re not perfect or inherently fair. Algorithms are subject to the same biases and prejudices that humans are, simply from the fact that 1) a human designed the steps in an algorithm, and 2) the data that is fed into the algorithm is generated in the context of our human society, including all of its inherent biases.

These inherent biases built into algorithms can manifest in dark ways with considerable negative impacts to individuals. If you’re interested in some examples, take a look at this story about Somali markets in Seattle that were prohibited from accepting food stamps, or this story about how facial recognition software used in criminal investigations can lead to a disproportionate targeting of minority individuals.

In the future, when you see claims that an important decision was based on some algorithm, I hope you will hold the algorithm to the same standards that you would any other human decision-maker. We should continue to question the motivations behind the decision, the information that was considered, and the impact of the results.

For further reading:

Open Data Challenges and Opportunities

            The Open Data movement has won the day and governments around the world  – as well as scientific researchers, non-profits and even private companies – are embracing data sharing. Open data has huge benefits in encouraging replicability of scientific research, encouraging community and non-profit engagement with government data, providing accountability, and aiding businesses (especially new businesses which may not yet have detailed customer data). Open Data has already scored several key wins in areas from criminal justice reform to improved diagnostics. On the criminal justice side, open data lead to the discovery of the fact that “stop and frisk” policies in New York were both ineffective and heavily biased against minorities. This discovery was instrumental in the successful community effort to end this policy[1]. Considering health care, promising innovations in using deep learning to diagnose cancer are being assisted by publicly released data sets of labeled MRI and CAT scan images. These efforts demonstrate how Open Data can democratize the benefits of government data and publicly funded research and in this positive context, it is easy to see Open Data as a panacea. However,  several key concerns remain around privacy, accessibility, and ethics.

The most popular concern about Open Data revolves around the privacy concerns that arise from the public release of data – especially the public release of data that users might not have realized is private. Balancing privacy with providing the granularity of data needed for more sophisticated analysis is an ongoing concern, although increasingly a shared set of policies and practices are being developed around privacy protection[2]. But despite these advances in policy and practice, key privacy concerns remain both in general and in specific instances in which clear privacy harms have been caused.  For example, New York city released taxi trip data with license numbers hashed, which lead to two separate key privacy concerns[3]. First, data was easily de-anonymized by a  civic hacker, leading to privacy concerns for the taxi drivers. And second, several data scientists demonstrate how they identified one particular individual as a frequent customer of a gentleman’s club (which, clearly, is potentially very publically embarrassing) [4]. This case specifically demonstrates how very specific GPS data and a little bit of clever analytics can very easily de-anonymize particular users and expose them to substantial privacy risk. Differential privacy – seeking to provide accurate results on an aggregate level without allowing any individual to be identified – should be applied to avoid these situations but advances in record linkage make this even more difficult as researchers have to consider both the privacy risks in their own data but also how it could be combined with other data in privacy-damaging ways. And these high-profile failures demonstrate that there is still a progress to be made in privacy protection.

In addition to these concerns about privacy, there are several other aspects of open data that merit further consideration. One of these is around access to open data – with the increasing size and complexity of data, just providing access data itself may not be enough. Several companies’ business models are the processing and serving of open data[5], which illustrates that open data doesn’t mean easy to access data. If the public is funding data collection, some argue, it is not enough to provide the data but more effort needs to be put into making this data truly accessible to the public and smaller firms. Seconds, concerns remain around opacity of the ethics behind publicly accessible data. While the data itself – and often the code that produces results – is publicly accessible, the ethical decision making around study design, eligibility, and protections for subject often is not[6].  The lack of accessibility of Internal Review Board reports and ethical evaluations impedes the public’s ability to make informed judgements about public data sets.

Overall, Open Data has great potential to improve accountability and foster innovative solutions to social issues, but still requires work in order to balance privacy and ethics with openness.



[3] ibid