Archive for December 19th, 2017

If you’re anything like me, it may be tough to imagine a world without the personalized product advances that we’ve seen.

I rely on the recommendations they provide, like in my personalized shopping experience with Amazon, in the relevant material on my Facebook news feed, and through highly pertinent search results by Google. These companies collect information about my product usage and combine it with relevant characteristics gathered about me, running these through a set of learning algorithms to provide a highly relevant output. It’s the unique personalization that makes these products so “sticky”, typically creating a mutually beneficial relationship between the company and for users.

However, there are certain types of offers and uses of machine learning that create serious drawbacks for consumer groups – particularly, with targeted advertising and offerings for things like job postings and access to financial services that create characteristic-based reinforcement,  which often disproportionally impacts disadvantaged groups. Furthermore, misplaced incentives entice companies to overextend data collection practices and invade privacy. I’ll briefly lead you through the framework that allows to happen, then will show you how certain groups can be harmed. While there are some proposed solutions, I hope you’ll also see why it’s so important to begin addressing these issues now.

Companies seek profit maximization by maximizing revenue and minimizing costs. This can result in real consumer benefit through firms trying to attract consumers by creating innovative consumer products, or otherwise through lower consumer prices caused by competition entering new markets. Nonetheless, there can also be negative impacts, which are especially pronounced when companies use learning algorithms within targeted advertising and offers.

In this case, consumer privacy impact gets treated as a negative third-party externality where the associated cost is not picked up by society instead of by the company. Academic work has shown that consumers value privacy enough to assign a monetary value to it when making purchases. However, more consumer data creates better personalized products and more accurate models; important for targeted advertising and offers. However, in collecting as much data as possible, consumer desire to protect privacy is wholly ignored.

Moreover, companies have extra leeway regarding consumer data collection due to monopolistic markets and consumer information disparities. Monopolistic markets occur when one company dominates a market, leaving consumers with high switching costs and low consumer choice. Many tech markets (e.g., Search – Google, Social Media – Facebook, Online shopping – Amazon) are dominated by one company, and while consumers can choose to stop using the products, quality switching choices are very limited. Furthermore, consumers may not know what data is collected or how it is used, creating a veritable “black box” where consumers may be vaguely aware of privacy intrusion, but without any specific examples, won’t discontinue product use.

Perhaps more worrisome are some of the reinforcement mechanisms created by some targeted advertising and offers. Google AdWords provides highly effective targeted advertising; however, it is problematic when these ads pick up on existing stereotypes. In fact, researchers at Carnegie Mellon and the International Computer Science Institute proved that  AdWords showed high-income job offers more often to otherwise equivalent male candidates. Machine learning-based models are typically formed on existing “training” data, providing an outcome or recommendation based on gathered usage patterns, inferred traits, etc. While typically creating consumer benefit, re-training can create an undesirable reinforcement mechanism if the results from the prior model impact outcomes in the next round of training data. While a simplified explanation, this effectively works as follows:

  1. Model shows fewer women ads for high-income jobs
  2. Less women take high-income jobs (same bias that helped inform original model)
  3. Subsequent model re-training shows men as even better candidates to show ads to than prior model

Nonetheless, targeted ad firms are incented to prioritize ad clicks over fairness, and even if looking to fix this specific issue, the interconnectedness of the data inputs and the complexity of learning algorithms may make this a more difficult issue to solve than it seems.

Furthermore, other high societal impact industries are increasingly using machine learning. In particular, the banking and financial services sector now typically uses machine learning in credit risk assessment offer decisions, in order to lower costs by reducing probability of consumer default. However, a troubling trend is now showing up in these targeted models. Rather than making offer decisions on behaviors, credit decisions are being made on characteristics. Traditional credit scoring models compared known credit behaviors of a consumer—such as past late payments—with data that shows how people with the same behaviors performed in meeting credit obligations. However, many banks now purchase predictive analytics products for eligibility determinations, basing decisions on characteristics like store preferences and purchases (FTC Report Reference). While seemingly innocuous, characteristics can serve be highly correlated with protected traits like gender, age, race, or family status. By bringing in characteristics into predictive models, financial services can often make their decision criteria models more accurate as a whole. However, this is often at the expense of people from disadvantaged groups, who may be rejected from an offer even though they have exhibited the exact same behavior as another, thereby setting the individual and group further back. Depending on the circumstance, those companies leveraging characteristics most heavily may end up with more accurate models, giving them an unethical “leg up” on competition.

However, the complexity of data collection, transformation, and algorithms make these issues very difficult to regulate. Additionally, it is critical to also not stifle innovation. As such, industry experts have recommended adapting existing industry regulation to account for machine learning, rather than creating a set of all-encompassing rules (FTC Report Reference); an approach that fits well within the contextual integrity of existing frameworks. For example, motor vehicle regulation should adapt to AI within self-driving cars (FTC Report Reference). Currently, statutes like the Fair Credit Reporting Act, Equal Credit Opportunity Act, and Title VII prohibit overt discrimination of protected groups. As such, much of the statutory framework already exists, but may need to be strengthened to include characteristic-based exclusion language.

Moreover, the British Information Commissioner’s Office recommends that project teams perform a Privacy Impact Assessment detailing noteworthy data collection, identifying data needs, describing information flows, and identifying privacy risks; all helping to reduce the scope of privacy intrusion. Similarly, I recommend that teams perform an Unintended Outcome Assessment before implementing a targeted advertising campaign or learning-based offer, gathering team project managers and data scientists to brainstorm unintended discriminatory reinforcement consequences and propose mitigating procedures or model updates.

Recently, the Illinois Department of Children and Family Services (DCFS) decided to discontinue using a predictive analytics tool that would predict if children were likely to die from abuse or neglect or other physical threats within the next two years. Director BJ Walker reported that the program was both 1) not predicting cases of actual child deaths, including the homicide of 17-month old Semaj Crosby earlier in April of this year, and 2) alerting way too many cases of over 100% probability of death. When it comes to a matters like abuse, neglect, and deaths of children, false positives (Type I Errors) and false negatives (Type II Errors) have enormous consequences, which makes productionalizing and applying these tools with unstable error rates even more consequential.

The algorithm used by the DCFS is based on a program called the Rapid Safety Feedback program, the brainchild of Will Jones and Eckerd Connects, a Florida-based non-profit dedicated to helping children and families. First applied in Hillsborough County back in 2012 by then-Eckerd Youth Alternatives, the predictive software read in data and records about children’s parents, family history, and other agency records. Some factors going into the algorithm include whether there was a new boyfriend or girlfriend in the house, whether the child had been previously removed for sexual abuse, and whether the parent had also been a victim of abuse or neglect previously. Using these and many other factors, the software would rank each child a score of 0 to 100 on how likely it was that a  death would occur in the next two years. Caseworkers would be alerted of those with high risk scores and with proper training and knowledge, intervene in the family. In Hillsborough County, the program was a success in seemingly reducing child deaths after its implementation. The author and director of the program acknowledged that they cannot 100% attribute the decrease to the program, as there could be other factors, but for the most part, the County saw a decrease in child deaths, and that’s a good outcome.

Since then, the program has gained attention from different states and agencies looking to improve child welfare. One such state was Illinois. However, the program reported that more than 4,000 children were reported to have more than a 90% probability of death or injury, and that 369 children under the age of 9 had a 100% probability. Through the program, caseworkers are trained not to immediately act solely based on these numbers, but the fact of the matter was that this was a very high and clearly unreasonable number. High positive matches brings the conversation to the impact of false positives on welfare and families. A false positive means that a caseworker could intervene in the family and potentially remove the child from the parents. If abuse or neglect were not actually happening, and the algorithm was wrong, then the mental and emotional impacts on the families can be devastating. Not only could the intervention unnecessarily tear apart families physically, but they can traumatize and devastate the family emotionally. In addition, trust in child services and the government agencies involved would deteriorate rapidly.

On top of the high false positives, the program also failed to predict two high-profile child deaths this year, of 17-month old Semaj Crosby and 22-month-old Itachi Boyle. As Director Walker said, the predictive algorithm wasn’t predicting things. The impact of these Type II errors in this case don’t even have to be discussed in detail.

On top of the dilemmas with the pure algorithm in this predictive software, the DCFS caseworkers also complained about the language of the alerts, which used harsh language like “Please note that the two youngest children, ages 1 year and 4 years have been assigned a 99% probability by the Eckerd Rapid Safety Feedback metrics of serious harm or death in the next two years.” Eckerd acknowledged that language could have been improved, which brings in another topic of discussion around communicating the findings of data science results well. The numbers might spit out a 99% probability, but when we’re dealing with such sensitive and emotional topics, the language of the alerts matters. Even if the numbers were entirely accurate, figuring out how to apply such technology into the actual industry of child protective services is another problem altogether.

When it comes to utilizing data science tools like predictive software into government agencies, how small should these error rates be? How big is too big to actually implement? Is failing to predict one child death enough to render the software a failure, or is it better than having no software at all and missing more? Is accidentally devastating some families in the search for those who are actually mistreating their children worth saving those in need? Do the financial costs of the software outweigh the benefit of some predictive assistance, and if not, how do you measure the cost of losing a child? Is having the software helpful at all? As data science and analytics becomes more and more applied to this industry of social services, these are the questions many agencies will be trying to answer. And as more and more agencies look towards taking proactive and predictive steps to better protect their children and families, these are the questions that data scientists should be tackling in order to better integrate these products into society.



Digital Data and Privacy Concerns

In a world where fast pace digital changes are happening every second, and new form services are being built using data in different ways. At the same time, more and more people are becoming concerned about their data privacy as more and more data about them are being collected and analyzed. However, are the privacy regulations able to catch up with the pace of data collection and usage? Are the existing efforts put into privacy notices effective in helping to communicate and to form agreements between services and the users?

An estimated 77% of websites now post a privacy policy.
These policies differ greatly from site to site, and often
address issues that are different from those that users care
about. They are in most cases the users’ only source of

Policy Accessibility

According to a study done at The Georgia Institute of Technology that studied the online privacy notice format, out of the 64 sites offering a
privacy policy, 55 (86%) offer a link to it
from the bottom of their homepage.Three sites (5%)
offered it as a link in a left-hand menu, while two (3%)
offered it as a link at the top of the page.  While there are regulations requiring the privacy notice be given to the users, there is no explicit regulation about how/where it should be communicated to the users. In the above mentioned study, we can see that most of the sites tend not to emphasis the privacy notice before users start accessing the data. Rather due to lack of incentives to make the notice accessible, it is pushed to the least viewed section of the home page ,since most users redirect out of home page before reaching the bottom.

In my final project I conducted a survey targeted to collect accessibility feedback directly from users who interact with new services regularly. The data supports the above observations from another perspective, which is the notice are designed with much lower priority than other contents presented to the users, which leads to very few percentage of users actually reads those notices.

Policy Readability

Another gap in the privacy regulations would be the readability of the notice. In the course of W231, one of the assignment we did was to read various privacy notices, and from the discussion we saw very different privacy notice structures/approaches from site to site. Particularly, a general pattern is that there is often use of strong and intimidating language
has strong legal backing, which makes the notice content not easily understood by the general population, but at the same time does abide with the regulations to a large extent.

Policy Content

According to the examinations of different privacy notices during the course of the W231, it is obvious that even among those service providers who intend to abide with privacy regulations, there is often use of vague languages and missing data. One commonly seen pattern is usage of languages like ‘may or may not’, ‘could’. In combination of the issue of accessibility, with users’ different mental state in different stages of using the services,  few users actually seek clarifications from service providers before they virtually sign the privacy agreements. The lack of standard or control over the privacy policy content puts the users to an disadvantage when they encounter privacy related issues, as the content was already agreed on.

To summarize, the existing regulations on online privacy agreements are largely at a stage of getting from ‘zero’ to ‘one’, which is an important step as the digital data world evolve. However, a considerable amount of improvements are still needed to close the gaps between the existing policies to an ideal situation where services providers are incentivized to make the policy agreement accessible, readable and reliable.


Privacy Policy Regulation

December 19th, 2017

With rapid advances in data science, there is an ever increasing need to better regulate data and people’s privacy.  In the United States, existing guidelines such as FTC Guidelines and California Online Privacy Protection Act are not sufficient in addressing all the privacy concerns.  We need to draw some inspiration from the regulations suggested in  European Union General Data Protection Regulation (GDPR) Some have criticized GDPR for potentially impeding innovation .  I don’t agree with this for two reasons 1) There is a lot of regulation in the traditional industries in the US and organizations still manage to innovate. Why should data-driven organizations be treated any differently 2) I feel that majority of the data-driven organizations have been really good at innovating and coming up with new ideas. If they have to innovate with more regulated-data, I believe they will figure how to do it.

From the standpoint of data and privacy, I believe we need more regulation in the following areas

  • Data Security – We have seen a number of cases where user information is compromised and organizations have not been held accountable for the same. They get away with a fine which pales in comparison to the organizations’ finances.
  • Data Accessibility – Any data collected on a user should be made available to the user. The procedure to obtain the data should be simple and easy to execute.
  • Data Recession – Users should have the choice to remove any data they wish to be removed.
  • Data Sharing – There should be greater regulation in how organizations share data with third parties. The organization sharing the data should be held accountable in case of any complications that arise.
  • A/B Testing – Today, there is no regulation on A/B testing. Users need to be educated on A/B testing and there should be regulation on A/B testing with respect to content.  Users must be consented before performing A/B testing related to content and users should be compensated fairly for their inputs. Today, organizations compensate users for completing a survey. Why shouldn’t users be compensated for being a part of an experiment in A/B testing.

The privacy policy of every organization needs to include a Privacy Policy Rubric as shown below. The rubric would indicate a user about the organization’s compliance with the policy regulations. It can also be used to hold an organization accountable for any violation of regulations.


Lastly, there needs to stricter fines for any breach in regulation. GDPR sets a maximum penalty of 4 % of total global revenue, with penalties befitting the nature of the violation. The top-level management of an organization needs to be held accountable by the organization board for failing to meet the regulations.