Child Deaths: A Case of Immensely Consequential Type I & II Errors

Recently, the Illinois Department of Children and Family Services (DCFS) decided to discontinue using a predictive analytics tool that would predict if children were likely to die from abuse or neglect or other physical threats within the next two years. Director BJ Walker reported that the program was both 1) not predicting cases of actual child deaths, including the homicide of 17-month old Semaj Crosby earlier in April of this year, and 2) alerting way too many cases of over 100% probability of death. When it comes to a matters like abuse, neglect, and deaths of children, false positives (Type I Errors) and false negatives (Type II Errors) have enormous consequences, which makes productionalizing and applying these tools with unstable error rates even more consequential.

The algorithm used by the DCFS is based on a program called the Rapid Safety Feedback program, the brainchild of Will Jones and Eckerd Connects, a Florida-based non-profit dedicated to helping children and families. First applied in Hillsborough County back in 2012 by then-Eckerd Youth Alternatives, the predictive software read in data and records about children’s parents, family history, and other agency records. Some factors going into the algorithm include whether there was a new boyfriend or girlfriend in the house, whether the child had been previously removed for sexual abuse, and whether the parent had also been a victim of abuse or neglect previously. Using these and many other factors, the software would rank each child a score of 0 to 100 on how likely it was that a  death would occur in the next two years. Caseworkers would be alerted of those with high risk scores and with proper training and knowledge, intervene in the family. In Hillsborough County, the program was a success in seemingly reducing child deaths after its implementation. The author and director of the program acknowledged that they cannot 100% attribute the decrease to the program, as there could be other factors, but for the most part, the County saw a decrease in child deaths, and that’s a good outcome.

Since then, the program has gained attention from different states and agencies looking to improve child welfare. One such state was Illinois. However, the program reported that more than 4,000 children were reported to have more than a 90% probability of death or injury, and that 369 children under the age of 9 had a 100% probability. Through the program, caseworkers are trained not to immediately act solely based on these numbers, but the fact of the matter was that this was a very high and clearly unreasonable number. High positive matches brings the conversation to the impact of false positives on welfare and families. A false positive means that a caseworker could intervene in the family and potentially remove the child from the parents. If abuse or neglect were not actually happening, and the algorithm was wrong, then the mental and emotional impacts on the families can be devastating. Not only could the intervention unnecessarily tear apart families physically, but they can traumatize and devastate the family emotionally. In addition, trust in child services and the government agencies involved would deteriorate rapidly.

On top of the high false positives, the program also failed to predict two high-profile child deaths this year, of 17-month old Semaj Crosby and 22-month-old Itachi Boyle. As Director Walker said, the predictive algorithm wasn’t predicting things. The impact of these Type II errors in this case don’t even have to be discussed in detail.

On top of the dilemmas with the pure algorithm in this predictive software, the DCFS caseworkers also complained about the language of the alerts, which used harsh language like “Please note that the two youngest children, ages 1 year and 4 years have been assigned a 99% probability by the Eckerd Rapid Safety Feedback metrics of serious harm or death in the next two years.” Eckerd acknowledged that language could have been improved, which brings in another topic of discussion around communicating the findings of data science results well. The numbers might spit out a 99% probability, but when we’re dealing with such sensitive and emotional topics, the language of the alerts matters. Even if the numbers were entirely accurate, figuring out how to apply such technology into the actual industry of child protective services is another problem altogether.

When it comes to utilizing data science tools like predictive software into government agencies, how small should these error rates be? How big is too big to actually implement? Is failing to predict one child death enough to render the software a failure, or is it better than having no software at all and missing more? Is accidentally devastating some families in the search for those who are actually mistreating their children worth saving those in need? Do the financial costs of the software outweigh the benefit of some predictive assistance, and if not, how do you measure the cost of losing a child? Is having the software helpful at all? As data science and analytics becomes more and more applied to this industry of social services, these are the questions many agencies will be trying to answer. And as more and more agencies look towards taking proactive and predictive steps to better protect their children and families, these are the questions that data scientists should be tackling in order to better integrate these products into society.

 

References:

www.chicagotribune.com/news/watchdog/ct-dcfs-eckerd-met-20171206-story.html

www.chronicleofsocialchange.org/analysis/managing-flow-predictive-analytics-child-welfare

www.chronicleofsocialchange.org/featured/who-will-seize-the-child-abuse-prediction-market

The Effectiveness of Privacy Regulations

Digital Data and Privacy Concerns

In a world where fast pace digital changes are happening every second, and new form services are being built using data in different ways. At the same time, more and more people are becoming concerned about their data privacy as more and more data about them are being collected and analyzed. However, are the privacy regulations able to catch up with the pace of data collection and usage? Are the existing efforts put into privacy notices effective in helping to communicate and to form agreements between services and the users?

An estimated 77% of websites now post a privacy policy.
These policies differ greatly from site to site, and often
address issues that are different from those that users care
about. They are in most cases the users’ only source of
information.

Policy Accessibility

According to a study done at The Georgia Institute of Technology that studied the online privacy notice format, out of the 64 sites offering a
privacy policy, 55 (86%) offer a link to it
from the bottom of their homepage.Three sites (5%)
offered it as a link in a left-hand menu, while two (3%)
offered it as a link at the top of the page.  While there are regulations requiring the privacy notice be given to the users, there is no explicit regulation about how/where it should be communicated to the users. In the above mentioned study, we can see that most of the sites tend not to emphasis the privacy notice before users start accessing the data. Rather due to lack of incentives to make the notice accessible, it is pushed to the least viewed section of the home page ,since most users redirect out of home page before reaching the bottom.

In my final project I conducted a survey targeted to collect accessibility feedback directly from users who interact with new services regularly. The data supports the above observations from another perspective, which is the notice are designed with much lower priority than other contents presented to the users, which leads to very few percentage of users actually reads those notices.

Policy Readability

Another gap in the privacy regulations would be the readability of the notice. In the course of W231, one of the assignment we did was to read various privacy notices, and from the discussion we saw very different privacy notice structures/approaches from site to site. Particularly, a general pattern is that there is often use of strong and intimidating language
has strong legal backing, which makes the notice content not easily understood by the general population, but at the same time does abide with the regulations to a large extent.

Policy Content

According to the examinations of different privacy notices during the course of the W231, it is obvious that even among those service providers who intend to abide with privacy regulations, there is often use of vague languages and missing data. One commonly seen pattern is usage of languages like ‘may or may not’, ‘could’. In combination of the issue of accessibility, with users’ different mental state in different stages of using the services,  few users actually seek clarifications from service providers before they virtually sign the privacy agreements. The lack of standard or control over the privacy policy content puts the users to an disadvantage when they encounter privacy related issues, as the content was already agreed on.

To summarize, the existing regulations on online privacy agreements are largely at a stage of getting from ‘zero’ to ‘one’, which is an important step as the digital data world evolve. However, a considerable amount of improvements are still needed to close the gaps between the existing policies to an ideal situation where services providers are incentivized to make the policy agreement accessible, readable and reliable.

 

Privacy Policy Regulation

With rapid advances in data science, there is an ever increasing need to better regulate data and people’s privacy.  In the United States, existing guidelines such as FTC Guidelines and California Online Privacy Protection Act are not sufficient in addressing all the privacy concerns.  We need to draw some inspiration from the regulations suggested in  European Union General Data Protection Regulation (GDPR) Some have criticized GDPR for potentially impeding innovation .  I don’t agree with this for two reasons 1) There is a lot of regulation in the traditional industries in the US and organizations still manage to innovate. Why should data-driven organizations be treated any differently 2) I feel that majority of the data-driven organizations have been really good at innovating and coming up with new ideas. If they have to innovate with more regulated-data, I believe they will figure how to do it.

From the standpoint of data and privacy, I believe we need more regulation in the following areas

  • Data Security – We have seen a number of cases where user information is compromised and organizations have not been held accountable for the same. They get away with a fine which pales in comparison to the organizations’ finances.
  • Data Accessibility – Any data collected on a user should be made available to the user. The procedure to obtain the data should be simple and easy to execute.
  • Data Recession – Users should have the choice to remove any data they wish to be removed.
  • Data Sharing – There should be greater regulation in how organizations share data with third parties. The organization sharing the data should be held accountable in case of any complications that arise.
  • A/B Testing – Today, there is no regulation on A/B testing. Users need to be educated on A/B testing and there should be regulation on A/B testing with respect to content.  Users must be consented before performing A/B testing related to content and users should be compensated fairly for their inputs. Today, organizations compensate users for completing a survey. Why shouldn’t users be compensated for being a part of an experiment in A/B testing.

The privacy policy of every organization needs to include a Privacy Policy Rubric as shown below. The rubric would indicate a user about the organization’s compliance with the policy regulations. It can also be used to hold an organization accountable for any violation of regulations.

 

Lastly, there needs to stricter fines for any breach in regulation. GDPR sets a maximum penalty of 4 % of total global revenue, with penalties befitting the nature of the violation. The top-level management of an organization needs to be held accountable by the organization board for failing to meet the regulations.

Artificial Intelligence: The Doctor is In

When I hear that AI will be replacing doctors in the near future, images of Westworld cybernetics come to mind, with robots toting stethoscopes instead of rifles. The debate of the role of AI in medicine is raging, and with good reason. To understand the perspectives, you just have to ask these questions:

• What will AI be used for in medicine?
• If for diagnosis, does AI have the capability of understanding physiology in order to make a diagnosis?
• Will AI ever harm the patient?

To the first point, AI can be a significant player in areas such as gauging adverse events and outcomes for clinical trials and processing genomic data or immunological patterns. Image recognition in pathology and radiology is a flourishing field for AI, and there have even been gasp white papers proving so. The dangers start emerging when AI is used for new diagnoses or predictive analytics for treatment and patient outcomes. How a doctor navigates through the history and symptoms of a new patient to formulate a diagnosis is akin to the manner in which supervised learning occurs. We see a new patient, hear their history, do an exam, and come up with an idea of diagnosis. While that is going on, we have already wired into our brains, let’s say, a convolutional neural network. That CNN has already been created by medical school/residency/fellowship training with ongoing feature engineering every time we see a patient, read an article, or go to a medical conference. Wonderful. We have our own weights for each point found in the patient visit and voila! A differential diagnosis. Isn’t that how AI works?

Probably not. There is a gaping disconnect between the scenario described above and what actually goes on in a doctor’s mind. The problem is that machine learning can only learn from data that is fed into it, probably through an electronic medical record (EHR), a database also created by human users, with inherent bias. Without connecting the medical knowledge and physiology that physicians have, that the CNN does not have. If this is too abstract, consider this scenario – a new patient comes into your clinic with a referral for evaluation of chronic cough. Your clinic is located in the southwest US. Based on the patient’s history and symptoms, coupled with your knowledge of medicine, you diagnose her with histoplasmosis infection. However, your CNN is based on EHR data from the northeast coast, which has almost no cases of histoplasmosis. Instead, the CNN diagnoses the patient with asthma, a prevalent issue across the US and a disease which has a completely different treatment.

AI could harm the patient. After all, we do not have the luxury of missing one case like when we screen emails for spam. Testing models and reengineering features will come with risks that everyone – the medical staff and the patient – must understand and accept. But before we jump to conclusions of Dr. Robot, we must have much more discussion on the ethics as we improve healthcare with AI.

Understanding the Basics of the GDPR

On May 25, 2018, enforcement of the General Data Protection Regulation (GDPR) will begin in the European Union.  The Regulation unifies data protections for all individuals within the European Union, however, in some cases, it also hinders the usage of such data.  By no means a comprehensive analysis, this post will help get you up to speed on the GDPR, how it impacts business, and what analysts can do to still get valid results from data.

Very Brief History

On January 25, 2012, The European Commission proposed a comprehensive reform of the 1995 data protection rules to “strengthen online privacy rights and boost Europe’s digital economy.”  It was estimated that implementing a single law could bypass “the current fragmentation and costly administrative burdens, leading to savings for businesses of around €2.3 billion a year.”  On April 14, 2016, the Regulation was officially adopted by the European Parliament and is scheduled to be put into force on May 25, 2018.  Now that we know how we got here, let’s answer some basic questions:

Why does Europe need these new rules?

In 1995, when the prior regulations were written, there were only 16 million Internet users in the world.  By June 2017, that number had increased to almost 4 billion users worldwide and more than 433 million of the EUropean Union’s 506 million inhabitants were online.  The increased use ushered in increased technology, search capabilities, data collection practices and legal complexity.  Individuals lack control over their personal data and businesses were required to develop complex compliance plans to comply with the varying implementations of the 1995 Regulations throughout Europe.  The GDPR fixes these issues by applying the same law consistently throughout the European Union and will allow companies to interact with just one data protection authority.  The rules are simpler, clearer, and provide increased protections to citizens.

What do we even mean by “personal data?”

Simply put, personal data is any information relating to an identified or identifiable natural person.  According to The Regulation’s intent, it “can be anything from a name, a photo, an email address, bank details, your posts on social networking websites, your medical information, or your computer’s IP address.”

Isn’t there also something called “Sensitive personal data?”

Yes.  Sensitive personal data is “personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation.” Under the GDPR, the processing of this is prohibited, unless it meets an exception.

What are those exceptions?

Without getting into the weeds of the rule, the excepts lay out cases where it is necessary and beneficial to take into consideration sensitive personal data.  These include legal proceedings, substantial public interests, medical purposes, protecting against cross-border threats, and scientific research.

With all this data being protected, can I still use Facebook?

Yes!  The new rules just change how data controllers collect and use your information.  Rather than users having to prove that the collection of information is unnecessary, the businesses must prove that the collections and storing of your data is necessary for the business.  Further, companies must take into account “data protections by default” meaning those pesky default settings that you have to set on Facebook to keep people from seeing your pictures will already be set to the most restrictive setting.  Further, the GDPR includes a right to be forgotten, so you can make organizations remove your personal data if there is no legitimate reason for its continued possession.

How can data scientists continue to provide personalized results under these new rules?

This is a tricky question, but some other really smart people have been working on this problem and the results are promising!  By aggregating and undergoing pseudonymization processes, data gurus have continued to achieve great results!  For a good jumping off point on this topic, head over here!

Online monitoring and “Do Not Track”

If you’ve ever noticed that online ads are targeted to your tastes and interests or that websites remember preferences from visit-to visit, the reason is online tracking methods, such as cookies.  In recent years, we, as consumers, have become increasingly aware of organization’s ability to track our movement online.  These organizations are able to track movement across their own website as well as sister websites, for example LinkedIn is able to track movement across LinkedIn, Lynda.com, SlideShare, etc.  Organizations are able to track movement by utilizing IP addresses, accounts, and much more to identify us and connect our online movements across sessions.  From there, advertisers and third-parties are able to purchase information about our movements to target ads.  

While online tracking is now ubiquitous in 2017, efforts to curtail an organization’s ability to monitor online movement started almost 10 years earlier. In 2007, several consumer advocacy groups advocated to the FTC for an online “Do Not Track” list for advertising.  Then, in 2009, researchers created a prototype for an add-on for Mozilla Firefox that implemented a “Do Not Track” header.  One year later, the FTC Chairman told the Senate Commerce Committee that the commission was exploring the idea of proposing an online “Do Not Track” list.  And towards the end of 2010, the FTC issued a privacy report that called for a “Do Not Track” system that would enable users to avoid the monitoring of their online actions.

As a result of the FTC’s announcement, most internet browsers provided a “Do Not Track” header similar to the one created in 2009.  These “Do Not Track” headers work by alerting websites that the user does not want their movements to be tracked through a signal, which the website can then choose to honor or not. While most of these browsers created an opt-in option for the “Do Not Track” header, Internet Explorer 10’s original default was to enable the “Do Not Track” option.  Microsoft faced blow back from advertising companies for the default setting, who thought that users should have to choose to utilise the “Do Not Track” header.  Eventually in 2015, Microsoft changed the “Do Not Track” option to be an opt-in and no longer a default option.

Despite these browsers implementing “Do Not Track” solutions for users, there has been no legally agreed upon standard for what organizations should do when they receive the signal. As a result, the majority of online organizations do not honor the “Do Not Track” signal.  This was further enabled in 2015, when the FCC dismissed a petition that would have required some of the larger online companies (e.g., Facebook, Google, Netflix) to honor the “Do Not Track” signals from consumers’ browsers.  In the response, the FCC stated that “the Commission has been unequivocal in declaring that it has no intent to regulate edge providers.”  The FCC’s response to this petition has enabled organizations to ignore “Do Not Track” signals as the response indicated that the FCC has no intention of enforcing the signal.  Due to the lack of enforcement, today, the “Do Not Track” signal amounts to almost nothing for the majority of the web.  However, there are a few organizations that have decided to implement the “Do Not Track” signal, including Pinterest and Reddit

In general, while there is a “Do Not Track” option available for internet browsers, it does not do much at all.  Instead, for users to protect their online privacy and prevent tracking, they must consider other options such as setting your browser to reject third-party cookies, leveraging browser extensions to limit tracking, etc.

Social Media’s Problem with the Truth

Based on your political views, your Facebook/Google/Twitter news feed probably looks quite a bit different from mine.  This is because of a process known as a “filter bubble”, in which news which comports to your world view is highlighted and news which conflicts with your world view is filtered out, painting a very lopsided picture in your news feed.  A filter bubble results from the confluence of two phenomena.  The first is known as an “echo chamber,” in which we seek out information to confirm what we already believe and disregard that which challenges those beliefs.  The second is social media recommender algorithms doing what recommender algorithms do — recommending content it thinks you’ll enjoy.   If you only read news of a certain political persuasion, eventually, your favorite social media site will only recommend news of a certain political persuasion.

Unfortunately, this has resulted in a breeding ground for fake news.  Unscrupulous content providers don’t care whether or not you know the information they peddle is false, so long as you click on it and share it with your social network (many members of which probably share your political views and will keep the fake news article propagating through the network).  The only barrier between fake news mongers and your news feed is the filter bubble you’ve created.  That same barrier, however, becomes an express lane simply by capitalizing on key words and phrases that you’ve already expressed an interest in.

Social media sites have done little to combat this barrage of fake news.  Their position on the matter is that it’s up to the user to decide what is fake and what is real.  In fact, Twitter relieves itself of any obligation to discern fact from fiction in its Terms of Service, stating that you “use or rely upon” anything you read there at your own risk.  Placing the onus of fact-checking on the users has led to real consequences such as “Pizzagate,” an incident in which a man, acting in response to fake news he had read on Facebook, fired an assault rifle in a pizzeria he believed was being used as a front by Hillary Clinton’s campaign manager, John Podesta, to traffic child sex slaves.

Clearly, placing the burden of verifying news on the users’ shoulders doesn’t work.  Many users suffer from information illiteracy — they aren’t equipped with the skills necessary to ascertain whether or not a news article has any grounding in reality.  They don’t know how to fact check a claim, or even question the expertise or motivation of someone going on the record as “an expert.”  And if the news article happens to align with their existing world view, many have little reason to question its authenticity.

Social media sites need to do more to combat fake news.  They’ve already been excoriated by Congressional committees over their part in the Russian meddling effort during the 2016 Presidential Election.  Facebook, Google, and Twitter have since pledged to find a solution to end fake news, and Twitter has suspended 45 accounts suspected of pushing pro-Russia propaganda into U.S. political discourse, but they are only addressing the issue now that they are facing scrutiny, and they are still dragging their feet about it.  Ultimately though, the incentive structures in place do little to encourage social media giants to change their ways.  Social media sites make the majority of their money through advertisements and sponsored content, so when a content provider offers large sums to ensure millions of people get their message, social media sites won’t ask questions until the fines for sponsoring misleading content offset any potential profit.

Data Privacy and GDPR

As the world’s most valuable resource and as coined by the economist article written in May 2017, data is the new oil. Internet and technology giant have transformed the business world and social interaction as we know it. Companies such as Amazon, Google, Facebook and Uber have more data on consumers around the world in the past 5 years than the entire history of data collection since its inception.

With that said, companies don’t always collect data in the most “ethical” way; as we have seen in class Google for example collected private data from wi-fi signals during its routine Google street view cars. Uber gets access to your phone details, your contacts, schedule amongst many other things on your phone and my personal favorite: “third-party site or service you were using before interacting with our services.” (Uber Privacy)

Many other big and small tech companies capture more information by the second and we as consumers have grown accustom to scrolling, scrolling, scrolling some more, clicking “accept”, entering personal information and start the consumption of services. Through cognitive tricks and with the help of psychologist, companies nudge consumers behavior to their benefit; it gets increasingly easier for making it more difficult for users to “opt-out” vs. “opting-in” if given the opportunity. Most of us never pause to think how can we continue using the services without sharing all this private information, we never pause to read what is being captured and tracked as we interact with the application or the platform and we never pause to question the impact on our lives if this data gets leaked or the company gets hacked due to weak security policies or lack of privacy regulations implemented by the organization.

Fortunately, the EU has drafted a new set of data protection regulations built around the protection of the user’s privacy and information. The new General Data Protection Regulation (GDPR) will be enforced on May 28 2018. Companies in violation of these regulations will be subject to a “penalty of up to 4% of their annual global turnover or €20 Million (whichever is greater)”. Most of you reading this blog are thinking great but this is in Europe and we are in the U.S., why should we care? How will it impact us?

The beauty about GDPR is “it applies to all companies processing the personal data of data subjects residing in the Union, regardless of the company’s location”, hence all technology companies and internet giants will need to comply with these new regulations if they would like to continue operating in the EU.

It is worth noting that though data protection directives are not new to the EU, GDPR introduces new regulations that are necessary to address the issues brought forth by the evolution and creativity of today’s technology companies when it comes to data protection and privacy. The biggest 2 changes that were introduced are the global reach of GDPR and the financial penalty as previously mentioned above. Other changes include strengthened consent statements and improved data subject rights. (see more details here).

All said and done, though the GDPR is a step in the right direction focused around the protection of our data and privacy, there are still no clear and strict guidelines that are preventing companies from capturing and processing excessive data being captured that are irrelevant to the user’s experience (if there is such a concept as “excessive data”). For example, my Uber hailing and riding experience is not linked in any shape or form to Uber capturing my browsing history on “how to sue Uber” or me checking my wife’s ovulation cycle before using their application!

Hence, I believe regulations should also include clear consent “Opt-in” as an option (empty checkbox) to capture, monitor and process data not relevant to the user’s experience and services offered by the platform.

How important is data integrity to a consumer?

I want to explore a couple of ideas. Is data a consumers best friend or worst enemy? Or both? Can they tell the difference? Do they care?

Big data has become the buzz around the silicon valley over the last few decades. Every company strives to not only be a data driven company but also, in many cases, become a “data company.” From the perspective of a data enthusiast this prospect is not only exciting but it also promises many exciting opportunities. However, as we have seen in this class, with these opportunities comes a risk. Throughout this class we have tried to make sense of the blurry line that governs what companies can collect, how they can collect it, and what their duty is to their consumers at the end of the day when it comes to data collection and data privacy.

Many times companies use data in less than ethical ways but in the end it benefits the user. For example let’s assume a hypothetical company scraped information off the web for everyone of their user base so that they can serve up personalized and relevant content to their users. This benefits the users because the personalization makes the product more attractive to them but at the same time it is a clear invasion of their privacy because they are giving the company access to this information. My base question is: do people care? My assumption is that they definitely care, as I am sure everyone in this class’s assumption is. But would the conversation change if they don’t? Put another way – why do we have the rules that we have? Is it because we feel that this is  what the vast population wants or is it because we feel that they need to be protected?

We have a vast array or rules and regulations but, according to an article by Aaron Smith published in Pew Research, about half of online Americans don’t even know what a privacy policy governs. From his article he suggests that the majority of online consumers believe that if a company has a privacy policy they are not allowed to share the data that they receive. Given this research, and many others in the same subject, I think it is reasonable to suggest that we may think that our privacy laws are put into place to protect the population rather than to conform to what they want because for the most part they don’t even understand what we are protecting them from. Or at the very least we have not adequately explained to them what our laws say and how our laws restrict companies.

With that in mind the next logical question is whether the consumer population actually cares. This is the topic of my final project. They may not understand what we are trying to protect them from today but if they did understand it does it matter to them? Would they rather have a more compelling product or would they rather have more control over how their information is being handled? A quick note: I have focused mainly on improving a product because I felt that the question of using it only for the companies gain was not an interesting one. I am interested in what a consumer feels is more beneficial to them – a compelling product or control of their data.

Pew research article: http://www.pewresearch.org/fact-tank/2014/12/04/half-of-americans-dont-know-what-a-privacy-policy-is/

 

 

Sources of Bias in Machine Learning

Recently Perspective, an application created by Google to score “toxicity” in online comments, has come under fire for displaying high levels of gender and racial bias. Here we will attempt to view this perceived bias through a machine learning lens.

There are two types of bias exhibited in machine learning and it is useful to distinguish them. On one hand we have the contribution due to algorithmic bias. This occurs when the mathematical algorithm itself is too simple to account for all of the variance in the observed data. On the other hand we have training bias. This occurs when the data used as training input into the mathematical model is too limited to explain all of the variance in the world. The solution would seem simple: increase the complexity of the algorithm and the variety of the training data. However in the real world this is often difficult and sometimes impossible.

One of the decisions that goes into creating a mathematical model is known as the “bias-variance tradeoff” in which a supervised machine learning model is selected such that it isn’t so specific that it only works in a limited number of cases, but isn’t so general that it ignores all the details. This tradeoff is straightforward to quantify and is very well understood in known algorithms. With Perspective Google uses a type of supervised machine learning called a deep neural network, a machine learning algorithm specifically designed to solve complex problems. Interestingly deep neural networks almost exclusively sit on the high variance end of the bias-variance spectrum. That is to say Perspective almost certainly has very low algorithmic bias. While it is possible that the model does have some unquantified algorithmic bias (for example it may not be able to distinguish intentional deception) the instances of text used in the referenced articles are not an example of this.

The conclusion then is that training bias accounts for almost all of the bias in this application. Training bias is much less understood than its algorithmic counterpart. The data used to train Perspective comes from discussions between different Wikipedia editors about the content of page edits, a simple and widely available data set. However the latent sources of bias in this training dataset are difficult to spot ranging from local copyright law to the composition of the employees in the software industry. Algorithms can be corrected so that these biases are not amplified but these adjustments require an apriori knowledge to identify the affected classes and still result in at least the same bias as the training data itself. The solution to this problem will involve awareness, working with both the variables in the data set as well as the outcome to be predicted.

Developing tests for bias among the predictor variables in the training set can, at a minimum, allow the consumer of the model to be informed of its limits. With Perspective Google simply put forward a test environment that allows an individual to enter in any english utterance and get a toxicity score. But the testing data consisted of full sentences that were a part of a larger thread of conversation, which is a bias in itself. If Perspective forced the user to submit an entire conversation and then select a specific response for a toxicity rating the results may be more interpretable.

Adjusting the outcome variables to incorporate additional parameters of “fairness” is one avenue being explored. Another solution is to throw out the predicted outcomes entirely and allow the algorithm infer the underlying structure of the data. Asking Perspective to partition up the Wikipedia data into a number of unlabeled categories may yield an implicit toxic/non-toxic split. This type of machine learning is much less understood but many experts believe this is the path forward towards a more generalizable intelligence.

Overall bias presents one of the most difficult obstacles to overcome in a wider adoption of machine learning. An awareness and understanding of the sources of those biases is the first step to correcting them.