We need to protect our information on Social Media

Currently, for most tech companies, their major revenue comes from online advertisement. In order to how to deliver the advertisement more efficient to the right target, companies like Linkedin and Facebook start to collect a large variety of data from the society. They start to analysis people’s geographic location, friends, education and the place they usually go etc. For marketing A/B testing purpose, the more information you collected, the more accurate you will predict.
Recently, Facebook started to launch their new product called Facebook Local App. Based on its descriptions – Keep up with what’s happening locally—wherever you are—whether you’re looking for something to do with friends this weekend or want to explore a new neighborhood. The key idea will try to help people know what’s happening in their neighborhood. Everything looks good and interesting. However, this app’s average score on Apple Store is only 3.3 out of 5.
Here are the reasons: first is people started to care about their privacy. And there are lots of similar apps on Apple Store. Why they need to choose this one. Facebook already has most of their information, they do not want to let Facebook know everything about them. Second and the most important, once you start to use this App, Facebook will start to get your information more accurate. For example, on your Facebook account, you only need to put which city you live in, for me, my Facebook account only shows I live in San Francisco. However, once I start to use this app, Facebook will know which area in the city I live in and what is my life patterns. For me I am a Pokemon Go Fan, they will know where I am going every weekend, how long I will be in these locations. I may feel I was watched by someone every day.
Based on what we learn so far, Facebook gets huge benefits, because they can directly charge their advertisement fees to their clients. Because they will deliver the content to people based on area. They may keep sending the good restaurants around the areas where I usually go. For me, I will increase the times to go these places because I have no choice which saves me time as well. But, if they use this information for another purpose such as using the information to develop another Apps or research? We do not know how they store and use our data. We all have the similar experience that when we start to put our contact information to apply for credit card in credit companies like Visa or American Express, we will get a lot of calls from many banks as well. Why, because credit companies share our information with the banks. Banks use the information to find us and ask us to open an account with them.
There will be the similar situation on Linkedin as well. Once we change our title to currently looking full-time position, we will get a lot of email or request from the staffing companies for job hiring.
Above all, the technology does change our lives, but we need more rules to protect us and avoiding being bothering from them as well.

Racial Bias in the National Instant Criminal Background Check System

Modern gun control began with the Gun Control Act of 1968, passed after the assassinations of John F. Kennedy, Martin Luther King Jr., and Bobby Kennedy. It prohibits mail-order gun purchases, requires all new guns be marked with a serial number, and created the Federal Firearms License (FFL) system, which manages licenses required for businesses to sell guns. The law was further strengthened in 1993 by the Brady Handgun Violence Prevention Act. This new addition established a set of criteria that disqualify a person from legally purchasing a gun. It also created the National Instant Criminal Background Check System (NICS), which is maintained by the FBI and used by FFL licensed businesses to quickly check if a person matches any of the disqualifying criteria.

Although NICS was created with good intent, and without any explicit racist assumptions, the NICS database and algorithms likely inflict greater burden on the African American community than its white counterpart. This unintended consequence is based on a perfect storm of seemingly unrelated policies and history.

To see this bias we must first understand how a background check is performed. When you purchase a gun through an FFL licensed business you submit identifying information, such as your name, age, address, and physical descriptions. Then the NICS system looks for an exact match on your personal data in three databases that track criminal records. If no exact match is found a close name match can still halt your purchase.

The data and matching algorithms used by NICS are not publically available so we can only guess at what exists in the databases and how it is utilized, but based on public record and one particular criteria established by the Brady Act, conviction of a crime punishable by imprisonment for a term exceeding one year, we can make educated assumptions about the data. First, drug possession and distribution can result in multi-year imprisonment. Second, the largest proportion of inmates are there because of drug related offenses. These imply a large–maybe the largest–population in NICS is there due to drug related crimes. Lastly, African Americans are imprisoned at a rate six times greater than whites for drug related crimes even though white and African Americans use and possess drugs at essentially the same rate. This final statistics indicates the NICS databases must include a disproportionate number of African Americans due to biases in law enforcement and the criminal justice system. These upstream biases not only affect the inmates at the time of conviction but follows them throughout life, limiting their ability to exercise rights protected by the 2nd Amendment.   

Unfortunately this is not where the bias ends. There is evidence that shows using loose name-based matching algorithms against felony records in Florida disproportionately identified black voters incorrectly as felons and stripped them of their right to vote in the 2000 elections because African Americans are over-represented in common names due to losing their family names during the slavery era. It’s worth wondering if the FBI’s name-matching algorithm suffers from the same bias and results in denying or delaying a disproportionate number of law-abiding African Americans from buying guns. In addition, this bias would result in law-abiding African Americans having their gun purchases tracked in NICS. By law, NICS deletes all traces of successful gun purchases. However, if you are incorrectly denied purchase, you can appeal and add content to the databases that proves you are allowed to purchase guns. This is done to prevent the need to appeal every time you purchase a gun. The existence of this content is the only record of gun purchases in NICS, information the government is generally forbidden to retain. If this bias does exist, there is sad irony in laws passed on the backs of infamous violence perpetrated by non-African Americans now most negatively affecting African Americans.

This evidence should be weighed carefully, especially by those who advocate for both gun control and social justice. The solutions settled upon for gun control must pass intense scrutiny to insure social justice is not damaged. In the case of NICS, the algorithms should be transparent, and simple probabilistic methods employed to lessen the chance of burdening  African Americans who have common names.

Enrollment Management from a Student Perspective

If you received an undergraduate degree in the United States, you are likely familiar with the U.S. financial aid system from a student perspective – you submit your essays, your academic records and test scores, you file the FAFSA, and you expect some amount of financial aid from the colleges you applied to in return. Your institution may or may not have provided an estimated cost calculator on its website, and you may or may not have received as much financial aid as you hoped for from your institution. Given that approximately 71% of each undergraduate class takes on student loan debt (TICAS, 2014), institutional aid typically does not cover the gap between what the student can pay and what the institution offers (also known as unmet need). What is clear, however, is that despite a consistent sticker price, the actual cost of college differs from student to student.

Colleges and consultants refer to the practice as “enrollment management” or “financial aid leveraging”, but the pricing strategy itself is known as price discrimination (How Colleges Know What You Can Afford, NY Times, 2017). As with any business where functionality is constrained by net revenue, in some ways there is a fundamental opposition between the best interests of the consumer (student) and the seller (college), since consumers ideally want the product that the cheapest rate and sellers want to earn as much revenue as possible (though many factors other than revenue also drive colleges’ decision making). However, this idea becomes more problematic as we consider that education is not an inessential service, but a key component in personal development and economic opportunity.

The looming ethical discussion, at least in the U.S., is whether higher education should be free for anyone who wants it, perhaps eliminating the need for universities to engage in price discrimination. A parallel discussion is whether price discrimination that leaves unmet need for students is what needs more immediate resolution.

Rather than taking a stance on U.S. college pricing, however, I am interested in the enrollment management paradigm from a student privacy perspective. If Nissenbaum et al. posit that “informational norms, appropriateness, roles, and principles of transmission” govern a framework of contextual integrity (Nissenbaum et al., 2006), how might the use of student-provided data by enrollment consultants violate contextual integrity from the perspective of a student?

I cannot find any existing studies on students’ expectations of how colleges handle their data. As a U.S. student myself, I expect that many students’ expectations are driven by the norms laid out by U.S. policy (particularly FERPA), which treats educational and financial data as private and protected.

I believe, therefore, that certain expectations about the flow of data from student to institution may be violated when universities don’t explicitly divulge their partnerships. If the flow is expected to be a straight line from the student to the college, the continuation of that information from college to consultancy and back to the college may seem aberrant. Equally important, I think, is the expectation of the extent of the information. Students likely expect, and cost calculators imply, that certain static pieces of information will be used to make an admit decision, offer merit aid, and determine financial need. In that case, the passing of that information to an outside consultancy who can use that information (and third-party data) in a predictive model to an extent that surpasses any individual piece of data, both to recommend aid and to predict behavior, and then return that information to the college, may also violate students’ expectations.

It seems to me that whether financial aid leveraging is beneficial to the student or not, a lapse in privacy occurs to the benefit of institutions when they fail to disclose the extent to which student data will be used, and by whom exactly.

Privacy and Security of On-Board Diagnostics (OBD)

Privacy issues arising from technology often share more or less a similar story. A technology is usually developed with simple intentions to enhance a feature or perform a modest task. The fittest of those technologies survive to serve a wide set of users. However, as more information is logged and transmitted, a growing concern over privacy surfaces until that privacy issue devours the once simple technology. We have observed too many of these stories. Notably, each of the social networking sites that took turns in popularity were developed as a means to host personal opinions and connections. That never changed, except the discussion around privacy infringements exploded and profoundly affected the direction of the sites. The baton for the next debate seems to be handed over to On-Board Diagnostics (OBD). OBD is a device that is placed behind driver dashboards for the sake of collecting data on the car, such as whether or not tire pressure is low. But more features have been added with more to come. Addition of entertainment systems, cameras, and navigation devices contribute richer layers of data onto the OBD.

Originally developed to track maintenance information and later gas emissions, OBD is attracting mounting concern in its expanding capability to inflict some serious privacy violations. Much like the social network sites, OBD is becoming a lucrative source of rich data. In the case with cars, insurance agencies, advertisers, manufacturers, governments, and hackers all have an interest in the data contained in the OBD. For example, some insurance companies have used information from OBD to measure driving distance to determine discounts to drivers with low mileage. And other insurance companies are issuing monetary incentives for customers to submit information from their OBD. Manufactures can use the information to improve their cars and services. And governments can monitor and regulate traffic and gas emissions with the information. Advertisers can be guided with the information as well. Of course, the distribution of information to insurers and marketers seem trivial when you weigh the harm in a possible hacking incident.

As more OBDs are being loaded with internet connectivity functions, the vulnerability may be worsening. The types of information are no longer limited to whether or not your tires are low in pressure. More personal information such as your preference of music, number of passengers, and real time location. Location data can be used to infer your home address, school or office, choice of supermarkets, and maybe even your religious views or night life habits. Cameras in and around the vehicles can supply streaming videos as well. While each of these devices are useful in enhancing driver and passenger experiences, the privacy and security concerns are indeed alarming. Moreover, OBD loaded on a “smart” car can collect more information more accurately, and share the information faster with a wider audience. Unlike those of smartphones, however, developers of smart cars face bigger challenges in keeping up with the rapid technological evolution. Also, even if choices were offered to turn off features of the OBD, many of them are still likely to remain on as safety concerns may override privacy concerns. The question of ownership of the information is also debated in the absence of clear rules and regulations.

A collaborative effort involving governments, manufacturers, and cybersecurity professionals is needed to address the privacy and security concerns arising from OBD. In the United States, senators introduced a bill “Security and Privacy in Your Car Act of 2015” that reads cars to be “equipped with reasonable measures to protect against hacking attacks.” However, the bill is too ambiguous and will be difficult to enforce in a standardized way. Manufacturers, while acknowledging the possible risks associated with OBD, are not fully up to speed on the matter. Federal and state governments need to take leadership, with the cooperation of manufacturers and security professionals, to make sure safe and reliable automobiles are delivered to customers. How we collectively approach the issue will certainly affect what cost we pay.

Strava and Behavioral Economics

I am a self-described health and fitness nut, and in the years since smartphones have become an essential device in our day-to-day lives, technology has also slowly infiltrated my daily fitness regime.  With such pervasive use of apps to track one’s own health and lifestyle choices, is it any wonder that companies are also collecting the data that we freely give them, with the potential to monetize that information in unexpected ways?  Ten years ago, when I went outside for a run, I would try to keep to daylight hours and busy streets because of the worry that something could happen to me and no one would know.  Now, the worry is completely different – now I am worried that if I use my GPS-enabled running app, my location (along with my heart rate and running speed) is saved and stored in some unknown database, to be used in some unknown manner.

 

Recently, a fitness app called Strava made headlines after it published a heat map showing the locations and workouts of users who made the data public (which is the default setting) and inadvertently revealed the location of secret military bases and the daily habits of personnel.  It was a harsh reminder of how the seemingly innocuous use of an everyday tool can have serious consequences – not just personally, but also professionally, and even for one’s own safety (the Strava heatmap showed certain jogging routes of military personnel in the Middle East).  Strava’s response to the debacle was to release a statement that said they were reviewing their features, but also directed their users to review their own privacy settings – thus the burden remains on the user to opt out, for now.

 

Fitness apps don’t just have the problem of oversharing their users’ locations.  Apps and devices like Strava, or Fitbit, are in the business of collecting a myriad of health and wellness data, from sleep patterns, and heart rates, to what the user eats in a day.  Such data is especially sensitive, because it relates to a user’s health – however, because the user is not sharing it with their doctor or hospital, they may not even realize the extent to which others’ may be able to infer their private sensitive information.

 

One of the biggest issues here is the default setting.  Behavioral economics studies show that the status quo bias is a powerful indicator of how us humans make (or fail to make) decisions.  Additionally, most users simply fail to read and understand privacy statements when they sign up to use an app.  Why do some companies still choose to make the default setting “public” for users of their app – especially in cases where it is not necessary? For Strava, if the default had been to “opt in” to share your location and fitness tracking data with the public, their heatmaps would have looked very different.

 

It is not in the interest of companies to allow the default settings to be anything other than public.  The fewer people who share data, the less the company has about you, and the less likely they are able to use the data to their benefit – such as targeted marketing techniques, or using the data to develop additional features for the individual user.  Thus, they could argue that collecting their users’ data on a more widespread basis also benefits their users in the long run (as well as their own revenues).  However, headlines like this one erode public trust in technology companies – and companies such as Strava would do well to remember that their revenues also depend on the trust of their users.  In the absence of allowing “private” or “friends only” default settings, these companies would do well to analyze the potential consequences before releasing the public data that they collect about their users.

 

Candy Cigarettes- now available in “blue speech bubble” flavor

Less than two months after the launch of MessengerKids, Facebook’s new child-focused correspondence app has received backlash from child-health advocates, including a plea directly to Mark Zuckerberg to pull the plug. On January 30th, the Campaign for Commercial-Free Childhood published an open letter compiled and signed by over 110 medical professionals, educators, and child development experts, which accuses the tech giant of forsaking its promise to “do better” for society and targeting children under 13 to enter the world of social media.  

At its introduction in early December 2017, MessengerKids was branded as another tool for parents struggling to raise children in the digital age. After installing the app on their child’s device(s), parents can control their child’s contact list from their own Facebook account. The app has kid-friendly gifs, frames, and stickers, built in screening for age-inappropriate content in conversations, and a reporting feature for both parents and children to hopefully combat cyberbullying. It contains no advertisements, and the child’s personal information isn’t collected, in accordance with US federal law. Creating an account does not create a Facebook profile, but nonetheless, the service introduces children to social media and their own online presence.

Contrary to the image MessengerKids hoped to present, child-health advocates have interpreted the application less as a gatekeeper for online safety and more as a gateway for unhealthy online habits. In its letter to Mark Zuckerberg, the CCFC cites multiple studies linking screen time and social media presence to depression and negative mental health effects. In addition, the app will interfere with the development of social skills, like the “ability to read human emotion, delay gratification, and engage in the physical world.” The letter argues that the connectivity MessengerKids promises is not an innovation, as these communication methods already exist with parent’s approval or supervision (e.g. Skype or parents’ Facebook accounts); nor does the app provide the solution for underage Facebook accounts, as there’s little incentive for those users to migrate to a service with fewer features designed for younger kids. Instead, it reads as a play to bring users onboard even earlier but marketing specifically to the untapped, under 13 audience.

In addition to the psychological development concerns, a user’s early-instilled brand trust may surpass the perceived importance of privacy later on. Data spread and usage is already a foggy concept to adults, and young children certainly won’t understand the consequences of sharing personal information. This is what the US federal law (“COPPA”) hopes to mitigate by protecting underage users from targeted data collection. MessengerKids normalizes an online identity early on, so young users may not consider the risks of sharing their data with Facebook or other online services once they age out of COPPA protection. The prioritization of online identity that MessengerKids may propagate presents a developmental concern which may affect how those after generation Z  value online privacy and personal data collection.

While Facebook seems to have done its homework by engaging a panel of child-development and family advocates, this could be another high-risk situation for user trust, especially in the midst of the fake-news controversy. Facebook’s discussions with its team of advisors are neither publicly available nor subject to the review process of academic or medical research. With the CCFC’s public backlash, parents who wouldn’t have questioned the feature otherwise may now perceive the impact of the app and its introduction as a medical decision for their child’s health. A curated panel of experts may not be enough to assure parents that Facebook does, in fact, care about kids as more than potential users. The app has no built-in capability to report or prevent cyberbullying, so if Facebook is concerned about unmitigated online activity why not just enforce the existing policy of age restrictions?

Comparing the “benefits” of this service to the developmental risks, the private business interests have clearly outweighed Facebook’s concerns for users’ well-being. While changing social interactions has long been Facebook’s trademark, MessengerKids threatens to alter interpersonal relationships by molding the children who form them and could additionally undermine data responsibility by normalizing online presence at an early age. It appears that Facebook is willing to risk the current generation’s trust to gain the next generation’s- a profitable, but not necessarily ethical decision.

5G is Coming. Get Your Popcorn Ready.

TL;DR

Wi-Fi networks are on the cusp of a step-change improvement in speed and bandwidth. This will make previously-impossible technological intrusions (like multi-gig-per-second data transmission) feasible. Government involvement in the oversight of these new networks may increase consumers’ anxiety. Society should welcome this change, but perspectives on the boundaries between private and public formed in the era of 4G need to be updated.

Detailed Discussion 

The mobile networks in developed nations have been sufficient to support substantial progress toward connecting people and things to each other. However, these networks were built primarily to support voice communication and other standard applications like email and web surfing. Due to a variety of technical challenges, they are struggling to meet the demands of more demanding applications like augmented reality (AR), autonomous vehicles, and always-on HD video streaming. To address these concerns, private telecoms firms in the United States, China, Japan, and South Korea have been racing fervently to build out so-called fifth-generation (“5G”) network architectures.

If / when they are successful, 5G will fundamentally change the relationship private individuals have with technology and the ubiquity of computing in society. The promise of 5G is that mobile communication networks will be able to tolerate much larger transfer volumes with lower latency. This will make possible some transfers which were previously prohibitively costly or slow. According to an article from DMV.org, modern automobiles may be able to continuously transmit information about your location, identity (e.g. fingerprints, facial images), and even your health (such as your heart rate and posture). Most connected devices do not have the storage capacity locally to keep long time series of all that information, but the latency and throughput guarantees provided by 5G would allow it to be streamed out to a persistent data store where it could be used to build a more detailed profile and possibly be joined with other information about the device or individual(s) interacting with it.

In light of the new data transfers that will be possible, private individuals need to consider the new context within which they will be asked to make disclosure choices. In general, more consideration will need to be given to longitudinal data and what can be learned from it. For example, disclosing “location” in the 4G world may (for some applications / devices) just mean that you are consenting to the existence of a real-time endpoint that holds your information and can be used to trigger events like location-based ads. However, disclosing “location” in the 5G world may carry more weight, as it may imply consenting to the disclosure of time-series data which could be used to derive other information like patterns of behavior.

To complicate things further, it is possible that in the United States 5G may become a government-run public utility. A few days ago, the Trump Administration floated the possibility of a nationalized 5G wireless communication network. Today’s modern communication infrastructure is dominated by a handful of private firms and improvements in the infrastructure is largely driven by competitive forces. In an internal memo obtained by Axios, the administration cited national security concerns as the main reason it is considering subverting this competitive process. By some accounts, China’s Huawei Technologies is leading the 5G race in the private sector, and the Trump administration is worried about the national security implications of having a critical part of the U.S. communication infrastructure controlled by a foreign firm. If the federal government builds and maintains the network infrastructure, it may make it easier for government agencies to access the data traveling over it.

To be clear, the intention of this post is not to convince readers that the 5G-pocalypse is coming and  that we should fear its might. The promised improvements to mobile networks will open new opportunities for creativity to flourish, for individuals to connect, and for the reliability and effectiveness of institutions and infrastructure to improve. This post merely serves to raise the concern that 5G will alter the context in which individuals make data privacy decisions.

 

References:

  • “Trump team considers nationalizing 5G network”. (Axios)
  • “How Huawei is leading 5G development”. (Forbes)
  • “5G Network Architecture, A High-Level Overview”. (Huawei)
  • “1 billion could be using 5G by 2023 with China set to dominate”. (CNBC)
  • “Autonomous cars, big data, and the post-privacy world”. (DMV.org)
  • “Next-generation 5G speeds will be about 10 to 20 Gbps”. (Network World)

The Impact of Oracle v. Rimini on Data Professionals and the Public

The average Internet user visits dozens upon dozens of websites every day and thereby interacts with the infrastructure and code on which those websites are built upon.  However, individuals also interact with a layer beyond the digital code and many without their active knowledge – a legal code – that of a Terms of Service.

While the former code is easily modified and updated frequently, this legal code, the Terms of Service, is typically drafted by attorneys and updated much less frequently.  Understandably, many companies try to craft a Terms of Service to be as broad as possible, to afford the greatest amount of protection for the company.  While computer code is typically precise and exact, legal code provides for more ambiguity and interpretation.  The law must strike a balance protecting companies and individual’s property and also afford good-actors reasonable access and use.

One enduring question (curiously) not tested until this past decade was the legal consideration if violating a website’s Terms of Service constituted a crime.  The potential logical reasoning behind the inclusion of this violation was based upon the Computer Fraud and Abuse Act , an act passed by the US Congress in 1986 and broadened in 1996 prohibiting unauthorized access to “protected computers” or exceeds authorized access to and obtains any information from these aforementioned computers if the access involved interstate or foreign communication.

As the Electronic Freedom Foundation (EFF) notes in its detailed blog post, in the most recent case Oracle v. Rimini, the Court held that violating a website’s Terms of Service is not criminally punishable under the Computer Fraud and Abuse Act (and similar state statutes).

Core to a component of this case, and as was argued by an Amicus Brief filed by the EFF, definitions of criminal activity must be very specific and follow the Rule of Lenity, which states as the EFF mentioned, “criminal statutes be interpreted to give clear notice of what conduct is criminal.”  But most critically, the EFF goes on to say, “Not only do people rarely (if ever) read terms of use agreements, but the bounds of criminal law should not be defined by the preferences of website operators.”

Of particular interest to Data Scientists was the question of whether using “bots and scrapers” for automated collection of data was deemed a violation of the law if it violated a Terms of Service.  An important tool in the Data Scientists’ and Data Engineers’ toolbox, automated scraping scripts provide for efficient accumulation of data.  Further, many individuals cite instances of Terms of Service being too broad or vague for interpretation.

Among the applications of these scraped data, it subsequently can be used for academic research or used to develop novel products and services that connect disparate sets of information and reduce information asymmetries across consumer populations (for example, search engines or price tracking).  On the other hand, sometimes malicious bots can become burdensome to a company’s website and impact or impede their operations.

Legal scholars have argued public websites implicitly give the public the right to access (including to scrape) the content, but a some companies disagree.  This presents a fascinating quandary that is beyond the scope of this article.

At risk, and argued by Oracle in the case, was that “the manner in which [the defendant] used” “bots and scrapers” was more than a contractual violation (a violation of the Terms of Service), but also a criminal violation under the Computer Fraud and Abuse Act.  Viewable beginning at 33:42 , Judge Susan Graber stated (at 36:00) she has difficulty seeing how Oracle’s arguments fits with the statute and previous cases.  “They had permission to take [the scraped data]” she states, and that previous cases and statues refer only to data that they did not have legal access to.  Oracle’s attorney rebuts by saying (at 34:47), “The manner restriction is critical to protect the integrity of the computer systems.” And Judge Graber counters that this potentially has jurisdiction in the civil sphere, but not in the criminal realm.

In another, currently pending case, hiQ v. LinkedIn, the Court noted further danger:

Under [an aggressive] interpretation of [the Computer Fraud and Abuse Act (CFAA) ], a website would be free to revoke ‘authorization’ with respect to any person, at any time, for any reason, and invoke the CFAA for enforcement, potentially subjecting an Internet user to criminal, as well as civil liability.  Indeed … merely viewing a website in contravention of a unilateral directive from a private entity would be a crime, effectuating the digital equivalence of Medusa.

The Court goes on to articulate that website owners could block certain populations on the basis of discrimination, consequently, putting any individual, including Data Professionals, who accesses a website at risk.

Fortunately, the Ninth Circuit articulated in the Oracle v. Rimini case that “[T]aking data using a method prohibited by the applicable terms of use when the taking itself generally is permitted, does not violate [criminal statutes]” (Page 3).

This Oracle decision further clarifies for Data Scientists, Data Engineers, and others that they cannot be criminally prosecuted from violating a website’s Terms of Service.  As mentioned above, because Terms of Service can be broad and open to interpretation, data professionals were potentially under risk of criminal prosecution and liability if a company were to encourage authorities to pursue criminal prosecution in addition to exclusion and discrimination.  This resolution, however, still leaves a remedy for businesses to go after bad-actors through civil litigation.  Oracle v. Rimini helps clarify some of the parameters in which the law will be applied to web scraping.  The other case mentioned in this post, hiQ v. LinkedIn, soon to hear oral arguments in March of 2018, will further test the resolution in the Oracle case in addition to previous cases that have been resolved similarly.

 

Note: When engaging in web scraping, there are a number of best practices to engage in, such as respect the Terms of Service as much as possible, respect a website’s Robots.txt, identify your bot, do not republish the data without consent, do not gather non-public or sensitive data, do not overburden the website, e-mail the admin if you have a question, or if you have additional questions seek advice from an attorney.

Disclosure: I am not a lawyer and am interpreting these legal concepts and rulings from an aspiring Data Scientist’s perspective.  Should there be an error in my understanding or writing, or if you have a question, please let me know at dkent [at] Berkeley [dot] edu.  Thank you in advance.

 

Targeted Offers and a Race to the Bottom: How Incentives May Cause Unethical Machine Learning Users to Flourish

If you’re anything like me, it may be tough to imagine a world without the personalized product advances that we’ve seen.

I rely on the recommendations they provide, like in my personalized shopping experience with Amazon, in the relevant material on my Facebook news feed, and through highly pertinent search results by Google. These companies collect information about my product usage and combine it with relevant characteristics gathered about me, running these through a set of learning algorithms to provide a highly relevant output. It’s the unique personalization that makes these products so “sticky”, typically creating a mutually beneficial relationship between the company and for users.

However, there are certain types of offers and uses of machine learning that create serious drawbacks for consumer groups – particularly, with targeted advertising and offerings for things like job postings and access to financial services that create characteristic-based reinforcement,  which often disproportionally impacts disadvantaged groups. Furthermore, misplaced incentives entice companies to overextend data collection practices and invade privacy. I’ll briefly lead you through the framework that allows to happen, then will show you how certain groups can be harmed. While there are some proposed solutions, I hope you’ll also see why it’s so important to begin addressing these issues now.

Companies seek profit maximization by maximizing revenue and minimizing costs. This can result in real consumer benefit through firms trying to attract consumers by creating innovative consumer products, or otherwise through lower consumer prices caused by competition entering new markets. Nonetheless, there can also be negative impacts, which are especially pronounced when companies use learning algorithms within targeted advertising and offers.

In this case, consumer privacy impact gets treated as a negative third-party externality where the associated cost is not picked up by society instead of by the company. Academic work has shown that consumers value privacy enough to assign a monetary value to it when making purchases. However, more consumer data creates better personalized products and more accurate models; important for targeted advertising and offers. However, in collecting as much data as possible, consumer desire to protect privacy is wholly ignored.

Moreover, companies have extra leeway regarding consumer data collection due to monopolistic markets and consumer information disparities. Monopolistic markets occur when one company dominates a market, leaving consumers with high switching costs and low consumer choice. Many tech markets (e.g., Search – Google, Social Media – Facebook, Online shopping – Amazon) are dominated by one company, and while consumers can choose to stop using the products, quality switching choices are very limited. Furthermore, consumers may not know what data is collected or how it is used, creating a veritable “black box” where consumers may be vaguely aware of privacy intrusion, but without any specific examples, won’t discontinue product use.

Perhaps more worrisome are some of the reinforcement mechanisms created by some targeted advertising and offers. Google AdWords provides highly effective targeted advertising; however, it is problematic when these ads pick up on existing stereotypes. In fact, researchers at Carnegie Mellon and the International Computer Science Institute proved that  AdWords showed high-income job offers more often to otherwise equivalent male candidates. Machine learning-based models are typically formed on existing “training” data, providing an outcome or recommendation based on gathered usage patterns, inferred traits, etc. While typically creating consumer benefit, re-training can create an undesirable reinforcement mechanism if the results from the prior model impact outcomes in the next round of training data. While a simplified explanation, this effectively works as follows:

  1. Model shows fewer women ads for high-income jobs
  2. Less women take high-income jobs (same bias that helped inform original model)
  3. Subsequent model re-training shows men as even better candidates to show ads to than prior model

Nonetheless, targeted ad firms are incented to prioritize ad clicks over fairness, and even if looking to fix this specific issue, the interconnectedness of the data inputs and the complexity of learning algorithms may make this a more difficult issue to solve than it seems.

Furthermore, other high societal impact industries are increasingly using machine learning. In particular, the banking and financial services sector now typically uses machine learning in credit risk assessment offer decisions, in order to lower costs by reducing probability of consumer default. However, a troubling trend is now showing up in these targeted models. Rather than making offer decisions on behaviors, credit decisions are being made on characteristics. Traditional credit scoring models compared known credit behaviors of a consumer—such as past late payments—with data that shows how people with the same behaviors performed in meeting credit obligations. However, many banks now purchase predictive analytics products for eligibility determinations, basing decisions on characteristics like store preferences and purchases (FTC Report Reference). While seemingly innocuous, characteristics can serve be highly correlated with protected traits like gender, age, race, or family status. By bringing in characteristics into predictive models, financial services can often make their decision criteria models more accurate as a whole. However, this is often at the expense of people from disadvantaged groups, who may be rejected from an offer even though they have exhibited the exact same behavior as another, thereby setting the individual and group further back. Depending on the circumstance, those companies leveraging characteristics most heavily may end up with more accurate models, giving them an unethical “leg up” on competition.

However, the complexity of data collection, transformation, and algorithms make these issues very difficult to regulate. Additionally, it is critical to also not stifle innovation. As such, industry experts have recommended adapting existing industry regulation to account for machine learning, rather than creating a set of all-encompassing rules (FTC Report Reference); an approach that fits well within the contextual integrity of existing frameworks. For example, motor vehicle regulation should adapt to AI within self-driving cars (FTC Report Reference). Currently, statutes like the Fair Credit Reporting Act, Equal Credit Opportunity Act, and Title VII prohibit overt discrimination of protected groups. As such, much of the statutory framework already exists, but may need to be strengthened to include characteristic-based exclusion language.

Moreover, the British Information Commissioner’s Office recommends that project teams perform a Privacy Impact Assessment detailing noteworthy data collection, identifying data needs, describing information flows, and identifying privacy risks; all helping to reduce the scope of privacy intrusion. Similarly, I recommend that teams perform an Unintended Outcome Assessment before implementing a targeted advertising campaign or learning-based offer, gathering team project managers and data scientists to brainstorm unintended discriminatory reinforcement consequences and propose mitigating procedures or model updates.

Child Deaths: A Case of Immensely Consequential Type I & II Errors

Recently, the Illinois Department of Children and Family Services (DCFS) decided to discontinue using a predictive analytics tool that would predict if children were likely to die from abuse or neglect or other physical threats within the next two years. Director BJ Walker reported that the program was both 1) not predicting cases of actual child deaths, including the homicide of 17-month old Semaj Crosby earlier in April of this year, and 2) alerting way too many cases of over 100% probability of death. When it comes to a matters like abuse, neglect, and deaths of children, false positives (Type I Errors) and false negatives (Type II Errors) have enormous consequences, which makes productionalizing and applying these tools with unstable error rates even more consequential.

The algorithm used by the DCFS is based on a program called the Rapid Safety Feedback program, the brainchild of Will Jones and Eckerd Connects, a Florida-based non-profit dedicated to helping children and families. First applied in Hillsborough County back in 2012 by then-Eckerd Youth Alternatives, the predictive software read in data and records about children’s parents, family history, and other agency records. Some factors going into the algorithm include whether there was a new boyfriend or girlfriend in the house, whether the child had been previously removed for sexual abuse, and whether the parent had also been a victim of abuse or neglect previously. Using these and many other factors, the software would rank each child a score of 0 to 100 on how likely it was that a  death would occur in the next two years. Caseworkers would be alerted of those with high risk scores and with proper training and knowledge, intervene in the family. In Hillsborough County, the program was a success in seemingly reducing child deaths after its implementation. The author and director of the program acknowledged that they cannot 100% attribute the decrease to the program, as there could be other factors, but for the most part, the County saw a decrease in child deaths, and that’s a good outcome.

Since then, the program has gained attention from different states and agencies looking to improve child welfare. One such state was Illinois. However, the program reported that more than 4,000 children were reported to have more than a 90% probability of death or injury, and that 369 children under the age of 9 had a 100% probability. Through the program, caseworkers are trained not to immediately act solely based on these numbers, but the fact of the matter was that this was a very high and clearly unreasonable number. High positive matches brings the conversation to the impact of false positives on welfare and families. A false positive means that a caseworker could intervene in the family and potentially remove the child from the parents. If abuse or neglect were not actually happening, and the algorithm was wrong, then the mental and emotional impacts on the families can be devastating. Not only could the intervention unnecessarily tear apart families physically, but they can traumatize and devastate the family emotionally. In addition, trust in child services and the government agencies involved would deteriorate rapidly.

On top of the high false positives, the program also failed to predict two high-profile child deaths this year, of 17-month old Semaj Crosby and 22-month-old Itachi Boyle. As Director Walker said, the predictive algorithm wasn’t predicting things. The impact of these Type II errors in this case don’t even have to be discussed in detail.

On top of the dilemmas with the pure algorithm in this predictive software, the DCFS caseworkers also complained about the language of the alerts, which used harsh language like “Please note that the two youngest children, ages 1 year and 4 years have been assigned a 99% probability by the Eckerd Rapid Safety Feedback metrics of serious harm or death in the next two years.” Eckerd acknowledged that language could have been improved, which brings in another topic of discussion around communicating the findings of data science results well. The numbers might spit out a 99% probability, but when we’re dealing with such sensitive and emotional topics, the language of the alerts matters. Even if the numbers were entirely accurate, figuring out how to apply such technology into the actual industry of child protective services is another problem altogether.

When it comes to utilizing data science tools like predictive software into government agencies, how small should these error rates be? How big is too big to actually implement? Is failing to predict one child death enough to render the software a failure, or is it better than having no software at all and missing more? Is accidentally devastating some families in the search for those who are actually mistreating their children worth saving those in need? Do the financial costs of the software outweigh the benefit of some predictive assistance, and if not, how do you measure the cost of losing a child? Is having the software helpful at all? As data science and analytics becomes more and more applied to this industry of social services, these are the questions many agencies will be trying to answer. And as more and more agencies look towards taking proactive and predictive steps to better protect their children and families, these are the questions that data scientists should be tackling in order to better integrate these products into society.

 

References:

www.chicagotribune.com/news/watchdog/ct-dcfs-eckerd-met-20171206-story.html

www.chronicleofsocialchange.org/analysis/managing-flow-predictive-analytics-child-welfare

www.chronicleofsocialchange.org/featured/who-will-seize-the-child-abuse-prediction-market