April 2018 – Data Science W231 | Behind the Data: Humans and Values

April 12, 2018

Power to the (Facebook) users: is a tool that makes privacy policies easier to understand sufficient enough to help users understand possible risks of data sharing?

Facebook and Cambridge Analytica have recently sparked major outrages, discussions, movements, and official investigations [1]. In a previous class blog post [2], author Kevin Foley aptly suggests that business incentives may have enabled this incident and that government regulators should work with companies to ensure privacy is respected. This made me wonder about the users themselves, who – from much of what I’ve read in the media – seem to be painted as helpless victims of this breach in trust. Should users rely solely on investors and the government, or should the users themselves be able to take steps to better secure their privacy? Specifically, if users read and better understood relevant documents, such as Facebook’s privacy policy, could they have understood how their data could be inappropriately collected by an organization that could create psychological profiles used to manipulate them?

Recognizing that privacy is “a secondary task – unless you’re a privacy professional, it’s never at the top of your mind” [3] and that “users generally do not read [privacy] policies and those who occasionally do struggle to understand what they read” [4], professor Norman Sadeh at Carnegie Mellon developed the Usable Privacy Policy Project to make privacy policies easier to understand. The project involved using Artificial Intelligence to learn from training data – 115 privacy policies annotated by law students, essentially transformed from legal jargon to simpler, more digestible language – to allow for automated classification of privacy categories, reading level, and user choice sections. The project exists as an online tool (https://explore.usableprivacy.org/) and contains thousands of machine-annotated privacy policies. Reddit’s most current privacy policy from 2017, which I summarized for a previous assignment, was algorithmically classified as College (Grade 14) reading level material, had mostly appropriate categories with corresponding sections in the text version of the privacy policy highlighted [5]. The tool isn’t perfect, boasting 79% for relevant passage detection [6], but it provides areas to search in the text if interested in a topic, such as statements associated with “third-party sharing,” which could exist in multiple places in the document.

The machine-annotated privacy policy for Facebook from late 2017 is also available and has multiple sections where third-party sharing is highlighted by the tool [7]. Reading through these non-summarized sections, users could understand that apps, services, and third-party integrations (subject to their own terms) could collect user information. The written examples suggest active involvement on the user’s part could result in information sharing for *that* user (i.e. self-participation entailed consent), but I don’t think readers could reasonably expect that simple actions their friends perform would allow their own data to be shared. This kind of sharing (user to app), albeit vaguely described and not necessarily expected by users, is part of Facebook’s permissive data policy. The violation behind the scandal was inappropriate information sharing from the app developer, Kogan, to a third-party, Cambridge Analytica [8].

I have to agree with Foley that government regulators should be able to hold companies accountable for protecting user data. Even a good tool that that makes it easier to understand privacy policies can’t help users identify a risk if that risk (or factors leading up to it: which data is being shared, in this case) isn’t sufficiently detailed in the document. The question then becomes how balanced should a privacy policy be, in terms of generality and specificity? How many relevant hypothetical examples and risks should be included, and what role (if any) should government regulators have in shaping an organization’s privacy policy?

On a more positive note, Facebook seems to be taking action to prevent misuse; for example, they now provide a link showing users which apps they use along with information being shared to those apps [9]. A final question is left for the reader: as a user of any online service, how informed would you like to be about how your data is handled, and whether there more effective methods to be informed than static privacy policy pages (e.g. flowcharts, embedded videos, interactive examples, short stories, separate features)?

[1] https://www.nytimes.com/2018/03/26/technology/ftc-facebook-investigation-cambridge-analytica.html

[2] https://blogs.ischool.berkeley.edu/w231/2018/03/27/why-regulation-of-handling-of-data-is-needed/

[3] https://www.fastcodesign.com/90164561/youre-never-going-to-read-that-privacy-policy-could-ai-do-it-for-you

[4] https://explore.usableprivacy.org/about/?view=machine

[5] https://explore.usableprivacy.org/reddit.com/?view=machine#

[6] https://motherboard.vice.com/en_us/article/a3yz4p/browser-plugin-to-read-privacy-policy-carnegie-mellon

[7] https://explore.usableprivacy.org/facebook.com/?view=machine#

[8] https://newsroom.fb.com/news/2018/03/suspending-cambridge-analytica/

[9] https://newsroom.fb.com/news/2018/04/restricting-data-access/

April 7, 2018

Algorithmic Discrimination and Equality of Opportunity

In recent years, organizations in both the public and private sphere have made widespread use of predictive analytics and machine learning for use cases such as college admissions, loan applications, airport screenings, and of course, advertising. These applications not only drive speed and efficiency, but there is an underlying assumption that decisions with social justice implications are best made by data-driven algorithms, because they are inherently impartial.

If only that were true. As it turns out, data is socially constructed, and inherits our human imperfections and biases with startling fidelity. So too do algorithms trained on these biased datasets, and the effects are very difficult to detect. Instead of curbing the potential for systemic discrimination against disadvantaged groups, many researchers believe that the use of algorithms has actually expanded it.

Consider the criminal justice system in the United States. In recent years, courts have been relying on a variety of third-party predictive algorithms to quantify the risk that a convicted criminal will commit a future crime (known as recidivism). Historically, judges have made these subjective determinations based on personal experience and professional expertise; the introduction of an objective, data-driven algorithm into these settings seems like a sensible thing to do. Indeed, it sounds like a marquee application for the field of machine learning.

Here’s the problem: in 2016, ProPublica published an analysis of the COMPAS Recidivism Risk Score algorithm showing that it was racially discriminatory towards Black defendants. According to the article, Black defendants were “77 percent more likely to be pegged as at higher risk of committing a future violent crime and 45 percent more likely to be predicted to commit a future crime of any kind”. Given that these risk scores were being shown to judges minutes before they presided over sentencing hearings, the implications are quite troubling; as of 2016, at least 9 US state courts were actively using COMPAS.

COMPAS is a proprietary algorithm and its publisher has declined to release the exact model specification; however, we do know that it is based on a questionnaire that includes criminal history as well as a set of behavioral questions. For example, it asks defendants questions such as “How many of your friends/acquaintances are taking drugs illegally?” and “How often did you get in fights at school?”. It also defendants to agree/disagree with statements such as “A hungry person has a right to steal” and “When people do minor offenses or use drugs they don’t hurt anyone except themselves”.

Notably, race is not referenced in the questionnaire; however, that does not mean it isn’t correlated with the above questions. These hidden correlations allow race to influence the model just as effectively as if it were included as an explicit variable. Predictive models that are “race-blinded” are simply blind to the fact that they are, in fact, deeply racist.

One might object here on philosophical grounds. After all, if a “protected” attribute is truly correlated with an important outcome, then by definition, the rest of us are worse-off for not being able to take advantage of this information. The fundamental principle behind establishing “protected attributes” is the axiomatic notion that most observed differences in racial/ethnic/gender groups are either the result of historical imbalances in opportunity, or reduced data quality due to smaller sample size. Absent those imbalances, we posit that we would not see, for example, differences in high school completion rates among White and Black students, or SAT scores between Hispanic and Asian students. By reading into these differences and using them to make important decisions, we are perpetuating the cycle of unfairness.

Thus far, we have seen that simply ignoring protected attributes is not a viable strategy for guarding against discrimination in our algorithms. An alternative is to control for the effects by establishing different threshold-criteria for various disadvantaged groups. For example, a bank granting loan applications on the basis of a predictive model may notice that Hispanic applicants have an average score of 45 (on a hypothetical scale of 1-100), whereas White applicants have an average score of 52. As a result, it observes that, on average, White applicants are approved 62% of the time, whereas Hispanic applicants only receive a loan in 48% of cases.

In this case, the bank can curb the discriminatory behavior of the algorithm by adjusting the decision-making threshold based on demographic criteria so as to bring the two acceptance rates into alignment. In a sense, this is reverse discrimination, but with the explicit intent to harmonize acceptance rates among the two populations.

Yet there are problems with this approach. First, there is the obvious fact that acceptance criteria varies based on a protected attribute, i.e., all other things being equal, under this scenario a less qualified Hispanic applicant has the same chance at getting a loan as a more qualified White applicant (due to the manipulation of thresholds). Moreover, there is a significant cost borne by society or the private enterprise by deviating from the “optimal solution”, which in this fictional scenario would accept White applications at a higher rate than Hispanic ones. Can we do better than this?

It turns out we can. In a journal article titled Equality of Opportunity in Supervised Learning, researchers Moritz Hardt, Eric Price, and Nathan Srebro propose a framework for post-processing any learned predictor to apply a condition known as “equality of opportunity”. Equality of opportunity (illustrated here) is the idea that, instead of harmonizing acceptance rates by demographic across the entire body of loan applicants (“positive rate”), we only need to ensure that the acceptance rate is equal for applicants who would actually pay back a loan (“true positive rate”).

Equality of opportunity provides an interesting alternative to the status quo. The best feature is that it is a post-processing step, meaning that the principle can be applied to existing predictive models. This is especially important in the scenario where organizations do not have access to the source code for a third-party algorithm, but still need to make use of it. It will be interesting to see how institutions will come to adopt equality of opportunity in the months and years ahead.

April 3, 2018April 3, 2018

Deep Dream Vagine Machine – Parallels between Sexual Consent and Consent for Data Usage

This is going to be a relatively intimate post for this blog. It does, however, involve data & machine learning in different forms as well as consent and the ethics of the data used and the application of these concepts brought down to ground level in the form of an art project that, over time, begged questions about the collection and use of people’s private data, contextual integrity, and the need for ongoing conversation of consent as context changes.

A few years ago, I became aware of a neural network known as ‘deep style transfer’ that allowed a user to transform an image using the artistic style from another.

I was consumed by the infinite combinations of stylizing everything, but realized that transforming my own selfies was not a sustainable hobby. I needed more content. Something unexpected, challenging to obtain, with texture and variety.

To make this fun, I crowdsourced from Facebook friends. “If anyone wants to send me vagina pics (your own vagina only), I will do something fun with it,” I posted. Soon after this invitation, I received my first anonymous vagina photo. I did not need to know the identity of the women, so long as they were consenting adults (my audience for the post was a limited portion of my friend list, and I don’t have any friends under 18).

So began my year long journey of making stylized vagina portraits for friends, a project now titled “Dream Vagine.” I sent them the results and requested consent to post the images anonymously on my Instagram. As I received support, people asked if I would show them in a gallery, sell them, and maybe make lots of money! To avoid getting in the weeds about my philosophy on art, I will jump straight to my discomfort of how I would hypothetically be handling the privacy of these women if I was to take up that offer. It would not feel right to me to take these women’s images and post them in a gallery with the intention of profiting. Since my photo was also part of the project, it was easy to put myself into the shoes of any other women. To put this dryly into Mulligan’s framework of the dimensions of privacy:

Dimensions of theory
- object: dignity, control over personal information
- justification: to protect from social or professional consequence as well as undesired use of image
- exemplar: sexual shaming, or unwanted use of private photos
Dimensions of protection
- target: body images
- subject: women in the photos
Dimensions of harm
- action: use or distribution of the photo without consent of the subject
- offender: me or anyone on the internet
- from-whom: everyone, employers, family, friends
Dimensions of provision
- mechanism: communication, social norms
- provider: me
Dimensions of scope
- social boundaries: Instagram, anywhere the photos are stored or displayed
- temporal scale: indefinitely until otherwise communicated
- quantitative scope: per-case

To summarize, a woman whose photo was submitted to me for this project would want to protect herself from the use of the photo in an undesired manner, by me or by other people. The act of publishing the stylized photos was a way to exhibit woman’s photo under consensual circumstances in contrast to the stories we often hear of people getting hacked and having their nude photos exposed without their consent. Even though this project was a form of protest against the cliche outcome of having a nude photo shared without consent, technically, there is still risk of misuse within the project. I wanted to avoid that, but how would I know if I was misusing the images?

It was important to consider the context in which the images were used or displayed. When the contextual integrity of data is static, the worry about loss of control is minimized. The image is used in one place, for one purpose, at one time, for a set audience. The change of context is precisely what many of the women worried about. In this project, the context changed every time the image was displayed. So, if I wasn’t going to maintain contextual integrity, how could I maintain the comfort of the women? How would I know what they were comfortable with? This was possible by maintaining communication. The only way to know was to ask each woman. If I continued to communicate with them about how their image would be used, I could obtain consent again, at every layer of context, as it changed. This allowed me to progress the project to the point that it could be displayed at a gallery. Further, if a viewer wanted to request a print, they could submit a request that would be relayed to the woman who would then consent to the purchase of her image by each individual.

As this story unfolded, it paralleled the widespread sharing of sexual assault stories through social media. The vagina portraits became more to lure viewers in to discuss the topic of consent in a sexual context in a different way. Sexual communication could be fun and not just some kind of legal obstacle. The request for consent is, actually, the coupled expression of desire and respect. “I want you, but more importantly, I respect you.”

This particular sexual consent conversation stacked perfectly on to the consent conversation for use of the visual information in these private photos. The portraits themselves were no longer the focus of the art, they were simply a by-product and a manifestation of the wonderful and unique things that can come from the consent of a woman for use of her body, and use of her information.