Archive for April, 2018

The four teams in CTSP’s Facebook-sponsored Data for Good Competition will be presenting today in CITRIS and CTSP’s Tech & Data for Good Showcase Day. The event will be streamed through Facebook Live on the CTSP Facebook page. After deliberations from the judges, the top team will receive $5000 and the runner-up will receive $2000.



Data for Good Judges:

Joy Bonaguro, Chief Data Officer, City and County of San Francisco

Joy Bonaguro the first Chief Data Officer for the City and County of San Francisco, where she manages the City’s open data program. Joy has spent more than a decade working at the nexus of public policy, data, and technology. Joy earned her Masters from UC Berkeley’s Goldman School of Public Policy, where she focused on IT policy.

Lisa García Bedolla, Professor, UC Berkeley Graduate School of Education and Director of UC Berkeley’s Institute of Governmental Studies

Professor Lisa García Bedolla is a Professor in the Graduate School of Education and Director of the Institute of Governmental Studies. Professor García Bedolla uses the tools of social science to reveal the causes of political and economic inequalities in the United States. Her current projects include the development of a multi-dimensional data system, called Data for Social Good, that can be used to track and improve organizing efforts on the ground to empower low-income communities of color. Professor García Bedolla earned her PhD in political science from Yale University and her BA in Latin American Studies and Comparative Literature from UC Berkeley.

Chaya Nayak, Research Manager, Public Policy, Data for Good at Facebook

Chaya Nayak is a Public Policy Research Manager at Facebook, where she leads Facebook’s Data for Good Initiative around how to use data to generate positive social impact and address policy issues. Chaya received a Masters of Public Policy from the Goldman School of Public Policy at UC Berkeley, where she focused on the intersection between Public Policy, Technology, and Utilizing Data for Social Impact.

Michael Valle, Manager, Technology Policy and Planning for California’s Office of Statewide Health Planning and Development

Michael D. Valle is Manager of Technology Policy and Planning at the California Office of Statewide Health Planning and Development, where he oversees the digital product portfolio. Michael has worked since 2009 in various roles within the California Health and Human Services Agency. In 2014 he helped launch the first statewide health open data portal in California. Michael also serves as Adjunct Professor of Political Science at American River College.


As detailed in the call for proposals, the teams will be judged on the quality of their application of data science skills, the demonstration of how the proposal or project addresses a social good problem, their advancing the use of public open data, all while demonstrating how the proposal or project mitigates potential pitfalls.

Please feel free to check us out:

Master of Information and Data Science student Daniel Kent is currently participating in the LAUNCH — UC Startup Accelerator for Jobwell, a job search organizer that allows jobseekers to take control of their job search. Kent tells us more as the team prepares for LAUNCH Demo Day, where they will present their product to investors and VCs.


Please join us for the last NLP Seminar of the semester on Monday, April 30,  at 4:00pm in 202 South Hall.   All are welcome!

Speaker:  Marilyn Walker (UCSC)

Title:  Modeling Narrative Structure in Informal First-Person Narratives


Many genres of natural language text are narratively structured, reflecting the human bias towards organizing our experiences as narratives. Understanding narrative structure in full requires many discourse-level NLP components, including modeling the motivations, goals and desires of the protagonists, modelling the affect states of the protagonists and their transitions across story timepoints, and modelling the causal links between story events. This talk will focus on our recent work on modeling first-person participant goals and desires and their outcomes. I describe DesireDB, a collection of personal first-person stories from the Spinn3r corpus, which are annotated for statements of desire, textual evidence for desire fulfillment, and whether the stated desire is fulfilled given the evidence in the narrative context. I will describe experiments on tracking desire fulfillment using different methods, and show that a LSTM Skip-Thought model using the context both before and after the desire statement achieves an F-Measure of 0.7 on the corpus. I will also briefly discuss our work on modelling affect states and causal links between story events on the same corpus of informal stories.

The presented work was jointly conducted with Elahe Rahimtoroghi, Jiaqi Wu, Pranav Anand, Ruimin Wang, Lena Reed and Shereen Oraby.


Marilyn Walker, is a Professor of Computer Science at UC Santa Cruz, and a fellow of the Association for Computational Linguistics (ACL), in recognition of her for fundamental contributions to statistical methods for dialog optimization, to centering theory, and to expressive generation for dialog. Her current research includes work on computational models of dialogue interaction and conversational agents, analysis of affect, sarcasm and other social phenomena in social media dialogue, acquiring causal knowledge from text, conversational summarization, interactive story and narrative generation, and statistical methods for training the dialogue manager and the language generation engine for dialogue systems.

Before coming to Santa Cruz in 2009, Walker was a professor of computer science at the University of Sheffield. From 1996 to 2003, she was a principal member of the research staff at AT&T Bell Labs and AT&T Research, where she worked on the AT&T Communicator project, developing a new architecture for spoken dialogue systems and statistical methods for dialogue management and generation. Walker has published more than 200 papers and has more than 10 U.S. patents granted. She earned an M.S. in computer science at Stanford University, and a Ph.D. in computer science at the University of Pennsylvania.


Noura Howell, Laura Devendorf, Tomás Alfonso Vega Gálvez, Rundong Tian, Kimiko Ryokai


Biosensing displays, increasingly enrolled in emotional reflection, promise authoritative insight by presenting users’ emotions as discrete categories. Rather than machines interpreting emotions, we sought to explore an alternative with emotional biosensing displays in which users formed their own interpretations and felt comfortable critiquing the display. So, we designed, implemented, and deployed, as a technology probe, an emotional biosensory display: Ripple is a shirt whose pattern changes color responding to the wearer’s skin conductance, which is associated with excitement. 17 participants wore Ripple over 2 days of daily life. While some participants appreciated the ‘physical connection’ Ripple provided between body and emotion, for others Ripple fostered insecurities about ‘how much’ feeling they had. Despite our design intentions, we found participants rarely questioned the display’s relation to their feelings. Using biopolitics to speculate on Ripple’s surprising authority, we highlight ethical stakes of biosensory representations for sense of self and ways of feeling.


Facebook and Cambridge Analytica have recently sparked major outrages, discussions, movements, and official investigations [1]. In a previous class blog post [2], author Kevin Foley aptly suggests that business incentives may have enabled this incident and that government regulators should work with companies to ensure privacy is respected. This made me wonder about the users themselves, who – from much of what I’ve read in the media – seem to be painted as helpless victims of this breach in trust. Should users rely solely on investors and the government, or should the users themselves be able to take steps to better secure their privacy? Specifically, if users read and better understood relevant documents, such as Facebook’s privacy policy, could they have understood how their data could be inappropriately collected by an organization that could create psychological profiles used to manipulate them?

Recognizing that privacy is “a secondary task – unless you’re a privacy professional, it’s never at the top of your mind” [3] and that “users generally do not read [privacy] policies and those who occasionally do struggle to understand what they read” [4], professor Norman Sadeh at Carnegie Mellon developed the Usable Privacy Policy Project to make privacy policies easier to understand. The project involved using Artificial Intelligence to learn from training data – 115 privacy policies annotated by law students, essentially transformed from legal jargon to simpler, more digestible language – to allow for automated classification of privacy categories, reading level, and user choice sections. The project exists as an online tool ( and contains thousands of machine-annotated privacy policies. Reddit’s most current privacy policy from 2017, which I summarized for a previous assignment, was algorithmically classified as College (Grade 14) reading level material, had mostly appropriate categories with corresponding sections in the text version of the privacy policy highlighted [5]. The tool isn’t perfect, boasting 79% for relevant passage detection [6], but it provides areas to search in the text if interested in a topic, such as statements associated with “third-party sharing,” which could exist in multiple places in the document.

The machine-annotated privacy policy for Facebook from late 2017 is also available and has multiple sections where third-party sharing is highlighted by the tool [7]. Reading through these non-summarized sections, users could understand that apps, services, and third-party integrations (subject to their own terms) could collect user information. The written examples suggest active involvement on the user’s part could result in information sharing for *that* user (i.e. self-participation entailed consent), but I don’t think readers could reasonably expect that simple actions their friends perform would allow their own data to be shared. This kind of sharing (user to app), albeit vaguely described and not necessarily expected by users, is part of Facebook’s permissive data policy. The violation behind the scandal was inappropriate information sharing from the app developer, Kogan, to a third-party, Cambridge Analytica [8].

I have to agree with Foley that government regulators should be able to hold companies accountable for protecting user data. Even a good tool that that makes it easier to understand privacy policies can’t help users identify a risk if that risk (or factors leading up to it: which data is being shared, in this case) isn’t sufficiently detailed in the document. The question then becomes how balanced should a privacy policy be, in terms of generality and specificity? How many relevant hypothetical examples and risks should be included, and what role (if any) should government regulators have in shaping an organization’s privacy policy?

On a more positive note, Facebook seems to be taking action to prevent misuse; for example, they now provide a link showing users which apps they use along with information being shared to those apps [9]. A final question is left for the reader: as a user of any online service, how informed would you like to be about how your data is handled, and whether there more effective methods to be informed than static privacy policy pages (e.g. flowcharts, embedded videos, interactive examples, short stories, separate features)?










In 2018, Covered California has served over 3.4 million consumers since its creation in 2014.  Much of the administration of one’s account is available online; the service has a chat functionality, but it is restricted to queries that are not case specific.

Iin situations when its necessary to speak to a representative, it can be difficult to navigate the phone tree that you reach when you call their 1-800 contact number.

After speaking with Rebecca, a customer care representative, she provided me with the strategy of navigating the tree to speak to a live human.

After dialing the 1800 number cited, she said:

To get through to a representative, select option #1 for English, then select option #2 for help with an online account. At this point, you’ll need to wait until the recording finishes then select option #0 to indicate that none of the information in the recording helped such that you need to talk to a representative.

Hope this is useful!

Please join us for our NLP Seminar next Monday, April 16, at 4:00pm in 202 South Hall.

Speaker: Amber Boydstun (Associate Professor of Political Science, UC Davis)

Title: How Surges in Dominant Media Narratives Move Public Opinion


Studies examining the potential effects of media coverage on public attitudes toward policy issues (e.g., abortion, capital punishment) have identified three variables that, depending on the issue, can wield significant influence: the tone of the coverage (positive/negative/neutral), the frames used (e.g., discussing the issue from an economic vs. a moral perspective), and the overall level of media attention to the issue.  Yet, to date, no study has examined all three variables in combination.  We fill this gap by building a theoretical argument for why, despite the important variance across different issues, in general a single measure should be able to predict significant shifts in public opinion: surges in media attention to “dominant media narratives,” or stories that consistently frame the issue the same way (e.g., economic) using the same tone (e.g., anti-immigration) relative to other competing narratives.  We test this hypothesis in U.S. newspaper coverage to three very different policy issues—immigration, same-sex marriage, and gun control—from 1992 to 2012.  We use manual content analysis linked with computational modeling, tracking tone (pro/anti/neutral), emphasis frames (e.g., economic, morality), and overall levels of attention. Using time series analysis of public opinion data, we show that, for all three issues, previous surges in dominant media narratives significantly shape opinion.  In short, when media coverage converges around a unified way of describing a policy issue, the public tends to follow.  Our study adds to the fields of political communication and public opinion and marks an advance in computational text analysis methods.  (Joint work with Dallas Card and Noah Smith)

In recent years, organizations in both the public and private sphere have made widespread use of predictive analytics and machine learning for use cases such as college admissions, loan applications, airport screenings, and of course, advertising. These applications not only drive speed and efficiency, but there is an underlying assumption that decisions with social justice implications are best made by data-driven algorithms, because they are inherently impartial.

If only that were true. As it turns out, data is socially constructed, and inherits our human imperfections and biases with startling fidelity. So too do algorithms trained on these biased datasets, and the effects are very difficult to detect. Instead of curbing the potential for systemic discrimination against disadvantaged groups, many researchers believe that the use of algorithms has actually expanded it.

Consider the criminal justice system in the United States. In recent years, courts have been relying on a variety of third-party predictive algorithms to quantify the risk that a convicted criminal will commit a future crime (known as recidivism). Historically, judges have made these subjective determinations based on personal experience and professional expertise; the introduction of an objective, data-driven algorithm into these settings seems like a sensible thing to do. Indeed, it sounds like a marquee application for the field of machine learning.

Here’s the problem: in 2016, ProPublica published an analysis of the COMPAS Recidivism Risk Score algorithm showing that it was racially discriminatory towards Black defendants. According to the article, Black defendants were “77 percent more likely to be pegged as at higher risk of committing a future violent crime and 45 percent more likely to be predicted to commit a future crime of any kind”. Given that these risk scores were being shown to judges minutes before they presided over sentencing hearings, the implications are quite troubling; as of 2016, at least 9 US state courts were actively using COMPAS.

COMPAS is a proprietary algorithm and its publisher has declined to release the exact model specification; however, we do know that it is based on a questionnaire that includes criminal history as well as a set of behavioral questions. For example, it asks defendants questions such as “How many of your friends/acquaintances are taking drugs illegally?” and “How often did you get in fights at school?”. It also defendants to agree/disagree with statements such as “A hungry person has a right to steal” and “When people do minor offenses or use drugs they don’t hurt anyone except themselves”.

Notably, race is not referenced in the questionnaire; however, that does not mean it isn’t correlated with the above questions. These hidden correlations allow race to influence the model just as effectively as if it were included as an explicit variable. Predictive models that are “race-blinded” are simply blind to the fact that they are, in fact, deeply racist.

One might object here on philosophical grounds. After all, if a “protected” attribute is truly correlated with an important outcome, then by definition, the rest of us are worse-off for not being able to take advantage of this information. The fundamental principle behind establishing “protected attributes” is the axiomatic notion that most observed differences in racial/ethnic/gender groups are either the result of historical imbalances in opportunity, or reduced data quality due to smaller sample size. Absent those imbalances, we posit that we would not see, for example, differences in high school completion rates among White and Black students, or SAT scores between Hispanic and Asian students. By reading into these differences and using them to make important decisions, we are perpetuating the cycle of unfairness.

Thus far, we have seen that simply ignoring protected attributes is not a viable strategy for guarding against discrimination in our algorithms. An alternative is to control for the effects by establishing different threshold-criteria for various disadvantaged groups. For example, a bank granting loan applications on the basis of a predictive model may notice that Hispanic applicants have an average score of 45 (on a hypothetical scale of 1-100), whereas White applicants have an average score of 52. As a result, it observes that, on average, White applicants are approved 62% of the time, whereas Hispanic applicants only receive a loan in 48% of cases.

In this case, the bank can curb the discriminatory behavior of the algorithm by adjusting the decision-making threshold based on demographic criteria so as to bring the two acceptance rates into alignment. In a sense, this is reverse discrimination, but with the explicit intent to harmonize acceptance rates among the two populations.

Yet there are problems with this approach. First, there is the obvious fact that acceptance criteria varies based on a protected attribute, i.e., all other things being equal, under this scenario a less qualified Hispanic applicant has the same chance at getting a loan as a more qualified White applicant (due to the manipulation of thresholds). Moreover, there is a significant cost borne by society or the private enterprise by deviating from the “optimal solution”, which in this fictional scenario would accept White applications at a higher rate than Hispanic ones. Can we do better than this?

It turns out we can. In a journal article titled Equality of Opportunity in Supervised Learning,  researchers Moritz Hardt, Eric Price, and Nathan Srebro propose a framework for post-processing any learned predictor to apply a condition known as “equality of opportunity”. Equality of opportunity (illustrated here) is the idea that, instead of harmonizing acceptance rates by demographic across the entire body of loan applicants (“positive rate”), we only need to ensure that the acceptance rate is equal for applicants who would actually pay back a loan (“true positive rate”).

Equality of opportunity provides an interesting alternative to the status quo. The best feature is that it is a post-processing step, meaning that the principle can be applied to existing predictive models. This is especially important in the scenario where organizations do not have access to the source code for a third-party algorithm, but still need to make use of it. It will be interesting to see how institutions will come to adopt equality of opportunity in the months and years ahead.

This is going to be a relatively intimate post for this blog.  It does, however, involve data & machine learning in different forms as well as consent and the ethics of the data used and the application of these concepts brought down to ground level in the form of an art project that, over time, begged questions about the collection and use of people’s private data, contextual integrity, and the need for ongoing conversation of consent as context changes.

A few years ago, I became aware of a neural network known as ‘deep style transfer’ that allowed a user to transform an image using the artistic style from another.

I was consumed by the infinite combinations of stylizing everything, but realized that transforming my own selfies was not a sustainable hobby. I needed more content.  Something unexpected, challenging to obtain, with texture and variety.

To make this fun, I crowdsourced from Facebook friends.   “If anyone wants to send me vagina pics (your own vagina only), I will do something fun with it,”  I posted. Soon after this invitation, I received my first anonymous vagina photo. I did not need to know the identity of the women, so long as they were consenting adults (my audience for the post was a limited portion of my friend list, and I don’t have any friends under 18).

So began my year long journey of making stylized vagina portraits for friends, a project now titled “Dream Vagine.”  I sent them the results and requested consent to post the images anonymously on my Instagram. As I received support, people asked if I would show them in a gallery, sell them, and maybe make lots of money!  To avoid getting in the weeds about my philosophy on art, I will jump straight to my discomfort of how I would hypothetically be handling the privacy of these women if I was to take up that offer. It would not feel right to me to take these women’s images and post them in a gallery with the intention of profiting.   Since my photo was also part of the project, it was easy to put myself into the shoes of any other women. To put this dryly into Mulligan’s framework of the dimensions of privacy:

  • Dimensions of theory
    • object: dignity, control over personal information
    • justification: to protect from social or professional consequence as well as undesired use of image
    • exemplar: sexual shaming, or unwanted use of private photos
  • Dimensions of protection
    • target: body images
    • subject: women in the photos
  • Dimensions of harm
    • action: use or distribution of the photo without consent of the subject
    • offender: me or anyone on the internet
    • from-whom: everyone, employers, family, friends
  • Dimensions of provision
    • mechanism: communication, social norms
    • provider: me
  • Dimensions of scope
    • social boundaries: Instagram, anywhere the photos are stored or displayed
    • temporal scale: indefinitely until otherwise communicated
    • quantitative scope: per-case

To summarize, a woman whose photo was submitted to me for this project would want to protect herself from the use of the photo in an undesired manner, by me or by other people.   The act of publishing the stylized photos was a way to exhibit woman’s photo under consensual circumstances in contrast to the stories we often hear of people getting hacked and having their nude photos exposed without their consent.  Even though this project was a form of protest against the cliche outcome of having a nude photo shared without consent, technically, there is still risk of misuse within the project. I wanted to avoid that, but how would I know if I was misusing the images?  

It was important to consider the context in which the images were used or displayed.  When the contextual integrity of data is static, the worry about loss of control is minimized. The image is used in one place, for one purpose, at one time, for a set audience.  The change of context is precisely what many of the women worried about. In this project, the context changed every time the image was displayed. So, if I wasn’t going to maintain contextual integrity, how could I maintain the comfort of the women?  How would I know what they were comfortable with? This was possible by maintaining communication. The only way to know was to ask each woman. If I continued to communicate with them about how their image would be used, I could obtain consent again, at every layer of context, as it changed. This allowed me to progress the project to the point that it could be displayed at a gallery.  Further, if a viewer wanted to request a print, they could submit a request that would be relayed to the woman who would then consent to the purchase of her image by each individual.

As this story unfolded, it paralleled the widespread sharing of sexual assault stories through social media.  The vagina portraits became more to lure viewers in to discuss the topic of consent in a sexual context in a different way.  Sexual communication could be fun and not just some kind of legal obstacle. The request for consent is, actually, the coupled expression of desire and respect. “I want you, but more importantly, I respect you.”

This particular sexual consent conversation stacked perfectly on to the consent conversation for use of the visual information in these private photos.  The portraits themselves were no longer the focus of the art, they were simply a by-product and a manifestation of the wonderful and unique things that can come from the consent of a woman for use of her body, and use of her information.