Archive for November, 2017

Data Breaches

November 23rd, 2017

According to the United States Government, “A data breach is a security violation in which sensitive, protected or confidential data is copied, transmitted, viewed, stolen or used by an individual unauthorized to do so.”  The news has been filled with massive company data breaches involving customer and employee information.

Notification Laws: Every state in the U.S., with the exception of Alabama and South Dakota, has a data breach notification law in place.  The National Conference of State Legislators has a link to all the different state laws so you can see what your state requires.  Keeping track of all these laws could be very confusing, not including all the international laws for multinational corporations.  Currently, there is no federal law that covers general personal information data breaches. Both the Data Security and Breach Notification Act of 2015 and Personal Data Notification and Protection Act of 2017 have been introduced into the House of Representatives but that is as far as they got.  For health information specifically, there are two rules at the federal level that cover notification to those effected which are the Health Breach Notification Rule and the HIPAA Breach Notification Rule.

Data Ownership: Discussion stemming from these breaches has brought up the topic of data ownership. The personal information that companies have residing in their databases has long been thought of as their property.  This concept has been changing and evolving as our personal data has been proliferated into many databases with increasingly more personal information being collected and generated.  Users of these websites and companies understand that organizations need their information to provide services, whether that’s a personalized shopping experience or hailing a ride.  This point of ownership cannot be highlighted enough.  The acquiring of personal information gained in a data breach is not just an attack on the company but is an attack on all this users whose personal information was stolen and could be sold or used for illegal activities.

Timing: Customers of these companies want to know if their information has been compromised, so they can evaluate if accounts or other identity fraud situations have occurred. There are several milestones in the data breach timeline.  One is when the data breach actually occurred.  This may not be known if the company does not have a digital trail and infrastructure to discover when this happened.  This may be well before the next milestone of the company discovering a breach and assessing the extent of the breach.  The next milestone would be the corrective action taken by the effected company or agency to ensure the data is now being protected.  Currently, only eight states have a firm deadline for notification which is usually 30 to 90 days after discovery of the breach.

Encryption: California led the data breach notification law effort by passing, in 2002, a law requiring businesses and government agencies to notify California residents of data security breaches.  In the California law, there is an exception to notifying those effected if the personal information is encrypted. The law defines the term “encrypted” to mean “rendered unusable, unreadable, or indecipherable to an unauthorized person through a security technology or methodology generally accepted in the field of information security.”  These broad terms for encryption do not include a particular levels of encryption but tries to leave open the increasing level of encryption by whatever the industry standard is at that time.  Maybe if a breach occurs, a government or third party could evaluate the company’s encryption levels to determine if reporting is required.

The issue of data breaches is not going away. If Government agencies and companies do not respond in a fashion that customers find acceptable, users will start to become wary of sharing this valuable personal information and the insights that come with it will be lost.

The Blurry Line of Wearables

November 22nd, 2017

Imagine a world where your heart rate, sleep pattern, and other biometrics are being measured and monitored by your employer on a daily basis. They can decide when and how much you can work based on these metrics. They can decide how much to pay you or even fire you based on this information. This is the world that a lot of athletes now live in. Wearable biometric trackers are becoming the norm in sports across the world. The NBA has begun adopting it to aid recovery and monitor training regimens. MLB is using it to monitor pitchers’ arms. There are European sports teams that display this information on a jumbo-tron live in the arena.

Currently in the NBA, the collective bargaining agreement between the player’s union and the NBA state that “data collected from a Wearable worn at the request of a team may be used for player health and performance purposes and Team on-court tactical and strategic purposes only. The data may not be considered, used, discussed or referenced for any other purpose such as in negotiations regarding a future Player Contract or other Player Contract transaction.” The line between using biometric data for strategic or health purposes and using it for contract talks or roster decisions can become blurred quickly. What begins as a simple monitoring of heart rate in practice can end with a team cutting a player for being out of shape or not trying hard enough. A pitcher getting rested by a team because they saw that his arm is tiring out can lead to less money next time that pitcher negotiates for a contract since that pitcher played in less games. If a team sees that a player isn’t sleeping as much as they should, they can offer less money and say that it’s because the player is going out partying too much and they question his character. These are all possible ethical and legal violations of a player’s rights. Being able to monitor a player during a game is one thing. But being able to monitor their entire lives and controlling what they can or cannot do and tying that to their paycheck is another.

On top of these ethical and legal issues that wearables raise, there are also a lot of privacy risks. These players are in the public eye every day and leaks happen within organizations. The more sensitive medical information a team collects, the more risks there are that there are of HIPAA violations. Imagine if a player gets traded or cut for what seems like an unexplainable reason to the public. People might start questioning if there are medical issues with the player. Fans love to dissect every part of a player’s life and teams love to post as much information as possible online for their fans to look at. We’ve seen the risks of open public datasets and how even anonymized information can be de-anonymized and tied back to an individual. With the vast amount of information in the world about these players, this is definitely a risk.

All of these issues come as wearable technology become commonplace, but there are other risks on the horizon. We’re beginning to see embeddables, such as a pill to take or a device implemented under the skin, being developed and it will only be a matter a time before professional sports start taking advantage of it. These issues are not only faced by professional athletes. Think about your own job. Do you wear a badge at work? Can your employer track where you are at all times and analyze your performance based on that? What if in the future DNA testing is required? As more technology is developed and data is collected, we need to be aware of the possible issues that come with it and ask ourselves these questions.

In a W231 group project that I recently worked on, we attempted to combine Transparent California, an open database of California’s employees’ salaries and pension data, with social media information to assemble detailed profiles of California’s employees. While we were able to do so on a individual employees, one by one, limitations posed by large social networks on scraping by third party tools prevented us from doing this at scale. This may soon change.

In a recent ruling, the Federal US District Court for the Northern District of California concluded that the giant social network cannot prevent HiQ Labs from accessing and using its data. The startup helps HR professionals fight attrition by scraping LinkedIn data and deploying machine learning algorithms to predict employees’ flight risk.


This was a fascinating case of the sometimes-inevitable clash between the public’s right to access to information, and the individuals’ right to privacy. On one hand – most would agree that liberating our data from tech giants such as LinkedIn, now owned by Microsoft, is a positive outcome; on the other – allowing universal access to it doesn’t come without risks.


Free speech advocates praise this ruling as potentially signaling a new direction in US courts’ approach to social networks’ control over data their users shared publically. This new approach was exemplified only a month earlier by another ruling by the US supreme court, where usage of social media was described as “speaking and listening in the modern public square”. If social media indeed is a modern public square, there should be little debate on whether information posted there can be used by anyone, for any reason.


There are, however, disadvantages to this increased access to publically posted data. The first of which, as my group project discussed, is that by combining multiple publically available data sets one can violate users’ privacy. And, practically adding the vast sea of information held by social networks to these open data sets, enables an ever-increased violation of privacy. It may be claimed that if users themselves post information online, companies who use it do nothing wrong. However, it is unclear whether users who post information to their public LinkedIn profiles intend for it to be scraped, analyzed and sold by other services. It is much more likely that most users expect this information to be viewed by other individual users.


Finally, as argued by LinkedIn, the right to privacy covers not only the data itself but also changes to it. When changing their profiles, social media users mostly do not wish to broadcast the change publically, but to display it to friends or connections who visit their profiles. LinkedIn even allows users to choose whether to make changes private, and post them only to the user’s profile, or public and let them appear on others’ news feed. Allowing third parties to scrape information revokes this right from users – algorithms such as HiQ’s scrape profiles, pick up changes, and sell them.


LinkedIn already appealed the court’s decision, and it will likely be a while before information on social media will be treated literally as posted on the public square. Courts will be required to choose, again, between the right to privacy and the right to access to information in this case. But regardless of what the decision will be, this is yet another warning sign reminding us, again, to be thoughtful of what information we share online – it will likely reach more eyes, and be used in other ways than we originally intended.

Algorithm is not a magic word

November 20th, 2017

People may throw the word “algorithm” around to justify that their decisions are good and unbiased, and it sounds technical enough that you might trust them. Today I want to demystify this concept for you. Fair warning: it may be a bit disillusioning.

What is an algorithm? You’ll be happy to know that it’s not even that technical. An algorithm is just a set of predetermined steps that are followed to reach some final conclusion. Often we talk about algorithms in the context of computer programs, but you probably have several algorithms that you use in your daily life without even realizing it. You may have an algorithm that you follow for clearing out your email inbox, for unloading your dishwasher, or for making a cup of tea.

Potential Inbox Clearing Algorithm

  • Until inbox is empty:
    • Look at the first message:
      • If it’s spam, delete it.
      • If it’s a task you need to complete:
        • Add it to your to-do list.
        • File the email into a folder.
      • If it’s an event you will attend:
        • Add it to your calendar.
        • File the email into a folder.
      • If it’s personal correspondence you need to reply to:
        • Reply.
        • File the email into a folder.
      • For all other messages:
        • File the email into a folder.
    • Repeat.

Computer programs use algorithms too, and they work in a very similar way. Since the steps are predetermined and the computer is making all the decisions and spitting out the conclusion in the end, some people may think that algorithms are completely unbiased. But if you look more closely, you’ll notice that interesting algorithms can have pre-made decisions built into the steps. In this case, the computer isn’t making the decision at all, it’s just executing the decision that a person made. Consider this (overly simplified) algorithm that could decide when to approve credit for an individual requesting a loan:

Algorithm for Responding to Loan Request

  • Check requestor’s credit score.
  • If credit score is above 750, approve credit.
  • Otherwise, deny credit.

This may seem completely unbiased because you are trusting the computer to act on the data provided (credit score) and you are not allowing any other external factors such as age, race, gender, or sexual orientation to influence the decision. In reality though, a human being had to decide on the appropriate threshold for the credit score. Why did they choose 750? Why not 700, 650, or 642? They also had to choose to base their decision solely on credit score, but could there be other factors that the credit score subtly reflects, such as age or duration of time spent in America? (hint: yes.) With more complicated algorithms, there are many more decisions about what thresholds to use and what information is worth considering, which brings more potential for bias to creep in, even if it’s unintentional.

Algorithms are useful because they can help humans use additional data and resources to make more informed decisions in a shorter amount of time, but they’re not perfect or inherently fair. Algorithms are subject to the same biases and prejudices that humans are, simply from the fact that 1) a human designed the steps in an algorithm, and 2) the data that is fed into the algorithm is generated in the context of our human society, including all of its inherent biases.

These inherent biases built into algorithms can manifest in dark ways with considerable negative impacts to individuals. If you’re interested in some examples, take a look at this story about Somali markets in Seattle that were prohibited from accepting food stamps, or this story about how facial recognition software used in criminal investigations can lead to a disproportionate targeting of minority individuals.

In the future, when you see claims that an important decision was based on some algorithm, I hope you will hold the algorithm to the same standards that you would any other human decision-maker. We should continue to question the motivations behind the decision, the information that was considered, and the impact of the results.

For further reading:

Listen to the full interview with Dr. Blumenstock on the most recent Bloomberg Benchmark podcast

Circle Design Workbook - colored cards

Several speculative designs and design fictions from a design workbook were printed onto cards and into other formats for participants to interact with.


Richmond Wong, Deirdre Mulligan, Ellen Van Wyk, James Pierce, and John Chuang published a paper in CSCW (Computer Supported Cooperative Work) 2018’s online-first publication, in the Proceedings of the ACM on Human-Computer Interaction.

The paper, titled “Eliciting Values Reflections by Engaging Privacy Futures Using Design Workbooks,” presents a case study where a set of design workbooks of conceptual speculative designs and design fictions were presented to technologists in training in order to surface discussions and critical reflections about privacy. From the paper:

Although “privacy by design” (PBD)—embedding privacy protections into products during design, rather than retroactively—uses the term “design” to recognize how technical design choices implement and settle policy, design approaches and methodologies are largely absent from PBD conversations. Critical, speculative, and value-centered design approaches can be used to elicit reflections on relevant social values early in product development, and are a natural fit for PBD and necessary to achieve PBD’s goal. Bringing these together, we present a case study using a design workbook of speculative design fictions as a values elicitation tool. Originally used as a reflective tool among a research group, we transformed the workbook into artifacts to share as values elicitation tools in interviews with graduate students training as future technology professionals. We discuss how these design artifacts surface contextual, socially-oriented understandings of privacy, and their potential utility in relationship to other values levers.

We suggest that technology professionals can view and interact with design workbooks—collections of design proposals or conceptual designs, drawn together to allow designers to investigate, explore, reflect on, and expand a design space—to elicit values reflections and
discussions about privacy before a system is built, in essence “looking around corners” by broadening the imagination about what is possible.

Download the paper from the ACM Digital Library, or the Open Access version on eScholarship.

See the people and projects that advanced to the seed grant phase.

The Center for Technology, Society & Policy (CTSP) seeks proposals for a Data for Good Competition. The competition will be hosted and promoted by CTSP in coordination with the UC Berkeley School of Information IMSA, and made possible through funds provided by Facebook.

Team proposals will apply data science skills to address a social good problem with public open data. The objective of the Data for Good Competition is to incentivize students from across the UC Berkeley campus to apply their data science skills towards a compelling public policy or social justice issue.

The competition is intended to encourage the creation of data tools or analyses of open data. Open datasets may be local, state, national, or international so long as they are publicly accessible. The data tool or analysis may include, but is not limited to:

  1. integration or combination of two or more disparate datasets, including integration with private datasets;
  2. data conversions into more accessible formats;
  3. visualization of data graphically, temporally, and/or spatially;
  4. data validations or verifications with other open data sources;
  5. platforms that help citizens access and/or manipulate data without coding experience; etc.

Issues that may be relevant and addressed via this competition include environmental issues, civic engagement (e.g., voting), government accountability, land use (e.g., housing challenges, agriculture), criminal justice, access to health care, etc. CTSP suggests that teams should consider using local or California state data since there may be additional opportunities for access and collaboration with agencies who produce and maintain these datasets.

The competition will consist of three phases:

  • an initial proposal phase when teams work on developing proposals
  • seed grant execution phase when selected teams execute on their proposals
  • final competition and presentation of completed projects at an event in early April 2018

Teams selected for the seed grant must be able to complete a working prototype or final product ready for demonstration at the final competition and presentation event. It is acceptable for submitted proposals to already have some groundwork already completed or serve as a substantial extension of an existing project, but we are looking to fund something novel and not already completed work.

Initial Proposal Phase

The initial proposal phase ends at 11:59pm (PST) on January 28th, 2018 when proposals are due. Proposals will then be considered against the guidelines below. CTSP will soon announce events to support teams in writing proposals and to share conversations on data for good and uses of public open data.

Note: This Data for Good Competition is distinct from the CTSP yearlong fellowship RFP.

Proposal Guidelines

Each team proposal (approximately 2-3 pages) is expected to answer the following questions:

Project Title and Team Composition

  • What is the title of your project, and the names, department affiliations, student classification (undergraduate/graduate), and email contact information?


  • What is the social good problem?
  • How do you know it is a real problem?
  • If you are successful how will your data science approach address this problem?  Who will use the data and how will they use it to address the problem?  


  • What public open data will you be using?

Output & Projected Timeframe

  • What will your output be? How may this be used by the public, stakeholders, or otherwise used to address your social good problem?
  • Outline a timeframe of how the project will be executed in order to become a finished product or working prototype by the April competition. Will any additional resources be needed in order to achieve the outlined goal?

Privacy Risks and Social Harms

  • What, if any, are the potential negative consequences of your project and how do you propose to minimize them? For example, does your project create new privacy risks?  Are there other social harms?  Is the risk higher for any particular group?  Alternatively, does your project aim to address known privacy risks, social harms, and/or aid open data practitioners in assessing risks associated with releasing data publicly?

Proposals will be submitted through the CTSP website. Successful projects will demonstrate knowledge of the proposed subject area by explaining expertise and qualifications of team members and/or citing sources that validate claims presented. This should be a well-developed proposal, and the team should be prepared to execute the project in a short timeframe before the competition. Please include all relevant information needed for CTSP evaluation–a bare bones proposal is unlikely to advance to the seed funding stage.

Seed Grant Phase

Four to six teams will advance to the seed grant phase. This will be announced in February 2018. Each member of an accepted project proposal team becomes a CTSP Data for Good grantee, and each team will receive $800 to support development of their project. If you pass to the seed grant phase we will be working with you to connect you with stakeholder groups and other resources to help improve the final product. CTSP will not directly provide teams with hardware, software, or data.

Final Competition and Presentation Phase

This phase consists of an April evening of public presentation before judges from academia, Facebook, and the public sector and a decision on the competition winner. The top team will receive $5000 and the runner-up will receive $2000. 

Note: The presentation of projects will support the remote participation of distance-learning Berkeley students, including Master of Information and Data Science (MIDS) students in the School of Information.

Final Judging Criteria

In addition to examining continued consideration of the project proposal guidelines, final projects will be judged by the following criteria and those judgments are final:

  • Quality of the application of data science skills
  • Demonstration of how the proposal or project addresses a social good problem
  • Advancing the use of public open data

After the Competition

Materials from the final event (e.g., video) and successful projects will be hosted on a public website for use by policymakers, citizens, and students. Teams will be encouraged to publish a blogpost on CTSP’s Citizen Technologist Blog sharing their motivation, process, and lessons learned.

General Rules

  • Open to current UC Berkeley students (undergraduate and graduate) from all departments (Teams with outside members will not be considered. However, teams that have a partnership with an external organization who might use the tool or analysis will be considered.)
  • Teams must have a minimum of two participants
  • Participants must use data sets that are considered public or open.

Code of Conduct

This code of conduct has been adapted from the 2017 Towards Inclusive Tech conference held at the UC Berkeley School of Information:

The organizers of this competition are committed to principles of openness and inclusion. We value the participation of every participant and expect that we will show respect and courtesy to one another during each phase and event in the competition. We aim to provide a harassment-free experience for everyone, regardless of gender, sexual orientation, disability, physical appearance, body size, race, or religion. Attendees who disregard these expectations may be asked to leave the competition. Thank you for helping make this a respectful and collaborative event for all.


Please direct all questions about the application or competition process to ude.yelekrebnull@PSTC.


Please submit your application at this link.

Please join us for the NLP Seminar on Monday, November 13,  at 4:00pm in 202 South Hall.  All are welcome!

Speaker:  He He (Stanford)

Title:  Learning agents that interact with humans


The future of virtual assistants, self-driving cars, and smart homes require intelligent agents that work intimately with users. Instead of passively following orders given by users, an interactive agent must actively collaborate with people through communication, coordination, and user-adaptation. In this talk, I will present our recent work towards building agents that interact with humans. First, we propose a symmetric collaborative dialogue setting in which two agents, each with some private knowledge, must communicate in natural language to achieve a common goal. We present a human-human dialogue dataset that poses new challenges to existing models, and propose a neural model with dynamic knowledge graph embedding. Second, we study the user-adaptation problem in quizbowl – a competitive, incremental question-answering game. We show that explicitly modeling of different human behavior leads to more effective policies that exploits sub-optimal players. I will conclude by discussing opportunities and open questions in learning interactive agents.


Paper accepted at NIPS workshop

November 1st, 2017

Raza Khan and Joshua Blumenstock have had a paper accepted at the NIPS 2017 workshop on Machine Learning for the Developing World:

  • Khan, MR, and Blumenstock, JE (2017). Determinants of Mobile Money Adoption in Pakistan, The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS ’17), Workshop on Machine Learning for the Developing World. [pdf]

            The Open Data movement has won the day and governments around the world  – as well as scientific researchers, non-profits and even private companies – are embracing data sharing. Open data has huge benefits in encouraging replicability of scientific research, encouraging community and non-profit engagement with government data, providing accountability, and aiding businesses (especially new businesses which may not yet have detailed customer data). Open Data has already scored several key wins in areas from criminal justice reform to improved diagnostics. On the criminal justice side, open data lead to the discovery of the fact that “stop and frisk” policies in New York were both ineffective and heavily biased against minorities. This discovery was instrumental in the successful community effort to end this policy[1]. Considering health care, promising innovations in using deep learning to diagnose cancer are being assisted by publicly released data sets of labeled MRI and CAT scan images. These efforts demonstrate how Open Data can democratize the benefits of government data and publicly funded research and in this positive context, it is easy to see Open Data as a panacea. However,  several key concerns remain around privacy, accessibility, and ethics.

The most popular concern about Open Data revolves around the privacy concerns that arise from the public release of data – especially the public release of data that users might not have realized is private. Balancing privacy with providing the granularity of data needed for more sophisticated analysis is an ongoing concern, although increasingly a shared set of policies and practices are being developed around privacy protection[2]. But despite these advances in policy and practice, key privacy concerns remain both in general and in specific instances in which clear privacy harms have been caused.  For example, New York city released taxi trip data with license numbers hashed, which lead to two separate key privacy concerns[3]. First, data was easily de-anonymized by a  civic hacker, leading to privacy concerns for the taxi drivers. And second, several data scientists demonstrate how they identified one particular individual as a frequent customer of a gentleman’s club (which, clearly, is potentially very publically embarrassing) [4]. This case specifically demonstrates how very specific GPS data and a little bit of clever analytics can very easily de-anonymize particular users and expose them to substantial privacy risk. Differential privacy – seeking to provide accurate results on an aggregate level without allowing any individual to be identified – should be applied to avoid these situations but advances in record linkage make this even more difficult as researchers have to consider both the privacy risks in their own data but also how it could be combined with other data in privacy-damaging ways. And these high-profile failures demonstrate that there is still a progress to be made in privacy protection.

In addition to these concerns about privacy, there are several other aspects of open data that merit further consideration. One of these is around access to open data – with the increasing size and complexity of data, just providing access data itself may not be enough. Several companies’ business models are the processing and serving of open data[5], which illustrates that open data doesn’t mean easy to access data. If the public is funding data collection, some argue, it is not enough to provide the data but more effort needs to be put into making this data truly accessible to the public and smaller firms. Seconds, concerns remain around opacity of the ethics behind publicly accessible data. While the data itself – and often the code that produces results – is publicly accessible, the ethical decision making around study design, eligibility, and protections for subject often is not[6].  The lack of accessibility of Internal Review Board reports and ethical evaluations impedes the public’s ability to make informed judgements about public data sets.

Overall, Open Data has great potential to improve accountability and foster innovative solutions to social issues, but still requires work in order to balance privacy and ethics with openness.



[3] ibid