Archive for November 1st, 2017

Paper accepted at NIPS workshop

November 1st, 2017

Raza Khan and Joshua Blumenstock have had a paper accepted at the NIPS 2017 workshop on Machine Learning for the Developing World:

  • Khan, MR, and Blumenstock, JE (2017). Determinants of Mobile Money Adoption in Pakistan, The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS ’17), Workshop on Machine Learning for the Developing World. [pdf]

            The Open Data movement has won the day and governments around the world  – as well as scientific researchers, non-profits and even private companies – are embracing data sharing. Open data has huge benefits in encouraging replicability of scientific research, encouraging community and non-profit engagement with government data, providing accountability, and aiding businesses (especially new businesses which may not yet have detailed customer data). Open Data has already scored several key wins in areas from criminal justice reform to improved diagnostics. On the criminal justice side, open data lead to the discovery of the fact that “stop and frisk” policies in New York were both ineffective and heavily biased against minorities. This discovery was instrumental in the successful community effort to end this policy[1]. Considering health care, promising innovations in using deep learning to diagnose cancer are being assisted by publicly released data sets of labeled MRI and CAT scan images. These efforts demonstrate how Open Data can democratize the benefits of government data and publicly funded research and in this positive context, it is easy to see Open Data as a panacea. However,  several key concerns remain around privacy, accessibility, and ethics.

The most popular concern about Open Data revolves around the privacy concerns that arise from the public release of data – especially the public release of data that users might not have realized is private. Balancing privacy with providing the granularity of data needed for more sophisticated analysis is an ongoing concern, although increasingly a shared set of policies and practices are being developed around privacy protection[2]. But despite these advances in policy and practice, key privacy concerns remain both in general and in specific instances in which clear privacy harms have been caused.  For example, New York city released taxi trip data with license numbers hashed, which lead to two separate key privacy concerns[3]. First, data was easily de-anonymized by a  civic hacker, leading to privacy concerns for the taxi drivers. And second, several data scientists demonstrate how they identified one particular individual as a frequent customer of a gentleman’s club (which, clearly, is potentially very publically embarrassing) [4]. This case specifically demonstrates how very specific GPS data and a little bit of clever analytics can very easily de-anonymize particular users and expose them to substantial privacy risk. Differential privacy – seeking to provide accurate results on an aggregate level without allowing any individual to be identified – should be applied to avoid these situations but advances in record linkage make this even more difficult as researchers have to consider both the privacy risks in their own data but also how it could be combined with other data in privacy-damaging ways. And these high-profile failures demonstrate that there is still a progress to be made in privacy protection.

In addition to these concerns about privacy, there are several other aspects of open data that merit further consideration. One of these is around access to open data – with the increasing size and complexity of data, just providing access data itself may not be enough. Several companies’ business models are the processing and serving of open data[5], which illustrates that open data doesn’t mean easy to access data. If the public is funding data collection, some argue, it is not enough to provide the data but more effort needs to be put into making this data truly accessible to the public and smaller firms. Seconds, concerns remain around opacity of the ethics behind publicly accessible data. While the data itself – and often the code that produces results – is publicly accessible, the ethical decision making around study design, eligibility, and protections for subject often is not[6].  The lack of accessibility of Internal Review Board reports and ethical evaluations impedes the public’s ability to make informed judgements about public data sets.

Overall, Open Data has great potential to improve accountability and foster innovative solutions to social issues, but still requires work in order to balance privacy and ethics with openness.

[1] https://sunlightfoundation.com/2015/05/01/the-benefits-of-criminal-justice-data-beyond-policing/

[2] http://reports.opendataenterprise.org/BriefingPaperonOpenDataandPrivacy.pdf

[3] ibid

[4] https://research.neustar.biz/author/atockar/

[5] http://www.computerweekly.com/opinion/The-problem-with-Open-Data

[6] https://www.forbes.com/sites/kalevleetaru/2017/07/20/should-open-access-and-open-data-come-with-open-ethics/#1b0bc7565426