Clearview AI: The startup that is threatening privacy

Clearview AI: The startup that is threatening privacy
By Stefania Halac, October 16, 2020

Imagine walking down the street, a stranger points their camera at you and can immediately pull up all your pictures from across the internet; they may see your instagram posts, your friends’ posts, any picture that you appear in, some which you may have never seen before. This stranger could now ascertain where you live, where you work, where you went to school, whether you’re married, who your children are… This is one of many compromising scenarios that may become part of our normal life if facial recognition software is widely available.

Clearview AI, a private technology company, offers facial recognition software that can effectively identify any individual. Facial recognition technology is intrinsically controversial, so much so that certain companies like Google don’t offer facial recognition APIs due to ethical concerns. And while some large tech companies like Amazon and Microsoft do sell facial recognition APIs, there is an important distinction between Clearview’s offering and that of the other tech giants. Amazon and Microsoft only allow you to search for faces from a private database of pictures supplied by the customer. Clearview instead allows for recognition of individuals in the public domain — practically anyone can be recognized. What sets Clearview apart is not its technology, but rather the database it assembled of over three billion pictures scraped from the public internet and social media. Clearview AI did not obtain consent from individuals to scrape these pictures, and has been sent cease and desist orders from major tech companies like Twitter, Facebook and Youtube over its practices due to policy violations.

In the wake of the Black Lives Matter protests earlier this year, IBM, Microsoft and Amazon updated their policies to restrict the sale of their facial recognition software to law enforcement agencies. On the other hand, Clearview AI not only sells to law enforcement and government agencies, but until May of this year was also selling to private companies, and has even been reported to have granted access to high net-worth individuals.

So what are the risks? One on hand, the algorithms that feed these technologies are known to be heavily biased and perform more poorly on certain minority populations such as women and African Americans. In a recent study, Amazon’s Rekognition was found to misclassify women as men 19% of times, and darker-skinned women for men 31% of time. If this technology were to be used in the criminal justice system, one implication here is that dark-skinned people would be more likely to be wrongfully identified and convicted.

Another major harm is that this technology essentially provides its users the ability to find anyone. Clearview’s technology would enable surveillance at protests, AA meetings and religious gatherings. Attending any one of these events or locations would become a matter of public record. In the wrong hands, such as those of a former abusive partner or a white supremacist organization, this surveillance technology could even be life-threatening for vulnerable populations.

In response, the ACLU filed a lawsuit against Clearview AI in May for violation of the Illinois Biometric Information Privacy Act (BIPA), alleging the company illegally collected and stored data on Illinois citizens without their knowledge or consent and then sold access to its technology to law enforcement and private companies. While some cities like San Francisco and Portland have enacted facial recognition bans, there is no overarching national law protecting civilian privacy from these blatant privacy violations. With no such law in sight, this may be the end of privacy as we know it.


The Gender Square: A Different Way to Encode Gender

The Gender Square: A Different Way to Encode Gender
By Emma Tebbe, October 16, 2020

Image: square with two axes, the horizontal reading Masculine and Feminine and the vertical reading Low Gender Association / Agender and Strong Gender Association

As non-gender-conforming and transgender folks become more visible and normalized, the standard male / female / other gender selections we all encounter in forms and surveys become more tired and outdated. First of all, the terms “male” and “female” generally refer to sex, or someones biological configuration, “including chromosomes, gene expression, hormone levels and function, and reproductive/sexual anatomy.” Male and female are not considered the correct terms for gender orientation, which “refers to socially constructed roles, behaviours, expressions and identities of girls, women, boys, men, and gender diverse people.” Although sex exists on a spectrum which includes intersex people, gender has a wide range of identities, including agender, bigender, and genderqueer. This gender square method of encoding gender aims to encompass more of the gender spectrum than a simple male / female / other selection.

Image: triangle defining sex, gender expression, gender attribution, and gender identity

Upon encountering this square in a form or survey, the user would drag the marker to the spot on the square that most accurately represents their gender identity. This location would then be recorded as a coordinate pair, where (0, 0) is the center of the square. The entity gathering the data would then likely use those coordinates to categorize respondents. However, using continuous variables to represent gender identity allows for many methods of categorization. The square could be divided into quadrants, as pictured above, vertical halves (or thirds, or quarters), or horizontal sections. This simultaneously allows for flexibility in how to categorize gender and reproducibility of results by other entities. Other analysts would be able to reproduce results if they are given respondents’ coordinates and the categorization methodology used. Coordinate data could even be used as it was recorded, turning gender from a categorical variable into a continuous one.

Although this encoding of gender encompasses more dimensions, namely representing gender as a spectrum which includes agender identities, it still comes with its own problems. First of all, the gender square still does not leave room for flexible gender identities including those whose gender is in flux or those who identify as genderfluid or bigender. There are a few potential solutions for this misrepresentation on the UI side, but these create new problems with data encoding. Genderfluid folks could perhaps draw an enclosed area in which their gender generally exists, but recording this data is much more complex than a simple coordinate pair, and would become an array of values rather than a coordinate pair. People who identify as bigender could potentially place two markers, one for each of the genders they experience. Both this approach and an area selection approach make the process of categorization more complex – if an individual’s gender identity spans two categories, would they be labeled twice? Or would there be another category for people who fall into multiple categories?

Image: a gender spectrum defining maximum femininity as “Barbie” and maximum masculinity as “G.I. Joe”

Another issue might arise with users who haven’t questioned their gender identity along either of these axes, and may not understand the axes (particularly the Highly Gendered / Agender axis) enough to accurately use the gender square. When implemented, the gender square would likely need an explanation, definitions, and potentially suggestions. Suggestions could include examples such as “If you identify as a man and were assigned that gender at birth, you may belong in the upper left quadrant.” Another option may be to include examples such as in the somewhat problematic illustration above.

This encoding of gender would likely first be adopted by groups occupying primarily queer spaces, where concepts of masculinity, femininity, and agender identities are more prominent and considered. If used in places where data on sex and transgender status is vital information, such as at a doctor’s office, then the gender square would need to be supplemented by questions obtaining that necessary information. Otherwise, it is intended for use in spaces where a person’s sex is irrelevant information (which is most situations where gender information is requested).

Although still imperfect, representation and identification of gender along two axes represents more of the gender spectrum than a simple binary, and still allows for categorization, which is necessary for data processing and analytics. With potential weaknesses in misunderstanding and inflexibility, it finds its strength in allowing individuals to more accurately and easily represent their own identities.

Valentine, David. The Categories Themselves. GLQ: A Journal of Lesbian and Gay Studies, Volume 10, Number 2, 2004, pp. 215-220 for image only


When Algorithms Are Too Accurate

When Algorithms Are Too Accurate
By Jill Cheney, October 16, 2020

An annual rite of passage every Spring for innumerable students is college entrance exams. Regardless of their name, the end result is the same: to influence admission applications. When the Covid-19 pandemic swept the globe in 2020, this milestone changed overnight. Examinations were cancelled, leaving students and universities with no traditional way to evaluate admission. Alternative solutions emerged with varying degrees of veracity.

In England, the solution used to replace their A-level exams involved developing a computer algorithm to predict student performance. In the spirit of a parsimonious model, two parameters were used: the student’s current grades and the historical test record of the attending school. The outcome elicited nationwide ire by highlighting inherent testing realities.

Overall, the predicted exam scores were higher – more students did better than on any previous resident exam with 28% getting top scores in England, Wales and Northern Ireland. However, incorporating the school’s previous test performance into the algorithm created a self-fulfilling reality. Students at historically high performing schools had inflated scores; conversely, students from less performing schools had deflated ones. Immediate cries of AI bias erupted. However, the data wasn’t wrong – the algorithm simply highlighted the inherent biases and disparity in the actual data modeled.

Reference points did exist for the predicted exam scores. One was from teachers since they provide a prediction on student performance. The other was from student scores on previous ‘mock’ exams. Around 40 percent of students received a predicted score that was one step lower than their teachers’ predictions. Not surprisingly, the largest downturn in predictions occurred amongst poorer students. Many others had predicted scores below their ‘mock’ exam scores. Mock exam results support initial university acceptance; however, they must be followed-up with commensurate official exam scores. For many
students, the disparity between their predicted and ‘mock’ exam scores jeopardized their university admission.

Attempting to rectify the disparities came with its own challenges. Opting to use teacher predicted scores required accepting that not all teachers provided meticulous student predictions. Based on teacher predictions alone, 38% of predicted scores would have been at the highest levels: A*s and As. Other alternatives included permitting students to retake the exam in the Fall or allowing the ‘mock’ exam scores to stand-in should they be higher than the predicted ones. No easy answers existed when attempting to navigate an equitable national response.

As designed, the computer model assessed the past performance of a school over student performance. Individual grades could not offset the influence of a school’s testing record. It also clearly discounted more qualitative variables, such as test performance skills. In the face of a computer-generated scoring model, a feeling of powerlessness emerged. No longer did students feel they possessed control over their future and schooling opportunities.

Ultimately, the predictive model simply exposed the underlying societal realities and quantified how wide the gap actually is. In the absence of the pandemic, testing would have continued on the status quo. Affluent schools would have received higher scores on average than fiscally limited schools. Many students from disadvantaged schools would have individually succeeded and gained university admission. The public outcry this predictive algorithm generated underscores how the guise of traditional test conditions assuages our concerns about the realities of standardized testing.


Data as Taxation

Data as Taxation
By Anonymous, October 16, 2020

Data is often analogized with transaction. We formulate our interactions with tech companies as an exchange of our data as payment for services, which in turn allow for the continued provision of those services.

Metaphors like these can be useful in that they allow us to port developed intuitions from a well-trodded domain (transactions) to help us navigate more less familiar waters (data). In this spirit, I wanted to further develop this “data collection = economic transaction” metaphor, and explore how our perceptions of data collection change with a slight tweak: “data collection = taxation”

In the context of data collection, the following quote from Supreme Court Justice Oliver Wendall Holmes might give one pause. Is this applicable, or entirely irrelevant?

Here’s what I mean: with taxation, government bodies mandate that citizens contribute a certain amount of resources to fund public services. The same goes for data – while Google, Facebook, and Amazon are not governments, they also create and maintain enormous ecosystems that facilitate otherwise impossible interactions. Governments allow for a coordination around national security, education, and supply chains, and Big Tech provides the digital analogues. Taxation and ad revenue allow for the perpetual creation of this value. Both can embody some (deeply imperfect) notion of “consent of the governed” through voter and consumer choice, although neither provides an easy way to “opt out.”

Is this metaphor perfect? Not at all, but there is still value in making the comparison. We can recycle centuries of bickering over fairness in taxation.

For instance, one might ask “when is taxation / data collection exploitative?” On one end, some maintain that “all taxation is theft,” a process by which private property is coercively stripped. Some may feel a similar sense of violation as their personal information is harvested – for them, perhaps the amorphous concept of “data” latches onto the familiar notion of “private property,” which might in turn suggest the need for some kind of remuneration.

At the other extreme, some argue that taxation cannot be the theft of private property, because the property was never private to begin with. Governments create the institutions and infrastructure that allows the concept of “ownership” to even exist, and thus all property is on loan. One privacy analogue could be that the generation of data is impossible and worthless without the scaffolding of Big Tech, and thus users have a similarly tenuous claim on their digital trails.

The philosophy of just taxation has provided me an off-the-shelf frame by which to parse a less familiar space. Had I stayed with the “data collection = economic transaction” metaphor, I would have never thought about data from this angle. As is often the case, a different metaphor illuminates different dimensions of the issue.

Insights can flow the other way as well. For example, in data circles there is a developing sophistication around what it means to be an “informed consumer.” It is recognized by many that merely checking the “I agree” box does not constitute a philosophically meaningful notion of consent, as the quantity and complexity of relevant information is too much to expect from any one consumer. Policies and discussions around the “right to be forgotten”, user control of data, or the right to certain types of transparency acknowledge the moral tensions inherent in the space.

These discussions are directly relevant to justifications often given for a government’s right to tax, like the “social contract” or the “consent of the governed.” Both often have some notion of informed consent, but this sits on similarly shaky ground. How many voters know how their tax dollars are being spent? While government budgets are publicly available, how many are willing to sift through reams of legalese? How many voters can tell you what military spending is within an even order of magnitude? Probably as many as who know exactly how their data is packaged and sold. The data world and its critics have much to contribute to the question of how to promote informed decision-making in a world of increasing complexity.

Linguists George Lakoff and Mark Johnson suggest that metaphors are central to our cognitive processes.

Of course, all of these comparisons are deeply imperfect, and require much more space to elaborate. My main interest in writing this was exploring how this analogical shift led to different questions and frames. The metaphors we use have a deep impact on our ability to think through novel concepts, particularly when navigating the abstract. They shape the questions we ask, the connections we make, and even the conversations we can have. To the extent that that’s true, metaphors can profoundly reroute society’s direction on issues of privacy, consent, autonomy, and property, and are thus well-worth exploring.

When an Algorithm Replaces Cash Bail

When an Algorithm Replaces Cash Bail
Allison Godfrey
October 9th, 2020

In an effort to make the criminal justice system more equitable, California Senate Bill 10 replaced cash bail with a predictive algorithm producing a risk assessment score to determine if the accused needs to remain in jail before their trial. The risk assessment places suspects into low, medium, or high risk categories. Low risk individuals are generally released before trial, while high risk individuals remain in jail. In cases with medium risk individuals, the judge has much more discretion in determining their placement before trial and conditions of release. This bill also releases all suspects charged with a misdemeanor without needing a risk assessment. This bill was signed into law in 2018 and effective in October 2019. California Proposition 25 seeks to repeal this bill and return to cash bail on the basis that this algorithm biases the system even more than cash bail. People often see data and algorithms as purely objective, since they are based on numbers and formulas. However, they are often “black box” models where we have no way of knowing exactly how the algorithm arrived at the output. If we cannot follow the model’s logic, we have no way of identifying and modifying its bias.

Image from this article

By the nature of predictive algorithms, they learn from the data in much of the same way as humans learn from their life’s inputs (experiences, conversations, schooling, family, etc). Our life experiences make us inherently biased since we hold a unique perspective purely shaped by this set of experiences. Similarly, algorithms learn from the data we feed into them and spit out the perspective that the data creates: an inherently biased perspective. Say, for example, we feed a predictive model some data about 1,000 people with pending trials. While the Senate Bill is not clear on the exact inputs to the model, say we feed the model the following attributes of each person: age, gender, charge, past record, income, zip code, and education level. We exclude the person’s race from the model in an effort to eliminate racial bias. But, have we really eliminated racial bias?

Image from this article

Let’s compare two people: Fred and Marc. Fred and Marc have the exact same charge, identify as the same gender, have similar incomes, both have bachelor’s degrees, but live in different towns. The model learns from past data that people from Fred’s zip code are generally more likely to commit another crime than people from Marc’s zip code. Thus, Fred receives a higher risk score than does Marc and he awaits his trial in jail while Marc is allowed to go home. Due to the history and continuation of systemic racism in the country, neighborhoods are often racially and economically segregated, so people from one zip code may be much more likely to be people of color and lower income than those from their neighboring town. Thus, by including an attribute like zipcode, we are introducing economic and racial bias into the model even if these additional attributes are not explicitly stated. While the original goal of Senate Bill 10 was to eliminate the ability for wealth to be a determining factor in bail decisions, it inadvertently reintroduces wealth as a predictor in the algorithm through the economic bias that is woven into it. Instead of equalizing the scale in the criminal justice system, the algorithm tips the scale even further.

Image from this article

Additionally, the purpose of cash bail is to ensure the accused will show up to their trial. While it is true that the system of cash bail can be economically inequitable, the algorithm does not seem to be addressing the primary purpose of bail. There is no part of Senate Bill 10 that helps ensure that the accused will be present at their trial.

Lastly, Senate Bill 10 allows judge discretion for any case, particularly medium risk cases. Human bias in the courtroom has historically played a big role in the inequality of our justice system today. The level of discretion the judge has to overrule the risk assessment score could re-introduce the human bias the model partly seeks to avoid. It has been shown that judges exercise this power more often to place someone in jail than they do to release them. In the time of Covid-19, going to jail has an increased risk of infection. With this heightened risk of jail, our decision system, whether that be algorithmic, monetary, and/or human centered, should err more on the side of release, not detainment.

The fundamental question is one that neither cash bail nor algorithms can answer:
How do we eliminate wealth as a determining factor in the justice system while also not introducing other biases and thus perpetuating systemic racism in the courtroom?

Authentication and the State

Authentication and the State
By Julie Nguyen


For historical and cultural reasons, the American society is one of very few democracies in the world where there is no universal authentication system at the national level. Surprisingly, the Americans don’t trust governments as they do toward corporations because they consider such identifier system a serious violation of privacy and a major opening to Big Brother government. I will argue that it is more beneficial for the US to create a universal authentication system to replace the patchwork of de facto paper documents currently in use in a disparate fashion at the state level in the United States.

Though controversial and difficult to be implemented, a national-level authentication system would entail a lot of benefits.

It is not reasonable to argue that it is too complex to create a national-level authentication system. No, it is hard but possible elsewhere.

The debate on a national-level authentication system is not new. In Europe, national census scheme inspired a lot of resistance as it tended to focus the attention on privacy issues. One of the earliest examples was the protest against a census in the Netherland in 1971. Likewise, nobody foresaw the storms of protests over the censuses in Germany in 1983 and 1987. In both countries, the memories of the World War II and how the governments had terrorized the Dutch and German people during and after the war could explain such kind of reactions.

Similarly, proposals for a national-level identity cards produced the same reactions in numerous countries. Today, however, almost all modern societies have developed systems to authenticate their citizens. Those systems have evolved with the advent of new technologies in particular the biometric cards or e-cards: the pocket-sized “ID cards” have become biometric cards in almost all European countries and E-cards in Estonia.  Citizens of many countries, including democracies, are required by law to have ID cards with them all the time. Surprisingly, these cards are still viewed by Americans as a major tool of oppressive governments and any discussion on establishing a national-level ID cards are not in general considered fit for discussion.

In some countries where people shared the same American view, their governments have learnt their hard lessons. As the result, contemporary national identification policies tend to be introduced more gradually under other symbols than the ID system per se. Thus, the new Australian policy is termed an Access Card since its introduction in 2006. The Canadian government now talks of a national Identity Management policy. More recently, the Indian government has implemented Aadhaar, the biggest world-wide biometric identification scheme containing the personal details, fingerprints and iris patterns of 1.2 billion people – nine out of ten Indians.

It is time that the federal government, taking lessons from other countries, create a national-level authentication system in the Unites States given that the system would create a lot of benefits for the Americans.

The advantages of a national authentication system would outperform its disadvantages in contrast to the argument of the opponents related to privacy and discrimination issues. I will use two main arguments to justify my statement. First and foremost, the most significant justification for identifying citizens is to insure the public’s safety and well-being. Even in Europe where the right to privacy is extremely important, Europeans have made a trade-off in favour of their safety. Documents captured from Al Queda or ISIS show that terrorists are aware that anonymity is a valuable tool for penetrating an open society. For domestic terrorist acts, it would be also easier and simpler to get terrorists caught in the case the country has a universal authentication system. For instance, Unabomber is one of the most notorious terrorists in the United States due to fact that it was extremely hard to track him as he had quasi no identity in the society.

Second, the opponents of a national authentication system argue that traditional ID cards or a national authentication system are a source of discrimination. Actually, universal identifiers could serve to reduce discrimination in some areas. All job applicants would be identified to avoid the fake identity, not only immigrant people or those who look or sound “foreign”. Taking the example of E-Verify which is a voluntary online system operated by the U.S. Department of Homeland Security (DHS) in partnership with the Social Security Administration (SSA). It’s used to verify an employee’s eligibility to legally work in the United States. E-Verify checks workers’ Form I-9 information for authenticity and work authorization status against SSA and Citizenship and Immigration Services (CIS) database. Today, more than 20 states have adopted laws that require employers to use the federal government’s E-Verify Program. As the E-verify entails further administrative costs for potential employers, it is a driver of discrimination towards immigrant workers in the United States. A national “E-verify” system of all US residents would prevent such a source of discrimination.

Lack of a national-wide authentication system results in a lot of social costs.

Identity theft has become a serious problem in the United States. Though the federal government passed the Identity Theft and Assumption Deterrence Act in 1998 in order to crackdown the problem and make it a federal felony, the cost of identity theft has continued to increase significantly[1]. Identity thieves have stolen over $107 billion in the US for the past six years. Identity theft is particularly frightening because there is no completely effective way for most people to protect themselves. Rich and powerful persons can be also caught in the trap. For example, Abraham Abdallah, a busboy in New York, succeeded in stealing millions of dollars from famous people’s bank accounts, using the Social Security Numbers, home addresses and birthdays of Warren Buffet, Oprah Winfrey and Steven Spielberg…

People usually think that identity theft is mainly a case of someone using another person’s identity to steal money from this person, mostly via stolen credit cards or more complicated in the case of the above-mentioned New Yorker. But the reality is much more complex. In his book The Limits of Privacy, Amitai Etzioni lists several categories of crime related to identity theft:

    • Criminal fugitive
    • Child abuse and sex offenses
    • Income tax fraud and welfare fraud
    • Nonpayment of child support
    • Illegal immigration

Additionally, the highest hidden cost for American society due to the lack of a universal identity system is, in my opinion, the vulnerability of their democracy and the inefficient function of the whole society. In most democracies, a universal authentication system permits citizens to interact with government, reducing transaction cost and increasing the trust in governments at the same time. Moreover, it is a step toward an e-election in these countries where, like in the United States, the turnout rate has become critical. Without a universal and secured authentication system, any reform of the election in the country would be very difficult to put in place.

Overall, the tangible and intangible cost of not having a national authentication system is very high.


The United States is one of the very few democracies that has no standardized universal identification system. The social cost is very significant. The new technologies today can make it possible to protect the system from abuse. There is no zero-sum game in a society. Opponents of such kind of authentication system are wrong and their arguments would not hold today anymore. “Information does not kill people; people kill people” as Dennis Bailey wrote in The open Society Paradox. It is time to create a single, secure and standardized national-level ID to replace the patchwork of de facto paper documents currently in use in the United States. An incremental implementation of an Estonian-like system with a possible opting-out solution like Canadian approach can be an appropriate answer to the opponents of a national authentication system in the United States.  


1/ The Privacy Advocates – Colin J. Bennett, The MIT Press, 2008.

2/ The Open Society Paradox – Dennis Bailey, Brassey’s Inc., 2004.

3/ The Limits of Privacy – Amitai Etzioni, Basic Books, 1999.

4/ E-Estonia: The power and potential of digital identity – Joyce Shen, 2016.

5/ E-Authentication Best Practices for Government – Keir Breitenfeld, 2011.

6/ My life under Estonia’s digital government – Charles Brett, 2015.

7/ Hello Aadhaar, Goodbye Privacy – Jean Drèze, 2017.

Impact of Algorithmic Bias on Society

Impact of Algorithmic Bias on Society
By Anonymous | December 11, 2018

Artificial intelligence (AI) is being widely deployed in a number of realms where they have never been used before. A few examples of areas in which big data and artificial intelligence techniques are used are selecting potential candidates for employment, decisions on whether a loan should be approved or denied, and using facial recognition techniques for policing activities. Unfortunately, AI algorithms are treated as a black box in which the “answer” provided by the algorithm is presumed to the absolute truth. What is missed is the fact that these algorithms are biased for many reasons including the data that was utilized for training it. These hidden biases have serious impact on society and in many cases the divisions that have appeared among us. In the next few paragraphs we will present examples of such biases and what can be done to address them.

Impact of Bias in Education

In her book titled, “Weapons of Mass Destruction”, a mathematician, Cathy O’Neil, gives many examples of how mathematics on which machine learning algorithms are based on can easily cause untold harm on people and society. One such example she provides is the goal set forward by Washington D.C.’s newly elected mayor, Adrian Fenty, to turn around the city’s underperforming schools. To achieve his goal, the mayor hired an education reformeras the chancellor of Washington’s schools. This individual, based on an ongoing theory that the students were not learning enough because their teachers were not doing a good job, implemented a plan to weed out the “worst” teachers. A new teacher assessment tools called IMPACT was put in place and the teachers whose scores fell in the bottom 2% in the first year of operation, and 5% in the second year of the operation were automatically fired. From mathematical sense this approach makes perfect sense: evaluate the data and optimize the system to get the most out of it. Alas, as Cathy points out in the example, the factors that were used to determine the IMPACT score were flawed. Specifically, it was based on a model that did not have enough data to reduce statistical variance and improve accuracy of the conclusions one can draw from the score. As a result, teachers in poor neighborhoods, performing very well in a number of different metrics, were the ones that were impacted by the use of the flawed model. The situation was further exacerbated by the fact that it is very hard to attract and grow talented teachers in the schools in poor neighborhoods, many of whom are underperforming.

Gender Bias in Algorithms Used By Large Public Cloud Providers

The bias in algorithms is not limited to small entities with limited amount of data. Even large public cloud providers with access to large number of records can easily create algorithms that are biased and cause irreperable harm when used to make impactful decisions. The website,, provides one such example. The research to determine if there were any biases in the algorithms of three major facial recognition AI service provider— Microsoft, IBM and Face++— was conducted by providing 1270 images from a mix of individuals originating from the continent of Africa and Europe. The sample had subjects from 3 African countries and 3 European countries with 54.4% male and 44.6% female division. Furthermore, 53.6% of the subjects had light skin and 46.4% had darker skin. When the algorithms from the three companies were asked to classify the gender of the samples, as seen in the figure below, the algorithms performed relatively well when one looks just at the overall accuracy.

However, on further investigation, as seen in the figure below, the algorithms performed poorly when classifying dark skinned individuals, particularly women. Clearly, any decisions that one makes based on the classification results of these algorithms, would be inherently biased and potentially harmful to dark skinned women in particular.

Techniques to Address Biases in Algorithms

The recognition that the algorithms are potentially biased is the first and the most important step towards addressing the issue. The techniques to use to reduce bias and improve the performance of algorithms is an active area of research. A number of techniques ranging from creation of an oath similar to the Hippocratic Oath that doctor’s pledge to a conscious effort to use a diverse set of data much more representative of the society has been proposed and is being evaluated. There are many reasons to be optimistic that although the bias in algorithms can never be eliminated, in the very near future the extent of the bias in the algorithms would be reduced.


  1. Cathy O’Neil, 2016, Weapons of Math Destruction, Crown Publishing Company.
  2. How well do IBM, Microsoft and Face++ AI services guess the gender of a face?

Potential Negative Consequences IoT Devices Could Have on Consumers

Potential Negative Consequences IoT Devices Could Have on Consumers
By Anonymous | December 4, 2018

IoT, or the Internet of Things, are devices that have the ability to collect and transmit data across the internet or other devices. The number of internet connected devices has grown rapidly among consumers. In the past, a typical person only owned a few IoT devices, such as desktops, laptops, routers and smartphones. Now, due to technological advances, many people also own televisions, video game consoles, smart watches (e.g. Fitbit, Apple Watch), digital assistants (e.g. Amazon Alexa, Google Home), cars, security systems, appliances, thermostats, locks and lights that all connect and transmit information over the internet.

While companies are constantly trying to find new ways to implement IoT capabilities into the lives of consumers, security seems to be taking a back seat. Therefore, with all of these new devices, it is important for consumers to remain aware of the personal information that is being collected, and to be informed of the potential negative consequences that could result from owning such devices. Here are four things you may want to be aware of:

1. Hackers could spy on you

43768357 – hooded cyber criminal stealing secrets with laptop

I am sure you have heard stories of people who have been spied on after having the webcams on their laptops hacked. Other devices, like The Owlet, a wearable baby monitor, was found to be hackable, along with SecurView smart cameras. What if someone were able to access your Alexa? They would be able to learn a lot about your personal life through recordings of your conversations. If someone were to hack your smart car, then they would be able to know where you are at most times. Recently, researchers uncovered vulnerabilities in Dongguan Diqee vacuum cleaners that could allow attackers to listen or perform video surveillance.

2. Hackers could sell or use your personal information

It may not seem like a big deal if a device, such as your FitBit is hacked. However, many companies would be interested in obtaining this information and could achieve financial gains with it. What if an insurance company could improve their models with this data, and as a result, increased their rates for customers with poor vital signs? Earlier this year, hackers were able to steal sensitive information from a casino after gaining access to a smart thermometer in a fish tank. If hackers can steal data from companies that prioritize security, then they will probably have a much easier time doing the same to an average person. The data you generate is valuable, and hackers can find a way to monetize it.

3. Invasion of privacy by device makers

Our personal information is not only obtainable through hacks. We may be willingly giving it away to the makers of the devices we use. Each device and application has its own policies regarding the data it chooses to collect and store. A GPS app may store you travel history so it can make recommendations in the future. However, it may also use this information to make money on marketing offers for local businesses. Device makers are financially motivated to use your information to improve their products and target their marketing efforts.

4. Invasion of privacy by government agencies

Government agencies are another group that may have access to our personal information. Some agencies, like the FBI, have the power to request data from device makers in order to gather intelligence related to possible threats. Law enforcement may be able to access certain information for purposes of investigations. Last year, police charged a man with murdering his wife using data from her Fitbit. Also, lawyers may be able to subpoena data in criminal and civil litigation.

IoT devices will continue to play an important role in everyone’s lives. They will continue to create an integrated system that will lead to increased efficiency for all. However, consumers should remain informed, and if given a choice between a brand of device, like Alexa or Google Home, consider choosing a company that prioritizes the security and policy issues discussed above. This will send a message that consumers care, and encourage positive change.

The View from The Middle

The View from The Middle
By Anonymous | December 4, 2018

If you are like me, you probably spend quite a bit of time online.

We read news articles online, watch videos, plan vacations, shop and much more. At the same time, we are generating data that is being used to tailor advertising to our personal preferences. Profiles constructed from our personal information are used to suggest movies and music we might like. Data driven recommendations make it easier for us to find relevant content. Advertising also provides revenue for the content providers which allows us to access those videos and articles at reduced cost.

But is the cost really reduced? How valuable is your data and how important is your privacy? Suppose you were sharing a computer with other members of your household. Would you want all your activities reflected in targeted advertising? Most of the time we are unaware that we are under surveillance and have no insight into the profiles created using our personal information. If we don’t want our personal information shared, how do we turn it off?

To answer that question, let’s first see what is being collected. We’ll put a proxy server between the web browser and the internet to act as a ‘Man-in-the-Middle’. All web communication goes through the proxy server which can record and display the content. We can now see what is being shared and where it is going.

The Privacy Settings of our Chrome browser allow us to turn off web services that share data. We also enable ‘Do Not Track’ to request that sites not track our browsing habits across websites.

Let’s see what happens when we browse to the webpage of a popular travel site and perform a search for vacation accommodation. In our proxy server we observe that the travel website caused many requests to be sent from our machine to advertising and analytics sites.

We can see requests being made to AppNexus (, a company which builds groups of users for targeted advertising. These requests have used the X-Proxy-Origin HTTP Header to transmit our IP address. As IP addresses can be associated with geographic location this is personal data we may prefer to protect.

Both the Google Marketing Platform ( and AppNexus are sharing details of the travel search in the Referrer HTTP Header. They know the intended destination and dates and the number of adults and children travelling.

ATDMT ( is owned by a Facebook subsidiary Atlas Solutions. It is using a one pixel image as a tracking bug although the Do Not Track header is set to true. Clearbrain is a predictive analytics company which is also using a tracking bug.

Now we’ll have a look at the effectiveness of some popular privacy tools:

  1. The Electronic Frontier Foundation’s ‘Privacy Badger’ combined with ‘Adblock Plus’ in Chrome. Privacy Badger is a browser add-on from the Electronic Frontier Foundation that stops advertisers and other third-party trackers from secretly tracking what pages you look at on the web. Adblock Plus is a free open source ad blocker which allows users to customize how much advertising they want to see.
  2. The Cliqz browser with Ghostery enabled. Ghostery is a privacy plugin giving control over ads and tracking technologies. Cliqz is an open source browser designed for privacy.

There are now far fewer calls to third party websites. Privacy Badger has successfully identified and blocked the ATDMT tracking bug. Our IP address and travel search are no longer being collected. However neither Privacy Badger nor Ghostery detected the Clearbrain tracker. Since Privacy Badger learns to spot trackers while we browse it may just need to more time to detect bugs.

While these privacy tools are quite effective at providing some individual control over personal information, they are by no means a perfect solution. This approach places the burden of protecting privacy on the individual who does not always understand the risks. While these tools are designed to be easy to install, many people are unfamiliar with browser plugins.

Furthermore, we are making a trade off between our privacy and access to tailored advertising. Content websites we love to use may be sponsored by the advertising revenue we are now blocking.

For now, these tools at least offer the ability to make a choice.

The Customer Is Always Right: No Ethics in Algorithms Without Consumer Support

The Customer Is Always Right: No Ethics in Algorithms Without Consumer Support
by Matt Swan | December 4, 2018

There is a something missing in data science today: ethics. It seems like there is a new scandal everyday; more personal data leaked to any number of bad actors in the greatest quantities possible. Big Data has quickly given way to Big Data Theft.

The Internet Society of France, for example, a public interest group advocating for online rights, is pushing Facebook to fix the problems that led to the recent string of violations. They’re suing for $100 million Euros (~$113 million USD) and threatening EU-based group action, if appropriate remedies are not made. Facebook is also being pursued by a public interest group in Ireland and recently paid a fine of 500,000 pounds (~$649,000 USD) for their role in the Cambridge Analytica breach. Is this the new normal?

Before we answer that question, it might be more prudent to ask why this happened in the first place. That answer is simple.

Dollars dictate ethics.

Facebook’s primary use of our data is to offer highly targeted (read: effective) advertising. Ads are the price of admission and it seems we’ve all come to terms with that. Amid all the scandals and breaches, Facebook made their money – far more money than they paid in fines. And they did it without any trace of ethical introspection. Move fast and break things, so long as they’re not your things.

Dollars dictate ethics.

Someone should be more concerned about this. In the recent hearings in the US Congress in early September, there was talk about regulating the tech industry to try to bring these problems under control. This feels like an encouraging move in the correct direction. It isn’t.

First, laws cannot enforce ethical behavior. Laws can put in place measures to reduce the likelihood of breaches or punish those not sufficiently safeguarding personal data or those failing to correct algorithms with a measurable bias, but it cannot require a company to have a Data Ethicist on the payroll. We’ve already noted that Facebook made more money than they paid in fines, so what motivation do they have to change their behavior?

Second, members of Congress are more likely to believe TensorFlow is a new setting on their Keurig than they are to know it’s an open source machine learning framework. Because of this reality, some organizations – such as 314 Action – exist and prioritize electing more STEM professionals to government because of technology has progressed quickly and government is out of touch. We need individuals who have a thorough understanding of technological methods.

Meanwhile, higher education is making an effort to import ethics into computer and data science programs, but there are still limitations. Some programs, such as UC Berkeley’s MIDS program, have implemented an ethics course. However, at the time of this writing, no program includes a course in ethics as a graduation requirement.

Dollars dictate ethics.

Consider the time constraints; only so many courses can be taken. If one program requires an ethics course, the programs that do not will be at an advantage in recruiting because they will argue the ethics course is a lost opportunity to squeeze in one more technology course. This will resonate with prospective students since there are no Data Ethicist jobs waiting for them and they’d prefer to load up on technology-oriented courses.

Also, taking an ethics course does not make one ethical. Ultimately, while each budding data scientist should be forced to consider the effects of his or her actions, it is certainly no guarantee of future ethical behavior.

If companies aren’t motivated to pursue ethics themselves and the government can’t force them to be ethical and schools can’t force us to be ethical, how can we possibly ensure the inclusion of ethics in data science?

I’ve provided the answer three times. If it were “ruby slippers”, we’d be home by now.

Dollars dictate ethics.

All the dollars start with consumers. And it turns out that when consumers collectively flex their economic muscles, companies bend and things break. Literally.

In late 2017, Fox News anchor Sean Hannity had made some questionable comments regarding a candidate for an Alabama senate seat. Consumers contacted Keurig, whose commercials aired during Hannity’s show, and complained. Keurig worked with Fox to ensure their ads would no longer be shown at those times, which also resulted in the untimely death of a number of Keurig machines.

The point is this: if we want to effect swift and enduring change within tech companies, the most effective way to do that is through consistent and persistent consumer influence. If we financially support companies that consider the ethical implications of their algorithms, or simply avoid those that don’t, we can create the necessary motivation for them to take it seriously.

But if we keep learning about the newest Facebook scandal from our Facebook feeds, we shouldn’t expect anymore more than the same “ask for forgiveness, not permission” attitude we’ve been getting all along.