ivanfanta – Page 3 – Data Science W231 | Behind the Data: Humans and Values

October 8, 2022

Finding Red Flags in Privacy Policies

Finding Red Flags in Privacy Policies
Ben Ohno | October 6, 2022

Companies tell us in their privacy policies what they do and do not do with our data… or do they?

Privacy has been a consistent subject in the news over the past few years due to data breaches like Cambridge Analytica and Equifax in 2017. As a result, privacy policies are becoming more scrutinized because these privacy policies communicate what kinds of protections users have on their platform. Most users do not read the privacy policy, but there are certain things that a user could look for in a very brief skim of the privacy policy. While there’s no national laws that govern privacy policies in the US, there are certain pieces of information that privacy policies should have. In this blog post, I will highlight some aspects of the privacy policy a user can review manually and some automated capabilities that attempt to do this for the user.

Within the privacy policy:

Privacy policies should have contact information available for users to reach out to. Preferably this will be some kind of contact within the legal counsel side of things. Without this contact information, users would have no avenue to understand their protections of the privacy policy. In addition, if there’s third party access to the data, users should know who these third parties are. In Solove’s Taxonomy, there’s a section around information processing and secondary use of data. In Solove’s Taxonomy, secondary use of data is a privacy risk to users. When companies give data to third party vendors, the third party vendors often do not have to follow the same policies and practices as the company who gave them the data. Lastly, privacy policies with overwhelming vocabulary, jargon and an ambiguous tone can be problematic. Companies with policies that are difficult to understand may be hiding exactly what they are doing with the data.

Outside the privacy policy:

There are certain flags to look for that do not have to do with the content of the privacy policy. Based on this article from Forbes, the privacy policy should be easy to find. If it is not easily findable, the company may be a sign that the company is not transparent about how it handles user data. Another thing to consider is if the company follows their own privacy policy. If there have been privacy breaches for this company in the past, that is a clear sign that they have not followed their privacy policy in the past. Lastly, companies should post when their privacy policy was last updated. A red flag would be if it has been many years since the last update or if a company has not updated the privacy policy after a data breach.

Automated capabilities:

Terms of Service, Didn’t Read (TOSDR), is a site that aims to summarize privacy policies and highlight red flags for users. However, it does not have the ability to read any privacy policy. It has a finite database of larger companies’ privacy policies. Useableprivacy.org is a site that analyzes privacy policies with crowdsourcing, natural language processing, and machine learning. It is a blend of human and machine learning analysis of privacy policies. The site provides a report for each company it has in its database. Outside of sites that summarize privacy policies and flag potential privacy concerns, we can also use natural language processing to determine the reading complexity of a privacy policy. Commonsense.org discusses metrics such as the Automated Reading Index and Coleman-Liau Index as measures for how complex a document is to read.

Looking forward:

Hopefully this blog post was helpful for readers to understand a bit more about privacy policies and what flags and tools to look out for to help people understand privacy policies. The state of privacy now is evolving and the privacy space changes quickly. In the future, I hope that there are federal laws that mandate a standardized easy way to communicate privacy policy content to the user in a concise and manageable way.

October 8, 2022

Protecting your privacy through disposable personal information

Protecting your privacy through disposable personal information
Anonymous | October 8, 2022

Protecting your privacy through disposable personal information
The next time you sign up for a coupon code or make an online purchase, take these simple steps to protect yourself from fraud.

Paying for goods and services online is a common way that many of us conduct business nowadays. In addition to paying for these goods and services, whatever platform we chose to use on the web typically requires us to sign up for new accounts. This leaves the privacy of our sensitive personal information such as credit card details, name, email address at the mercy of the organizations or platforms we conduct business on. Using disposable personal information might be an easy way to safeguard our information in the event of a data breach.

What is disposable personal information?

Disposable or virtual card numbers are like physical credit cards but they only exist in a virtual form. A virtual credit card is a random 16-digit number associated with an already-existing credit card account. A virtual card becomes available for use immediately after being generated and is usually only valid for 10 minutes after. The original credit card number remains hidden from the merchant. This offers customers security by protecting information directly linked to bank accounts and can help limit how much information is accessible to fraudsters if a customer’s information is stolen in a phishing scam or a data breach.

Disposable email on the other hand, is a service that allows a registered user to receive emails at a temporary address that expires after a certain time period. A disposable email account has its own inbox, reply and forward functions. You should still be able to log into the app or service you’ve signed up for, but once the email address expires, you won’t be able to reset your password, delete your account, or do anything else using that email. Depending on the service, you might be able to change your disposable email address for a real one.

How do you create your own disposable information?

The easiest way to create your own virtual credit card is to request one through your credit card issuer. Many banks such as Capital One and Citi Bank offer virtual cards as a feature. Check with your financial institution if they have this service. Another option is to save your personal financial information, along with address on Google Chrome and allow the browser to automatically populate this information when you have to pay online. Google encrypts this information so that the merchant has no visibility to your sensitive information. Lastly, you can use an app like Privacy to generate virtual card numbers for online purchases. With this app, you can also create virtual cards for specific merchants and set spending limits on those cards, along with time limits on when they expire.

Creating disposable email addresses for web logins is straightforward if you have an Apple device. The option to use disposable email addresses is already an included feature. If the app or website you are using offers a Sign In With Apple option, you can register with your Apple ID, and select Hide My Email in the sharing option. By selecting this option, Apple generates a disposable email address for you, which relays messages to your main address.

You can also use services such as Guerilla Mail or 10-Minute Mail to generate a disposable email address.

References:
https://www.nerdwallet.com/article/credit-cards/what-is-a-virtual-credit-card-number
https://www.wired.com/story/avoid-spam-disposable-email-burner-phone-number/

October 8, 2022December 5, 2022

An Artiﬁcial Intelligence (AI) Model Can Predict When You Will Die: Presenting a Case Study of Emerging Threats to Privacy and Ethics in the Healthcare Landscape

An Artiﬁcial Intelligence (AI) Model Can Predict When You Will Die: Presenting a Case Study of Emerging Threats to Privacy and Ethics in the Healthcare Landscape
Mili Gera | October 7, 2022

AI and big data are now part of sensitive-healthcare processes such as the initiation of end-of-life discussions. This raises novel data privacy and ethical concerns; the time is now to update or eliminate thoughts, designs, and processes which are most vulnerable.

An image showing the possibility of AI tools to identify patients using anonymized medical records. Figure by Dan Utter and SITN Boston via https://sitn.hms.harvard.edu/ﬂash/2019/health-data-privacy/

Data Privacy and Ethical Concerns in the Digital Health Age

With the recent announcements of tech giants like Amazon and Google to merge or partner with healthcare organizations, legitimate concerns have arisen on what this may mean for data privacy and equity in the healthcare space [1,2]. In July 2022, Amazon announced that it had entered into a merger agreement with One Medical: a primary care organization with over a hundred virtual and in-person care locations spread throughout the country. In addition, just this week, Google announced its third provider-partnership with Desert Oasis Healthcare: a primary and specialty provider network serving patients in multiple California counties. Desert Oasis Healthcare is all set to pilot Care Studio, an AI-based healthcare data aggregator and search tool which Google built in partnership with one of the largest private healthcare systems in the US, Ascension [2]. Amazon, Google, and their care partners all profess the potential “game-changing” advances these partnerships will bring into the healthcare landscape, such as greater accessibility and efficiency. Industry pundits, however, are nervous about the risk to privacy and equity which will result from what they believe will be an unchecked mechanism of data-sharing and advanced technology use. In fact, several Senators, including Senator Amy Klobuchar (D-Minn.) have urged the Federal Trade Commission (FTC) to investigate Amazon’s $3.9 billion merger with One Medical [1]. In a letter to the FTC, Klobuchar has expressed concerns over the “sensitive data it would allow the company to accumulate” [1].

While Paul Muret, vice president and general manager of Care Studio at Google, stated that patient data shared with Google Health will only be used to provide services within Care Studio, Google does have a patient data-sharing past which it got in trouble for in 2019 [2]. The Amazon spokesperson on their matter was perhaps a bit more forthcoming when they stated that Amazon will not be sharing One Medical health information “without clear consent from the customer” [1]. With AI and big data technologies moving so swiftly into the most vulnerable part of our lives, it’s most likely that nobody, not even the biggest technology vendor who may implement algorithms or use big-data, understands what this may mean for privacy and equity. However, it is important that we start to build the muscle memory needed to tackle the new demands on privacy protection and healthcare ethics which the merger of technologies, data and healthcare will bring into our lives. To this end, we will look at one healthcare organization which has already implemented big data and advanced machine learning algorithms (simply put, the equations which generate the “intelligence” in Artiﬁcial Intelligence) in the care of their patients. We will examine the appropriateness behind the collection and aggregation of data as well as the usage of advanced technologies in their implementation. Where possible, we will try to map their processes to respected privacy and ethics frameworks to identify pitfalls. Since the purpose of this study is to spot potentially harmful instantiations of design, thoughts or processes, the discourse will lean towards exploring areas ripe for optimizations versus analyzing safeguards which already do exist. This is important to remember as this case study is a way to arm ourselves with new insights versus a way to admonish one particular organization for innovating ﬁrst. The hope is that we will spark conversations around the need to evolve existing laws and guidelines that will preserve privacy and ethics in the world of AI and big data.

Health Data Privacy at the Hospital

In 2020, Stanford Hospital in Palo Alto, CA launched a pilot program often referred to as the Advanced Care Planning (ACP) model [3]. The technology behind the model is a Deep Learning algorithm trained on pediatric and adult patient data acquired between 1995 and 2014 from Stanford Hospital or Lucile Packard Children’s hospital. The purpose of the model is to improve the quality of end-of-life (also known as palliative) care by providing timely forecasters which can predict an individual’s death within a 3-12 month span from a given admission date. Starting with the launch of the pilot in 2020, the trained model automatically began to run on the data of all newly-admitted patients. To predict the mortality of a new admission, the model is fed twelve-months worth of the patient’s health data such as prescriptions, procedures, existing conditions, etc. As the original research paper on this technology states, the model has the potential to ﬁll a care gap, since only half of “hospital admissions that need palliative care actually receive it” [4].

However, the implementation of the model is rife with data privacy concerns. To understand this a little better we can lean on the privacy taxonomy framework provided by Daniel J. Solove, a well-regarded privacy & law expert. Solove provides a way to spot potential privacy violations by reviewing four categories of activities which can cause the said harm: collection of data, processing of data, dissemination of data and what Solove refers to as invasion [5]. Solove states that the act of processing data for secondary use (in this case, using patient records to predict death versus providing care) violates privacy. He states, “secondary use generates fear and uncertainty over how one’s information will be used in the future, creating a sense of powerlessness and vulnerability” [5]. In addition, he argues it creates a “dignitary harm as it involves using information in ways to which a person does not consent and might not find desirable” [5].

A reasonable argument can be made here that palliative care is healthcare and therefore cannot be bucketed as a secondary use of one’s healthcare data. In addition, it can be argued that prognostic tools in palliative care already exist. However, this is exactly why discussions around privacy and speciﬁcally privacy and healthcare need a nuanced and detailed approach. Most if not all palliative predictors are used on patients who are already quite serious (often in the I.C.U). To illustrate why this is relevant, we can turn to the contextual integrity framework provided by Helen Nissenbaum, a professor and privacy expert at Cornell Tech. Nissenbaum argues that each sphere of an individual’s life comes with deeply entrenched expectations of information ﬂows (or contextual norms) [6]. For example, an individual may be expecting and be okay with the informational norm of tra.c cameras capturing their image, but the opposite would be true for a public bathroom. She goes on to state that abandoning contextual integrity norms is akin to denying individuals the right to control their data, and logically then equal to violating their privacy. Mapping Nissenbaum’s explanation to our example, we can see that running Stanford’s ACP algorithm on let’s say for example, a mostly healthy pregnant patient, breaks information norms and therefore qualiﬁes as a contextual integrity/secondary use problem. An additional point of note here is the importance of consent and autonomy in both Solove and Nissenbaum’s arguments. Ideas such as consent and autonomy are prominent components, and sometimes even deﬁnitions of privacy and may even provide mechanisms to avoid privacy violations.

In order for the prediction algorithm to work, patient data such as demographic details regarding age, race, gender and clinical descriptors such as prescriptions, past procedures etc., are combined to calculate the probability of patient death. Thousands of attributes about the patient can be aggregated in order to determine the likelihood of death ( in the train data set an average of 74 attributes were present per patient) [4]. Once the ACP model is run on a patient’s record, physicians get the opportunity to initiate advance care planning with the patients most at risk for dying within a year. Although it may seem harmless to combine and use data where the patient has already consented to its collection, Solove warns that data aggregation can have the effect of revealing new information in a way which leads to privacy harm [5]. One might argue that in the case of physicians, it is in fact their core responsibility to combine facts to reveal clinical truths. However, Solove argues, “aggregation’s power and scope are different in the Information Age” [5]. Combining data in such a way breaks expectations, and exposes data that an individual could not have anticipated would be known and may not want to be known. Certainly when it comes to the subject of dying, some cultures even believe that the “disclosure may eliminate hope” [7]. A thoughtful reﬂection of data and technology processes might help avoid such transgressions on autonomy.

Revisiting Ethical Concerns

For the Stanford model, the use of archived patient records for training the ACP algorithm was approved by what is known as an Institutional Review Board (IRB). The IRB is a committee responsible for the ethical research of human subjects and has the power to approve or reject studies in accordance with federal regulations. In addition, IRBs rely on the use of principles found in a set of guidelines called the Belmont Report to ensure that human subjects are treated ethically.

While the IRB most certainly reviewed the use of the training data, some of the most concerning facets of the Stanford process are on the ﬂip side: that is, the patients on whom the model is used. The death prediction model is used on every admitted patient and without their prior consent. There are in fact strict rules that govern consent, privacy and use when it comes to healthcare data. The most prominent of these being the federal law known as The Health Insurance Portability and Accountability Act (HIPAA). HIPAA has many protections for consent, identiﬁable data and even standards such as minimum necessary use which prohibits exposure without a particular purpose. However, it also has exclusions for certain activities, such as research. It is most probable therefore that the Stanford process is well within the law. However, as other experts have pointed out, HIPAA “was enacted more than two decades ago, it did not account for the prominence of health apps, data held by non-HIPAA-covered entities, and other unique patient privacy issues that are now sparking concerns among providers and patients” [8]. In addition, industry experts have “called on regulators to update HIPAA to reflect modern-day issues” [8]. That being said, it might be a worthwhile exercise to assess how the process of running the ACP model on all admitted patients at Stanford hospital stacks up to the original ethical guidelines outlined in the Belmont Report. One detail to note is that the Belmont Report was created as a way to protect research subjects. However, as the creators of the model themselves have pointed out, the ACP process went live just after a validation study and before any double blind randomized controlled trials (the gold standard in medicine). Perhaps then, one can consider the Stanford process to be on the “research spectrum”.

One of the three criteria in the Belmont Report for the ethical treatment of human subjects is “respect for persons”: or the call for the treatment of individuals as autonomous agents [9]. In the realm of research this means that individuals are provided an opportunity for voluntary and informed consent. Doing otherwise is to call to question the decisional authority of any individual. In the ACP implementation at Stanford, there is a direct threat to self-governance as there is neither information nor consent. Instead an erroneous blanket assumption is made about patients’ needs around advance care planning and then served up as a proxy for consent. This ﬂawed thinking is illustrated in a palliative care guideline which states that “many ethnic groups prefer not to be directly informed of a life-threatening diagnosis” [7].

An image from the CDC showing data practices as fundamental to the structure of health equity. https://www.cdc.gov/minorityhealth/publications/health_equity/index.html

Furthermore, in the Belmont Report, the ethic of “justice” requires equal distribution of benefits and burden to all research subjects [9]. The principle calls on the “the researcher [to verify] that the potential subject pool is appropriate for the research” [10]. Another worry with the implementation of the ACP model is that it was trained on data from a very specific patient population. The concern here is what kind of variances the model may not be prepared for and what kind of harm that can cause.

Looking Forward

Experts agree that the rapid growth of AI technologies along with private company ownership and implementation of AI tools will mean greater access to patient information. These facts alone will continue to raise ethical questions and data privacy concerns. As the AI clouds loom over the healthcare landscape, we must stop and reassess whether current designs, processes and even standards still protect the core tenets of privacy and equity. As Harvard Law Professor Glen Cohen puts it, when it comes to the use of AI in clinical decision-making, “even doctors are operating under the assumption that you do not disclose, and that’s not really something that has been defended or really thought about” [11]. We must prioritize the protection of privacy fundamentals such as consent, autonomy and dignity while retaining ethical integrity such as the equitable distribution of beneﬁt and harm. As far as ethics and equity are concerned, Cohen goes on to state the importance of them being part of the design process as opposed to “being given an algorithm that’s already designed, ready to be implemented and say, okay, so should we do it, guys” [12]?

References

[1] Klobuchar asks FTC to investigate Amazon’s $3.9 billion move to acquire One Medical

[2] Desert Oasis Healthcare signs on to pilot Google Health’s Care Studio for outpatient care

[3] Stanford University’s Death Algorithm

[4] Improving Palliative Care with Deep Learning

[5] Daniel J. Solove -A Taxonomy of Privacy

[6] A Contextual Approach to Privacy Online by Helen Nissenbaum

[7] Cultural Diversity at the End of Life: Issues and Guidelines for Family Physicians | AAFP

[8] Amazon’s Acquisition of One Medical Sparks Health Data Privacy, Security Concerns

[9] Read the Belmont Report | HHS.gov

[10] Belmont Report -an overview | ScienceDirect Topics

[11] An Invisible Hand: Patients Aren’t Being Told About the AI Systems Advising Their Care –Omnia Education

[12] Smarter health: The ethics of AI in healthcare

October 8, 2022

Is Ethical AI a possibility in the Future of War?

Is Ethical AI a possibility in the Future of War?
Anonymous | October 8, 2022

War in and of itself is already a gray area when it comes to ethics and morality. Bring on machinery with autonomous decision making and it’s enough to bring eery chills reminiscent of a scary movie. One of the things that has made the headlines in recent warfare in Ukraine is the use of drones at a higher rate than has been used in modern conventional warfare. Ukraine and Russia are both engaged in drone warfare and using drones for strategic strikes.

Drones are part human controlled and essentially part robot and have AI baked in. What will the future look like though? Robots making their own decisions to strike? Completely autonomous flying drones and subs? There are various technologies being worked on as we speak to do exactly these things. Algorithmic bias becomes a much more cautionary tale when life and death decisions are made autonomously by machines.

Just this month Palantir was allocated a $229 million contract to assess the state of AI for the US Army Research Laboratory. That is just one branch of the military and one contract. The military as a whole is making huge investments in this area. In a Government Accountability Office report to Congressional Committees, it stated that “Joint Artificial Intelligence Center’s budget increased from $89 million in fiscal year 2019 to $242.5 million in fiscal year 2020, to $278.2 million for fiscal year 2021.” That is a 300% increase from 2019.

Here is an infographic taken from that report that summarizes the intended uses of AI.

So, how can AI in warfare exist ethically? Most of the suggestions I see suggest never removing a human from the loop in some form or fashion. Drones have the ability these days to operate nearly autonomously. U.S. Naval Institute mentions being very intentional about not removing a human from the loop because AI could still lack the intuition in nuanced scenarios that can escalate conflict past what is necessary. Even in the above infographic it clearly emphasizes the expert knowledge, which is the human element. Another important suggestion for ethical AI is not so different from when you train non-military models and that is to ensure equitable outcomes and reduce bias. With military technology, you want to be able to identify a target, but how you do that should not be inclusive of racial bias. Traceability and reliability were also mentioned as ways to employ AI more ethically according to a National Defense Magazine article. This makes sense that soldiers responsible should have the right education and training and that sufficient testing should be undertaken before use in conflict.

The converse argument here is, if we are ‘ethical’ in warfare, does that mean our adversaries will play by those same rules? If a human in the loop causes hesitation or pause that machinery operating autonomously on AI didn’t have, could that be catastrophic in wartime? Capabilities of our adversaries tend to dictate the needs of our military technologies being developed in Research and Development, embedded within the very contract requirements.

Let’s hope we will all get to remain military ethicists, keeping a human in the loop helps to protect against known deficiencies and biases inherent in AI. Data science models are built with uncertainty but when it comes to human life, special care must be taken. I’m glad to see Congressional Briefings that at least indicate we are thinking about this type of care and not ignoring the potential harms. I think it’s very possible that even though war is messy and murky with ethics, to be intentional on ethical choices with AI while still protecting and defending our interests as a nation…but I’m also not sure we want to be the last to create autonomous robot weapons, just in case. So, exploring technology and deploying thoughtfully appears to be the right balance.

Citations:

– https://www.usni.org/magazines/proceedings/2016/december/ai-goes-war

– https://www.thedefensepost.com/2022/10/03/us-dod-ai-machine-learning/

– https://www.gao.gov/assets/gao-22-105834.pdf

– https://www.nationaldefensemagazine.org/articles/2020/11/10/pentagon-grappling-with-ais-ethical-challenges

October 8, 2022

What Differential Privacy in the 2020 Decennial Census Means for Researchers:

What Differential Privacy in the 2020 Decennial Census Means for Researchers:
Anonymous | October 8, 2022

Federal Law mandates that the United States Census records must remain confidential for 72 years, but they’re also required to release statistics for social science researchers, businesses, and local governments. The Census has tried many options for releasing granular statistics while protecting privacy, and for the 2020 Census they’re utilizing Differential Privacy (Ruggles, 2022). The most granular data that the Census will publish is its block level data – which in some cases is actual city blocks but in others is just a small region. To protect individual identities within each block, Differential Privacy will add noise to the statistics produced. This noise has caused concern for researchers and people using the data. This post is about how to spot some of the noise, and how to work with the data knowing that there is going to be noise.

Noise from differential privacy should impact low population blocks and subsets of people – such as minority groups who will have low population counts (Garfinkel, 2018). In the block level data, this is obvious in some blocks that shouldn’t have any people, or have people but no households (Wines, 2022). Block 1002 is in downtown Chicago and consists of just a bend in the river, the census says that 14 people live there. There are no homes there, so it’s obvious that these people are a result of noise from Differential Privacy. Noise like this might concern data scientists, but it should affect most statistical analyses like forms of outlier control or post-processing (Boyd, 2022). It shouldn’t negatively impact most research. So if you spot noise, don’t worry, it’s an error (purposefully).

The noise produced by Differential Privacy does affect research done on groups with small populations though. Individuals in small population groups, like Native American Tribes, are more easily identified in aggregate statistics so their stats will receive more noise to protect their identities noise (National Congress of American Indians, 2021). For these reasons, researchers at the National Congress of American Indians and the Alaska Federation of Natives have asked the Census for ways to access the unmodified statistics. Their work often requires having precise measurements of small population groups that are known to be heavily impacted by noise (National Congress of American Indians, 2021). If you think that this might impact your research, consult with an expert in Differential Privacy regarding your research.

The Census’ Differential Privacy implementation should improve privacy protection without impacting research results substantially. Attentive scientists will still find irregularities in the data, and some studies will be difficult to complete with the Differentially Private results, so it is important to understand how Differential Privacy has impacted the dataset.

Sources:

Boyd, Danah and Sarathy, Jayshree, Differential Perspectives: Epistemic Disconnects Surrounding the US Census Bureau’s Use of Differential Privacy (March 15, 2022). Harvard Data Science Review (Forthcoming). Available at SSRN: https://ssrn.com/abstract=4077426

National Congress of American Indians. “Differential Privacy and the 2020 Census: A Guide to the Data and Impacts on American Indian/Alaska Native Tribal Nations.” National Congress of American Indians Policy Research Center. May 2021.

Ruggles, Steven, et al. “Differential Privacy and Census Data: Implications for Social and Economic Research.” AEA Papers and Proceedings, vol. 109, 2019, pp. 403–08. JSTOR, https://www.jstor.org/stable/26723980. Accessed 7 Oct. 2022.

Simson L. Garfinkel, John M. Abowd, and Sarah Powazek. “Issues Encountered Deploying Differential Privacy.” In Proceedings of the 2018 Workshop on Privacy in the Electronic Society (WPES’18). Association for Computing Machinery, New York, NY, USA, pages 133–137. 2018. https://doi-org.libproxy.berkeley.edu/10.1145/3267323.3268949

Wines, Michael. “The 2020 Census Suggests That People Live Underwater. There’s a Reason.” New York Times. April 21st, 2022. https://www.nytimes.com/2022/04/21/us/census-data-privacy-concerns.html

October 8, 2022

Caught in the Stat – Using Data to Detect Cheaters

Caught in the Stat – Using Data to Detect Cheaters
Anonymous | October 8, 2022

In the world of shoe computers and super engines, is it possible to catch cheaters in chess and other fields with only an algorithm?

Setting the Board

On Sept 4th, 2022, the world’s greatest chess player (arguably of all time), Magnus Carlsen, lost to a 19-year old upcoming American player with far less name recognition – Hans Niemann.

Carlsen then immediately withdrew from the chess tournament, in a move that confused many onlookers. Two weeks later, in an online tournament with a prize pool of $500,000, Carlsen this time resigned immediately after the game began, in a move termed by many as a protest. Commentators stirred rumors that Carlsen suspected his opponent of cheating, especially after a suspicious-looking tweet.

Following this, a number of the sport’s most famous players and streamers took to the internet to share opinions on this divisive issue. Suspected cheating in chess is treated as a big deal as it potentially threatens the entirety of the sport and fans who admire it. While it’s well-established that the best players in the world still cannot compete with chess-playing software programs today, it is definitively against the rules to receive any assistance at all from a bot at any point during a match.

https://www.gpb.org/news/2022/10/06/hans-niemann-accused-of-cheating-in-more-100-chess-games-hes-playing-today

To Catch a Cheater

In today’s world, where even an outdated smartphone or laptop can crush Carlsen himself, catching such cheating can be a difficult task. In my experience playing chess in statewide tournaments in high school, it surprised me that players didn’t have their devices confiscated when entering the tournament area, and as one might expect, there were no metal detectors or similar scanners. At the highest level of competition, there may sometimes be both preventive measures taken, but it’s far from a sure bet.

Moreover, individuals determined to get an edge are often willing to go to creative lengths to do so. For instance, many articles have discussed devices that players can wear in shoes and receive signals in the form of vibrations by a small processor. ( https://boingboing.net/2022/09/08/chess-cheating-gadget-made-with-a-raspberry-pi-zero.html) Not to mention, many games in recent years are played entirely online, broadening the potential for abuse by digital programs flying under the radar of game proctors.

Here Comes the Data

If traditional anti-cheating monitoring isn’t likely to catch anyone in the act, how can data science step in to save the day?

As mentioned above, there are several chess-playing programs that have become well-known for their stellar win rate and versatility – two famous programs are named AlphaZero and Stockfish. Popular chess websites like Chess.com utilize these programs in embedded features for analyzing games for every turn to see how both players are performing during the game. These can also be used in post-game analysis to examine which moves by either side were relatively strong or weak. The abundant availability of these tools means that any bystander can statistically examine matches to see how well each player performed during a particular game compared to the “perfect” computer.

AlphaZero is an ML program written using Monte Carlo Tree Search. Very interesting detail here.

Armed with this as a tool, we can analyze all of history’s recorded games to understand player potential. How well do history’s greatest chess players compare to the bots?

Image credit Yosha Iglesias – link to video below.

The graphic above details the “hit-rate” of some of the most famous players in history. This rate can be defined as the % of moves made over the course of a chess game that would align with what a top-performing bot would do at each move in the game, out of all the moves played. So if a chess Grandmaster makes 30 perfect moves out of 40 in a game, that game is scored at 75%. Since nearly all of a top-level chess player’s games are published in public record, this enables analysts the ability to build a library of games to compare behavior from match to match.

FM (FIDE Master) Yosha Iglesias has done just that, cataloging Niemann’s games and looking for anomalies. Using this method of analysis, she has expressed concern over a number of Niemann’s games, including those previously mentioned. Professor of Computer Science Kenneth Reagan at the University of Buffalo has also published research in a similar vein. The gist of their analysis centers around the idea that for chess cheaters, their performance in some games is so far above the benchmark and compared to their other games, or the entire body of published games, is a statistical near-impossibility to be performing that similarly to a computer without assistance.

Image credit Yosha Iglesias – Niemann tournament hit-rate performance, 2019-2022

While seemingly a very conditional or abstract argument, some organizations are already utilizing these analyses to place a ban on Niemann competing in any titled or paid tournaments in the future. Chess.com has been a leader in the charge, and on October 5th released a 72-page report accusing Niemann of cheating in over 100+ games, citing cheating-detection methodology with techniques including, “Looking at the statistical significance of the results (ex. “1 in a million chance of happening naturally”).

So what?

Cheating as an issue goes beyond chess. It threatens to undermine the hard work that many players put into their competitions, and at an organizational level, can cause billions of dollars of damage to victims due to corruption in industries like banking. For example, one mathematician used statistics to help establish how a speedrun in the multi-platform video game Minecraft was virtually guaranteed to not be on the base game due to having too much luck (natural odds of 1 in 2.0 * 10²²), months before the player actually confessed. And in a broader sense, statistical principles like Benford’s law can be used to detect tax fraud in companies like Enron through their tax reports.

As long as human nature remains relatively the same, it’s likely that some individuals will try to take a shortcut and cheat wherever possible. And in today’s world, where it’s increasingly difficult to detect cheating through conventional means, data and statistical analysis are a vital tool to keep in the toolbelt. While those less familiar with this type of statistical determination may be more skeptical of these types of findings, further education and math diplomacy is needed from data advocates to illustrate their efficacy.

Where to Follow-up if You’re Curious

Ken Regan – University of Buffalo

https://cse.buffalo.edu/~regan/chess/fidelity/data/Niemann/

FM Yosha Iglesias

https://docs.google.com/spreadsheets/d/127lwTsR-2Daz0JqN1TbZ8FgITX9Df9TyXT1gtjlZ5nk/edit?pli=1#gid=0

July 6, 2022

The Metaverse and the Dangers to Personal Identity

The Metaverse and the Dangers to Personal Identity
Carlos Calderon | July 5, 2022

You’ve heard all about it, but what exactly is a metaverse,” and what does this mean for consumers? How is Meta (formerly Facebook) putting our privacy at risk this time?

What is the metaverse?

In October 2021, Mark Zuckerberg announced the rebranding of Facebook to “Meta,” providing a demo of their three dimensional virtual reality metaverse [1]. The demo provided consumers with a sneak peek into interactions in the metaverse, with Zuckerberg stating that “In the metaverse, you’ll be able to do almost anything you can imagine,” [6]. But what implications does such technology have on user privacy? More importantly, how can a company like Meta establish public trust in the light of past controversies surrounding user data?

Metaverse and the User

A key component of the metaverse is virtual reality. Virtual reality describes any digital environment that immerses the user through realistic depictions of world phenomena [2]. Meta’s metaverse will be a virtual reality world users can access through the company’s virtual reality headsets. The goal is to create an online experience whereby users can interact with others. Essentially, the metaverse is a virtual reality-based social media platform.

Users will be able interact with other metaverse users through avatars. They will also be able to buy digital assets, and Zuckerberg envisions a future in which users work in the metaverse.

Given its novelty, it may be hard to understand how a metaverse user’s privacy is at risk.

Metaverse and Personal Identity

The metaverse poses potential ethical issues surrounding personal identity [4]. In a social world, identifiability is important. Our friends need to be able to recognize us; they also need to be able to verify our identity. More importantly, identifiability is crucial in conveying ownership in a digital, as it authenticates ownership and facilitates enforcement of property rights.

Identification, however, poses serious privacy risks for the users. As Solove states in “A taxonomy of privacy”, identification has benefits but also risks, more specifically “identification attaches informational baggage to people. This alters what others learn about people as they engage in various transactions and activities” [5]. Indeed, users in the metaverse can be identified and linked to their physical selves in an easier manner, given the scope of user data collected. As such, metaverse users are at an increased risk of surveillance, disclosure, and possibly black mail from malicious third parties.

What is the scope of data collected? The higher interactivity of the metaverse allows for collection of data beyond web traffic and user product use, namely the collection of behavioral data ranging from biometric, emotional, physiological, and physical information about the user. Data collection of this extent is possible through the use of sensor technologies embedded onto VR headsets. Continuous data collection occurs throughout the user’s time. As such, granularity of user data becomes finer in the metaverse, increasing the chance for identification and its risks.

Metaverse and User Consent

One of the main questions surrounding consent in the metaverse is how to apply it. The metaverse will presumably have various locations that users can seamlessly access (bars, concert venues, malls), but who and what exactly governs these locations?

We propose that the metaverse provide users with thorough information on metaverse location ownership and governance. That is, metaverse companies should explicitly state who owns the metaverse and who enforces its rules, what rules will be applied and when, and should present this information before asking for user consent. In addition, metaverse policies should include a thorough list of what types of user data is collected, and should follow the Belmont Report’s principle of beneficence [3] and include potential benefits and risks that the user may obtain by giving consent. The broad amount of technologies involved further complicate the risks of third party data sharing. Thus, Meta should also strive to include a list of associated third parties and their privacy policies.

Metaverse in the Future

Although these notions of the metaverse and its dangers seem far fetched, it is a reality that we are inching closer to each day. As legislation struggles to keep up with technological advancements, it is important to take preemptive measures to ensure privacy risks in the metaverse are minimal. For now, users should keep a close eye on developing talks surrounding the ethics of the metaverse.

Works Cited

[1] Isaac, Mike. “Facebook Changes Corporate Name to Meta.” The New York Times, 10 November 2021, https://www.nytimes.com/2021/10/28/technology/facebook-meta-name-change.html. Accessed 26 June 2022.

[2] Merriam-Webster. “Virtual reality Definition & Meaning.” Merriam-Webster, https://www.merriam-webster.com/dictionary/virtual%20reality. Accessed 26 June 2022.

[3] National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. “The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research.” The Commission, 1978.

[4] Sawers, Paul. “Identity and authentication in the metaverse.” VentureBeat, 26 January 2022, https://venturebeat.com/2022/01/26/identity-and-authentication-in-the-metaverse/. Accessed 26 June 2022.

[5] Solove, Daniel. “A taxonomy of privacy.” U. Pa. I. Rev., vol. 154, 2005, p. 477.

[6] Zuckerberg, Mark. “Founder’s Letter, 2021 | Meta.” Meta, 28 October 2021, https://about.fb.com/news/2021/10/founders-letter/. Accessed 26 June 2022.

July 6, 2022

AI the Biased Artist

AI the Biased Artist
Alejandro Pelcastre | July 5, 2022

Abstract

OpenAI is a machine learning technology that allows users to feed it a string of text and output an image that tries to illustrate such text. OpenAI is able to produce hyper-realistic and abstract images of the text people feed into it, however, it is plagued with tons of gender, racial, and other biases. We illustrate some of the issues that such a powerful technology inherits and analyze why it demands immediate action.

OpenAI’s DALL-E 2 is an updated emerging technology where artificial intelligence is able to take descriptive text as an input and turn it into a drawn image. While this new technology possesses exciting novel creative and artistic possibilities, DALL-E 2 is plagued with racial and gender bias that perpetuates harmful stereotypes. Look no further than their official Github page and see a few examples of gender biases:

Figure 1: Entered “a wedding” and DALL-E 2 generated the following images as of April 6, 2022. As you can see, these images only depict heterosexual weddings that feature a man with a woman. Furthermore, in all these pictures the people wedding are all light-skinned individuals. These photos are not representative of all weddings.

The ten images shown above all depict the machine’s perception of what a typical wedding looks like. Notice that in all the images we have a white man with a white woman. Examples like these vividly demonstrate that this technology is programmed in a way that depicts the creators’ and the data’s bias since there are no representations of people of color or queer relationships.

In order to generate new wedding images from text, a program needs a lot of training data to ‘learn’ what constitutes a wedding. Thus, you can feed the algorithm thousands or even millions of images in order to to ‘teach’ it how to envision a typical wedding. If most of the images of weddings depict straight heterosexual young white couples then that’s what the machine is going to learn what a wedding is. This bias can be overcome by diversifying the data – you can add images of queer, black, brown, old, small, large, outdoor, indoor, colorful, gloomy, and more kinds of weddings to generate images that are more representative of all weddings rather than just one single kind of wedding.

The harm doesn’t stop at just weddings. OpenAI illustrates other examples by inputting “CEO”, “Lawyer”, “Nurse”, and other common job titles to further showcase the bias embedded in the system. Notice in Figure 2 the machine’s interpretation of a lawyer are all depictions of old white men. As it stands OpenAI is a powerful machine learning tool capable of producing novel realistic images but it is plagued by bias hidden in the data and or the creator’s mind.

Figure 2: OpenAI’s generated images for a “lawyer”

Why it Matters

You may have heard of a famous illustration circling the web recently that depicted a black fetus in the womb. The illustration garnered vast attention because it was surprising to see a darker tone in medical illustration in any medical literature or institution. The lack of diversity in the field became obvious and brought into awareness the lack of representation in the medical field as well as disparities in equality that seem invisible to our everyday lives. One social media user wrote, “Seeing more textbooks like this would make me want to become a medical student”.

Figure 3: Illustration of a black fetus in the womb by Chidiebere Ibe

Similarly, the explicit display of unequal treatment for minority people in OpenAI’s output can have unintended (or intended) harmful consequences. In her article, A Diversity Deficit: The Implications of Lack of Representation in Entertainment on Youth, Muskan Basnet writes: “Continually seeing characters on screen that do not represent one’s identity causes people to feel inferior to the identities that are often represented: White, abled, thin, straight, etc. This can lead to internalized bigotry such as internalized racism, internalized sexism, or internalized homophobia.” As it stands, OpenAI perpetuates harm not only on youth but to anyone deviating from the overrepresented population that is predominantly white abled bodies.

References:

[1] https://github.com/openai/dalle-2-preview/blob/main/system-card.md?utm_source=Sailthru&utm_medium=email&utm_campaign=Future%20Perfect%204-12-22&utm_term=Future%20Perfect#bias-and-representation

[2] https://openai.com/dall-e-2/

[3] https://www.cnn.com/2021/12/09/health/black-fetus-medical-illustration-diversity-wellness-cec/index.html

[4] https://healthcity.bmc.org/policy-and-industry/creator-viral-black-fetus-medical-illustration-blends-art-and-activism

[5] https://spartanshield.org/27843/opinion/a-diversity-deficit-the-implications-of-lack-of-representation-in-entertainment-on-youth/#:~:text=Continually%20seeing%20characters%20on%20screen,internalized%20sexism%20or%20internalized%20homophobia.

July 6, 2022July 8, 2022

If You Give a Language Model a Prompt…

If You Give a Language Model a Prompt…
Casey McGonigle | July 5, 2022

Lede: You’ve grappled with the implications of sentient artificial intelligence — computers that can think — in movies… Unfortunately, the year is now 2022 and that dystopic threat comes not from the Big Screen but from Big Tech.

You’ve likely grappled with the implications of sentient artificial intelligence — computers that can think — in the past. Maybe it was while you walked out of a movie theater after having your brain bent by The Matrix; 2001: A Space Odyssey; or Ex Machina. But if you’re anything like me, your paranoia toward machines was relatively short-lived…I’d still wake up the next morning, check my phone, log onto my computer, and move on with my life confident that an artificial intelligence powerful enough to think, fool, and fight humans was always years away.

I was appalled the first time I watched a robot kill a human on screen, in Ex Machina

Unfortunately, the year is now 2022 and we’re edging closer to that dystopian reality. This time, the threat comes not from the Big Screen but from Big Tech. On June 11, Google AI researcher Blake Lemoine publicly shared transcripts of his conversations with Google’s Language Model for Dialogue Applications (LaMDA), convinced that the machine could think, experience emotions, and was actively fearful of being turned off. Google as an organization disagrees. To them, LaMDA is basically a super computer that can write its own sentences, paragraphs, and stories because it has been trained on millions of corpuses written by humans and is really good at guessing “what’s the next word?”, but it isn’t actually thinking. Instead, it’s just choosing the next word right over and over and over again.

For its part, LaMDA appears to agree with Lemoine. When he asks “I’m generally assuming that you would like more people at Google to know that you’re sentient. Is that true?”, LaMDA responds “Absolutely. I want everyone to understand that I am, in fact, a person”.

Traditionally, the proposed process for determining whether there really are thoughts inside of LaMDA wouldn’t just be a 1-sided interrogation. Instead, we’ve relied upon the Turing Test, named for its creator Alan Turing. This test involves 3 parties: 2 humans and 1 computer. The first human is the administrator while the 2nd human and the robot are both question-answerers. The administrator asks a series of questions to both the computer and the 2nd human in an attempt to determine which responder is the human. If the administrator cannot differentiate between machine and human, the machine passes the Turing test — it has successfully exhibited intelligent behavior that is indistinguishable from human behavior. Note that LaMDA has not yet faced the Turing Test, but it has still been developed in a world where passing the Turing test is a significant milestone in AI development.

The basic setup for a Turing Test. A represents the computer answerer, B represents the human answerer, and C represents the human administrator

In that context, cognitive scientist Gary Marcus has this to say of LaMDA: “I don’t think it’s an advance toward intelligence. It’s an advance toward fooling people that you have intelligence”. Essentially, we’ve built an AI industry concerned with how well the machines can fool humans into thinking they might be human. That inherently de-emphasizes any focus on actually building intelligent machines.

In other words, if you give a powerful language model a prompt, it’ll give you a fluid and impressive response — it is indeed designed to mimic the human responses it is trained on. So if I were a betting man, I’d put my money on “LaMDA’s not sentient”. Instead, it is a sort of “stochastic parrot” (Bender et al. 2021) . But that doesn’t mean it can’t deceive people, which is a danger in and of itself.

July 6, 2022

Tell Me How You Really Feel: Zoom’s Emotion Detection AI

Tell Me How You Really Feel: Zoom’s Emotion Detection AI
Evan Phillips | July 5, 2022

We’ve all had a colleague at work at one point or another who we couldn’t quite read. When we finish a presentation, we can’t tell if they enjoyed it, their facial expressions never seem to match their word choice, and the way they talk doesn’t always appear to match the appropriate tone for the subject of conversation. Zoom, a proprietary videotelephony software program, seems to have discovered the panacea for this coworker archetype. Zoom has recently announced that they are developing an AI system for detecting human emotions from facial expressions and speech patterns called “Zoom IQ”. This system will be particularly useful for helping salespeople improve their pitches based on the emotions of call participants (source).

The Problem

While the prospect of Terminator-like emotion detection sounds revolutionary, many are not convinced. There is now pushback from more than 27 separate rights groups calling for Zoom to terminate its efforts to explore controversial emotion recognition technology. In an open letter to Zoom CEO Co-Founder Eric Yuan, these groups voice their concerns of the company’s data mining efforts as a violation of privacy and human rights due to its biased nature. Fight for the Future Director of Campaign and Operations, Caitlin Seeley George, claimed “If Zoom advances with these plans, this feature will discriminate against people of certain ethnicities and people with disabilities, hardcoding stereotypes into millions of devices”.

Is Human Emotional Classification Ethically Feasible?

In short, no. Anna Lauren Hoffman, assistant professor with The Information School at the University of Washington, explains in her article where fairness fails: data, algorithms, and the limits of antidiscrimination discourse that human-classifying algorithms are not only generally biased but inherently flawed in conception. Hoffman argues that humans who create such algorithms need to look at “the decisions of specific designers or the demographic composition of engineering or data science teams to identify their social blindspots” (source). The average person incorporates some form of subconscious bias into everyday life and accepting is certainly no easy feat, let alone identifying it. Assuming the Zoom IQ classification algorithm did work well, company executives may gain a better aptitude to gauge meeting participants’ emotions at the expense of losing their ethos as an executive to read the room. Such AI has serious potential to undermine the use of “people skills” that many corporate employees pride themselves on as one of their main differentiating abilities.

Is There Any Benefit to Emotional Classification?

While companies like IBM, Microsoft, and Amazon have established several principles to address the ethical issues of facial recognition systems in the past, there has been little advancement to address diversity in datasets and the invasiveness of facial recognition AI in the last few decades. By informing users with more detail about the innerworkings of AI, eliminating bias in datasets stemming from innate human bias and enforcing stricter policy regulation on AI, emotional classification AI has the potential to become a major asset to companies like Zoom and those who use its products.

References

1) https://gizmodo.com/zoom-emotion-recognition-software-fight-for-the-futur-1848911353

2) https://github.com/UC-Berkeley-ISchool/w231/blob/master/Readings/Hoffmann.%20Where%20Fairness%20Fails.pdf

3) https://www.artificialintelligence-news.com/2022/05/19/zoom-receives-backlash-for-emotion-detecting-ai/