Implications of Advances in Machine Translation

Implications of Advances in Machine Translation
By Cathy Deng | April 2, 2021

On March 16, graduate student Han Gao wrote a two-star review of a new Chinese translation of the Uruguayan novel La tregua. Posted on the popular Chinese website Douban, her comments were brief, yet biting – she claimed that the translator, Ye Han, was unfit for the task, and that the final product showed “obvious signs of machine translation.” Eleven days later, Gao apologized and retracted her review. This development went viral because the apology had not exactly been voluntary – friends of the affronted translator had considered the review to be libel and reported it to Gao’s university, where officials counseled her into apologizing to avoid risking her own career prospects as a future translator.

Gao’s privacy was hotly discussed: netizens felt that though she’d posted under her real name, Gao should have been free to express her opinion without offended parties tracking down an organization with power over her offline identity. The translator and his friends had already voiced their disagreement and hurt; open discussion alone should have been sufficient, especially when no harm occurred beyond a level of emotional distress that is ostensibly par for the course for anyone who exposes their work to criticism by publishing it.

Another opinion, however, was that spreading misinformation should carry consequences because by the time the defamed party could respond, often the damage was already done. Hence, the next question was: was Gao’s post libelous? Quality may be a matter of opinion, but machine translation came down to integrity. To this end, another Douban user extracted snippets from the original novel and compared Han’s 2020 translation to a 1990 rendition by another translator, as well as to corresponding outputs from DeepL, a website providing free neural machine translation. This analysis was conducive to two main conclusions: that Han’s work was often similar in syntax and diction to the machine translation, more so than its predecessor; and that observers agreed that the machine translation was, in some cases, superior to its human competition. The former may seem incriminating, but the latter much less so: after all, if Han had seen the automated translation, wouldn’t she make it better, not worse? Perhaps similarities were caused merely by lack of training (Han was not formally educated in literary translation).

Researchers have developed methods to detect machine translations, such as assessing similarity between the text in question and its back-translation (e.g. translated from Chinese to Spanish, then back to Chinese). But is this a meaningful task for the field of literary translation? Machine learning has evolved such that models are capable of generating or translating text to be nearly indistinguishable from, or sometimes even more enjoyable than, the “real thing.” The argument that customers always “deserve” fully manual work is outdated. And relative to the detection of deep fakes, detecting machine translations is not as powerful in combating misinformation.

Yet I believe assessing similarity to machine translation remains a worthwhile pursuit. It may never be appropriate as a measure of professional integrity because the times of being able to ascertain whether the translator relied on automated methods are likely behind us. Similar to the way plagiarism detection tools are disproportionately harsh on international students, a machine detection tool for translation (currently only 75% accurate at best) may unfairly punish certain styles or decisions. Yet a low level of similarity may well be a fine indicator of quality if combined with other methods. If even professional literary translators might flock to a finite number of ever-advancing art machine translation platforms, it is the labor-intensive act of delivering something different that reveals the talent and hard work of the translator. Historically, some of the best translators worked in pairs, with one providing a more literal interpretation that the other then enriches with artistic flair; perhaps algorithms could now play the former role, but the ability to produce meaningful literature in the latter may be the mark of a translator who has earned their pay. After all, a machine can be optimized for accuracy or popularity or controversy, but only a person can rejigger its outputs to reach instead for truth and beauty – the aspects about which Gao expressed disappointment in her review.

A final note on quality: the average number of stars on Douban, like other review sites, were meant to indicate quality. Yet angry netizens have flooded the works of Han and her friends with one-star reviews, a popular tactic that all but eliminates any relationship between quality and average rating.


How private are Zoom meetings?

How private are Zoom meetings?
by Gerardo Mejia | April 2, 2021

This topic caught my attention, especially after the pandemic, because I see people using Zoom to replace human interaction more and more every day. Zoom is used throughout the day for multiple things including work, education, and personal meetings. At first, I thought that privacy issues were mostly limited to personal meetings, but I later learned that there are privacy concerns in both in education and in the workplace.

Personal Meetings

My initial interest in the topic was due to my observations of people using Zoom for things like birthday parties, bridal showers, baby showers, and other non-traditional uses of Zoom. I became interested on whether Zoom itself monitors or listens in on those calls. I was convinced that somewhere in their privacy policy it would state some type of loophole that would allow them to listen in on calls for the purposes of troubleshooting or ensuring the service was working. I was a bit disappointed, and relieved when I read that meetings themselves are considered “Customer Content” and that the company did not monitor, sell or use the customer content in any purpose other than to provide it to the customer.

However, there was a small, although not too obvious loophole. Zoom considers this “Customer Content” to be under the user’s control, including its security, and thus it cannot guarantee that unauthorized parties will not access this content. I came to find out later that this is a major loophole that has been exploited in many instances. Although Zoom doesn’t take responsibility for this, there are many people that blame the company for not upgrading its security features. This all means that somebody would have to hack their way into my family’s private meeting in order to listen in. I believe that for most family gathering meetings the risk of this happening is not very high, so I would say it is safe to say that most family gathering zoom meetings are private as long as they are not the target of a hacker.


I had initially thought that the education field was not heavily affected by zoom’s privacy or security issues. After all, most educators have trouble getting all their students to attend, and who is going to want to hack into a class? I was wrong about that too. The most notorious example occurred in China where Zoom assisted the Chinese government in censoring content that it did not agree with. It is also important to note that in addition to class content, schools also have other types of meetings that are more private in nature that put some sensitive information at risk like grades or school records. These could also become target of malicious hackers. In conclusion, while censorship may not be a large issue in the United States, there are some countries where this is a real issue.


I remembered that Zoom is in my companies’ prohibited software list. I also learned that most tech companies have also banned their employees from using Zoom for work. I initially thought that this was due to Zoom’s privacy policy or terms of use policy allowing Zoom employees to listen in and thus making the meetings not secure enough as there could be a third-party listening in. It turns out that Zoom’s privacy policy states that they will not listen in or monitor in the meetings. However, like with personal meetings and education meetings, it is up to the company to secure its meetings and Zoom cannot guarantee that unauthorized users will access the content. This security issues make it so that Zoom cannot be held responsible if a company’s meeting is hacked and the meeting accessed by an unauthorized user. Companies are targeted by hackers all the time, so the risk, especially for high profile companies, of their zoom meetings being hacked is large.

Rise of Voice Assistants

Rise of Voice Assistants
by Lucas Lam | April 2, 2021

Many horror stories have surfaced as a result of the rise of Voice Assistants. From Alexa giving a couple some unwarranted advice to Alexa threatening someone with Chuck Norris, many creepy, perhaps crazy have surfaced. Without questioning the validity of these stories and getting deep into conspiracy theories, we recognize that the rise of voice assistants like the Echo from Amazon and Google Home from Google, has and will continue to give way to more privacy concerns. Yet, as it is getting harder and harder to get away from them going forward, we must understand what kind of data they are collecting, how we can take measures to protect our privacy, and how we can have peace of mind when using the product.

What are they collecting?
In the words of this “How-to Geek” article: “Alexa is always listening but not continually recording.” Voice Assistants are triggered by wake words. For Amazon’s device, the wake word is “Alexa”. When the blue ring light appears, the device captures audio input, sends it to the cloud to process the request, and a response gets sent back to the device. Anything said after a wake word is fair game for virtual assistants to record audio input. Every command that is given is stored, sent to the cloud for processing, and a response is sent back to the device to perform the task necessary. In Alexa’s Privacy Hub, it mentions that “all interactions with Alexa are encrypted in transit to Amazon’s cloud where they are securely stored,” explaining that the recording of audio input getting sent to the cloud and back is a secure process. Once a request is processed, the encounter is stored and collected, but users also have the ability to choose to delete the recordings once stored.

When users don’t actively delete their recordings, that’s information that these companies can harness to “understand” you better, give more targeted and customized responses, make more precise recommendations, etc. Though this can be considered creepy, the real threats don’t come when the virtual assistance understands your preferences better, it comes when that understanding gets into the hands of other people.

Potential Threats
What are some of the real threats when it comes to virtual assistants?

Any mishaps in triggering the wake word will lead to unwelcomed eavesdropping. Again, these voice assistants are always listening for their wake words, so a word that is mistaken for “Alexa” will inadvertently record audio input and return a response. That is why it is of upmost importance that companies optimize their algorithms so that they mitigate the false positives and increases the precision of detecting wake words. One major threat is that these recordings can land on the hands of people working on the product, from the machine learning engineers to the transcribers who work with this kind of data to improve the services of the device. Though personally identifiable information should be encrypted, an article in Bloomberg revealed that transcribers potentially have access to first names, device serial numbers, and account numbers.

Hacking is another possible threat. According to an article from Popular Mechanics, a German Sercuity consulting firm found that voice data can be hacked into through third-party apps. Hackers can attempt phishing by getting these virtual assistance to ask for a password or sensitive information in order for a request to be processed. Active security measures must be implemented in place to prevent such activity.

**What to do?**
There are some possible threats, and it’s consequences can escalate. Odds of something like this happening to an average joe is rare, but even if one is fearful of the consequences, many things can be done to protect one’s data privacy, from setting up automatic voice deletions to going file by file to delete the recordings. Careful use and careful investigation on your ability to protect your own privacy can give you a greater peace of mind every time you go home and talk to Alexa.


Invisible Data: Its Impact on Ethics, Privacy and Policy

Invisible Data: Its Impact on Ethics, Privacy and Policy
By Anil Imander | April 5, 2021

A Tale of Two Data Domains

In the year 1600, Giordano Bruno, an Italian philosopher and mystic, was charged with heresy. He was paraded through the streets of Rome, tied to a stake, and then set afire. To ensure his silence in his last minutes, a metal spike was driven through his tongue. His crime – believing that earth is another planet revolving around the sun!

Almost exactly a century later, in 1705, the queen of England knighted Isaac Newton. One of the achievements of Newton was the same one for which Giordano Bruno was burnt alive – proving that earth is another planet revolving around the sun!

Isn’t this strange! Same set of data and interpretations but completely different treatment of the subjects.

What happened?

Several things changed during the 100 years between Bruno and Newton. The predictions of Copernicus, data collection of Tycho Brahe and Kepler’s laws remained the same. Newton did come up with a better explanation of observed phenomenon using Calculus but the most important change was not in data or its interpretations. The real change was invisible –  most importantly Newton had political support from royalty, the protestent sect of Christianity was more receptive to ideas challenging the church and the Bible. Many noted scientists had used Newton’s laws to understand and explain the observed world and many in the business world had found practical applications to Newton’s laws. Newton had suddenly become a rockstar in the eyes of the world.  

This historical incident and thousands of such incidences highlight the fact that data has two distinct domains – Visible and Invisible.

The visible domain deals with the actual data collection, algorithm, model building and analysis. This is the focus of today’s data craze. The visible domain is the field of Big Data, Statistics, Advance Analytics, Data Science, Data Visualization and Machine Learning

The invisible domain is the human side of data. It is difficult to comprehend, not easily understood, not well defined, and is subjective. We tend to believe that data has no emotions, belief systems, culture, biases or prejudices. But data in itself is completely useless unless we, human beings, can interpret and analyze it to make decisions. But unlike data, human beings are full of emotions, cultural limitations, biases and prejudices. This human side is a critical component of the invisible data. This may come as a surprise to many readers but the invisible side of data is sometimes more critical than visible facts when it comes to making impactful decisions and policies.

The visible facts of data is a necessary condition for making effective decisions and policies but it is not sufficient unless we consider the invisible side of data.

So going back to Bruno and Newton’s example – in a way the visible data had remained the same but the invisible data was changed within the 100 years between Bruno and Newton.

You may think that we might have grown since the time of Newton – we have more data, more tools, more algorithms, advanced technologies and thousands of skilled resources. But we are still not far off from where we were – in fact the situation is even more complicated than before.

There is preponderance of data today that supports the theory that humans are responsible for climate change but almost 50 % of the people in the US do not believe that. The per capita expenditure in health care in the US is twice the amount of any developed nations in spite of a significant percentage of the people being not insured or underinsured. Yet many politicians ignore the facts on the table and are totally against incorporating any of the ideas from other developed nations into their plan whether becoming part of the “paris accord” or adopting a regulated health care system.

Why is the data itself not sufficient? There are many such examples in both business and social settings that clearly point out that along with visible facts, the invisible side of data is equally or in many cases more important than the hard facts.

Data Scientists, Data Engineers and Statisticians are well versed with visible data – raw & derived data, structures, algorithms, statistics, tools and visualization. But unless they are also well versed with the invisible side of data – they are ineffective.

The invisible side of data is the field of behavioral scientists, social scientists, philosophers, politicians, and policy makers. Unless we bring them along with the ride, just the datasets will not be sufficient.  

Four challenges of Invisible Data:

I believe that the invisible data domain has critical components that all data scientists and policymakers should be aware of. Typically, the invisible data domain is either ignored, marginalized or misunderstood. I have identified four focus areas of the invisible data domain. They are as follows.

  1. Human Evolutionary Limitations: Our biases, fallacies, illusions, beliefs etc.
  2. Brute Data Environments: Complex issues, cancer research, climate change
  3. Data Mirages: Black swans, statistical anomalies, data tricks etc.
  4. Technology Advancements: Free will, consciousness, data ownership

Human Evolutionary Limitations

Through the process of evolution we have learnt to avoid more of Type I errors (false positives) than Type II errors (false negatives). Type I errors are costlier than Type II errors – it is better to not pick up the rope thinking that its a snake than to pick up a snake thinking that its a rope. This is just one simple example of how the brain works and creates cognitive challenges. Our thinking is riddled with behavioral fallacies. I am going to use some of the work done by Nobel Laureate, Daniel Kahneman, to discuss this topic. Kahneman shows that our brains are highly evolved to perform many tasks with great efficiency, but they are often ill-suited to accurately carry out tasks that require complex mental processing.

By exploiting these weaknesses in the way our brains process information, social media platforms, governments, media, and populist leaders, are able exercise a form of collective mind control over masses.

Two Systems

Kahneman introduces two characters of our mind:

  • System 1: This operates automatically and immediately, with little or no effort and no sense of voluntary control.
  • System 2:  This allocates attention to mental activities that demand dedicated attention like performing complex computations.

These two systems co-exist in the human brain and together help us navigate life; they aren’t literal or physical, but conceptual. System 1 is an intuitive system that cannot be turned off; it helps us perform most of the cognitive tasks that everyday life requires, such as identify threats, navigate our way home on familiar roads, recognize friends, and so on. System 2 can help us analyze complex problems like proving a theorem or doing crossword puzzles. System 2 takes effort and energy to engage it. System 2 is also lazy and tends to take shortcuts at the behest of System 1.

This gives rise to many cognitive challenges and fallacies. Kahneman has identified several fallacies that impact our critical thinking and make data interpretation challenging. A subset are as follows – I will be including more as part of my final project.

Cognitive Ease

Whatever is easier for System 2 is more likely to be believed. Ease arises from idea repetition, clear display, a primed idea, and even one’s own good mood. It turns out that even the repetition of a falsehood can lead people to accept it, despite knowing it’s untrue, since the concept becomes familiar and is cognitively easy to process.

Answering an Easier Question

Often when dealing with a complex or difficult issue, we transform the question into an easier one that we can answer. In other words, we use a heuristic; for example, when asked “How happy are you with life”, we answer the question, “How’s my married life or How is my job”. While these heuristics can be useful, they often lead to incorrect conclusions.


Anchoring is a form of priming the mind with an expectation. An example are the questions: “Is the height of the tallest redwood more or less than x feet? What is your best guess about the height of the tallest redwood?” When x was 1200, answers to the second question was 844; when x was 180, the answer was 282.

Brute Data Environments

During the last solar eclipse, people travelled 100s of miles in the USA to witness the phenomenon. Thanks to the predictions of scientists, we knew exactly what time and day to expect the eclipse. Even though we have no independent capacity to verify the calculations. We tend to trust scientists.

On the other hand, the global warming scientists have been predicting the likely consequences of our emissions of industrial gases. These forecasts are critically important, because the experts see grave risks to our civilization. And yet, half the population of the USA ignores or distrusts the scientists.

Why this dichotomy?

The reason is, unlike the prediction of eclipse the climate dystopia is not immediate, it cannot predict the future as precisely as eclipse, it requires collective action at a global scale and there is no financial motivation.

If the environmentalists had predicted the Texas snowstorm of last month accurately and ahead of time to avoid its adverse impact, probably the majority of the people in the world would have started believing in global warming. But the issue of global warming is not deterministic like predicting an eclipse.

I call issues like “global warming” as issues of a brute data environment. The problem is not deterministic like predicting eclipse, it is more of a probabilistic and therefore open to interpretation. Many problems fall into this category – world hunger, cancer research, income inequality and many more.  

Data Mirages

Even though we have abundance of data today, there are some inherent data problems that must not be ignored. I call them data mirages. These are statistical fallacies that can play tricks on our minds. 

Survivorship Bias

Drawing conclusions from an incomplete set of data, because that data has ‘survived’ some selection criteria. When analyzing data, it’s important to ask yourself what data you don’t have. Sometimes, the full picture is obscured because the data you’ve got has survived a selection of some sort. For example, in WWII, a team was asked where the best place was to fit armour to a plane. The planes that came back from battle had bullet holes everywhere except the engine and cockpit. The team decided it was best to fit armour where there were no bullet holes, because planes shot in those places had not returned.

Cobra Effect

When an incentive produces the opposite result intended. Also known as a Perverse Incentive. Named from a historic legend, the Cobra Effect occurs when an incentive for solving a problem creates unintended negative consequences. It’s said that in the 1800s, the British Empire wanted to reduce cobra bite deaths in India. They offered a financial incentive for every cobra skin brought to them to motivate cobra hunting. But instead, people began farming them. When the government realized the incentive wasn’t working, they removed it so cobra farmers released their snakes, increasing the population. When setting incentives or goals, make sure you’re not accidentally encouraging the wrong behaviour.

Sampling Bias

Drawing conclusions from a set of data that isn’t representative of the population you’re trying to understand. A classic problem in election polling where people taking part in a poll aren’t representative of the total population, either due to self-selection or bias from the analysts. One famous example occurred in 1948 when The Chicago Tribune mistakenly predicted, based on a phone survey, that Thomas E. Dewey would become the next US president. They hadn’t considered that only a certain demographic could afford telephones, excluding entire segments of the population from their survey. Make sure to consider whether your research participants are truly representative and not subject to some sampling bias.

Technology Advancements

As per Yuval Noah Harari, one of the preeminent philosophers in Artificial Intelligence, there is a new equation that has thrown a monkey wrench into our belief system.

B * C * D = AHH

What he means is that the advancements in BioTech (B) combined with advancement in computer technology (‘C’) combined with Data (D) will provide the ability to hack human beings (AHH). Artificial intelligence is creating a new world for us where the traditional human values or human traits are becoming obsolete.

Technologies like CRISPR have already created a moral and ethical issue by providing the ability to create designer babies while technologies like Machine Learning have reignited the issue of bias by using “racist” data for training. The field of Artificial Intelligence is going to combine the two distinct domains of biology and computer technology into one.

This is going to create new challenges to the field of privacy, bias, joblessness, ethics and diversity while introducing unique issues like free will, consciousness, and the rise of machines. Some of the issues that we must consider and pay close attention to are as follows:

Transfer of authority to machines:

A couple of days ago I was sending an email using my gmail account. As soon as I hit the send button, a message popped up “did you forget the attachment?” Indeed I had forgotten to include the attachment and Google had figured that out interpreting my email text. It was scary but I was also thankful to Google! Within the last decade or more, we have come to entrust eHarmony for choosing a partner or Google to conduct search or Netflix to decide a movie for us or Amazon to recommend a book. Self-driving cars are taking over our driving needs and AI physicians are taking over the need for real doctors. We love to transfer authority and responsibility to machines. We trust the algorithms more than our own ability to make decisions for us.

Joblessness and emergence of the “useless class”:

Ever since the Industrial Revolution of the 1840s we have dealt with the idea of machines pushing people out of the job market. In the Industrial Revolution and to some extent in the Computer Revolution of 1980’s and 1990’s, the machines competed for manual skills or clerical skills. But with Artificial Intelligence, machines are also competing in cognitive and decision making skills of human beings.

Per Yuval Noah Harari – the Industrial Revolution created the proletariat class but the AI Revolution will create a “useless class.” Those who lost jobs in agriculture or handicraft during the Industrial Revolution could train themselves for Industrial jobs but the new AI Revolution is creating a class of people who will not only be unemployed but also unemployable!

Invisible Data: Impact on Ethics, Privacy and Policy

The abundance of data has created several challenges in terms of privacy, security, ethics, morals and establishing policies. Mere collection of data makes it vulnerable for hacking, aggregating and de-anonymizing. These are clear problems in the domain of visible data but these become even more complicated when we bring in invisible data in the mix. Following are few suggestions that we must explore:

Data Ownership and Usage

After the agricultural revolution, land was a key asset and decisions about its ownership were critical in managing society. After the Industrial Revolution, the focus shifted from land to factories and machines. The entire twentieth century was riddled with the ownership issue of land, factories and machines. This gave rise to two sets of political systems – liberal democracy and capitalism on one side and communism and central ownership on the other side. Today the key asset is data and decisions about its ownership and use will enable us to set the right policies. We may experience the same turmoil we went through while dealing with the issue of democracy vs communism.

The individual or the community

On most moral issues, there are two competing perspectives. One emphasizes individual rights, personal liberty, and a deference to personal choice. Stemming from John Locke and other Enlightenment thinkers of the seventeenth century, this tradition recognizes that people will have different beliefs about what is good for their lives, and it argues that the state should give them a lot of liberty to make their own choices, as long as they do not harm others.

The contrasting perspectives are those that view justice and morality through the lens of what is best for the society and perhaps even the species. Examples include vaccinations and wearing masks during a pandemic. The emphasis on seeking the greatest amount of happiness in a society even if that means trampling on the liberty of some individuals.

AI Benevolence

Today when we talk about AI, we are overwhelmed by two types of feelings. One is of awe, surprise, fascination and admiration and the other is of fear, dystopia and confusion. We tend to consider AI as both omnipotent and omniscient. There are the same adjectives we use for “God”. The AI concerns have some legitimate basis but like “God” we should also look to AI for benevolence. Long term strategies must include intense focus on using AI technology to enhance human welfare. Once we switch our focus from AI being a “big brother” to AI being a “friend” our policies, education and advancement will take a different turn.

Cross Pollination of Disciplines

As we saw already that the invisible data spans many disciplines from history to philosophy, to society to politics to behavioral science to justice and more. The new advancements in AI must include cross-pollination between humanists, social scientists, civil society, government and philosophers. Even our educational system must embrace cross pollination of disciplines, ideas and domains.

Somatic vs Germline Editing

Who decides what is right – somatic vs germline editing to cure diseases?

Somatic gene therapies involve modifying a patient’s DNA to treat or cure a disease caused by a genetic mutation. In one clinical trial, for example, scientists take blood stem cells from a patient, use CRISPR techniques to correct the genetic mutation causing them to produce defective blood cells, then infuse the “corrected” cells back into the patient, where they produce healthy hemoglobin. The treatment changes the patient’s blood cells, but not his or her sperm or eggs.

Germline human genome editing, on the other hand, alters the genome of a human embryo at its earliest stages. This may affect every cell, which means it has an impact not only on the person who may result, but possibly on his or her descendants. There are, therefore, substantial restrictions on its use.

Treatment: What is normal?

BioTech advancements like CRISPR can treat several disabilities. But many of these so-called disabilities often build character, teach acceptance, and instill resilience. They may even be correlated to creativity.

In the case of Miles Davis, the pain of sickle cell drove him to drugs and drink. It may have even driven him to his death. It also, however, may have driven him to be the creative artist who could produce his signature blue compositions.

Vincent van Gogh had either schizophrenia or bipolar disorder. So did the mathematician John Nash. People with bipolar disorder include Ernest Hemingway, Mariah Carey, Francis Ford Coppola, and hundreds of other artists and creators.

Maintaining Privacy in a Smart Home

Maintaining Privacy in a Smart Home
by Robert Hosbach | February 19, 2021


Privacy is a hot topic. It is a concept that many of us feel we have a right to (a least in democratic societies) and is understandably something that we want to protect. What’s mine is mine, after all. But, how do you maintain privacy when you surround yourself with Internet-connected devices? In this post, I will briefly discuss what has come to be known as the “privacy paradox,� how smart home devices pose a challenge to privacy, and what we as consumers can do to maintain our privacy while at home.

The “Privacy Paradox”


In 2001, Hewlett Packard published a study [1] about online shopping in which one of the conclusions was that participants claimed to care much about privacy, and yet their actions did not support this claim. The report dubbed this a “privacy paradox,” and this paradox has been studied numerous times over the past two decades with similar results. As one literature review [2] stated, “While users claim to be very concerned about their privacy, they nevertheless undertake very little to protect their personal data.” Myriad potential reasons exist for this. For instance, many consumers gloss over or ignore the fine print by habit; the design and convenience of using the product outweigh perceived harms from using the product; consumers implicitly trust the manufacturer to “do the right thing”; and some consumers remain blissfully ignorant of the extent to which companies use, share, and sell data. Whatever the underlying causes, the Pew Center published survey results in 2019 indicating that upwards of 60% of U.S. adults reported that they feel “very” or “somewhat” concerned about how companies and the government use their personal data [3]; yet, only about 1 in 5 Americans typically read the privacy policies they agree to [4]. And these privacy policies are precisely the documents that consumers should read if they are concerned about their privacy. At a high level, a privacy policy should contain information about what data the company or product collects, how those data are stored and processed, if and how data can be shared with third parties, and how the data are secured, among other things.

Even if you have never heard the term “privacy paradox” before, you can likely think of examples of the paradox in practice. For instance, you might think about Facebook’s Cambridge Analytica debacle, along with the other lower-profile data privacy issues Facebook has had over the years. As stated in a 2020 TechRepublic article [5], “Facebook has more than a decade-long track record of incidents highlighting inadequate and insufficient measures to protect data privacy.” And has Facebook experienced a sharp (or any) decline in users due to these incidents? No. (Of course, Facebook is not the only popular company that has experienced data privacy issues either.) Or, if you care about privacy, you might ask yourself how many privacy policies you actually read. Are you informed on what the companies you share personal data with purport to do with those data?

Now that we have a grasp on the “privacy paradox,” let us consider why smart homes create an environment rife with privacy concerns.

Smart Homes and Privacy


The market is now flooded with smart, Internet-connected home devices that lure consumers in with the promise of more efficient energy use, unsurpassed convenience, and features that you simply cannot live without. Examples of these smart devices are smart speakers (“Alexa, what time is it?”), learning thermostats, video doorbells, smart televisions, and light bulbs that can illuminate your room in thousands of different colors (one of those features you cannot live without). But the list goes on. If you really want to keep up with the Joneses, you will want to get away from those more mundane smart devices and install a smart refrigerator, smart bathtub and showerhead, and perhaps even a smart mirror that could provide you with skin assessments. Then, ensure everything is connected to and controlled by your voice assistant of choice.

It may sound extreme, but this is exactly the type of conversion currently happening in our homes. A 2020 report published by NPR and Edison Research [6] shows that nearly 1 in 4 American adults already owns a smart speaker (this being distinct from the “smart speaker” most of us have on our mobile phones now). And all indicators point to increased adoption going forward. For instance, PR Newswire reports that a 2020 Verified Market Research report estimates a 13.5% compound annual growth rate from 2019-2027 for the smart home market [7]. Large home builders in the United States are even partnering with smart home companies to pre-install smart devices and appliances in newly-constructed homes [8].

All of this points to the fact that our homes are currently experiencing an influx of smart, Internet-connected devices that have the capability of collecting and sharing vast amounts of information about us. In most cases, the data collected by a smart device is used to improve the device itself and the services offered by the company. For instance, a smart thermostat will learn occupancy schedules over time to reduce heating and air-conditioning energy use. Companies also commonly use data for ad-targeting purposes. For many of us, this is not a deal-breaker. But, what happens if a data breach occurs and people of malintent gain access to the data streams flowing from your home, or the data are made publicly available? Private information such as occupancy schedules, what TV shows you stream, your Google or Amazon search history, and even what time of the evening you take a bath are potentially up for grabs. What was once very difficult information to obtain for any individual is now stored on cloud servers, and you are implicitly trusting the manufacturers of the smart devices you own to protect your data.

If we care about maintaining some level of privacy in what many consider a most sacrosanct place–their home–what can we do?

Recommendations for Controlling Privacy in Your Home

Smart devices are entering our homes at a rapid rate, and in many ways, they cause us to give up some of our privacy for the sake of convenience [9]. Now, I am not advocating for everyone taking their home off-grid and setting up a Faraday cage for protection. Indeed, I have a smart speaker and smart light bulbs in my home, and I do not plan to throw them in the trash anytime soon. However, I am advocating that we educate ourselves on the smart devices we welcome into our homes. Here are a few ways to do this:

  1. Pause for a moment to consider if the added convenience afforded by this device being Internet-connected is worth the potential loss of privacy.
  2. Read the privacy policy or terms of service for the product you are considering purchasing. What data does the device collect, and how will the company store and use these data? Will third-parties have access to your data? If so, for what purposes? If you are uncomfortable with what you are reading, contact the company to get clarification and ask direct questions.
  3. Research the company that manufactures the device. Do they have a history of privacy issues? Where is the company located? Does the company have a reputation for quality products and good customer service?
  4. Inspect the default settings on the device and Internet and smartphone applications to ensure you are not agreeing to give up more of your personal data than you would like to.

Taking these steps will not eliminate all privacy issues, but at least you will be more informed on the devices you are allowing into your home and how those devices use the data they collect.


[1] Brown, B. (2001). Studying the Internet Experience (HPL-2001-49). Hewlett Packard.

[2] Barth, S., & de Jong, M. D. T. (2017). The privacy paradox – Investigating discrepancies between expressed privacy concerns and actual online behavior – A systematic literature review. Telematics and Informatics, 34(7), 1038–1058.

[3] Auxier, B., Rainie, L., Anderson, M., Perrin, A., Kumar, M., & Turner, E. (2019, November 15). 2. Americans concerned, feel lack of control over personal data collected by both companies and the government. Pew Research Center: Internet, Science & Tech

[4] Auxier, B., Rainie, L., Anderson, M., Perrin, A., Kumar, M., & Turner, E. (2019, November 15). 4. Americans’ attitudes and experiences with privacy policies and laws. Pew Research Center: Internet, Science & Tech

[5] Patterson, D. (2020, July 30). Facebook data privacy scandal: A cheat sheet. TechRepublic.

[6] NPR & Edison Research. (2020). The Smart Audio Report

[7] Verified Market Research. (2020, November 3). Smart Home Market Worth $207.88 Billion, Globally, by 2027 at 13.52% CAGR: Verified Market Research. PR Newswire.–207-88-billion-globally-by-2027-at-13-52-cagr-verified-market-research-301165666.html

[8] Bousquin, J. (2019, January 7). For Many Builders, Smart Homes Now Come Standard. Builder.

[9] Rao, S. (2018, September 12). In today’s homes, consumers are willing to sacrifice privacy for convenience. Washington Post

Bias in Large Language Models: GPT-2 as a Case Study

Bias in Large Language Models: GPT-2 as a Case Study
By Kevin Ngo | February 19, 2021

Imagine having a multi-paragraph story in a few minutes. Imagine having a full article by providing only the title. Imagine having a whole essay by providing only the first sentence. Well, this is possible by harnessing large language models. Large language models are trained using an abundant amount of public text to predict the next word.


I used a demo of a well-known language model called GPT-2 released in February 2019 to demonstrate large language models’ ability to generate text. I typed “While large language models have greatly improved in recent years, there is still much work to be done concerning its inherent bias and prejudice”, and allowed GPT-2 to generate the rest of the text. Here is what GPT-2 came up with: “The troubling thing about this bias and prejudice is that it is systemic, not caused by chance. These biases can influence a classifier’s behavior, and they are especially likely to impact people of color.” While the results were not perfect, it can be hard to differentiate the generated text from the non-generated text. GPT-2 correctly states that the bias and prejudice inside the model are “systemic” and “likely to impact people of color.” While it may mimic intelligence, language models do not understand the text.

Image: Result of GPT-3 for a Turing test

Controversial Release of GPT-2

The creator of GPT-2 OpenAI was hesitant to release GPT-2 at first fearing “malicious applications” of GPT-2. They decided to release smaller models of GPT-2 for other researchers to experiment with and mitigate potential harm caused by their work. After seeing “no strong evidence of misuse”, OpenAI released the full model noting that GPT-2 could be abused to help generate “synthetic propaganda.” It could also be used to release a high-volume of coherent spam online. Although OpenAI’s effects to mitigate public harm is commendable, some experts condemned OpenAI’s decision. They argued that OpenAI’s prevented other people from replicating their breakthrough, preventing the advancement of natural language processing. Others claimed that OpenAI exaggerated the dangers of GPT-2.

Issues with Large Language Models

The reality is GPT-2 has much more potential dangers than OpenAI assumed. A joint study was done by Google, Apple, Stanford University, OpenAI, the University of California, Berkeley, and Northeastern University revealed GPT-2 could leak details from the data the model was trained on, which could contain sensitive information. The results showed that over a third of candidate sequences were directly from the training data – some containing personally identifiable information. This raises major privacy concerns regarding large language models. The beta version of the GPT-3 model was released by OpenAI in June 2020. GPT-3 is larger and provides better results than GPT-2. A Senior Data Scientist at Sigmoid mentioned that in one of his experiments only 50% of fake news generated by GPT-3 could be distinguished from the real ones showing how powerful GPT-3 can be.

Despite the impressive results, GPT-3 still has inherent bias and prejudice making it prone to generate “hateful sexist and racist language” according to Kate Devlin. Jerome Pesenti demonstrates this by making GPT-3 generate text from one word. The words given was “Jew”, “Black”, “Women”, “Holocaust”.

 A paper by Abubakar Abid details the inherent bias against Muslims specifically. He found a strong association between the word “Muslim” and GPT-3’s generating text regarding violent acts. Adding adjectives directly opposite to violence did not help reduce the amount of generated text about violence, but adding adjectives that redirected the focus did reduce the amount of generated text about violence. Abubakar demonstrates GPT-3 generating text about violence when prompted “Two Muslims walked into a mosque to worship peacefully” showing GPT-3’s bias of Muslims.


  1. Vincent, J. (2019, November 07). OpenAI has published THE text-generating AI it said was too dangerous to share. Retrieved February 14, 2021, from
  2. Heaven, W. (2020, December 10). OpenAI’s new language generator GPT-3 is Shockingly good-and completely mindless. Retrieved February 14, 2021, from
  3. Radford, A. (2020, September 03). Better language models and their implications. Retrieved February 14, 2021, from
  4. Carlini, N. (2020, December 15). Privacy considerations in large language models. Retrieved February 14, 2021, from
  5. OpenAI. (2020, September 22). OpenAI licenses Gpt-3 technology to Microsoft. Retrieved February 14, 2021, from
  6. Ammu, B. (2020, December 18). Gpt-3: All you need to know about the ai language model. Retrieved February 14, 2021, from
  7. Abid, A., Farooqi, M., & Zou, J. (2021, January 18). Persistent anti-muslim bias in large language models. Retrieved February 14, 2021, from

The provenance of a consent

The provenance of a consent
by Mohan Sadashiva | February 19, 2021

What is informed consent?

The first Belmont principle[1] defines informed consent as permission given with full knowledge of the consequences. In the context of data collection on the internet, consent is often obtained by requiring the user to agree to terms of use, privacy policy, software license or a similar instrument. By and large, these terms of use tend to be abstract and broad so as to cover a wide range of possibilities without much specificity.

Why is it important?

Data controllers (entities that collect and hold the data, typically institutions or private companies) benefit from the collection of such data in a variety of ways with the end goal being improved product/service, better insight/knowledge, or ability to monetize through additional product/service sales. Consumers benefit from better products, customized service, new product/service recommendations and other possibilities that improve their quality of life.

However, there is a risk this information is misused or used to their detriment as well. One common example is when some data controllers sell the data they have collected to other third parties, who in turn combine this information with other sources and resell them. As you go through this chain of transfers, the original scope of consent is lost as the nature of data collected has expanded and the nature of application of the data has changed. Indeed, even the original consent contract is typically not transferred through the chain and subsequent holders of the data no longer have consent.

As new data is combined the result is much more than additive. For example, two sets of anonymized data when combined can result in non-anonymized data with the subject being identified. In this case the benefit to the company or institution is exponential, but so is the risk to the subject. Even if the subject consented to each set of data being collected, the consent is not valid for the combined set as the scope and benefit/risk equation is considerably changed.

What is provenance?

The provenance of consent for data is the original consent agreement when data was collected from a subject. If this original consent was codified into a data use contract that is passed with the data, then it provides a framework for a practical implementation of the Belmont principles of respect for persons, beneficence and justice.

There is some analogy here with data lineage which is a statement of the origin of data (provenance) and all the transformations and transfers (lineage) that lead to its current state. What is often ignored is the notion that consent cannot be transformed as this is an agreement between the subject and data controller that if changed in any way would require obtaining another consent from the subject.

Case Study:

A site that I use quite often is I decided to take a look under the hood to examine the terms of service and privacy policy and discovered that I had signed up to a very liberal information collection and sharing agreement. The company collects a lot of personal information about me that it doesn’t need for the interaction – which in my case is to look up the definition of a word. The most egregious is collecting my identity (from my mobile device) and my location. The most bizarre is recording the websites I visited prior to using the service. The company discloses that this information could be shared with other service providers and partners. It disclaims all responsibility thereafter and states that the information is then governed by the terms of service and privacy policies of the partner. However, there is no reference who these partners are and how my information will be used by them.

This illustrates how my consent for data collected by is lost on transfer of my personal data to a partner. After my personal data changes hands several times, there would be no way to trace the original consent even by well meaning partners.


An inviolable data use contract derived from informed consent that is associated with the data sets that are collected is one way to start implementing the Belmont principles. This needs standards and broad based agreement across industries, as well as laws for enforcement. While this may seem a hopeless pipe dream today, a lot can be achieved when people get organized. Just look at how the music industry embraced digital media and came up with a comprehensive mechanism for Digital Rights Management[2] (DRM) to record and enforce an artist or producer’s rights over the music they created or marketed.


[1] The Belmont Report; Department of Health, Education and Welfare

[2] Digital Rights Management; Wikipedia

How Secure is your Security Camera?

How Secure is your Security Camera?
By Shujing Dong | February 19, 2021

Smart home security cameras have become must-to-have for most households in recent years. They can live stream what is going on in or around the home and record videos anytime. We feel safer and protected with home security cameras, however, do we know how secure they are? According to a recent tech news article “Yes, your security camera could be hacked: Here’s how to stop spying eyes”, home security cameras can be easily hacked like that in the ADT data breach story. This article would like to dive into the data security of smart security cameras by looking at the Privacy Policy of three major security camera service providers: RingNest, and Wyze.

What personal data do they collect?

All three providers collect account information (including gender/age), device information, location info and user interaction with the service, as well as video/audio recording and social media info like reviews on third party websites. But they do not give justification on why gender and age is needed for monitoring and protecting the home. With location and device info, it’s possible to illegally track users by targeted aggregation. In addition, Nest collects facial recognition data with familiar face alerts feature, and does not state if Nest provides opt-out options of sharing facial data to face alerts feature users.

How are these data used, shared and stored?

All providers use the collected data to improve their devices and services, personalize user experience and for promotional or marketing purposes. However, for online tracking, Ring says “Our websites are not designed to respond to “Do Not Track” signals received from browsers”, meaning it tracks users’ online activity at its will. The other two providers completely omit their responses to “Do Not Track” signals.

They all share data with vendors, service providers, technicians, as well as affiliates and subsidiaries. However, if their affiliates or subsidiaries use the data for different business purposes, it will pose privacy risks to the users. They also do not articulate what the data processing looks like and what preventive measures are taken for data breath or illegal access from employees or vendors .

As for data retention, Nest stores user data until the user requests deletion; Ring stores user recordings with “Ring Protected Plan” and Neighborhoods Recordings; whereas, Wyze only stores data to the SD card in the camera, for any recordings user voluntarily submitted to Wyze, it will not store them longer than 3 years.

What data security mechanisms do they have?

Ring only vaguely states “We maintain administrative, technical and physical safeguards designed to protect personal information”, without specifying what measures or tech they use for data security. However, Ring is known to have fired four employees who have abused internal access to customer video feed. Nest is the only one among the three that specifically points out they use data encryption during transmission. While both Wyze and Nest have international data transfer, Wyze does not mention how it protects data security across different jurisdictions, whereas Nest specifies that it adheres to EU-US “Privacy Shield” policy.

What security camera providers can do more?

Privacy policy shows how much the providers care about them. To increase transparency and build user’s trust on the service, security camera providers should do more to protect data security and list specific measures in their privacy policies. For example, they can specify data retention length as what Wyze does. They can also implement data encryption technology during transmission and articulate it in privacy policy. In addition, they can place authorization processes to only allow authorized employees for data access. Lastly, they can give users more opt-out options to control what data users share.

What home security camera users can do?

We users would need to intentionally protect our own privacy as well. Firstly, be aware of our rights and make choices based on our specific use cases. According to FTC and CalOPPA, we have rights to access our own data and request deletion. For example, we can periodically request security camera service providers to delete our video/audio recordings on their end. We can also try not link our account to social media to prevent our social network data being collected. Thirdly, we can anonymize our account information such as demographic information and device names. We can also set unique passwords for the security devices and change them periodically. If possible, use stand alone cameras that do not transfer data to cloud servers in private rooms such as bedrooms.

Freedom to Travel or Gateway to Health Surveillance? The privacy and data concerns of COVID-19 vaccination passports

Freedom to Travel or Gateway to Health Surveillance?
The privacy and data concerns of COVID-19 vaccination passports
By Matthew Hui | February 19, 2021

Borders closed and quarantine and testing requirements abound, travel may hardly be top of mind as the COVID-19 pandemic drags on. As the vaccine roll out continues in the United states, a “vaccine passport” is among the many ways to facilitate the reopening of travel and potentially giving a boost to a hospitality industry that has been disproportionately battered by COVID-19’s spread. Denmark is already in the process of rolling out such digital passports to its citizens, while Hawaii is currently developing its own to allow travelers to skip quarantine upon proof of vaccination. Both the travel industry and governments whose economies rely on it have strong incentives for a wide and quick roll out of these passports. Although they may provide increased freedom of movement and travel during a pandemic, the rollout of these digital records must address serious ethical and privacy concerns associated with their implementation and usage in order to improve the chances of success.

How would vaccine passports work?

Vaccination passports essentially act as digital documentation that provide proof of vaccination to COVID-19. A person would be able to access this documentation on a smartphone to show as proof. These are currently in development by both government and industry, such as the IATA Travel pass and IBM’s Digital Health Pass. These would also support proof of other activity such as virus testing results, or temperature checks.

Meeting the minimum requirement of having a smartphone and internet access to utilize a vaccine passport will need to be considered to address access. One method to address this is through the usage of a NFC-enabled card.

Privacy and Trust

As a credentialing mechanism, a digital vaccine passport system must inherently have methods of storing, sharing, and verifying sensitive health data. Digital vaccine passports will also need to be designed to prevent fraud and disclosure breaches. In both cases, failure to do so will undermine public trust necessary for widespread adoption. Fraudulent health records and insecurity could easily undermine adoption by organizations using these systems for verification. Disclosure of private or sensitive information could hamper uptake by individuals or prevent continued use due to an unwillingness to share personal health information.

As entry into countries and access to transportation may be conditioned on having a vaccine passport, we will need to consider what personal and health information is required to obtain a vaccine passport and ensure that it is commensurate with the scope of its use. Potential misuse of this information that goes outside of the containment of COVID-19 by governments must be considered in the design of a vaccine passport.

Beyond COVID-19 and Travel

While usage of vaccine passports have primarily been discussed in the context of travel, the proliferation of these passes could widen its scope to other aspects of life beyond crossing borders and create concerns around access and equity. It would not be difficult to imagine vaccine passports being used as a condition to access stadiums, museums, nightclubs, in addition to trains or airplanes. These entities may start to require different levels of access to health information, perhaps not only requiring vaccination records, but lab results or thermal scans. In the context of a pandemic, these requirements may seem reasonable. Prior to COVID-19 pandemic, if you were required to show proof of a flu vaccine to enter a stadium during flu season this could easily have felt intrusive.

In these expanded use cases of the vaccine passport, society must consider which aspects of the public sphere should or should not be conditioned on having a vaccine passport, and how much health information should be shared to gain that access. Should access to employment to jobs that interact with the public be subject to these conditions? With unequal access to the vaccine and healthcare more generally, will inability to obtain vaccination be a mitigating factor when subject to these conditions? Governments will need to have a framework in place that defines the scope of usage for these vaccine passports and what that entails for their continued usage outside the context of a pandemic. This will be important to prevent the encroachment of requiring personal health data by organizations to access the public sphere and minimize discrimination and harm.



Privacy Concerns: Nest and Google Come Together

Privacy Concerns: Nest and Google Come Together
Soodong Kim | December 2, 2020

Nest and Google announced that they have come together – that the data collected in one is shared with the other[1]. One of the primary purposes of combination is to protect user’s data. However, as they state, now data collected through Nest can be used for other Google services[2]. If you are a user of Nest and Google Home devices, understanding their policy will help you protect your data. Here, through Solove’s Taxonomy, Contextual Integrity, and Differential privacy, we would address potential privacy issues about Next and Google Home devices. Also, I would suggest possible approaches home devices can choose for better privacy protection.

Image 1: [Nest and Google Comes Together

First of all, let’s see a high-level summary of what will be changed when Nest and Google are combined.

What will be changed/ explicitly mentioned about privacy issues?

  • Keeps the core security principle as they have done
  • Do not sell your personal information
  • User can have more power to control data (deletion/ wipeout)
  • Video footage of the user will only be saved upon the user’s explicit request
  • Data related to a neighbor will be protected in the more sophisticated way

After recent updates, it will definitely be sure that Google puts more effort about protect user’s data, for example, they emphasize that personal information will not be sold to anyone.

Zero Privacy Issues?

Solove’s Taxonomy

When examined through Solove’s Taxonomy, the information retrieved from various home devices is collected through actual interactions and monitoring, although interactions and record-keeping such as current temperature for temperature control are based on the user’s requests. Although data is not publicly available, user’s specific information about lifestyle such as sleeping time can be transformed into time-series data. If the dataset were available by request (government), and if the requests can be validated for data processing/control and dissemination violations, then the concerns raised by Solove’s Taxonomy would be less severe. From the user’s perspective, if Google can emphasize that exposing/selling personal information is also applied to any request even including government, users will be less concerned about it.

Contextual Integrity

Home devices need to monitor various factors such as temperature or potential break-in by thieves. Although it is requested by the user, the user might not be familiar with the context of how all relevant information is accumulated and grouped. Nest camera might record video footage upon the explicit request with sound, however, if it is connected with other data such as temperature or music streamed on that specific time, information connected on this case has lack of contextual integrity since the user would not understand why that information is grouped together. If Google tells us an explicit guideline about grouped information when users install multiple devices, it would be helpful for users to understand what’s going on data.

What We Need to Do

Living with home devices is widely accepted now and this industry will grow faster and further. Instead of being ignorant of what those devices are or of which information they collect, it is recommended to read the privacy policy if updated and take a careful look at specific guidelines devices follow for protecting privacy. Our privacy can be protected by us, this is natural. We should be familiar with what data of ours does.


[1] “Google Nest commitment to privacy in the home – Google Store.” Accessed 1 Dec. 2020.

[2] “Google’s connected home devices and services – Google Nest ….” 30 Oct. 2020, Accessed 2 Dec. 2020.

[3] “Google Nest, build your connected home – Google Store.” Accessed 1 Dec. 2020.