The average Internet user visits dozens upon dozens of websites every day and thereby interacts with the infrastructure and code on which those websites are built upon. However, individuals also interact with a layer beyond the digital code and many without their active knowledge – a legal code – that of a Terms of Service.
While the former code is easily modified and updated frequently, this legal code, the Terms of Service, is typically drafted by attorneys and updated much less frequently. Understandably, many companies try to craft a Terms of Service to be as broad as possible, to afford the greatest amount of protection for the company. While computer code is typically precise and exact, legal code provides for more ambiguity and interpretation. The law must strike a balance protecting companies and individual’s property and also afford good-actors reasonable access and use.
One enduring question (curiously) not tested until this past decade was the legal consideration if violating a website’s Terms of Service constituted a crime. The potential logical reasoning behind the inclusion of this violation was based upon the Computer Fraud and Abuse Act , an act passed by the US Congress in 1986 and broadened in 1996 prohibiting unauthorized access to “protected computers” or exceeds authorized access to and obtains any information from these aforementioned computers if the access involved interstate or foreign communication.
As the Electronic Freedom Foundation (EFF) notes in its detailed blog post, in the most recent case Oracle v. Rimini, the Court held that violating a website’s Terms of Service is not criminally punishable under the Computer Fraud and Abuse Act (and similar state statutes).
Core to a component of this case, and as was argued by an Amicus Brief filed by the EFF, definitions of criminal activity must be very specific and follow the Rule of Lenity, which states as the EFF mentioned, “criminal statutes be interpreted to give clear notice of what conduct is criminal.” But most critically, the EFF goes on to say, “Not only do people rarely (if ever) read terms of use agreements, but the bounds of criminal law should not be defined by the preferences of website operators.”
Of particular interest to Data Scientists was the question of whether using “bots and scrapers” for automated collection of data was deemed a violation of the law if it violated a Terms of Service. An important tool in the Data Scientists’ and Data Engineers’ toolbox, automated scraping scripts provide for efficient accumulation of data. Further, many individuals cite instances of Terms of Service being too broad or vague for interpretation.
Among the applications of these scraped data, it subsequently can be used for academic research or used to develop novel products and services that connect disparate sets of information and reduce information asymmetries across consumer populations (for example, search engines or price tracking). On the other hand, sometimes malicious bots can become burdensome to a company’s website and impact or impede their operations.
Legal scholars have argued public websites implicitly give the public the right to access (including to scrape) the content, but a some companies disagree. This presents a fascinating quandary that is beyond the scope of this article.
At risk, and argued by Oracle in the case, was that “the manner in which [the defendant] used” “bots and scrapers” was more than a contractual violation (a violation of the Terms of Service), but also a criminal violation under the Computer Fraud and Abuse Act. Viewable beginning at 33:42 , Judge Susan Graber stated (at 36:00) she has difficulty seeing how Oracle’s arguments fits with the statute and previous cases. “They had permission to take [the scraped data]” she states, and that previous cases and statues refer only to data that they did not have legal access to. Oracle’s attorney rebuts by saying (at 34:47), “The manner restriction is critical to protect the integrity of the computer systems.” And Judge Graber counters that this potentially has jurisdiction in the civil sphere, but not in the criminal realm.
In another, currently pending case, hiQ v. LinkedIn, the Court noted further danger:
Under [an aggressive] interpretation of [the Computer Fraud and Abuse Act (CFAA) ], a website would be free to revoke ‘authorization’ with respect to any person, at any time, for any reason, and invoke the CFAA for enforcement, potentially subjecting an Internet user to criminal, as well as civil liability. Indeed … merely viewing a website in contravention of a unilateral directive from a private entity would be a crime, effectuating the digital equivalence of Medusa.
The Court goes on to articulate that website owners could block certain populations on the basis of discrimination, consequently, putting any individual, including Data Professionals, who accesses a website at risk.
Fortunately, the Ninth Circuit articulated in the Oracle v. Rimini case that “[T]aking data using a method prohibited by the applicable terms of use when the taking itself generally is permitted, does not violate [criminal statutes]” (Page 3).
This Oracle decision further clarifies for Data Scientists, Data Engineers, and others that they cannot be criminally prosecuted from violating a website’s Terms of Service. As mentioned above, because Terms of Service can be broad and open to interpretation, data professionals were potentially under risk of criminal prosecution and liability if a company were to encourage authorities to pursue criminal prosecution in addition to exclusion and discrimination. This resolution, however, still leaves a remedy for businesses to go after bad-actors through civil litigation. Oracle v. Rimini helps clarify some of the parameters in which the law will be applied to web scraping. The other case mentioned in this post, hiQ v. LinkedIn, soon to hear oral arguments in March of 2018, will further test the resolution in the Oracle case in addition to previous cases that have been resolved similarly.
Note: When engaging in web scraping, there are a number of best practices to engage in, such as respect the Terms of Service as much as possible, respect a website’s Robots.txt, identify your bot, do not republish the data without consent, do not gather non-public or sensitive data, do not overburden the website, e-mail the admin if you have a question, or if you have additional questions seek advice from an attorney.
Disclosure: I am not a lawyer and am interpreting these legal concepts and rulings from an aspiring Data Scientist’s perspective. Should there be an error in my understanding or writing, or if you have a question, please let me know at dkent [at] Berkeley [dot] edu. Thank you in advance.