The rapidly evolving landscape of advanced technology renders data one of the most valuable commodities today. This is especially true for artificial intelligence (AI), which can advance significantly in capability and complexity by learning from massive data sets used as training data.
A company can have various types of data online — for example, content in the form of text, images, audio, and/or video. This article addresses proprietary content a company makes available on its website or a third-party website.
The web is a unique data source because the interface is widely accessible and the data is quickly transmitted. Such uniqueness leads to special technical and legal issues, the treatment of which often benefits from reinterpreting existing situations in new light. It is important for a company that makes data available online to stay informed of not only advancements in technology, but also laws and legal remedies that may be available to protect that data from unauthorized use.
Companies should consider the extent to which they make data available and the method in which they do so. Companies often have a strong business interest in making at least some of their data available publicly or without modification, as the introduction of restrictions can interrupt the user experience.
A company that makes data publicly available or as-is may find itself more exposed to unauthorized use of that data, particularly in the absence of taking more proactive measures. Fortunately, various measures are available, which when applied wisely can reduce unauthorized use without significantly sacrificing accessibility or usability.
In this article, we identify considerations companies should account for when undertaking efforts to protect their online data based on an analysis of legal protections applicable to companies’ online data against unauthorized use.
Restricting computer access with computer technology
A company that restricts access to the computer hosting its data with computer technology, thus restricting access to that data at the infrastructure level, is inherently afforded more protection. An example of such restriction includes employing authentication mechanisms that require a username and a password. The company can also make claims under federal law when such restriction is violated.
Computer Fraud and Abuse Act against circumvention of computer access
The Computer Fraud and Abuse Act (CFAA) prohibits intentionally accessing a computer without authorization or exceeding authorized access, thereby obtaining information. The “without authorization” or “exceeds authorized access” elements of the CFAA are applicable when a user obtains information from the computer by bypassing the restriction, where “access” and “authorization” are specifically construed. The current law prefers restricting access such that a website becomes “generally” unavailable and requiring permission.
Web scraping to collect training data for AI technology might not involve hacking to bypass user authentication, which was conventionally the target of the CFAA, but instead often involves improper use after gaining initial access.
Specifically, where a user accesses a restricted part of a website by logging in with a valid username and password, such access is “authorized.” At that point, using the accessed data for an unauthorized purpose (e.g., collecting data to misappropriate in violation of the data owner’s intellectual property rights) is insufficient to substantiate a CFAA claim.
This does not mean that the company has no further protection from the CFAA against any inappropriate use once one type of restricted access is granted. The company can implement additional restrictions on computer access with computer technology, the violation of which can implicate the CFAA.
In Ryanair DAC v. Booking Holdings Inc., the Court noted the plaintiff company’s efforts to prevent scraping by blocking access to the website based on IP addresses that indicate bot activity and concluded that CFAA claims were validly raised where defendants allegedly circumvented this technology.
Whether a company can raise CFAA claims against a person (entity or individual) who accesses its data depends on the extent to which the company has made that data accessible and the details of the access.
Where a company employs login requirements and additional mechanisms (e.g., blocking their IP address) to restrict access, even though the person’s activity once logged-in would not normally be grounds for a CFAA claim, there may be grounds for a CFAA claim where the user circumvented the additional mechanisms. These additional mechanisms strengthen the argument that something is not public, such that access may be “unauthorized” under the CFAA.
Where a user further defies a cease-and-desist letter, the company’s CFAA claim could be bolstered, particularly insofar as further access without authorization could be argued to be intentional.
A company may choose to keep its data generally accessible on its website or publish data on a similarly accessible website. Regardless of whether a company’s data is available generally or accessible with restrictions, there are other protections available to companies seeking to protect their data, which can depend on the management and nature of the data, as discussed below.
Altering data with management information or additional computer technology
Rather than simply making data available online as-is, a company can restrict access by layering additional data over, or embedding additional data into, its data. The company’s data often remains useful despite the presence of such additional data, which can, however, deter undesirable access while improving data management.
Examples of such restriction include adding digital watermarks or metatags associated with data ownership. The company can also make claims under federal law when such restriction is violated, and can do so even when the company makes its data available on a third-party website.
The Digital Millennium Copyright Act against circumvention of data access
Under the Digital Millennium Copyright Act (DMCA), circumvention means attempting to “descramble a scrambled work, to decrypt an encrypted work, or otherwise to avoid, bypass, remove, deactivate, or impair a technological measure, without the authority of the copyright owner.”
In the online context, technological measures that readily deter access to data are especially relevant. Such technological measures are typically implemented by altering the data being protected to add copyright management information (CMI) as plain text or watermarks.
The DMCA makes it unlawful to falsify, remove, or alter the CMI of a copyrighted work. CMI can take many forms, such as a copyright notice, to indicate that the content is protected by copyright. Depending on the form taken, there can be other intellectual property implications, as discussed below.
When a company proactively undertakes efforts to protect its online data by including CMI in, or otherwise altering, the data, it can avail itself of protections and remedies afforded by the DMCA. Others would violate the DMCA when they reverse-alter the resulting data to eliminate the CMI for their own purposes.
For example, in the recent, highly publicized complaint filed by Getty Images against Stability AI, Getty asserted that it includes information, watermarks, and metadata with its content, and that Stability AI’s product Stable Diffusion “generates images that include distorted versions of Getty Images’ watermark.” Getty argued that Stability AI removed or altered Getty’s CMI and provided false CMI under the DMCA.
Additionally, the Court in Doe 1 v. GitHub, Inc., made it clear that defendants, who allegedly removed CMI, could not hide behind the AI tool with respect to intent or knowledge elements of a DMCA claim:
Defendants argue that the complaint merely alleges “the passive non-inclusion of CMI” by neutral technology which excerpts code without the accompanying CMI, rather than the active removal of CMI from licensed code… This semantic distinction is not meaningful.
Lanham Act against infringement of trademark right
As noted above, CMI can include marks or designs that not only specify ownership but also may be associated with a company’s brand. Placing a trademark on the data, in addition to identifying the source of the data similar to a copyright notice, can create additional avenues for legal recourse under the Lanham Act against a party attempting to use that data.
For example, Getty also made several claims relating to its trademarks. Getty Images’ watermark that was allegedly distorted contained Getty Images’ name as a trademark. Thus, in addition to arguing that this distortion implicated the DMCA with respect to CMI, Getty alleged trademark infringement. Getty further alleged that the AI output images create confusion as to the origin of the images, suggest incorrectly an affiliation with Getty, and damage Getty’s reputation and good will.
These cases exemplify the utility in marking the data before making it available online. The inclusion of watermarks or metatags in data can pave the way for causes of action under the DMCA and the Lanham Act.
Controlling data access and use via user agreements
Companies can also leverage their user agreements to protect their online data. User agreements clarify certain operations with existing legal basis and impose additional requirements not inherently based in law.
User agreements lay out terms and conditions that apply to users (including visitors) of a company’s website — and, if properly implemented, create a binding contract between the company and user, the violation of which constitutes a breach of contract with legal remedies. These user agreements protect the company’s data by deterring users from prohibited usage. For purposes of this article, we assume any such user agreement is properly implemented such that it has a binding effect on the user.
Companies can strengthen their efforts to protect their data by explicitly prohibiting certain activities. For example, the terms can prohibit using any automated means (e.g., bots) to access or use the website (including via scraping) and bypassing (or attempting to bypass) any password protection or other restrictions on accessing any portion of the website.
Companies with account requirements can specify the requirements for making and maintaining a valid account, such as requiring that all account information be accurate and the account be used only by that person for non-commercial purposes.
Beyond laying the grounds for contractual claims, user agreements can support other claims. In Ryanair, one of the factors supporting the Court’s finding that plaintiffs raised a valid CFAA claim was the defendants’ alleged violation of the website terms, which prohibited screen scraping of a password-restricted portion of the website.
User agreements can specify not only how the website should be accessed, but also how its data should be used. The terms can prohibit replicating data from the website, for instance.
The Doe 1 case touched on the issue of including attribution to data owners where an AI tool produces output using that data. While the case remains ongoing and there have yet to be final rulings on the issues, the Court recognized that a breach of contract claim was validly raised where plaintiffs, who argued that defendants’ AI output failed to reproduce the copyright notice in plaintiffs’ data used to train the AI, identified the contractual obligations allegedly breached (which included attribution requirements).
This is instructive in that a company may seek to use its user agreement as a basis for contractual obligations relating to the use of its data in training AI tools — e.g., requiring attribution to the company in a derivative work outputted by the AI tools. Use of the company’s data can similarly be deterred from having to include extra information in the AI output.
Companies that make data available online have various tools at their disposal when undertaking efforts to protect their data from unauthorized use. At the same time, myriad legal issues can arise and the unique facts and circumstances of each case inform the outcome. As data becomes increasingly valuable for developing powerful AI models, a company seeking to protect its data should consider the type of data it makes available online, the way in which it makes the data available (and to whom), any ancillary information it includes with the data indicating ownership, and what terms apply to use of the website and its content.
 Such claims can cover user-generated content on the company’s website.
 18 U.S.C.A. § 1030(a).
 See hiQ Labs, Inc. v. LinkedIn Corp., 31 F.4th 1180, 1199 (9th Cir. 2022) (”authorization is only required for password-protected sites or sites that otherwise prevent the general public from viewing the information”).
 See Van Buren v. United States, 210 L. Ed. 2d 26, 141 S. Ct. 1648 (2021) (no CFAA violation where someone with access credentials accesses information for an “improper use” if that information is otherwise available to them).
 No. CV 20-1191-WCB, 2022 WL 13946243, at *12 (D. Del. Oct. 24, 2022).
 In Ryanair, the Court found that a CFAA claim was validly supported by “various” factors including a login mechanism, an IP blocking mechanism, and use of cease-and-desist letters. Id. at 12. The Court pointed to the increased chance of having a claim under the CFAA by implementing a username and password authentication system for access in the first place. Id. at 11. See also Craigslist Inc. v. 3Taps Inc., 942 F.Supp.2d 962 (N.D. Cal. 2013) (while not addressing whether the CFAA applied to public information, that the Court held that, assuming the CFAA applied, “Defendants’ continued use of Craigslist after the clear statements regarding authorization in the cease and desist letters and the technological measures to block them constitutes unauthorized access under the statute.”).
 17 U.S.C.A. § 1201.
 17 U.S.C.A. § 1202. The DMCA contains scienter elements which, depending on the alleged violation, require that the alleged copyright infringer knew and/or intended to falsify, remove, or alter the CMI. While owners can directly rely on the Copyright Act of 1976 for copyright infringement, a party engaging in scraping or other aggregation of the website’s data for “useful” purposes, such as AI training, may raise the fair use defense. A company’s success in overcoming a fair use defense may depend on whether the alleged infringer’s use of the data is sufficiently transformative such that it is fundamentally different than the original use.
 See Getty Images (US), Inc. v. Stability AI, Inc., 1:23-cv-00135 (D. Del) at 2. Getty stated at the outset of its complaint that “[t]he visual assets on Getty Images’ websites are accompanied by: (i) titles and captions which are themselves original and creative copyrighted expression; (ii) watermarks with credit information and content identifiers that are designed to deter infringing uses of the content; and (iii) metadata containing other copyright management information.”
 No. 22-CV-06823-JST, 2023 WL 3449131 (N.D. Cal. May 11, 2023).
 The Lanham Act, 15 U.S.C.A. § 1051 et. seq.
 See Getty at 22, 27.
 Id. at 28.
 In this case, defendants developed AI-based computer coding tools. According to plaintiffs, the data used to train these tools included code from defendants’ online repository of open-sourced code, to which plaintiffs had uploaded their own developed code under open-source license terms. These terms require that “any derivative work or copy include attribution, a copyright notice, and the license terms.” Doe 1 at 1. Plaintiffs argued that the tools’ output, which constitutes code, does not include such required information. The Court found that the open-source license terms were sufficient basis for a breach of contract claim.
Reprinted by permission.