TRAIL_DTMar
INTELLECTUAL PROPERTY - March 2024

Generative AI and Copyright Fair Use

By Professor David Tan (NUS Law)

I.   Introduction

Part 1 discussed how Singapore copyright law is poised to tackle the issue of whether the use of copyright-protected works for machine learning (“input”) and the works created from natural language commands (“output”) are infringing copyright. It concluded that it is generally difficult to prove wholesale copying of millions of works as the various generative AI applications (GAIAs) do not disclose the training datasets, and one would have to proceed on a classic substantial similarity analysis in respect of each output text/image vis-à-vis the original work. This Part evaluates how the computational data analysis exception and fair use provision in the Singapore Copyright Act 2021 may be relevant to the input and output scenarios.

II.  Does the computational data analysis exception apply?

In Singapore, copyright law can provide a defence for infringing uses if these uses fall under either the computational data analysis exception (section 243-244 of Copyright Act) or the fair use provision (section 190-191). Under the computational data analysis exception which is most relevant for use of copyright-protected works for machine learning, five stringent conditions must be fulfilled.[1] Computational data analysis is defined non-exhaustively as “using a computer program to identify, extract and analyse information or data from the work” – which is synonymous with text and data mining (TDM).

Under the Copyright Act 2021, the five conditions to be satisfied include the user proving that the copy is made for the purpose of computational data analysis and not for any other purpose; the user not supplying the copy to any person other than for the purpose of verifying the results of the computational data analysis carried out by the user; and the user has lawful access to the material (the first copy) from which the copy is made; and that the first copy is not an infringing copy. The Singapore legislation gives an example that “X does not have lawful access to the first copy if X accessed the first copy by circumventing paywalls” and that the use of images to train a computer program to recognise images, such as facial recognition software, as a permissible purpose. The latter scenario is likely deemed to be permissible since the software does not use the copyright-protected expression that inheres in a photograph, but rather utilises the “facts” (which are not protected) of a face to map the facial points typically using 68 co-ordinates. Furthermore, it is stated in the Act that “X does not have lawful access to the first copy if X accessed the first copy in breach of the terms of use of a database” (s 244(2)(d)).

For machine learning purposes, the scraping of the internet for text and images will often circumvent paywalls or violate the terms of use, hence failing the “lawful access” requirement under s 244(2)(d). The way that ChatGPT, as well as its precursor models, is allegedly trained using datasets from Common Crawl, BookCorpus, Books1 and Books2 from shadow libraries, suggests that the access is “unlawful”. Furthermore, the making of a copy, which will involve the conversion of authorial works into a machine-readable format or, in some GAIAs, data storage, will not be for the sole purpose of analysing the data to improve the functioning of the AI in relation to that data (s 244(2)(b)); it will be for the purpose of generating new expressive works based on that data, which is an impermissible purpose. It is also worth noting that the UK government’s consultation on a proposed TDM exception which allows TDM for any purpose received 88 written submissions, with only 13 in favour.[2]

III.  Is it fair use?

A.  Relevant Principles

But is it nonetheless fair use? In Singapore, section 191 of the Copyright Act 2021 enumerates a non-exclusive list of four factors to be weighed to determine whether an unauthorised use is fair, and hence a permitted use, much akin to the US fair use provision 17 USC §107. In the US, fair use has allowed Google Books, acting without permission of rights holders, to make digital copies of tens of millions of books to establish a publicly available internet search function.[3] An important feature is an internet user can use this function to search without charge to determine whether the book contains a specified word or term and also see snippets of text containing the searched-for terms.

This decision on “transformative use” by the US Second Circuit Court of Appeals was also cited by the Singapore Court of Appeal with approval in Global Yellow Pages v Promedia Directories.[4] In evaluating the extent to which a work is transformative, the court will typically consider the purpose of the original and infringing secondary works; the secondary use should be plainly different from the original purpose for which the works were created. In a recent decision, the US Supreme Court commented: “the first factor relates to the problem of substitution — copyright’s bête noire. The use of an original work to achieve a purpose that is the same as, or highly similar to, that of the original work is more likely to substitute for, or supplant the work”.[5]

  It was important that Google Books augmented public knowledge by making available information about the books without providing the public with a substantial substitute for matter protected by the copyright interests in the original works.[6] In May 2023, the majority opinion of the US Supreme Court’s decision of Andy Warhol Foundation for the Visual Arts, Inc v Goldsmith found that Lynn Goldsmith’s original photograph of the pop singer Prince, and the Andy Warhol Foundation (AWF)’s copying use of that photograph in an image licensed to a special edition magazine devoted to Prince, shared substantially the same purpose, and the use was of a commercial nature; as a result, factor one weighed against fair use.[7] The majority commented that a use may be justified because copying is reasonably necessary to achieve the user’s new purpose.[8]

Regarding GAIAs, two fair use factors that are likely to carry the greatest weight in the analysis are: (1) what is the purpose/character of the use, namely whether the use by generative AI is “transformative”, i.e. changes the purpose or the nature of the original work in some way; and (2) what is the impact of the generative AI’s use on the market i.e., does it threaten the livelihood of the original creator by competing with their works or the licensing market for their works?

In respect of the first factor, Authors Guild v HathiTrust (“HathiTrust”)[9] is also instructive – the issue was whether the digitisation of copyrighted works by 13 universities and other organisations in creating the HathiTrust Digital Library (“HDL”) without authorisation may constitute fair use. The US Second Circuit Court of Appeals found that the first factor weighed in favour of fair use as HDL’s enabling of full-text search “serves a new and different function from the original” and is socially beneficial.[10] Additionally, the dealing was found to carry a “non-profit educational” purpose as the HDL was a project started by educational and non-profit institutions targeted at providing greater access to works without any “purely commercial” motive.[11] Even if there is a commercial motivation, the Second Circuit in the later decision of Authors Guild v Google saw “no reason … why Google’s overall profit motivation should prevail as a reason for denying fair use over its highly convincing transformative purpose, together with the absence of significant substitutive competition, as reasons for granting fair use.”[12] The court held that similar to HathiTrust, the purpose of Google’s copying of the original copyrighted books is “to make available significant information about those books, permitting a searcher to identify those that contain a word or term of interest, as well as those that do not include reference to it”[13] which is significantly different from the purposes of the original books.

The Ninth Circuit’s decision in Kelly v Arriba Soft Corp[14] is also useful in understanding how the evaluation of the third factor could be applied to generative AI uses. There, it was held that the use of entire copyrighted works was necessary in situations involving search engines since copying only a part of the copyrighted work would create practical difficulties for users, thereby reducing the usefulness of the search engine. In the same vein, even if entire works were copied by web robots in the text and data mining context, it could be reasoned that such a taking is reasonable, considering the different purpose of the dealing (i.e. to identify patterns in vast amounts of raw data); thus, the third factor might not necessarily weigh against fair use. But if the purpose of generative AI is to analyse the specific expression of particular artists, and then replicate portions of that expression in response to a text prompt, then it does not appear to be a different purpose. The Second Circuit has observed that the courts have rejected any categorical rule that a copying of the entirety cannot be a fair use, especially when the copying was reasonably appropriate to achieve the copier’s transformative purpose and was conducted in a manner that did not offer a competing substitute for the original.[15]

Again, the Supreme Court’s decision in Andy Warhol Foundation for the Visual Arts is instructive. The majority opinion observed that “whether the purpose and character of a use weighs in favor of fair use is, instead, an objective inquiry into what use was made, i.e., what the user does with the original work.”[16] In that case, the use was AWF’s commercial licensing of Warhol’s Orange Prince (which was based on Lynn Goldsmith’s original photograph) to appear on the cover of Condé Nast’s special commemorative edition. The purpose of that use was to illustrate a magazine about Prince with a portrait of Prince, and an infringing work that portrays Prince somewhat differently from Goldsmith’s photograph (yet has no critical bearing on her photograph) was insufficient for the first factor to favour AWF, given the specific context of the use. The majority emphasised: “To hold otherwise would potentially authorize a range of commercial copying of photographs, to be used for purposes that are substantially the same as those of the originals. As long as the user somehow portrays the subject of the photograph differently, he could make modest alterations to the original, sell it to an outlet to accompany a story about the subject, and claim transformative use.”[17] These observations are especially pertinent for images produced by GAIAs such as DALL·E, Stable Diffusion or Midjourney. If a user was looking for an image for illustrative purposes for a magazine, book, annual report or marketing brochure, and provides specific text prompts to a generative AI system to produce such an image – as opposed to licensing one directly from the original author – then the first factor is unlikely to weigh in favour of fair use.

The application of the fourth factor is also highly dependent on the finding of the first factor. The US Supreme Court in Campbell v Acuff-Rose Music Inc had emphasised the close linkage between the first and fourth factors, in that the more the copying is done to achieve a purpose that differs from the purpose of the original, the less likely it is that the copy will serve as a satisfactory substitute for the original.[18] The Second Circuit noted that even if the purpose of the copying was for a valuably transformative purpose, such copying might nonetheless harm the value of the copyrighted original if done in a manner that resulted in widespread revelation of sufficiently significant portions of the original as to make available a significantly competing substitute.[19] Generally, copyright-protected works copied for data mining purposes will require extensive processing and analysis before knowledge is derived and shared. Miners must ensure that they do not reveal significant portions of the original copyrighted works to the public. Although one could argue that data mining could limit the rights owners’ expansion into a potential market (e.g. a lost opportunity to license the works[20]) since markets are dynamic and change over time to meet new demands, the US Circuit Courts have universally dismissed this argument where only a small portion of the original works was revealed to the public. In the Google Books litigation, the Second Circuit held that “a mere revelation of 16% of the text of plaintiffs’ books overstates the degree to which snippet view can provide a meaningful substitute.”[21] In generative AI scenarios where a significant portion of an original work is reproduced in an output in response to a user’s text prompt, then one may more confidently discern a substitutive impact.

B. Fair Use in respect of Input and Output

ChatGPT, Stable Diffusion, DreamStudio and many other comparable GAIAs are not search engines. A number of them are highly successful commercial enterprises, with Stability AI valued at US$1 billion, and some charging a user fee for their services. There is also little transformative purpose to be found as the AI would be accessing and reproducing the creative expression in these works in the outputs, i.e. the works would have been appropriated for their creative elements rather than their underlying facts.

It is not easy to apply the fair use analysis separately to the issue of input independently of the output, as the purpose of the defendant’s use of the training input is often discernible only when considered in light of the output. In Authors Guild v Google, the Second Circuit observed that “copying from an original for the purpose of criticism or commentary on the original or provision of information about it, tends most clearly to satisfy Campbell’s notion of the ‘transformative’ purpose involved in the analysis of Factor One.”[22] In the claim, Authors Guild et al v OpenAI, Inc et al, filed in September 2023, it was alleged that when prompted, ChatGPT can, inter alia, generate summaries of books and infringing unauthorised detailed outlines for the next purported instalment of certain books.[23] In the Chabon claim, when ChatGPT was prompted to produce a screenplay in the style of David Henry Hwang’s The Dance and the Railroad, it generated a script written in Hwang’s style.[24] Ideas, facts, style and genre generally do not attract copyright protection. As fair use is a fact-specific inquiry, courts will have to assess each particular infringing use to determine whether it is a fair use. In response to a user’s prompt, ChatGPT’s generation of a summary of a book that it had ingested for training purposes may not be fair (both training input and output). However, its critical essay of the book would qualify as the equivalent of a book review and may be fair use, especially in light of the public benefit and absence of market substitutability. Moreover, the output from GAIAs that generate text command a very different analysis from those that generate images. A summary or critique of a book is less likely to substitute for the original book compared to an AI-generated image that would substitute for one that could have been purchased from Getty Images or Shutterstock.

In Google LLC v Oracle America Inc, the US Supreme Court emphasised that the copier’s use must add “something new, with a further purpose or different character, altering the copyrighted work with new expression, meaning or message.”[25] GAIAs are trained essentially with existing creative works, and in a significant number of outputs, it appears the works have been simply remixed or collaged to derive more works of the same kind based on our text prompts. Depending on the text prompts, it may be reproducing certain copyright-protected works without the requisite degree of transformativeness that will tilt the first factor in favour of fair use. Furthermore, the majority’s decision in Andy Warhol Foundation for the Visual Arts, as well as Google v Oracle, suggest that the first factor analysis may depend if there is a market for licensing content for training data (fourth factor); commercial copying to generate training data input for GAIAs would likely fail the first factor analysis absent a “particularly compelling justification” for the copying.[26]

Let us assume I am publishing a book titled “Andy Warhol and the Pop Art Movement”, one that contains commentaries on Warhol’s illustrious iconography and his influence on other artists. I could approach the AWF to pay a licence fee for the use of one of Warhol’s “Marilyn” works for the cover of the book and related marketing materials. Instead, using Stable Diffusion Online,[27] I entered the text prompt “Marilyn Monroe in the style of Andy Warhol”, and the image generated by Stable Diffusion resembles the original “Marilyn” silkscreens by Andy Warhol (Figure 1). The output image with the pink facial hue and golden hair – albeit without the blue background – appear to have been reproduced from the original “Shot Sage Blue Marilyn” by Andy Warhol.[28] Since one can freely use the images from Stable Diffusion in a variety of commercial settings under the Creative ML OpenRAIL-M licence, one can assume that the generated image may be used in advertisements, creation and sale of non-fungible tokens, commercial merchandising – and book covers. Applying the reasoning of the majority opinion of the US Supreme Court, as well as the Second Circuit, in Andy Warhol Foundation for the Visual Arts, the purpose of the use is neither likely to be sufficiently distinct from the original nor offering a significantly different meaning or message. As the Stable Diffusion output reproduces a significant proportion of the original Warhol work, it can be distinguished from the snippets produced by Google Books in Authors Guild v Google which were adjudged to be fair use. The image generated by Stable Diffusion Online will also compete with the licensing market for Andy Warhol’s “Marilyn”, even though it would not be a substitute for the original artwork by Warhol. In summary, both the input (assuming Warhol’s “Marilyn” was used for machine learning) and output are unlikely to be fair use.

Figure 1 – Generated by Stable Diffusion Online (“Marilyn Monroe in the style of Andy Warhol”)

Last but not least, ChatGPT’s replies to our text prompts are not based on a process of reasoning or akin to human comprehension; it is based on the probabilities of certain words occurring together, and may generate paragraphs of text from copyrighted literary works in its response. To be clear, some of the output generated by AI may be highly transformative (e.g. they may resemble Jeff Koons’ collage “Niagara” in the Blanch v Koons case which was reasonably perceived to be a new work of art with a distinct and new meaning, message and character), but it is the use of the creative works in the machine learning process that is arguably not transformative. Such unrestricted and widespread use would have a substantially adverse impact on the licensing markets of these copyright-protected works.[29]

IV.  Conclusions

The different creative, cultural and economic imperatives that underlie the copyright system in Singapore are not significantly different from the US approach.[30] Generative AI can equally inspire awe and concern. The arrival of ChatGPT in November 2022 and its ensuing global success with users really precipitated a chaotic hustle amongst developers and corporations to refine their existing generative AI systems and launch new ones. Our lives may have been enriched by generative AI, and our secret dreams of being a writer or artist are now realised with the aid of GPT-4 or Stable Diffusion, but it does not mean that these benefits should not come at a price.

AUTHOR INFORMATION

Professor David Tan is the Co-Director of the Centre for Technology, Robotics, Artificial Intelligence & the Law (TRAIL) and Head (Intellectual Property) of the EW Barker Centre for Law & Business at NUS Law.

Email: david.tan@nus.edu.sg

REFERENCES

[1]     For an analysis of the computational data analysis exception, see David Tan and Thomas Lee Chee Seng, ‘Copying Right in Copyright Law: Fair Use, Computational Data Analysis and the Personal Data Protection Act’ (2021) 33 Singapore Academy of Law Journal 1032.

[2]     UK Intellectual Property Office, ‘Consultation Outcome – Artificial Intelligence and IP: copyright and patents’, Gov.uk (28 June 2022): https://www.gov.uk/government/consultations/artificial-intelligence-and-ip-copyright-and-patents. The Cyberspace Administration of China has also released a draft version of its Administrative Measures for Generative Artificial Intelligence Services for public consultation; the deadline for submitting comments is 10 May 2023. Article 7 of the draft Measures stipulate that providers of generative AI must ensure that data used for training and optimization is obtained through legal means, and must, inter alia, comply with requirements stipulated by the Cybersecurity Law and not contain content that infringes intellectual property. See, e.g., Jeremy Daum, ‘Overview of Draft Measures on Generative AI’, China Law Translate (14 April 2023): https://www.chinalawtranslate.com/en/overview-of-draft-measures-on-generative-ai/.

[3]     Authors Guild v Google, Inc, 804 F.3d 202 (2nd Cir. 2015).

[4]     Global Yellow Pages Ltd v Promedia Directories Pte Ltd [2017] SGCA 28; [2017] 2 SLR 185 at [81].

[5]     Andy Warhol Foundation for the Visual Arts, Inc v Goldsmith, 598 U.S. __ (2023) (Slip Opinion, 18 May 2023) at 15 (internal quotations omitted).

[6]     Ibid (citing Authors Guild v Google, Inc, 804 F.3d 202, 207 (2nd Cir. 2015)).

[7]     Andy Warhol Foundation for the Visual Arts, Inc v Goldsmith, 598 U.S. __ (2023) (Slip Opinion, 18 May 2023). The appeal to the Supreme Court was an unusual one as it was only on the finding of factor one against AWF by the Second Circuit; the other three factors also weighed in favour of the plaintiff Lynn Goldsmith. With the majority of the Supreme Court deciding that the first factor weighed against AWF, the Second Circuit’s judgment was affirmed. It is pertinent to note that “the Court expresses no opinion as to the creation, display, or sale of any of the original Prince Series works [by Andy Warhol].” Ibid at 21.

[8]     Andy Warhol Foundation for the Visual Arts, Inc v Goldsmith, 598 U.S. __ (2023) (Slip Opinion, 18 May 2023) at 19.

[9]     Authors Guild v HathiTrust, 755 F.3d 87, 92 (2nd Cir. 2014).

[10]    Ibid at 97. See also William F Patry, Patry on Copyright, vol. 4 (West, Online, 2015) at §10:21 (observing that the use in HathiTrust is “socially beneficial, serves a different purpose than the original, and is in no way substitutional”).

[11]    Authors Guild v HathiTrust, 755 F.3d 87, 90-91 (2nd Cir. 2014).

[12]    Authors Guild v Google, 804 F.3d 202, 219 (2nd Cir. 2015).

[13]    Ibid at 217 (emphasis in original).

[14]    336 F.3d 811 (9th Cir. 2003).

[15]    Authors Guild v Google, 804 F.3d 202, 220 (2nd Cir. 2015).

[16]    Andy Warhol Foundation for the Visual Arts, Inc v Goldsmith, 598 U.S. __ (2023) (Slip Opinion, 18 May 2023) at 33.

[17]    Ibid.

[18]    Campbell v Acuff-Rose Music, Inc, 510 US 569, 591 (1994).

[19]    Authors Guild v Google, 804 F.3d 202, 223 (2nd Cir. 2015).

[20]    Authors Guild, Inc v HathiTrust, 755 F.3d 87, 99 (2nd Cir. 2014) (this was an argument the plaintiffs raised).

[21]    Authors Guild v Google, 804 F.3d 202, 223 (2nd Cir. 2015).

[22]    Authors Guild v Google, 804 F.3d 202, 216 (2nd Cir. 2015).

[23]    Authors Guild et al v OpenAI, Inc et al, Case 1:23-cv-08292 (filed 19 September 2023) (Southern District Court of New York);

[24]    Chabon et al v Open AI, Inc at al, Case 3:23-cv-04625-PHK (filed 8 September 2023) (Northern District of California) at [50].

[25]    Google LLC v Oracle America, Inc, 141 S.Ct. 1183, 1202 (2021).

[26]    Andy Warhol Foundation for the Visual Arts, Inc v Goldsmith, 598 U.S. __ (2023) (Slip Opinion, 18 May 2023) at 35.

[27]    Stable Diffusion Online <https://stablediffusionweb.com/>.

[28] ‘Warhol’s Marilyn Monroe painting sold for record-breaking $195m’, BBC News (10 May 2022) <https://www.bbc.com/news/world-us-canada-61339179>.

[29]   Global Yellow Pages Ltd v Promedia Directories Pte Ltd [2017] SGCA 28; [2017] 2 SLR 185 at [84].

[30]  E.g. Andy Warhol Foundation for the Visual Arts, Inc v Goldsmith, 598 U.S. __ (2023) (Slip Opinion, 18 May 2023) at 13.