David Tan-Generative AI and Copyright Infringement
INTELLECTUAL PROPERTY - January 2024

Generative AI and Copyright Infringement

By Professor David Tan (NUS Law)

I.   Introduction

This is part one of a two-part article that discusses how Singapore copyright law is poised to tackle two issues: (a) whether the use of copyright-protected works for machine learning (“input”) and the works created from natural language commands (“output”) are infringing copyright; and (b) whether a TDM exception or fair use defence applies to such uses. A longer and more comprehensive version of (a) may be found in the SAL Practitioner.[1]

Globally, and in Singapore, there is certainly significant public interest in what ChatGPT can deliver, whether in assisting students with writing school assignments or in generating scam e-mails. ChatGPT – where “Chat” in the name refers to it being a chatbot, and “GPT” stands for generative pre-trained transformer which is a type of large language model (“LLM”) – is the artificial intelligence (“AI”) system designed by OpenAI which builds generative models using deep learning technology that leverages large amounts of data to train an AI system to perform a task.  It has been reported to be the fastest-growing consumer application in history, far surpassing the successes of TikTok, Facebook and Instagram.

OpenAI also operates DALL·E which is an AI system that can create realistic images and art from a description in natural language. Other well-known image generating AI applications include Stable Diffusion and Midjourney. These sophisticated AI technologies which train on vast quantities of authorial works to generate new content in response to text prompts are often described as “generative AI”,[2] and the manner in which these copyright-protected works are employed in training the AI have attracted a number of high-profile lawsuits since the start of 2023.[3] For the purposes of this article, the panoply of generative AI applications will be known as GAIAs.

The new Singapore Copyright Act 2021 was carefully calibrated to negotiate the complex relationships between protecting rights owners and enabling the public and other users to have access to these works in order to create new ones.[4] Significantly, by codifying an open-ended fair use provision akin to that in the United States, works protected by copyright – which include music, videos, images, lyrics – may just be more readily available for transformative repurposing on social media platforms such as TikTok, Instagram and Facebook. However, at the time of public consultation in the mid-2010’s, the GAIAs such as ChatGPT, DALL·E, Stable Diffusion, Midjourney and DreamStudio were not even in the public consciousness.

II.  Infringement – Machine Learning Input and Generated Output

1.   Machine learning input can potentially infringe copyright

In order for ChatGPT to respond to the questions or commands posed by human individuals, it needs to have access to millions or even billions of literary works – many of which are protected by copyright – in order to produce fully fleshed out answers and results based on digitally accessible text-based information. Often referred to as the input of data for machine learning or machine training, an AI system is “fed” the relevant works in order for it to function effectively. OpenAI had previously revealed that that in its earlier AI models, e.g. GPT-1, it had accessed BookCorpus which had a collection of over 7,000 unique unpublished books. In 2020, while training GPT-3, the datasets came from two internet-based books corpora amounting up to 357,000 titles.[5] However, the training datasets for GPT-4 was not revealed.

The Singapore Copyright Act 2021 defines a “copy” of an authorial work is a reproduction of the work in any material form (s 41(1)) and deems reproduction to have occurred if the work “is converted into … a digital or other electronic machine‑readable form” (s 41(2)(f)). Furthermore, the making a copy of a work that is temporary or is incidental to some other use of the work is to be treated as making a copy of the work (s 50(1)). Section 146 stipulates that copyright is infringed if a person, who is neither the copyright owner or a licensee, does in Singapore, or authorises the doing in Singapore of, any act comprised in the copyright.

Presently, when inputting the images for machine learning, usually an algorithm will be scraping the internet for content from various websites, invariably accessing content without permission and in violation of express prohibitions against such conduct contained in the terms of use of these websites. It is unlikely that all works used in GAIA training are open-access works or works in the public domain. Generally, in the first stage of the data mining process (even if the AI system is not directly “fed” the relevant input), web robots may infringe the reproduction rights of the owners in the original literary, dramatic, musical and artistic (LDMA) authorial works if such works are copied.

In July 2023, the Authors Guild sent an open letter to the leaders of some of the world’s biggest generative AI companies. Signed by more than 9,000 writers, including prominent authors such as George Saunders and Margaret Atwood, it asked the likes of Alphabet, OpenAI, Meta, and Microsoft “to obtain consent, credit, and fairly compensate writers for the use of copyrighted materials in training AI.”[6] More recently, perhaps an implicit acknowledgement that the input of copyright-protected works for AI learning would be infringing, Adobe announced that it plans to pay content creators for using their works in its new AI tool – Firefly – and that it will also allow creators the choice to not let their work be used to train the AI.[7]

2.   Generated output can also infringe copyright

When ChatGPT or DreamStudio generates text or images based on the user’s questions or commands, the output can also infringe copyright in a source text or image if it is substantially similar to the original. In theory, ChatGPT’s output, like other LLMs, will be generated based on patterns and connections drawn from the training data. If asked to generate an essay on the position of copyright fair use in Singapore, ChatGPT is unlikely to paraphrase all the sentences from its training dataset of literary works, and will invariably reproduce significant amount of text verbatim from its sources (which may include academic articles, court judgments and online commentaries).

In the Getty Images lawsuit, the claim identified some of the output delivered by Stability AI in the DreamStudio application to include a modified or distorted version of a Getty Images watermark, underscoring the clear link between the copyrighted images and the final product. In such circumstances, this would be another instance of copyright infringement. However, in the class action lawsuit by artists that included Sarah Andersen, the motions to dismiss filed by the defendants’ lawyers pointed out that the plaintiffs failed to clearly identify which particular output was substantially similar to a specific input, and hence infringing.

One should further note that copyright does not protect the style of an artist, no matter how distinctive – this includes a painting-style (like Picasso’s distinctive cubist style or Warhol’s silkscreen treatments of photographs), writing-style or singing-style. The artistic style of an author in copyright law is generally considered an “idea” in the well-established idea-expression dichotomy, which has been codified in the US, and also adopted by the Singapore Court of Appeal.[8] In the same way that we can freely paint and sell a scenery of the Singapore Botanic Gardens in a Monet impressionist-style (assuming that Claude Monet’s paintings are still protected by copyright), it is not copyright infringement if DALL·E, in response to a prompt “Singapore Botanic Gardens in the style of Monet” generates a particular image that evokes Monet’s Bridge Over A Pond Of Water Lilies. The assessment of infringing outputs is a fact-intensive inquiry where one would need to prove substantial similarity between the output text/image and the expression protected in the original input text/image, and one should not presume that if the storage of input for generative AI learning was infringing, the output would necessarily be infringing.

III. Conclusions

In summary, it is difficult to prove wholesale copying of millions of works as the various GAIA do not disclose the training datasets, and one would have to proceed on a classic substantial similarity analysis in respect of each output text/image vis-à-vis the original work. Perhaps realising the enormity of the task before them should they resort to litigation, media companies such as CNN, The New York Times and Reuters, have deployed technological defensive measures such as injecting code into their websites that blocks OpenAI’s web crawler, GPTBot, from scanning their platforms for content.[9]

AUTHOR INFORMATION

Professor David Tan is the Co-Director of the Centre for Technology, Robotics, Artificial Intelligence & the Law (TRAIL) and Head (Intellectual Property) of the EW Barker Centre for Law & Business at NUS Law.

Email: david.tan@nus.edu.sg

REFERENCES

[1]    David Tan, ‘Generative AI and Copyright – Part 1: Copyright Infringement’ [2023] SAL Prac 24 (https://journalsonline.academypublishing.org.sg/Journals/SAL-Practitioner/Intellectual-Property-Law/ctl/eFirstSALPDFJournalView/mid/597/ArticleId/1921/Citation/JournalsOnlinePDF).

[2]     This should be contrasted with the term “AGI” which stands for Artificial General Intelligence, referring to “refers to a theoretical type of artificial intelligence that possesses human-like cognitive abilities, such as the ability to learn, reason, solve problems, and communicate in natural language”. See generally Gil Press, ‘Artificial General Intelligence (AGI) Is A Very Human Hallucination’, Forbes (28 March 2023): https://www.forbes.com/sites/gilpress/2023/03/28/artificial-general-intelligence-agi-is-a-very-human-hallucination/?sh=2f75b23364f2

[3]    E.g. Authors Guild et al v OpenAI Inc et al, Case 1:23-cv-08292 (filed 19 September 2023) (Southern District Court of New York); Silverman et al v OpenAI Inc, Case 3:23-cv-03416 (filed 7 July 2023) (Northern District Court of California); Tremblay et al v OpenAI Inc, Case 3:23-cv-03223 (filed 28 June 2023) (Northern District Court of California); Getty Images (US), Inc v Stability AI Inc, Case 1:23-cv-00135-UNA (filed 3 February 2023) (District Court of Delaware); Andersen et al v Stability AI Ltd et al, Case 3:23-cv-00201 (filed 13 January 2023) (Northern District Court of California).

[4]    See David Tan, ‘The Price of Generative AI Learning: Exceptions and Limitations under the New Singapore Copyright Act’ (2023) European Intellectual Property Review 400.

[5]    Tremblay et al v Open AI, Case 3:23-cv-03223 (filed 28 June 2023) (Northern District Court of California) at {28]-[35].

[6]    Will Bedingfield, “The Generative AI Battle Has A Fundamental Flaw”, Wired (25 July 2023) <https://www.wired.co.uk/artificial-intelligence-copyright-law?verso=true>.

[7]    Krist Boo, ‘Adobe to pay AI-content creators in groundbreaking move’, The Straits Times (21 March 2013): https://www.straitstimes.com/business/adobe-to-pay-ai-content-creators-in-groundbreaking-move.

[8]    Global Yellow Pages Ltd v Promedia Directories Pte Ltd [2017] SGCA 28; [2017] 2 SLR 185 at [15].

[9]    Oliver Darcy, ‘Disney, The New York Times and CNN are among a dozen major media companies blocking access to ChatGPT as they wage a cold war on A.I.’, CNN (28 August 2023) <https://edition.cnn.com/2023/08/28/media/media-companies-blocking-chatgpt-reliable-sources/index.html>.