9,000 authors say AI firms exploited books to train chatbots

Three separate photos show authors Rebecca Makkai, left; George Saunders, center and Celeste Ng.

July 19, 2023 6:40 PM PT

More than 9,000 authors are calling out the tech companies behind generative AI in an open letter that states there is an inherent injustice in exploiting copyright-protected works to train chatbots without consent, credit or compensation.

If users prompt GPT-4 to summarize works by Roxane Gay or Margaret Atwood, it can do so in detail, chapter by chapter. If users want ChatGPT to write a story in the style of an acclaimed author such as Maya Angelou, they can ask it to “write a personal essay in the style of Maya Angelou, exploring the theme of self-discovery and personal growth.” And voilà.

The generative AI is powered by two software programs known as large language models, which forgo a traditional programming method and instead extract massive amounts of text in order to produce natural and lifelike responses to user prompts.

In Tuesday’s open letter, the Authors Guild writes that “Generative AI technologies built on large language models owe their existence to our writings. These technologies mimic and regurgitate our language, stories, style, and ideas. Millions of copyrighted books, articles, essays, and poetry provide the ‘food’ for AI systems, endless meals for which there has been no bill.”

The letter further states that tech companies including OpenAI, Alphabet, Meta, Stability AI, IBM and Microsoft have spent billions to develop AI technology and that compensating the authors for using their works would be the fair move, because without those books, “AI would be banal and extremely limited.”

Novelist and essayist Jonathan Franzen commended the effort, stating, “The Authors Guild is taking an important step to advance the rights of all Americans whose data and words and images are being exploited, for immense profit, without their consent — in other words, pretty much all Americans over the age of six.”

Dan Brown, James Patterson, Margaret Atwood, Roxane Gay, Celeste Ng, Viet Thanh Nguyen, George Saunders and Rebecca Makkai are among the thousands of authors who are taking AI industry leaders to task, asking that their concerns be addressed and specific actions taken:

Obtain permission for the use of copyrighted material in generative AI programs.
Fairly compensate writers for both past and ongoing use of their works in generative AI programs.
Fairly compensate writers for the use of their works in AI output, regardless of whether the outputs infringe upon current laws.

“We understand that many of the books used to develop AI systems originated from notorious piracy websites,” the letter continues. “Not only does the recent Supreme Court decision in Warhol v. Goldsmith make clear that the high commerciality of your use argues against fair use, but no court would excuse copying illegally sourced works as fair use.”

The Authors Guild says generative AI threatens writers’ professions by “flooding the market with mediocre, machine-written books, stories, and journalism based on our work.” And that for at least the last decade, authors have experienced a 40% decline in income, with many full-time writers in 2022 barely surpassing the federal poverty level.

The letter comes just weeks after bestselling novelists Mona Awad and Paul Tremblay filed a suit against OpenAI in a San Francisco federal court, claiming that ChatGPT was trained in part by “ingesting” their novels without their consent.

When prompted, ChatGPT emitted extremely detailed summaries of Tremblay’s “The Cabin at the End of the World” and Awad’s “Bunny” and “13 Ways of Looking at a Fat Girl.” Both authors claim this is proof that their novels were used to train the chatbot, and the filing includes ChatGPT’s responses to prompts regarding their novels.

In June 2018, OpenAI revealed that it trained GPT-1 using BookCorpus, which the suit described as a “controversial dataset” assembled by artificial intelligence researchers in 2015, with a collection of “over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.

“They copied the books from a website called Smashwords.com that hosts unpublished novels that are available to readers at no cost. Those novels, however, are largely under copyright.”

According to the complaint, later iterations of the company’s large language models were trained using significantly larger quantities of copyright-protected books. In a July 2020 paper introducing GPT-3, the company revealed that 15% of the training data set came from “two internet-based books corpora” that OpenAI simply called “Books1” and “Books2.”

The suit approximates that, based on numbers revealed in OpenAI’s paper about GPT-3, Books1 would contain roughly 63,000 titles, and Books2 would include approximately 294,000 titles.

Experts have predicted more suits are sure to follow as AI becomes more adept at using information from the web to generate new content.