On Friday, the Joseph Saveri Law Firm sued the US government class action cases on behalf of Sarah Silverman and other authors against OpenAI and Meta, accusing the companies of arbitrarily using copyrighted material in teaching AI languages such as ChatGPT and LLAMA.
Other writers represented include Christopher Golden and Richard Kadrey, as well as classroom experiences a case which was presented by the same company on June 28 including writers Paul Tremblay and Mona Awad. All charges include violations of the Digital Millennium Copyright Act, unfair competition laws, and negligence.
The Joseph Saveri Law Firm is well-known for litigation against artificial intelligence. In November 2022, the same company suit filed on GitHub Copilot for breaking the rules. In January 2023, the same legal team he repeated that formula is a lawsuit against Stability AI, Midjourney, and DeviantArt over AI image generators. The GitHub case is on the way to trial, according to attorney Matthew Butterick. Legal procedures for Stable Diffusion cases still exist no clear results on.
In a Press release last month, the legal organization described ChatGPT and LLaMA as “powerful authors in the industry that violate the rights of authors.” Authors and publishers have been reaching out to the law firm since March 2023, lawyers Joseph Saveri and Butterick wrote, because the authors are “concerned” with these AI tools “the extraordinary ability to create texts similar to those found in written texts, including thousands. of books.”
The most recent lawsuits from Silverman, Golden, and Kadrey were heard in US District Court in San Francisco. The authors want legal proceedings in each case and are seeking formal support that would force Meta and OpenAI to change their AI tools.
Meta declined Ars’ request for comment. OpenAI did not immediately respond to Ars’ request for comment.
A spokesperson for the Saveri Law Firm sent a statement to Ars, saying, “If the alleged practices are allowed to continue, these models will replace the authors whose stolen works will empower AI competitors to fight for copyright rights for all artists and creators.” others.”
They are accused of using “unlawful” data.
Neither Meta nor OpenAI disclosed the full content of the datasets used to train LLaMA and ChatGPT. But lawyers for the authors who are suing them say they have found possible action from material shown in statements and papers released by companies or other researchers. Authors have criticized OpenAI and Meta for using educational datasets that contain copyrighted materials that are distributed without the permission of authors or publishers, including downloading works from large e-book pirate sites.
In The OpenAI case, the authors report that based on OpenAI disclosures, ChatGPT appears to have been trained on 294,000 books allegedly downloaded from popular ‘shadow library’ websites such as Library Genesis (aka LibGen), Z-Library (aka Bok), Sci-Hub, and Bibliotik . .” Meta has revealed that LLaMA was trained on another data source called ThePile, which another alleged crime includes “all Bibles,” and is 196,640 books.
On top of its alleged access to copyrighted works through image libraries, OpenAI is also accused of using a “conflict database” called BookCorpus.
BookCorpus, the OpenAI lawsuit said, “was collected in 2015 by a group of AI researchers to train language models.” The research team says it “copied the books from a website called Smashwords that hosts self-published books, available to readers at no cost.” These books, however, are still protected and are said to be “copied from the BookCorpus database without permission, credit, or compensation to the authors.”
Ars was unable to reach BookCorpus or Smashwords researchers for comment. [Update: Dan Wood, COO of Draft2Digital—which acquired Smashwords in March 2022—told Ars that the Smashwords “store site lists close to 800,000 titles for sale,” with “about 100,000” currently priced at free.
“Typically, the free book will be the first of a series,” Wood said. “Some authors will keep these titles free indefinitely, and some will run limited promotions where they offer the book for free. From what we understand of the BookCorpus data set, approximately 7,185 unique titles that were priced free at the time were scraped without the knowledge or permission of Smashwords or its authors.” It wasn’t until March 2023 when Draft2Digital “first became aware of the scraped books being used for commercial purposes and redistributed, which is a clear violation of Smashwords’ terms of service,” Wood said.
“Every author, whether they have an internationally recognizable name or have just published their first book, deserve to have their copyright protected,” Wood told Ars. “They also should have the confidence that the publishing service they entrust their work with will protect it. To that end, we are working diligently with our lawyers to fully understand the issues—including who took the data and where it was distributed—and to devise a strategy to ensure our authors’ rights are enforced. We are watching the current cases being brought against OpenAI and Meta very closely.”]