US lawmaker proposes a public database of all AI training material


US lawmaker proposes a public database of all AI training material

Amid a flurry of lawsuits over AI models’ training data, US Representative Adam Schiff (D-Calif.) has introduced a bill that would require AI companies to disclose exactly which copyrighted works are included in datasets training AI systems.

The Generative AI Disclosure Act “would require a notice to be submitted to the Register of Copyrights prior to the release of a new generative AI system with regard to all copyrighted works used in building or altering the training dataset for that system,” Schiff said in a press release.

The bill is retroactive and would apply to all AI systems available today, as well as to all AI systems to come. It would take effect 180 days after it’s enacted, requiring anyone who creates or alters a training set not only to list works referenced by the dataset, but also to provide a URL to the dataset within 30 days before the AI system is released to the public. That URL would presumably give creators a way to double-check if their materials have been used and seek any credit or compensation available before the AI tools are in use.

All notices would be kept in a publicly available online database.

Schiff described the act as championing “innovation while safeguarding the rights and contributions of creators, ensuring they are aware when their work contributes to AI training datasets.”

“This is about respecting creativity in the age of AI and marrying technological progress with fairness,” Schiff said.

Currently, creators who don’t have access to training datasets rely on AI models’ outputs to figure out if their copyrighted works may have been included in training various AI systems. The New York Times, for example, prompted ChatGPT to spit out excerpts of its articles, relying on a tactic to identify training data by asking ChatGPT to produce lines from specific articles, which OpenAI has curiously described as “hacking.”

Under Schiff’s law, The New York Times would need to consult the database to ID all articles used to train ChatGPT or any other AI system.

Any AI maker who violates the act would risk a “civil penalty in an amount not less than $5,000,” the proposed bill said.

At a hearing on artificial intelligence and intellectual property, Rep. Darrell Issa (R-Calif.)—who chairs the House Judiciary Subcommittee on Courts, Intellectual Property, and the Internet—told Schiff that his subcommittee would consider the “thoughtful” bill.

Schiff told the subcommittee that the bill is “only a first step” toward “ensuring that at a minimum” creators are “aware of when their work contributes to AI training datasets,” saying that he would “welcome the opportunity to work with members of the subcommittee” on advancing the bill.

“The rapid development of generative AI technologies has outpaced existing copyright laws, which has led to widespread use of creative content to train generative AI models without consent or compensation,” Schiff warned at the hearing.

In Schiff’s press release, Meredith Stiehm, president of the Writers Guild of America West, joined leaders from other creative groups celebrating the bill as an “important first step” for rightsholders.

“Greater transparency and guardrails around AI are necessary to protect writers and other creators” and address “the unprecedented and unauthorized use of copyrighted materials to train generative AI systems,” Stiehm said.

Until the thorniest AI copyright questions are settled, Ken Doroshow, a chief legal officer for the Recording Industry Association of America, suggested that Schiff’s bill filled an important gap by introducing “comprehensive and transparent recordkeeping” that would provide “one of the most fundamental building blocks of effective enforcement of creators’ rights.”

A senior adviser for the Human Artistry Campaign, Moiya McTier, went further, celebrating the bill as stopping AI companies from “exploiting” artists and creators.

“AI companies should stop hiding the ball when they copy creative works into AI systems and embrace clear rules of the road for recordkeeping that create a level and transparent playing field for the development and licensing of genuinely innovative applications and tools,” McTier said.

AI copyright guidance coming soon

While courts weigh copyright questions raised by artists, book authors, and newspapers, the US Copyright Office announced in March that it would be issuing guidance later this year, but the office does not seem to be prioritizing questions on AI training.

Instead, the Copyright Office will focus first on issuing guidance on deepfakes and AI outputs. This spring, the office will release a report “analyzing the impact of AI on copyright” of “digital replicas, or the use of AI to digitally replicate individuals’ appearances, voices, or other aspects of their identities.” Over the summer, another report will focus on “the copyrightability of works incorporating AI-generated material.”

Regarding “the topic of training AI models on copyrighted works as well as any licensing considerations and liability issues,” the Copyright Office did not provide a timeline for releasing guidance, only confirming that their “goal is to finalize the entire report by the end of the fiscal year.”

Once guidance is available, it could sway court opinions, although courts do not necessarily have to apply Copyright Office guidance when weighing cases.

The Copyright Office’s aspirational timeline does seem to be ahead of when at least some courts can be expected to decide on some of the biggest copyright questions for some creators. The class-action lawsuit raised by book authors against OpenAI, for example, is not expected to be resolved until February 2025, and the New York Times’ lawsuit is likely on a similar timeline. However, artists suing Stability AI face a hearing on that AI company’s motion to dismiss this May.


Leave a Reply

Your email address will not be published. Required fields are marked *