Meta’s AI memorised books verbatim – that could cost it billions
Many AI models were trained on the text of books, but a new test found at least one model has directly memorised nearly the entirety of some books, including Harry Potter and the Philosopher’s Stone, which could complicate ongoing legal battles over copyright infringement
By Jeremy Hsu
10 June 2025
In April, book authors and publishers protested Meta’s use of copyrighted books to train AI
Vuk Valcic/Alamy Live News
Billions of dollars are at stake as courts in the US and UK decide whether tech companies can legally train their artificial intelligence models on copyrighted books. Authors and publishers have filed multiple lawsuits over this issue, and in a new twist, researchers have shown that at least one AI model has not only used popular books in its training data, but also memorised their contents verbatim.
Many of the ongoing disputes revolve around whether AI developers have the legal right to use copyrighted works without first asking permission. Previous research found many of the large language models (LLMs) behind popular AI chatbots and other generative AI programs were trained on the “Books3” dataset, which contains nearly 200,000 copyrighted books, including many pirated ones. The AI developers who trained their models on this material have argued that they did not violate the law because an LLM puts out fresh combinations of words based on its training, transforming rather than replicating the copyrighted work.
Read more
ChatGPT seems to be trained on copyrighted books like Harry Potter
Advertisement
But now, researchers have tested multiple models to see how much of that training data they can spit back out verbatim. They found that many models do not retain the exact text of the books in their training data – but one of Meta’s models has memorised almost the entirety of certain books. If judges rule against the company, the researchers estimate that this could make Meta liable for at least $1 billion in damages.
“That means, on the one hand, that AI models are not just ‘plagiarism machines’, as some have alleged, but it also means that they do more than just learn general relationships between words,” says Mark Lemley at Stanford University in California. “And the fact that the answer differs model to model and book to book means that it is very hard to set a clear legal rule that will work across all cases.”
Lemley previously defended Meta in a generative AI copyright case called Kadrey v Meta Platforms. Authors whose books had been used to train Meta’s AI models filed a class-action suit against the tech giant for breach of copyright. The case is still being heard in the Northern District of California.