Anthropic Shredded Millions of Physical Books to Train its AI

Futurism Logo

Book It

Jun 29, 1:30 PM EDTbyFrank Landymore

Anthropic Shredded Millions of Physical Books to Train its AI

Such waste.

Artificial Intelligence/ Ai Copyright/ Anthropic/ Generative Ai

Getty / Futurism

Image by Getty / Futurism

Today in schnozz-smashing on-the-nose metaphors for the AI industry’s rapacious destruction of the arts: exactly how Anthropic gathered the data it needed to train its Claude AI model. 

As Ars Technica reports, the Google-backed startup didn’t just crib from millions of copyrighted books, a practice that’s ethically and legally fraught on its own. No —  it cut the book pages out from their bindings, scanned them to make digital files, then threw away all those millions of pages of the original texts. To say that the AI “devoured” these books wouldn’t merely be colorful language.

This practice was revealed in a copyright ruling on Monday, which turned out to be a major win for Anthropic and the data-voracious tech industry at large. The judge presiding over the case, US district judge William Alsup, found that Anthropic can train its large language models on books that it bought legally, even without authors’ explicit permission.

It’s a decision that owes, in part, to Anthropic’s method of destructive book scanning — which it’s far from the first company to use, according to Ars, but is notable for its massive scale. In sum, it takes advantage of a legal concept known as the first-sale doctrine, which allows a buyer to do what they want with their purchase without the copyright holder intervening. This rule is what allows the secondhand market to exist — otherwise a book’s publisher, for example, might demand a cut or prevent their books from being resold.

Leave it to AI companies, though, to use this in bad faith. According to the court filing, Anthropic hired former head of partnerships for Google’s book-scanning project Tom Turvey in February 2024 to obtain “all the books in the world” without running into “legal/practice/business slog,” as Anthropic CEO Dario Amodei described it, per the filing. Turvey came up with a workaround. By buying physical books, Anthropic would be protected by the first sale doctrine and would no longer have to obtain a license. Stripping the pages out allowed for cheaper and easier scanning.  Since Anthropic only used the scanned books internally and tossed out the copies afterwards, the judge found this process to be akin to “conserv[ing] space,” Ars noted, meaning it was transformative. Ergo, it’s legally OK.

It’s a specious workaround and flagrantly hypocritical, of course. When Anthropic first got up and running, the startup went the even more unscrupulous route of downloading millions of pirated books to feed its AI. Meta did this with millions of pirated books, too, for which it is currently getting sued by a group of authors.

It’s also lazy and careless. As Ars notes, plenty of archivists have pioneered various approaches for scanning books en masse without having to destroy or alter the originals, including the Internet Archive and Google’s own Google Books (which not too long ago was also the subject of its own major copyright battle.)

But anything to save a few bucks — and to get that all too precious training data. Indeed, the AI industry is running out of high quality sources of food to feeds its AI — not least of all because it’s short-sightedly spent this whole time crapping where it eats — so screwing over some authors and sending some books to the shredder is, for Big Tech, a small price to pay.

More on AI: Microsoft Is Having an Incredibly Embarrassing Problem With Its AI

Advertisement

Unknown's avatar

About michelleclarke2015

Life event that changes all: Horse riding accident in Zimbabwe in 1993, a fractured skull et al including bipolar anxiety, chronic fatigue …. co-morbidities (Nietzche 'He who has the reason why can deal with any how' details my health history from 1993 to date). 17th 2017 August operation for breast cancer (no indications just an appointment came from BreastCheck through the Post). Trinity College Dublin Business Economics and Social Studies (but no degree) 1997-2003; UCD 1997/1998 night classes) essays, projects, writings. Trinity Horizon Programme 1997/98 (Centre for Women Studies Trinity College Dublin/St. Patrick's Foundation (Professor McKeon) EU Horizon funded: research study of 15 women (I was one of this group and it became the cornerstone of my journey to now 2017) over 9 mth period diagnosed with depression and their reintegration into society, with special emphasis on work, arts, further education; Notes from time at Trinity Horizon Project 1997/98; Articles written for Irishhealth.com 2003/2004; St Patricks Foundation monthly lecture notes for a specific period in time; Selection of Poetry including poems written by people I know; Quotations 1998-2017; other writings mainly with theme of social justice under the heading Citizen Journalism Ireland. Letters written to friends about life in Zimbabwe; Family history including Michael Comyn KC, my grandfather, my grandmother's family, the O'Donnellan ffrench Blake-Forsters; Moral wrong: An acrimonious divorce but the real injustice was the Catholic Church granting an annulment – you can read it and make your own judgment, I have mine. Topics I have written about include annual Brain Awareness week, Mashonaland Irish Associataion in Zimbabwe, Suicide (a life sentence to those left behind); Nostalgia: Tara Hill, Co. Meath.
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a comment