It started blowing up on Thursday, March 20, with the publication of “The Unbelievable Scale of AI’s Pirated Book Problem” by Alex Reiser in the Atlantic Magazine. Reiser and the Atlantic posted a link to a tool that lets anyone search on the site in question, LibGen, to see if your works are included in the collection.
To give you some context, I’ve searched other sites of pirated books in the past and usually find five or six of mine, the older books, along with one or two stories I self-published a long time ago.
This time, there were thirty. It included ALL my books, along with some foreign editions. All my pieces for Amazon Original Stories.
On Thursday and Friday morning, author after author was searching the site, then posting on Bluesky: OMG. I can’t believe it. They have all my books.
This is huge. Let me explain why. But first…
What is LibGen? Lib/Gen (short for Library Genesis) started life as one of those “information should be free!” sites that in principal is hard to argue with, but in practice is a dumpster fire. It started (ostensibly) with researchers in poorer countries complaining that they needed access to scholarly journals to do their work, persuading readers with access to said articles to make those works, which were behind paywalls, available to be public.
Napster—you remember Napster, don’t you—was once a torrent site.1 It was a way for people to download music without paying for it. It was immensely popular in the 2000s. It was also illegal. It played a huge role in training people to expect to get stuff for the internet for free. When I was recruiting for CIA, Napster activity was a big obstacle to employment (because it’s, you know, stealing). And lest you think we’re talking small potatoes, one guy confessed to at least 25,000 illegal downloads.
Torrent sites are decentralized, peer-to-peer sites.2 That decentralization makes it hard to track down the responsible parties. Servers are hosted typically in countries with lax laws or attitudes that make it easier for the people behind the site to evade prosecution and continue operating.
If you search on LibGen, you’ll see citations saying its contents are “science based” and “in the public domain” and both are lies, pretty excuses to hide behind that have no basis in reality. What’s more, these lies are quickly apparent if you interact with it in the slightest. Which is important because…
META really sh*t the bed when it decided to use LibGen to train its new LLM, Llama3. This was happening in spring 2024, well after META’s competitor OpenAI was being investigated for using a different pirated book site to train ChatGPT (and, as Reisner found out in this latest dump of info from META, apparently OpenAI also used LibGen to train at least one of its LLMs).
If these sites with pirated data were so bad, why did META decide to go ahead and steal, uh, use those files anyway? It’s because as training data goes, books are far better than the next available big dataset, which is web scrapings. Books are structured better; grammatically correct (one assumes); fact-checked; and all in all, more reliable and better suited to training AI programs than what you find on the wild, wild webs. In the cutthroat world of AI development, where META is lagging behind the leaders and billions of dollars are on the line, it seemed just too good to pass up.
It would be really hard to argue that META didn’t know it was doing wrong. According to the Atlantic article, by their own assessment it was a medium-to-high legal risk. Does that mean we’ll see action taken, finally, to stop genAI developers from using copyrighted work? I remain pessimistic. Too many politicians are in the pockets of the tech industry. They make the argument that in the global race to dominate AI, we can’t afford to fall behind other countries, like China. In reality, there’s nothing nationalistic here. META in particular, but also OpenAI, have shown themselves to be driven purely by profit and power.
Unless something drastic happens, the best authors can hope for here is some kind of payment from AI developers, something along the agreements I mentioned in a previous post. No one is going to be sending their kids to college on these payouts. I think I read a little while back about one big book publisher putting in its contracts a standard offer of $2500 to allow use in AI development; I’d be surprised if we see an offer nearly this high from META, OpenAI, and the others. Plus, no offers will be forthcoming while there’s a chance they could win it all in court.
Authors’ only hopes, I believe, are to band together so that the courts can’t deny the sheer scope of the issue. There are several ways you can do that. A first step is seeing what the Authors Guild recommends we do to put META on notice.
A second step is to formally join the class action suits against OpenAI and META. There are several law suits representing authors, I believe, including one involving the Authors Guild, but Saveri Law Firm is one you can reach out to; Baker Hostetler is another.
Coincidentally, a book just came out that shows how horrible and self-serving the founder and leaders of Facebook truly are. Careless People was written by a woman who was their head of international policy and had a front-row seat to what happened as the tech company amassed wealth and power and lost touch with reality. I’m waiting for my copy to come in (the injunction against it is making it hard to get copies) but I trust Rob Hart’s recent Substack on it. You see, when I was a social media researcher for the government, several of the people I had on contract also had projects with Facebook. It was clear from what they were able to tell me that something was not right, whether it was the insane levels of secrecy, their paranoia of any government involvement, or the crazy way they tried to run their programs. You’ve heard me beat this drum many times before, but the only way to salvage social media is for there to be oversight and regulation. How much more evidence do we need?
More reading:
My post on how genAI developers are trying to handle the copyright issue.
A primer on generativeAI for the non-technical.
Now it is a subscription site. I thought it had been litigated into oblivion.
Very helpful write-up. Thank you, Alma!
Thoughtfully written, as usual, and with concrete information about what to do, not just a rant. I appreciate all your research into this. I found my books listed as well as some medical research my daughter published.