If you’re a writer and on social platforms like X and Threads, you’ve probably seen a lot of anger over the NaNoWriMo organization’s recent policy statement in support of participants using generative artificial intelligence/large language models (LLMs)/ChatGPT.
The really obnoxious part was in saying, “We want to be clear in our belief that the categorical condemnation of artificial intelligence has classist and ableist undertones, and the question around the use of AI tie to questions around privilege.” (Or something like that. I’m so mad I can’t see straight.)
This is ridiculous. (Please keep reading because I’m going to get to the heart of the matter soon, but need to address this first.) There is nothing more privileged than using genAI to write for you. Anyone can pick up a pencil and write on a scrap of paper. That’s what writing is, a conversation between you and a piece of paper. But to use ChatGPT or another LLM, you need a computer and access to the program. It costs money to use the program. Further, ChatGPT cost OpenAI $540 million to develop in 2022. Current models are estimated to require $1 billion to develop and run. That’s why only big companies with deep pockets are battling it out for market dominance.
You’re probably wondering why you should listen to me on this issue. I’m just an old lady writer. Yes, but I’m also a respected tech futurist. I’ve analyzed emerging technologies from the emergence of smart phones to the current day, mostly for the Intelligence Community (that’s the three-letter agencies) but also for private industry and start-ups. Since ChatGPT burst onto the scene, nearly all I’ve done is follow LLMs and genAI, putting developments into perspective for clients.
Today, you’re my client.
I think the NaNoWriMo case is not about genAI empowering anyone, but about copyright. See, the reason LLMs are good for genAI is because LLMs are ginormous (hence the “large” in large language models), orders of magnitude larger than the training data sets used previously. When developers like OpenAI were first developing their models, they got this data from data providers through a common clause that said researchers could use their data for, you know, research purposes. They scraped Wikipedia; they bought it from a company that pirated copyrighted ebooks. (More on this below.)
The trouble is that the models are now being monetized—gone from research to application—but there is no way to take out the copyrighted data. You can’t “untrain” the model at this point. It would be catastrophic for genAI developers, possibly wrecking the goose that lays the golden eggs, and they’re going to fight it for all they’re worth.
First, some background on copyright:
When this issue first started heating up on mid-2023, big content creators were suing AI companies. Let’s look at the NY Times. The only alternative to court was if the two companies were able come to a licensing agreement that would allow OpenAI to continue to use NY Times materials in its LLMs. But the NYT feared competition from chatbots like ChatGPT, which used the NYT’s content without compensation.
The NYT case is now in court. If a judge finds in the NYT’s favor, the remedy would be for OpenAI to remove all NYT content from its LLMs. OpenAI and other chatbot developers would face fines of up to $150,000 for each willful infringement.
The NYT is not alone in suing genAI developers. Comedian Sarah Silverman is one of the first to sue OpenAI (along with my friend, author Paul Tremblay), claiming that the company obtained a copy of her book from an illegal online “shadow library”. Stability AI, creator of the image generator Stable Diffusion, is also being sued by Getty Images for illegally training its LMM on 12 million images owned by Getty.
GenAI companies then learned they had to start controlling the conversation around this issue. OpenAI got cagier about what materials it had scraped for GPT-3 and GPT-4. This is supported by a recent Business Insider article which noted that the original paper that kicked off the generative AI goldrush provided “granular” information on the training data it used, understanding that they had to be transparent and traceable if anyone was to have any faith in the models. With the lawsuits, this “granular” level of info was quickly yanked. And when Meta announced its LLaMa-2 model, the only thing it would admit about the training data was that it contained “a new mix of publicly available online data.”[1]
Which brings us to NaNoWriMo’s policy statement. I haven’t done any digging, but this is all starting to look familiar. I have seen tech companies circle the wagons, deploy lawyers, and fire up propaganda machines before. GenAI developers are signing licensing deals with major content producers, probably to undercut the coming court judgments. In mid-August 2024 it was announced that OpenAI signed a contract to allow it use of Conde Nast (Vanity Fair, GQ, the New Yorker, etc.) copyrighted materials. Similar deals have been signed with the Associated Press, Axel Springer, the Atlantic, Dotdash Meredith, Financial Times, Le Monde, NewsCorp, Time and Vox Media. As genAI developers buy allies, they buy their silence, too—or their cooperation in propaganda camaigns designed to muddy the conversation and deflect criticism.
Back to writers. Once the big content providers—the Conde Nasts and NYTs of the world—are co-opted, it’s game over for small producers like… novelists. I’d like to think we’ll still win in court because a copyright violation is a copyright violation, but it is hard to be that optimistic. And lest anyone try to claim that I’m just anti-progress and all that malarky, remember: I’m a tech futurist.
But I’m also a realist.
To my regular subscribers: please forgive me for sending out a newsletter two days in a row. That is a drawback with using Substack for my newsletters, because Substack is also kind of like a town square and I felt I couldn’t pass up the opportunity to say something useful.
If you found this helpful, and you are part of a writer’s group that could use someone to talk to them about genAI, drop me a DM.
[1] In the OpenAI case, the company is alleged to have trained GPT-1 on BookCorpus, which was assembled by AI researchers in 2015 and scraped books from a site that made works available to users at no cost, but were in violation of copyright infringement. Emily St. Martin, LA Times, July 1, 2023, https://www.latimes.com/entertainment-arts/books/story/2023-07-01/mona-awad-paul-tremblay-sue-openai-claiming-copyright-infringement-chatgpt
There have been other repositories of scanned or scraped books, including one called The Pile that was taken down by court order, and yet Meta has said that The Pile was used it ins LLaMa training data set. Alex Perry, Mashable, August 18, 2023, https://mashable.com/article/books3-ai-training-dmca-takedown
Informative
Super interesting, thank you Alma!