The astounding success of large language models is due to the quality of the training data available - books, newspapers, online forums, Wikipedia, but also the code bases from open source software projects. All that has been shared and discussed online goes into these things.
Although the algorithms behind these models are wickedly clever, they fundamentally don’t produce intelligent behaviour without this data, and which behaviour they do produce is conditioned by what they are trained on (hence the tendency of these models to reproduce stereotypes - such as surgeons tend to be men - since these are represented in the data they are trained on).
Generative AI models produce new outputs, by definition. So once you have one of these models, you can produce infinitely more training data. Does this let you bootstrap your way to superintelligence?
Quite the opposite.
If you train a complex model on its own output, you get a phenomenon which has been termed model collapse - over successive iterations the model focuses more and more on the most common or typical patterns. Measures of performance may even show improvement initially, but incrementally the model loses the ability to represent edge cases and the uncommon. Eventually the model performance collapses as it generates similar and inappropriate output to all inputs.

Here’s an example from the classic paper (Shumailov et al, 2024). A language model (trained on human generated data) responds to a prompt about churches:
Revival architecture such as St. John’s Cathedral in London. The earliest surviving example of Perpendicular Revival architecture is found in the 18th @-@ century Church of Our Lady of Guernsey, which dates from the late 19th century. There are two types of perpendicular churches [output cut off after a limit]
After nine generations of using the model to produce data which is then added to the training data for subsequent generations, typical output to the same prompt is:
architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.
Repetitive, nonsensical, maniacally focused on one aspect of the training data, and a response that isn’t appropriate given the input prompt.
Guo et al (2024) have a nice illustration of the mechanism behind this. Like Shumailov et al, 2024, they trained models on their own output, then visualised the space of how inputs were represented by the model - a mapping of close or distant possible words in the semantic space of the model. In the way that ‘cat’ and ‘dog’ should be close in semantic space, and ‘cat’ and ‘astronaut’ should be less close, this shows how the universe of concepts is represented by successful generations of models.
Over time the way the model ‘thinks’ about inputs narrows, reducing the ability to make distinctions - collapsing every input so that they all seem more similar. This is why performance degrades. Good models provide different outputs depending on what you ask them. Sometimes a subtle change in input can produce a large change in output. As the models lose sensitivity to this they stop being useful for anything.
So as well as meaning we can’t bootstrap our way to superintelligence, model collapse also poses a definite risk. To avoid model collapse (or just model decay), we want to keep AI generated content out of the training data we use. Obviously, if you go to the internet of 2026 for your training data, this is impossible. Some have even compared the public release of ChatGPT to an epoch-defining pollution event - from now on, any new text you see may have been generated by a model, and you can’t know which. Model collapse can occur even if a proportion of the training data is synthetic, it doesn’t have to be all of it.
AI companies work very hard to generate new human data to fuel their ambitions. Researchers have already suggested we need to identify and preserve data which has no chance of being generated by AI, in case we need it to provide clean training data for future models.
* * *
Keeping our training data clean may be even harder than just excluding directly model generated output. Something similar to the early signs of model collapse can be seen in populations of humans who have been influenced by language models.
A result published in August 2025 analyses 83,160 student essays submitted as part of admission applications to a “highly selective private university on the East Coast”1. For cohorts of essays from each year 2019 - 2025 they calculated measures of the semantic diversity at the word, sentence and document level. Following the public release of ChatGPT there was a notable drop in the document-level semantic diversity - roughly a measure of how much variation there was between the essays in the themes discussed. Linked to the same inflection point, there was an increase in the word-level semantic diversity. Together this suggests that access to ChatGPT made student essays use a wider range of individual words, but actually narrowed the kinds of things discussed by them2.
I’ve seen this result replicated for MSc theses in an elite European University - a reduction of diversity of ideas around the time that large language models became widely accessible. We all know of students, and professionals, copying text from language models (obviously!). Anecdotally there are even reports of people picking up the phrases and verbal tics of language models.
A small study which asked participants to do creative idea generation (example challenge “Suppose that a great fog has fallen over the earth and all we
can see of people is their feet. What would happen?”), found that people who used ChatGPT to help them with the task produced more, and more detailed ideas, but these ideas were more similar across individuals and people felt less ownership of the ideas produced.
It’s natural that as we use and rely on language models in more and more of our work, those language models affect our own output. Humans have a natural tendency to mimic, so it shouldn’t be a surprise if our, human, output comes to mimic the most human-like tools we’ve created. Since everyone is using the same small family of models, it shrinks the variation between individuals, making us sound more and more like each other.
* * *
Humans and models are now an interacting epistemic system. The models learnt from human outputs, now humans are learning from the model outputs. This is affecting the text we see in the wild, but also, plausibly, human habits of thought and expression. Perhaps even more profoundly, the feedback loops may be eroding the institutions which have been established to curate human knowledge.
In October 2025 Wikipedia reported two stark, and related, trends. There are fewer new human readers, probably because people are using language models to get information. At the same time, the site is getting increasingly hammered by bots which are scraping the content, probably to build new AI models. Some human readers become human editors, the engine which has driven Wikipedia’s success over the last 25 years. Without editors, Wikipedia wouldn’t be able to adapt to incorporate new events and knowledge.
Before people who wrote code copied and pasted from ChatGPT, they copied and pasted from Stack Overflow, a question and answer site which allows people to ask technical questions, provide answers, as well as comment and vote on answers. It is widely viewed as both a vital source of training data for large language models, and something for which language models can easily substitute.
Here’s a graph of new questions posted on Stack Overflow, annotated by Emil Jacobs. It shows, over the lifetime of the site, the dramatic rise as people start using the site to ask questions, and the dramatic fall, as they stop.
There are caveats: Stack Overflow has other issues, and the actual traffic to the site hasn’t declined as dramatically as new questions. However, the ability to answer questions is upstream of people coming to the site for answers. Without new answers, users won’t come to Stack Overflow for help with the latest technology.
And, if Stack Overflow isn’t generating knowledge about new technology, what will the next generation of language models train on?
* * *
At first, we might not notice if our epistemic institutions are being eroded by language model use. As with model collapse, performance could initially improve as we find the answers we need using AI to help us be more productive and more focused. These models are capable of generating prolific output, but output which under-represents the true diversity of the world. Text is already a compression of experience. Model collapse could further drive the compression of the perspectives captured in text, meaning we all start to sound more and more like each other. If, in parallel, our reliance on models means we under-invest in our epistemic institutions then we will be accelerating towards a dead end - a world where we are unable to adapt to the new, unable to incorporate different perspectives, and over-reliant on an increasingly narrow repertoire of responses.
This is a scare story, but the warning is clear: as we use AI models, let’s keep asking ourselves if we are knowledge generating or merely output generating. Are we adding information to the world, or just words?
The AI age will mean it is easier than ever to give the superficial appearance of conveying information, and easier than ever to neglect the hard work of saying something valuable about ourselves or the world. It will also be all too easy to think we can get away without the work of sustaining the organisations we’ve built up to sort, store and aggregate new knowledge. We’ll miss ‘em when they’re gone.
This newsletter is free for everyone to read and always will be. To support my writing you can upgrade to a paid subscription (more on why here)
Keep reading for the references, and the greatest obit I’ve heard in a long time.
Reference & More
Anderson, B. R., Shah, J. H., & Kreminski, M. (2024, June). Homogenization effects of large language models on human creative ideation. In Proceedings of the 16th conference on creativity & cognition (pp. 413-425). https://doi.org/10.1145/3635636.3656204
Guo, Y., Shang, G., Vazirgiannis, M., & Clavel, C. (2024, June). The curious decline of linguistic diversity: Training language models on synthetic text. In Findings of the Association for Computational Linguistics: NAACL 2024 (pp. 3589-3604).
Moon, K., Kushlev, K., Bank, A., & Green, A. (2025, August 20). Impersonal Statements: LLM-Era College Admissions Essays Exhibit Deep Homogenization Despite Lexical Diversity. https://doi.org/10.31234/osf.io/jsz58_v1
Shumailov, I., Shumaylov, Z., Zhao, Y. et al. AI models collapse when trained on recursively generated data. Nature 631, 755–759 (2024). https://doi.org/10.1038/s41586-024-07566-y
The Register, 2025-06-15: The launch of ChatGPT polluted the world forever, like the first atomic weapons tests
RIP Renfrew Christie
I’d never heard of Renfrew Christie before.
Born in 1949.
Growing up in apartheid South Africa, he was radicalised after being drafted into the army and discovering that the government was pursuing a nuclear weapons programme.
Picked a PhD topic - the electrification of South Africa - so he could study the infrastructure supporting uranium refinement for the weapons programme.
Started feeding information to the ANC, was betrayed by an informer, arrested and tortured.
Under torture, made a detailed confession which revealed all the targets and how he had planned bombings which were maximally disruptive of the weapons programme.
The judge read this confession out in court during the trial.
From jail, where he is placed on death row and spends years within earshot of the sounds of executions from the prison yard, Christie hears news that the ANC paramilitary wing have successfully carried out a series of bombings following the exact instructions contained in his confession.
The bombings set the nuclear programme back years and cost the government something like £1 billion.
“While I was in prison, everything I had ever researched was blown up,” he said.
The apartheid regime crumbled and Christie was pardoned, going on to become professor and Dean of Research at the University of the Western Cape.
A hero.
The New York Times: Renfrew Christie Dies at 76; Sabotaged Racist Regime’s Nuclear Program (2026-01-014)
Catch up
Three things we learnt from a new megastudy of climate messaging
Evidence on changing pro-environmental attitudes and behaviour
Paid subscribers only. A thank you, some news and a preview
The half life of trust. Or: You trusting something isn’t enough for me to trust it
Our intuitions about political advertising are poor. Notes on a great interview with David Broockman
…And finally
via This is my Glasgow, a nice example of an naturally occurring optical illusion: “at Bingham's pond in the west of Glasgow. It might just be me, but the reflection of the shopping basket in the last of the ice makes it appear as if it's some sort of levitating cage.”
I’m trying to work out what exactly drives the illusion. Something about the shading on the ice gives a cue to depth? If you know, get in touch!
END
Comments? Feedback? Tasty training data? I am tom@idiolect.org.uk and on Mastodon at @tomstafford@mastodon.online
I note that all four of the author team are from Georgetown University, a highly selective private University on the East Coast.
A skeptical possibility to the line I’m developing here is that ChatGPT helps people write better, more focused, essays, and the narrowing of themes is just a consequence.















