Biodiverse AIs
On Individuation
Preamble and Introduction
There is an image of the world where AIs are everywhere, they are embedded in our devices, homes, and perhaps even our own consciousness. The idea of anthropomorphizing AI is becoming increasingly common. The Turing test passed with barely a mention in popular media.
This article is about how an ecosystem of AIs could develop in personalized and unique ways. It is intended to give the reader a sense of how AIs work under the hood, some parallels to biological intelligence, and an introduction to some of the techniques being developed to enable a biodiversity of unique AIs in the world, more diverse than even the human species. It can be considered an intuition primer, not a technical review.
Table of Contents
jasonsteiner.xyz for more information
Basics of “AI”
This term is used for a lot of things, but all of them at the core are based on the concept of deep neural networks. These are basically a lot of matrices that are multiplied together in long sequences. It’s really that straightforward. The question of how these can produce what we see as “intelligence” is really a matter of figuring out what the numbers in the matrices are, and this is done through a process called training.
The simplest example of this is basic linear regression, y=mx+b. We are all familiar with drawing a line through a set of dots that can be used to approximate (and hence predict) values that aren’t in the data set. The dots are the “training samples,” and the variables m and b are the parameters that are optimized. The way to calculate these is to use an optimization algorithm that minimizes the value of distances between the sample dots and the line. In this way, we can approximate some trend that models the data. The idea of deep learning is essentially the same (albeit with additional non-linearities), just with billions of parameters and trillions of dots. The “dots” are encodings of any type of data one wants to model. This could be words, sounds, pixels, etc. All of these are different types of information that have a conversion that turns them into numbers, the same way we look at dots on a scatter plot. Just a lot more complex. The training happens by starting from a random set of estimates for the parameters, doing all of the calculations and math, seeing how close the output is to the actual value, and then adjusting all of the parameters bit by bit to get the calculation closer and closer to the real values. That’s it in a nutshell.
The results are often astounding. For chatbots, we often interpret them as almost human. Voice and video generation are becoming studio-grade. What we interpret as “intelligence” is often the composition of these models in increasingly sophisticated ways. Language is a powerful intermediary of knowledge, as are our other senses, like vision and hearing. When these are combined, chained together, and directed, they look like genuine intelligence.
Static and Dynamic
An AI model is basically just a file that has a bunch of numbers that are used for calculations. There can be billions or even trillions of these numbers. They are calculated through a training process that optimizes them to mimic the behavior of a specific training data set, for example, all of the text on the Internet. The numbers themselves are static, but when you have a unique input that is then multiplied by all of these numbers, the outputs are effectively infinite. The model is a fixed processing unit, for example, for words in and words out.
Most of the models used today are static versions that have been developed by large AI labs. Some are closed source (i.e., the numbers are not available), and some are open-sourced, where the actual numbers can be downloaded and used directly. For most practical intents, these numbers, however, are static.
In the case of open-source models where the parameters are publicly available, there is a diverging ecosystem of “fine-tuned” models. These take the baseline model and then train it on a subset of very specific examples, which serves to adjust the parameters, often slightly, to better approximate the style of that training set. Companies can do this, for example, by fine-tuning a model on its communication style guidelines. This allows for the creation of more diverse models that are slightly different from each other and are part of the expanding ecosystem of AI diversity.
Another way to adjust the behavior of a model is to provide it with “in context” examples, often referred to as “few-shot learning”. In practice, this isn’t really a learning progress, but what it does do is provide an example pattern that the model can follow to produce its outputs. By providing the context, for example, of saying “Cause A leads to Effect B, so tell me what Cause C would lead to”, the model receives a pattern as input that it can use to mimic in its output. The model itself is not learning anything; it is just following a pattern that the context is providing. If a subsequent context input did not have any pattern, the static model would not produce anything different, whether it had seen the prior example pattern or not. The parameters are static. This type of adjustment does not add to the diversity of AI models.
Context Histories
One of the features that we may ascribe to “intelligence” is the ability to maintain long-term coherence. This is typically reflected in the context length of a model’s input -- how much information it can use. Some of the frontier models now can manage up to 1M tokens of context for a response -- that is equivalent to the entire Harry Potter series of books. This is quite long for text-based inputs. The performance of models at this scale tends to vary, however. Computationally, there are a few typical ways of handling context history.
The most prevalent, at the core of the transformer architecture, is the attention mechanism, and the most vanilla version of this is simply a full context attention, that is, every token has a value that is associated with its relationship to every other token. This scales quadratically, which can be quite computationally problematic at very long contexts, but this may be necessary for very long documents such as regulatory or legal documents, where long-term coherence is critical.
A modification to the full attention approach is to use sliding window attention. This is much less computationally intensive and is based on the fact that, in many cases, text generation is mostly dependent on the local environment, not global context. For example, chat histories, code writing, etc., are mostly dependent on what is immediately around the generation point, not super distant content. The more distant context can still be present for historical logging, but is not used in the active generation step as it moves forward. This is a practical compromise between keeping history, managing computational load, and maintaining local relevance.
Both attention approaches are often augmented by increasing forms of memory that are externally stored and accessed on demand. There is a range of companies and products now that are providing architectures for long-term and local short-term memory. Often this takes the form of determining, in the duration of the context, whether salient information should be summarized and saved to disk such that it might be able to be retrieved at a later date to add to the context of a future conversation.
Another method of maintaining long context coherence is through context compression, for example, in state space models. Where the attention mechanism creates a matrix of N x N for all of the N input tokens, the state space models like S4 or Evo, which use long convolutional filters to compress long sequences into a compact internal state. From a signal-processing perspective, these filters determine which temporal patterns persist over long spans and which decay quickly, allowing the model to efficiently summarize long histories without explicitly storing or attending to every token. This works especially well in domains like genomics, where the alphabet is small and temporal patterns are statistically dense, but becomes more challenging in natural language, where vocabularies are large and long-range structure is more heterogeneous. Importantly, this representation still remains strictly in the context, not in the model.
All of these are methods of attempting to provide better and more relevant information for the conversation at hand. That information can be personalized and specialized for individuals, but it remains that the model itself remains unchanged. It remains a static set of numbers. It is not a new “intelligence”.
Continuous Learning
Biological intelligences undergo continuous learning over effectively infinite context length. Each moment, we have experiences that update our “parameters” for the next experience. The diversity of intelligences in biology is based on the diversity of these experiences and consequent parameter updates.
A recent publication, End-to-End Test-Time Training for Long Context (TTT-E2E), has demonstrated how this effect could be employed in the context of deep neural networks.
The basics of this model are simple -- it combines the idea of sliding window attention with real-time back propagation for model weight updates, which results in effective context compression within the weights of the model itself, as opposed to state compression of the context, such as in a state space model. The effect of this is to create a real-time model diversification based on context exposure. The model itself is changing, not just processing different contexts.
The mechanics are roughly as follows -- let’s say you have a 1M token context window that you want to generate from. This would be challenging for an N x N full attention calculation, would not well consider the early context if there was only a simple sliding window attention, and may not have ideal compression with a state space model. The TTT-E2E model does the following:
Uses a fixed sliding window attention mechanism that starts at the beginning of the context window and calculates the next token or batch of tokens
Determines the loss of the actual token/batch in the context window to the predicted values
Performs back propagation and updates model weights based on the prediction error
Moves the window forward and repeats by sliding the window through the full context and updating the weights along the way
This method effectively compresses running content in the context windows directly into the weights on the model itself during the prefill or loading of the context. In effect, it is updating the actual model in real time such that when it gets to the generation steps for future predictions, it has already learned from the prior context in a way that is context dependent. If the same starting model were exposed to a different context input, it would result in a different set of weights by the time it started generating new context. This is much more akin to how biological systems learn - the weights are real-time updated as a consequence of the context, and no two are the same.
The top line takeaway is that loss, as compared to full attention, is actually slightly better and, once prefill is complete, generation speed remains constant regardless of how long the original context was - unlike full attention, where each new token requires attending to an ever-growing history. It is worth noting that while overall loss is lower than full attention, the paper does mention that “needle in haystack” recall with full attention is still more performant above 8k context lengths.

One might consider this online learning approach as a way to generate genuinely diverse intelligences that are constantly updated based on both the content and timing of their input “experiences.”
Importantly, the learning signal here comes from the statistical structure of the input sequence itself, not from external environmental feedback or action outcomes. This is adaptive compression, closer to a reader adjusting their mental model while reading a book, rather than trial-and-error learning from interacting with the world.
Essential Pieces
This approach sounds elegant, but there are some missing pieces in the overall approach. One of the most salient is the compute balance based on context and window sizing, specifically in the prefill portion. For example, while a full attention mechanism has unfavorable quadratic scaling, this method introduces potentially substantial extra computation for the back propagation steps at each sliding window. The paper does present data showing that for shorter contexts, the prefill time is greater for this method, but achieves a 2.7X speedup compared to full attention at context lengths of 128k. However, while this is an important consideration, in the context of a model effectively being exposed to real-time input, the constant inference time is actually the more attractive feature. One could imagine a model like this, for example, using a continuous weight update model with a sliding window attention while making predictions about the real world and getting input to calculate losses against. In that sense, any initial prefill compute time would seem negligible compared to constant time inference for continuous contexts.
Important research in this area is also focused on the idea of avoiding “catastrophic forgetting” and, in a biological parallel, how to incorporate the impact of emotion or surprise. Catastrophic forgetting is a case where continuous model weight updates in this manner may result in a rapid loss of information coherence in unexpected ways. This is addressed, in part, by only updating specific components of the network and keeping others fixed as static MLP layers for “safe storage”, but more research is likely needed in this area. It’s also critical to note that this approach only works when the context itself contains learnable patterns. For random sequences, adversarial inputs, or highly heterogeneous documents where each section is about completely different topics, the gradient updates may learn noise rather than signal. The method assumes the long token context has enough statistical regularity that compressing it into weights is meaningful.
A second issue that can be addressed is the actual impact of calculated losses on model weights, which, in this approach, is mechanistic, but in biological systems, is much more nuanced. For example, our brains are wired to be prediction engines, and we generally ignore cases when predictions are correct, but overweight situations where predictions and observations are discordant. However, humans do not do this evenly. For example, when listening to music, our brains anticipate the next coherent notes, and we gain satisfaction from hearing them meet our expectations. We also, however, gain a sense of satisfaction at a surprise - at discordant notes - and we pay them specific attention, but it would likely be incorrect to say that such “losses” meaningfully update our biological expectation weights. A different example would be the amplified weight given to observations from emotional responses that accompany them, either positive or negative. The lack of a notion of which information is worth updating or not in this model is a critical constraint and may lead to degradation or unexpected results. This is addressed, in part, in the paper, where there is a meta “learning-to-learn” outer loop that looks at overall losses during training and updates the model initialization state to better adapt to the individual training sample updates. This is fundamentally a credit assignment problem, deciding which updates matter and which are noise.
Biodiverse AIs
The ability of machines to learn in dynamic ways is one of the most consequential shifts underway in modern AI. If models can update themselves continuously in response to experience, rather than simply process ever-larger context with static parameters, then a world emerges in which AI systems meaningfully diverge over time. Each model is, in a sense, grown through the specific environments, tasks, and inputs it encounters.
This points toward a future where intelligence is not defined solely by scale or generality, but by variation. Much like biological intelligences, which share common structural foundations yet differ profoundly due to experience, AI systems may increasingly reflect the contexts that shaped them. The result is not a single monolithic intelligence, but a flourishing ecosystem of distinct, specialized models.
The takeaway is that models themselves that embody continuous learning approaches like those described will inherently be unique and diverse, just as the brain is shaped by its unique experiences.
A Closing Odd Thought
Carl Jung introduced the idea of the collective unconscious: a shared psychological substrate underlying human cognition, upon which individual experience is layered. It is an intriguing thought experiment to imagine an analogue for artificial systems.
In that framing, the Internet, along with the data, language, and culture it encodes, may function as a kind of collective unconscious for AI. From this shared substrate, individuation can proceed, not through biology, but through experience. If so, the most important question for the future of AI may not be how intelligent machines become in the abstract, but how many distinct forms that intelligence can take.
Footnote: There is certainly much more research to be referenced on these topics, so this is in no way intended to detract from research that has preceded or complements this work, nor is it intended to oversimplify complex technical topics. The intent of this article is to provide intuition for a future blossoming of diverse AI systems.



