I: “What am I doing wrong?”
Until ~2021, LinkedIn’s job recommendation system would take 2-4 days to learn from user signals like applications and dismisses. Most leading recommendation systems at the time had this limitation. Processing all that engagement and featurizing it is compute intensive, so we just accepted it and focused on improvements in other areas instead, where we were quite successful! We helped hundreds of thousands of people get jobs every year, and the pace at which our algorithms were improving was phenomenal.
That summer, one UX research session changed everything for me: a former professor returning to the workforce after raising her kids. As she scrolled through recommendations, the only jobs she got were as a hostess at chain restaurants. Cheesecake Factory. Applebee’s. The same ones, over and over.
She dismissed them. They kept appearing. She dismissed them again. They came back.
You could hear the frustration and fear in her voice: “What am I doing wrong? Are these the roles out there for me?”
Nothing was wrong with her - but she was dismissing jobs today, and the system would learn from that the day after tomorrow. By then, she’d already given up. We found that a majority of our churned users dropped off within that 2-4 day window. We spent the next six months building a real-time recommendation pipeline that ingested signals immediately instead of batch processing every few days.
The results were the largest quality improvement I’ve ever seen: 20% reduction in job dismissals, statistically significant increase in weekly active users (rare for a non notifications test). But the most telling result was when we measured how much performance degraded based on signal delay:
A 1-hour delay cost us 3.5% in AUC (a measure of model quality)
A 6 hour delay cost us 4.27%
A 24-hour delay cost us 4.5%
Graph showing the decay in performance with the time taken to learn from user signals
Most of the damage happened in that first hour after a user gave us feedback. In fact, the relative decline after 6 hours is near negligible – if you’re trying to speed up model learning, you need to aim at within session improvements if you want to see real wins.
II: Defining the user problem
Here’s what users actually want from continual learning: they want the product to remember what they’ve already said.
When that professor dismissed Cheesecake Factory jobs, she wanted the system to remember she wasn’t interested. When someone tells Claude “be more concise,” they want it to stay concise in future responses. That’s it.
The research community has a more rigorous definition of continual learning – systems that self-improve post-deployment without catastrophic forgetting. However, from a product builder perspective, the problem we need to solve isn’t “are we doing something fundamentally new from a systems perspective”. It is “can we make the feedback loop tight enough that users experience the product as something that learns and adapts to them?”
III: Can infinite context windows solve continual learning?
One theoretical way to solve continual learning is often positioned as infinite context windows. Essentially, if we can fit in all the sessions a user has had, we should be able to be able to learn in-context and serve users and their evolving needs and preferences perfectly.
However, even as context windows have grown tremendously in the past couple of years, we’re seeing that models struggle as more and more of it gets used up. Drew Breunig’s post on failure models for long-context gives us a clear taxonomy of the issues we might see if we use infinite context as a solution for continual learning:
Context Poisoning: A hallucination or error gets embedded in context and is repeatedly referenced. For continual learning, this means incorrect information picked up in the context window at some point might get applied repeatedly.
Context Distraction: As context grows, it can distract models from their original training and instructions. This could lead to a model with an infinite context window ignoring its training and instructions to reason and think through problems, and instead simply over-indexing on recent sessions.
Context Confusion: Superfluous content degrades response quality because the model has to pay attention to any information in the context.
Context Clash: It’s not uncommon for information over multiple turns to start conflicting with each other, causing regressions in performance. Our preferences and relationships change over time, so we would likely have a lot of contradictory information collected over time in our context window. Additionally, most successful applications will be used in a variety of contexts (from work to personal, for instance).
Therefore, a system designed to offer us a context window that is effectively infinite isn’t a panacea – we’re going to need the system to also deal with these issues above.
IV: The core components of a continually learning system
I think this system is possible and consists of three core pieces:
A Memory System: where we store past experiences in a manner that makes accurate, contextual retrieval in the future possible.
The “Cognitive Core”: the chatbot or agent or whatever other form the LLM takes that users ultimately interact with.
A feedback loop: the mechanism by which we stitch everything together so that we offer a cohesive, continually improving system rather than modular components that are just bolted on.
The cognitive core and memory system in effect can give us an infinite context window. The feedback loop should be designed to solve for the context failure modes mentioned above and a new issue this system introduces: recall. Diving into them in detail below.
The different components and how they fit together
Memory System
The obvious ingredient: the infrastructure to store experiences and knowledge. Most people treat memory systems as just a sink to drop information into, assuming agentic retrieval solves all their problems. It doesn’t today, and probably won’t in the future either since agents will have an explosion of data to comb through.
Your memory system needs something akin to the Dewey Decimal System which solved the challenge of search and retrieval at libraries - categorization and indexing that makes retrieval efficient even at scale. And this will likely vary from application to application, and therefore the infrastructure requirements will likely vary as well.
At a high level, I expect a good memory system to have the following components at the very least:
Raw logs from all interactions: These logs will not be used for retrieval by the cognitive core but will instead play an important role with the feedback system, which I’ll detail next.
A memory graph of some sorts: This is not prescriptive to the use of actual graph databases or file systems or any other technique. The purpose of this graph is to be able to link together entities that naturally can be visualized in our heads as a graph. A graph can help a retrieval system or agent traverse related information and pull potentially useful context if and when it’s needed.
A synthesized context for each node & relationship in the graph: The node in each graph, once retrieved, needs to be able to offer a good detailed understanding of the entity that the node represents. It needs to be structured in a manner that enables retrieval from inside of the node to be efficient and accurate. The connections between nodes need to contain their own context as well.
For the curious, the best detailed writing on memory architectures is Samantha Whitmore’s on the memory system behind Dot by new.computer (RIP).
“Cognitive Core”
I’m borrowing Andrej Karpathy’s phrase here. In practice, this should “feel” like the agent or the chatbot that the user typically interacts with. The primary difference is that this core needs to be aware that there may be past context and is trained (using prompt optimization or RL) to query the memory system to bring in the right additional information rather than simply relying on its own initial knowledge and context.
At the same time, the cognitive core needs to be sufficiently knowledgeable to be useful – you need some baseline awareness to ask good questions in the first place. This means the cognitive core can’t be a pure reasoning engine, but have some awareness of the user and their past interactions.
One very promising direction for what the cognitive core might look like is a Recursive Language Model (RLM). Alex Zhang, summarized them as “a new inference strategy where LLMs can decompose and recursively interact with input prompts of seemingly unbounded length, as a REPL environment.” Isaac Miller just added the functionality to DSPy and there are some great examples to look at, like this and this.
Feedback Loop
This stitches everything together, and it’s probably the hardest part to get right.
First, you need cross-session evals. We are essentially looking for scenarios in products where past context being retrieved from the memory layer is causing any of the aforementioned failure modes in the current session, or if there’s a recall issue (i.e. we’re simply not retrieving useful past context). This is one of the key reasons we need to store raw logs of conversations in the memory system – while not being directly retrieved, creating these eval datasets over time is a necessity.
Then, you need an auto-updating evaluation stack. Static evals only tell you how your model performs on issues and scenarios you already know about. However, there is likely to be a drift in user behavior over time, either because their expectations evolve with time or simply because they start relying on your product for more as it improves. This system needs to be able to take explicit user feedback (like a thumbs down) or implicit (a user’s message indicating frustration) and add it to your eval set and some store of user preferences.
With these eval sets defined, you need to run periodic optimization loops – prompt optimization can be done fairly frequently (at least weekly) if you have the right infrastructure, while RL might vary based on your architecture (online RL makes sense for some applications but might not for others). You are essentially looking to hill-climb on your eval set.
Lastly, you need guardrails to prevent catastrophic failures. If your model behavior changes dramatically post-deployment, and it might as memory accumulates and starts overwhelming your instructions, it becomes harder to monitor and control. You want safeguards to prevent the worst issues – the ones you absolutely cannot have your product associated with. These guardrails should not only be used in the online path, but also the optimization processes, penalizing failures strongly.
V: The real challenge: co-optimization
The key hurdle we need to face is the need to optimize these components together rather than independent of each other. We often like to – it’s simpler to create dedicated services and assign individuals/teams to optimize each of them. But this is akin to shipping your org chart through your design.
Your memory system’s performance depends upon how the cognitive core will query it. Your cognitive core needs baseline knowledge to reason about what to remember. Your feedback loops need to evaluate the entire integrated system together, not individual components in isolation.
Ultimately, though, you need to start modular, but investing in the multi-session evaluation systems as early as you can. Then, you need to create environments where your agent can experiment with different memory architectures – files, vector stores, knowledge graphs – and learn which combination works best for your specific use case. Collecting these samples over time is tedious and time consuming, but is a tractable path forward.
The approach has broadly worked well for ML/AI for years now, and is hence why I think it’ll work going forward too.
—
Thanks to Abhay Kashyap, Abhinav Sharma, Ankur Gupta, Barak Widawsky, Drew Breunig, Jeff Huber, Jeff Picel, Julia Seregina, Marco Sanvido and Mehul Arora for their feedback on drafts of this post, as well as the South Park Commons Research Community for participating in the discussion in December. If you’re figuring out your -1 to 0 journey, you really should apply to join either the Membership or Fellowship – more info here.









