A pragmatic recipe for continual learning

With the ingredients we currently have

Jan 22, 2026

I: “What am I doing wrong?”

Until ~2021, LinkedIn’s job recommendation system would take 2-4 days to learn from user signals like applications and dismisses. Most leading recommendation systems at the time had this limitation. Processing all that engagement and featurizing it is compute intensive, so we just accepted it and focused on improvements in other areas instead, where we were quite successful! We helped hundreds of thousands of people get jobs every year, and the pace at which our algorithms were improving was phenomenal.

That summer, one UX research session changed everything for me: a former professor returning to the workforce after raising her kids. As she scrolled through recommendations, the only jobs she got were as a hostess at chain restaurants. Cheesecake Factory. Applebee’s. The same ones, over and over.

She dismissed them. They kept appearing. She dismissed them again. They came back.

You could hear the frustration and fear in her voice: “What am I doing wrong? Are these the roles out there for me?”

Nothing was wrong with her - but she was dismissing jobs today, and the system would learn from that the day after tomorrow. By then, she’d already given up. We found that a majority of our churned users dropped off within that 2-4 day window. We spent the next six months building a real-time recommendation pipeline that ingested signals immediately instead of batch processing every few days.

The results were the largest quality improvement I’ve ever seen: 20% reduction in job dismissals, statistically significant increase in weekly active users (rare for a non notifications test). But the most telling result was when we measured how much performance degraded based on signal delay:

A 1-hour delay cost us 3.5% in AUC (a measure of model quality)
A 6 hour delay cost us 4.27%
A 24-hour delay cost us 4.5%

Graph showing the decay in performance with the time taken to learn from user signals

Most of the damage happened in that first hour after a user gave us feedback. In fact, the relative decline after 6 hours is near negligible – if you’re trying to speed up model learning, you need to aim at within session improvements if you want to see real wins.

II: Defining the user problem

Here’s what users actually want from continual learning: they want the product to remember what they’ve already said.

When that professor dismissed Cheesecake Factory jobs, she wanted the system to remember she wasn’t interested. When someone tells Claude “be more concise,” they want it to stay concise in future responses. That’s it.

The research community has a more rigorous definition of continual learning – systems that self-improve post-deployment without catastrophic forgetting. However, from a product builder perspective, the problem we need to solve isn’t “are we doing something fundamentally new from a systems perspective”. It is “can we make the feedback loop tight enough that users experience the product as something that learns and adapts to them?”

III: Can infinite context windows solve continual learning?

One theoretical way to solve continual learning is often positioned as infinite context windows. Essentially, if we can fit in all the sessions a user has had, we should be able to be able to learn in-context and serve users and their evolving needs and preferences perfectly.

However, even as context windows have grown tremendously in the past couple of years, we’re seeing that models struggle as more and more of it gets used up. Drew Breunig’s post on failure models for long-context gives us a clear taxonomy of the issues we might see if we use infinite context as a solution for continual learning:

Context Poisoning: A hallucination or error gets embedded in context and is repeatedly referenced. For continual learning, this means incorrect information picked up in the context window at some point might get applied repeatedly.
Context Distraction: As context grows, it can distract models from their original training and instructions. This could lead to a model with an infinite context window ignoring its training and instructions to reason and think through problems, and instead simply over-indexing on recent sessions.
Context Confusion: Superfluous content degrades response quality because the model has to pay attention to any information in the context.
Context Clash: It’s not uncommon for information over multiple turns to start conflicting with each other, causing regressions in performance. Our preferences and relationships change over time, so we would likely have a lot of contradictory information collected over time in our context window. Additionally, most successful applications will be used in a variety of contexts (from work to personal, for instance).

Therefore, a system designed to offer us a context window that is effectively infinite isn’t a panacea – we’re going to need the system to also deal with these issues above.

IV: The core components of a continually learning system

I think this system is possible and consists of three core pieces:

A Memory System: where we store past experiences in a manner that makes accurate, contextual retrieval in the future possible.
The “Cognitive Core”: the chatbot or agent or whatever other form the LLM takes that users ultimately interact with.
A feedback loop: the mechanism by which we stitch everything together so that we offer a cohesive, continually improving system rather than modular components that are just bolted on.

The cognitive core and memory system in effect can give us an infinite context window. The feedback loop should be designed to solve for the context failure modes mentioned above and a new issue this system introduces: recall. Diving into them in detail below.

The different components and how they fit together

Memory System

The obvious ingredient: the infrastructure to store experiences and knowledge. Most people treat memory systems as just a sink to drop information into, assuming agentic retrieval solves all their problems. It doesn’t today, and probably won’t in the future either since agents will have an explosion of data to comb through.

Your memory system needs something akin to the Dewey Decimal System which solved the challenge of search and retrieval at libraries - categorization and indexing that makes retrieval efficient even at scale. And this will likely vary from application to application, and therefore the infrastructure requirements will likely vary as well.

At a high level, I expect a good memory system to have the following components at the very least:

Raw logs from all interactions: These logs will not be used for retrieval by the cognitive core but will instead play an important role with the feedback system, which I’ll detail next.
A memory graph of some sorts: This is not prescriptive to the use of actual graph databases or file systems or any other technique. The purpose of this graph is to be able to link together entities that naturally can be visualized in our heads as a graph. A graph can help a retrieval system or agent traverse related information and pull potentially useful context if and when it’s needed.
A synthesized context for each node & relationship in the graph: The node in each graph, once retrieved, needs to be able to offer a good detailed understanding of the entity that the node represents. It needs to be structured in a manner that enables retrieval from inside of the node to be efficient and accurate. The connections between nodes need to contain their own context as well.

For the curious, the best detailed writing on memory architectures is Samantha Whitmore’s on the memory system behind Dot by new.computer (RIP).

“Cognitive Core”

I’m borrowing Andrej Karpathy’s phrase here. In practice, this should “feel” like the agent or the chatbot that the user typically interacts with. The primary difference is that this core needs to be aware that there may be past context and is trained (using prompt optimization or RL) to query the memory system to bring in the right additional information rather than simply relying on its own initial knowledge and context.

At the same time, the cognitive core needs to be sufficiently knowledgeable to be useful – you need some baseline awareness to ask good questions in the first place. This means the cognitive core can’t be a pure reasoning engine, but have some awareness of the user and their past interactions.

One very promising direction for what the cognitive core might look like is a Recursive Language Model (RLM). Alex Zhang, summarized them as “a new inference strategy where LLMs can decompose and recursively interact with input prompts of seemingly unbounded length, as a REPL environment.” Isaac Miller just added the functionality to DSPy and there are some great examples to look at, like this and this.

Feedback Loop

This stitches everything together, and it’s probably the hardest part to get right.

First, you need cross-session evals. We are essentially looking for scenarios in products where past context being retrieved from the memory layer is causing any of the aforementioned failure modes in the current session, or if there’s a recall issue (i.e. we’re simply not retrieving useful past context). This is one of the key reasons we need to store raw logs of conversations in the memory system – while not being directly retrieved, creating these eval datasets over time is a necessity.

Then, you need an auto-updating evaluation stack. Static evals only tell you how your model performs on issues and scenarios you already know about. However, there is likely to be a drift in user behavior over time, either because their expectations evolve with time or simply because they start relying on your product for more as it improves. This system needs to be able to take explicit user feedback (like a thumbs down) or implicit (a user’s message indicating frustration) and add it to your eval set and some store of user preferences.

With these eval sets defined, you need to run periodic optimization loops – prompt optimization can be done fairly frequently (at least weekly) if you have the right infrastructure, while RL might vary based on your architecture (online RL makes sense for some applications but might not for others). You are essentially looking to hill-climb on your eval set.

Lastly, you need guardrails to prevent catastrophic failures. If your model behavior changes dramatically post-deployment, and it might as memory accumulates and starts overwhelming your instructions, it becomes harder to monitor and control. You want safeguards to prevent the worst issues – the ones you absolutely cannot have your product associated with. These guardrails should not only be used in the online path, but also the optimization processes, penalizing failures strongly.

V: The real challenge: co-optimization

The key hurdle we need to face is the need to optimize these components together rather than independent of each other. We often like to – it’s simpler to create dedicated services and assign individuals/teams to optimize each of them. But this is akin to shipping your org chart through your design.

Your memory system’s performance depends upon how the cognitive core will query it. Your cognitive core needs baseline knowledge to reason about what to remember. Your feedback loops need to evaluate the entire integrated system together, not individual components in isolation.

Ultimately, though, you need to start modular, but investing in the multi-session evaluation systems as early as you can. Then, you need to create environments where your agent can experiment with different memory architectures – files, vector stores, knowledge graphs – and learn which combination works best for your specific use case. Collecting these samples over time is tedious and time consuming, but is a tractable path forward.

The approach has broadly worked well for ML/AI for years now, and is hence why I think it’ll work going forward too.

—

Thanks to Abhay Kashyap, Abhinav Sharma, Ankur Gupta, Barak Widawsky, Drew Breunig, Jeff Huber, Jeff Picel, Julia Seregina, Marco Sanvido and Mehul Arora for their feedback on drafts of this post, as well as the South Park Commons Research Community for participating in the discussion in December. If you’re figuring out your -1 to 0 journey, you really should apply to join either the Membership or Fellowship – more info here.

Vibe-checking vibe-coding platforms w/Gemini 3

Raveesh

Nov 18, 2025

With Gemini 3 out, I decided to put a few vibe-coding services through the paces. I tested the exact same prompt on all of them.

I want to build an app that helps me plan my day. My tasks for the day can be pulled from 3 key sources:

- Gmail (only primary inbox)
- Calendar
- Todoist (overdue, today’s tasks, and unscheduled tasks)

I want to be able to see:

- What I need to get done today
- and an easy way to drag and drop these tasks into different slots on my calendar (think agenda view)

I should also be able to mark these things as done easily, or snooze to another day.

The goal is the user first sets these up. Then every day they see these listed out and can schedule them in their day, or snooze to another day.

Results

🥇 Replit

Replit just announced a new Design mode which, it turns out, puts it a mile in front of everyone else. Not just the visuals, but the interactions were so much better too.

🥈 Google AI Studio

I should note that some of the interactions here (like multi-columns when overlapping) came from refining afterwards, but overall the visuals and feature set are from the first try and worked decently. Obvious win to take to Github from here and continue working on it on your own.

🥉 Lovable

For some reason started with a landing page which gives a very “This is obviously AI-generated SaaS” vibe. Functional for the most parts and design-wise a tad better than AI Studio. But has this weird issue where a task disappears if you schedule another over it. Also, did not automatically put calendar events in the agenda.

Others tested

Firebase Studio

Schedule has only 7AMs on it, and put some random photo in the profile photo in the corner (LMAO).

v0 by Vercel

Decent enough design but scheduling functionality didn’t work: can’t drag and drop events in, or reschedule anything, or mark them as done.

Bolt.new

Provided a sign in/sign up page, followed by connectors to the services. The connectors don’t work, so couldn’t test anything.

Claude Code vs Codex CLI

Comparing their approaches and sharing my preferences. With a special appearance by two others.

Raveesh

Sep 22, 2025

Despite not being someone who enjoys using the terminal that much, Claude Code’s became a core part of my developer stack since late May this year. I’d tried it in the past and been disappointed, but Claude 4.0+ had clearly resulted in a step-change in reliability and performance.

When GPT-5 launched and was integrated into Codex CLI, I was eager to see if OpenAI could match the performance of Claude Code. A lot of people on Twitter clearly thought it did. Additionally, I was curious about the architectural choices in Codex, specifically on the tooling front. While the models are of course the key elements, I believe the tooling and the post-training to use the tools can play a significant role in the actual developer experience.

Tooling comparison

Claude Code has access to numerous tools, and that’s even before we get to MCPs, custom sub-agents, etc:

File operations: read, write, edit, multi-edit, glob, grep
Command execution: Bash, Bash output, KillShell
Network: search and fetch
Task management: Task (create subagents to do a specific task), TodoWrite, ExitPlanMode
Jupyter: notebook edits specific to python notebooks

A number of these tasks could be done via the shell directly, but Anthropic’s created dedicated tools that enforce certain rules and/or improve performance over standard shell commands. For example, the Edit and MultiEdit tools enforce reading the files first before making any changes. The latter also requires all edits to be successfully made before saving the file – if any edit fails, the entire set of changes are undone.

Codex’s approach is very minimal – it only has access to three tools:

shell
update_plan
view_image

They’ve effectively taken the approach of attempting everything through pure shell commands. There are no multiple sub-agents that are given specific tasks to work on, nor enforcement of rules such as “read before writing”, etc. It demonstrates pure trust in the model, with a larger context window, being able to figure things out as needed through one long conversation.

A detailed breakdown of the tools and their documentation is at the end if you’re curious.

My experience

First and foremost, it’s important for me to call out that both these products (and other coding agents like Devin and Cursor’s Agent) are genuinely incredible. I can write code, but I haven’t professionally done so for ~10 years and would not have been able to build what I have without them. To me they all are somewhat substitutable – if you took any one of them away, I’d probably make do with an alternative.

Having said that, I do have some preferences based on the task at hand. These tasks are on a fairly mature project at this stage:

Feature implementation: Claude Code, with Sonnet, remains my go-to. I’ve generally found it to handle fairly complex requests (usually full feature implementations) given a spec really well. I haven’t had much success (defined as near-complete implementation) with Codex CLI yet. This despite encouraging Codex to create plan.md files with detailed requirements for each phase. My thesis is the tooling/sub-agent approach from Claude is what makes the difference here – I love that Codex is trying something different in just trusting the model, but I do wonder if there’s context rot that ultimately causes issues
Code reviews and debugging: I actually really like Codex here. On numerous occasions it has done a better job identifying the root cause of issues across multiple modules. It also does a better job with code reviews, both if you ask it to locally as well as in GitHub. In the latter, I also like how it drops comments on specific lines (like a human PR reviewer!) versus one large blob of text from Claude Code. Also, it kinda helps that the personality is neutral and not over excited.
Documentation: this is where Devin shines, and I honestly think Cognition is slept on. DeepWiki is a great product, and the details Devin provides in pull requests are unmatched (screenshot below). This overall makes Devin the closest to a human coder.
IDE: The use case for IDEs for me is the final stretch of any initiative plus debugging. I’m hopeful about Windsurf post the Cognition acquisition but the tab model in Cursor remains the best I’ve seen.

Appendix

Claude Code

File Operations

  Read

  Purpose: Read files from the local filesystem
  Parameters:
  - file_path (required): Absolute path to the file
  - offset (optional): Line number to start reading from
  - limit (optional): Number of lines to read

  Features:
  - Reads up to 2000 lines by default
  - Supports images (PNG, JPG, etc.) - displays visually
  - Supports PDFs - processes page by page
  - Supports Jupyter notebooks (.ipynb) - shows all cells with outputs
  - Returns content with line numbers (cat -n format)
  - Truncates lines longer than 2000 characters

  Write

  Purpose: Write content to files, overwriting existing content
  Parameters:
  - file_path (required): Absolute path to the file
  - content (required): Content to write

  Important Notes:
  - Must use Read tool first if editing existing files
  - Overwrites existing files completely
  - Prefer editing over writing new files

  Edit

  Purpose: Perform exact string replacements in files
  Parameters:
  - file_path (required): Absolute path to the file
  - old_string (required): Exact text to replace
  - new_string (required): Replacement text
  - replace_all (optional): Replace all occurrences (default: false)

  Requirements:
  - Must use Read tool first
  - old_string must be unique in file (unless using replace_all)
  - Preserve exact indentation from Read output
  - Never include line number prefixes in strings

  MultiEdit

  Purpose: Make multiple edits to a single file in one operation
  Parameters:
  - file_path (required): Absolute path to the file
  - edits (required): Array of edit operations

  Edit Object Structure:
  - old_string (required): Text to replace
  - new_string (required): Replacement text
  - replace_all (optional): Replace all occurrences

  Features:
  - Edits applied sequentially in order
  - All edits must succeed or none are applied (atomic)
  - More efficient than multiple Edit calls

  Glob

  Purpose: Fast file pattern matching using glob patterns
  Parameters:
  - pattern (required): Glob pattern (e.g., "/*.js", "src//*.ts")
  - path (optional): Directory to search in (defaults to current)

  Features:
  - Works with any codebase size
  - Returns paths sorted by modification time
  - Supports standard glob patterns
  - Use for finding files by name patterns

  Grep

  Purpose: Powerful search tool built on ripgrep
  Parameters:
  - pattern (required): Regular expression pattern
  - path (optional): File or directory to search
  - glob (optional): File pattern filter (e.g., "*.js")
  - type (optional): File type filter (e.g., "js", "py")
  - output_mode (optional): "content", "files_with_matches", "count"
  - -i (optional): Case insensitive search
  - -n (optional): Show line numbers (content mode only)
  - -A/-B/-C (optional): Context lines after/before/around matches
  - multiline (optional): Enable multiline matching
  - head_limit (optional): Limit output to first N results

  Output Modes:
  - files_with_matches: Just file paths (default)
  - content: Shows matching lines with context
  - count: Shows match counts per file

Command Execution

  Bash

  Purpose: Execute bash commands in persistent shell
  Parameters:
  - command (required): Bash command to execute
  - description (optional): 5-10 word description of what command does
  - timeout (optional): Timeout in milliseconds (max 600000)
  - run_in_background (optional): Run command in background

  Important Guidelines:
  - Quote file paths with spaces: "path with spaces"
  - Use ; or && to separate multiple commands
  - Avoid cd - use absolute paths instead
  - NEVER use find, grep, cat, head, tail - use dedicated tools
  - Use rg (ripgrep) if you must grep
  - Commands timeout after 2 minutes by default

  Git Operations:
  - For commits: Use HEREDOC format for commit messages
  - Always include Claude attribution in commits
  - Never use -i flag (interactive mode not supported)
  - Check git status/diff before committing

  BashOutput

  Purpose: Retrieve output from background bash processes
  Parameters:
  - bash_id (required): ID of background shell
  - filter (optional): Regex to filter output lines

  Usage:
  - Monitor long-running background processes
  - Only shows new output since last check
  - Returns stdout, stderr, and shell status

  KillShell

  Purpose: Terminate background bash processes
  Parameters:
  - shell_id (required): ID of shell to kill

Web & Search

  WebFetch

  Purpose: Fetch and analyze web content using AI
  Parameters:
  - url (required): Valid URL to fetch
  - prompt (required): What information to extract

  Features:
  - Converts HTML to markdown
  - 15-minute cache for repeated requests
  - Handles redirects (will inform you of redirect URLs)
  - HTTP URLs automatically upgraded to HTTPS
  - Content may be summarized if very large

  WebSearch

  Purpose: Search the web for current information
  Parameters:
  - query (required): Search query (min 2 characters)
  - allowed_domains (optional): Only include these domains
  - blocked_domains (optional): Exclude these domains

  Notes:
  - Only available in US
  - Use for information beyond knowledge cutoff
  - Account for current date when searching

Task Management

  Task

  Purpose: Launch specialized agents for complex tasks
  Parameters:
  - description (required): 3-5 word task description
  - prompt (required): Detailed task description
  - subagent_type (required): Type of agent to use

  Available Agent Types:
  - general-purpose: Complex research, code search, multi-step tasks
  - statusline-setup: Configure Claude Code status line
  - output-style-setup: Create Claude Code output styles

  When to Use:
  - Open-ended searches requiring multiple rounds
  - Complex multi-step tasks
  - When you need specialized capabilities

  When NOT to Use:
  - Reading specific file paths (use Read/Glob)
  - Searching specific class definitions (use Glob)
  - Searching within 2-3 known files (use Read)

  TodoWrite

  Purpose: Create and manage structured task lists
  Parameters:
  - todos (required): Array of todo objects

  Todo Object Structure:
  - content (required): Imperative description ("Run tests")
  - activeForm (required): Present continuous form ("Running tests")
  - status (required): "pending", "in_progress", "completed"

  Usage Guidelines:
  - Use for complex multi-step tasks (3+ steps)
  - Only ONE task should be in_progress at a time
  - Mark tasks completed immediately after finishing
  - Update status in real-time
  - Remove irrelevant tasks entirely

  When to Use:
  - Complex multi-step tasks
  - Non-trivial implementations
  - User requests multiple tasks
  - User explicitly asks for todo list

  When NOT to Use:
  - Single straightforward tasks
  - Trivial tasks completable in <3 steps
  - Purely conversational requests

  ExitPlanMode

  Purpose: Exit planning mode after creating implementation plan
  Parameters:
  - plan (required): Concise implementation plan (markdown supported)

  Usage:
  - Only use when task requires planning implementation steps
  - NOT for research tasks or understanding codebase
  - Use after finishing planning, before coding

Jupyter Notebooks

  NotebookEdit

  Purpose: Edit Jupyter notebook cells
  Parameters:
  - notebook_path (required): Absolute path to .ipynb file
  - new_source (required): New cell content
  - cell_id (optional): ID of cell to edit
  - cell_type (optional): "code" or "markdown"
  - edit_mode (optional): "replace", "insert", "delete"

  Edit Modes:
  - replace: Replace existing cell content (default)
  - insert: Add new cell at specified position
  - delete: Remove cell at specified position

  Notes:
  - Cell numbering is 0-indexed
  - When inserting, new cell appears after specified cell_id
  - cell_type required when using insert mode

Codex CLI

shell    - Required: command (string array—think ["bash","-lc","echo hi"])   - Optional: workdir (path to run in; always set it), timeout_ms (int),   with_escalated_permissions (bool for running outside sandbox), justification   (one-sentence reason when escalating)  update_plan    - Optional: explanation (string)   - Required: plan (array of steps, each with step text and status set to   pending, in_progress, or completed)  view_image    - Required: path (string path to a local image file to attach)

Loading more posts…

Things worth writing down