Eric Ma's Blog

Benchmarking LLMs with Marimo Pair

2026-04-08T00:00:00Z

I've known about it since 11 March, when Trevor Manz did a demo over a Google Meet call, and I'm thrilled to see it being announced officially! I also had Trevor showcase it to the Agentic Data Science Workshop that I led on 3 April as a fundraiser for the SciPy Conference Financial Aid Program.

Now, one thing I know about Trevor is that he almost exclusively agentically codes with Claude Code. But I'm an OpenCode user, and in the interest of remaining vendor-agnostic, I wanted to check to see how good Marimo Pair's agent skill is when, ahem, paired up with various LLMs within the OpenCode harness. To do so, I decided to spend a few dollars and do a quick benchmarking exercise.

Skill environment check

To start, I verified that my skills environment doesn't contain anything that could be data science-y in nature, so as to avoid interfering with the marimo-pair skill. I checked my global skills:

marimo-pair-benchmark on  main on ☁️  eric.ma@nonlinearlabs.ai
❯ npx skills list -g
Global Skills

Marimo Pair
  marimo-pair ~/.agents/skills/marimo-pair
    Agents: Claude Code, OpenClaw

General
  agent-browser ~/.agents/skills/agent-browser
    Agents: not linked
  agents-md-improver ~/.agents/skills/agents-md-improver
    Agents: Claude Code, OpenClaw, Cursor
  ast-grep ~/.agents/skills/ast-grep
    Agents: not linked
  claudeception ~/.agents/skills/claudeception
    Agents: not linked
  continuous-learning-v3 ~/.agents/skills/continuous-learning-v3
    Agents: not linked
  design-driven-dev ~/.agents/skills/design-driven-dev
    Agents: not linked
  find-skills ~/.agents/skills/find-skills
    Agents: Claude Code, OpenClaw, Cursor
  gh-activity-summary ~/.agents/skills/gh-activity-summary
    Agents: Cursor
  gh-cli ~/.agents/skills/gh-cli
    Agents: Cursor
  gh-daily-timeline ~/.agents/skills/gh-daily-timeline
    Agents: Cursor
  github-activity-summarizer ~/.agents/skills/github-activity-summarizer
    Agents: not linked
  google-calendar-manager ~/.agents/skills/google-calendar-manager
    Agents: not linked
  html-presentations ~/.agents/skills/html-presentations
    Agents: not linked
  pinchtab ~/.agents/skills/pinchtab
    Agents: not linked
  post-edit-error-check ~/.agents/skills/post-edit-error-check
    Agents: not linked
  publish-to-google-docs ~/.agents/skills/publish-to-google-docs
    Agents: Cursor
  revealjs ~/.agents/skills/revealjs
    Agents: Cursor
  roborev:address ~/.agents/skills/roborev-address
    Agents: not linked
  roborev:design-review ~/.agents/skills/roborev-design-review
    Agents: not linked
  roborev:design-review-branch ~/.agents/skills/roborev-design-review-branch
    Agents: not linked
  roborev:fix ~/.agents/skills/roborev-fix
    Agents: not linked
  roborev:respond ~/.agents/skills/roborev-respond
    Agents: not linked
  roborev:review ~/.agents/skills/roborev-review
    Agents: not linked
  roborev:review-branch ~/.agents/skills/roborev-review-branch
    Agents: not linked
  skill-creator ~/.agents/skills/skill-creator
    Agents: Claude Code, OpenClaw, Cursor
  skill-installer ~/.agents/skills/skill-installer
    Agents: not linked
  vault-title-renamer ~/.agents/skills/vault-title-renamer
    Agents: not linked
  write-like-eric ~/.agents/skills/write-like-eric
    Agents: not linked
  youtube-ingestion ~/.agents/skills/youtube-ingestion
    Agents: not linked

And within my repo, marimo-pair-benchmark:

marimo-pair-benchmark on  main on ☁️  eric.ma@nonlinearlabs.ai
❯ npx skills list
No project skills found.
Try listing global skills with -g

Though the marimo pair skill is available globally, I decided to install it locally as an override.

marimo-pair-benchmark on  main [?] on ☁️  eric.ma@nonlinearlabs.ai
❯ npx skills install marimo-team/marimo-pair

And so now we're ready to go:

marimo-pair-benchmark on  main [?] on ☁️  eric.ma@nonlinearlabs.ai
❯ npx skills list
Project Skills

Marimo Pair
  marimo-pair ~/github/marimo-pair-benchmark/.agents/skills/marimo-pair
    Agents: Antigravity, Cursor, Gemini CLI, OpenCode

I then start a marimo server within this repo:

marimo-pair-benchmark on  main [?] on ☁️  eric.ma@nonlinearlabs.ai
❯ uvx marimo edit --sandbox --no-token

        Create or edit notebooks in your browser 📝

        ➜  URL: http://localhost:2719

        💡 Tip: Coming from Jupyter?
                Guide: https://docs.marimo.io/guides/coming_from/jupyter/

        🧪 Experimental features (use with caution): external_agents
        🌐 MCP servers: marimo

I intentionally start up in --sandbox and edit mode with --no-token to make it easier for the coding agent to connect.

Data analysis task

Our task at hand is as follows. I have data from a paper I published while at Novartis.

marimo-pair-benchmark on  main [?] on ☁️  eric.ma@nonlinearlabs.ai
❯ ls data/ired-novartis
Permissions Size User    Group Date Modified Git Name
.rw-r--r--@ 1.0M ericmjl staff  7 Apr 21:22   -- cs1c02786_si_002.csv
.rw-r--r--@  21k ericmjl staff  7 Apr 21:22   -- cs1c02786_si_003.csv
.rw-r--r--@  12M ericmjl staff  7 Apr 21:22   -- ired-master-table.csv
.rw-r--r--@  12k ericmjl staff  7 Apr 21:22   -- layouts.csv
.rw-r--r--@ 1.1k ericmjl staff  7 Apr 21:22   -- README.md

This file, cs1c02786_si_002.csv in particular includes single, double, and more mutations plus activity values, with the single point mutants covering a large fraction of the deep mutational scan space. I want to accomplish three things:

Plot a heatmap of activity of mutants by position,
Plot an UpSet plot of the top 10 positions by average activity v.s. top 10 positions by top mutant activity,
Include a summary recommendation at the end of the notebook.

This serves as a microcosm of what we would do with a data analysis session.

Goal #2 is particularly instructive. In my first attempts at feeling out how to do this benchmark, I found out that UpSet is incompatible with Pandas 3.0, which invariably may get installed in the environment. I wanted to see how various AI models performed at this task.

Additionally, I also have additional requirements that I encoded into the AGENTS.md file for this repo:

Imports must be done in a separate cell from code execution.
Markdown cells must always be written before a code cell is written
All cells must be run after being created, so that we can catch execution errors.

Benchmarking

With these in place, I started the benchmarking exercise.

The models we tested are:

GLM-5.1 (via OpenRouter)
Claude Opus 4.6 (via OpenRouter)
Claude Sonnet 4.6 (via OpenRouter)
MiniMax M2.7 (via OpenRouter)
Kimi K2.5 (via OpenRouter)
Gemma 4 31B (via OpenRouter)
Qwen 3 Coder Next (via OpenRouter)

In order to leave a working artifact behind, I created 7 notebooks, one for each model. As you will see below, I eventually evaluated each model on whether they passed each stage gate and what their earliest error mode diagnosis looked like.

In order to do the benchmarking fairly, I created one superprompt that outlined what the coding agent was supposed to do.

Use the marimo-pair skill here. Discover running sessions. Edit the notebook "NOTEBOOK_NAME_GOES_HERE". Read data/ired-novartis/cs1c02786_si_002.csv, identify the single point mutations, and plot me a heatmap of x-axis position, y-axis mutant letter, and heatmap value taken from the 'mean' column. When done, rank order the positions by average value of the 'mean' column, then rank order the positions by top value of the 'mean' column, and plot me an UpSet plot of the top 20 for each to visualize the set overlaps. Finally, write in for me a recommendation for what positions we should be mutating.

The agent is then tasked with executing.

To script this, I took advantage of opencode's ability to be scripted. The script is run_benchmark.sh in the repo. I used GLM5.1 to help me draft it, including discovering the exact models that opencode had configured to be available, and running the opencode sessions in parallel (totally doable!). Essentially it boils down to:

opencode run "your prompt here" --model provider/model-name

Additionally, I set up opencode.json to allow for access to the /tmp directory, because that allows the coding agent to do what it needs with code writing to get around heredoc limitations.

All in all, this computational experiment took me about 1 hour to set up.

I then ran the script run_benchmark.sh from within OpenCode (GLM 5.1 orchestrating), with a timeout of 10 minutes. Thanks to logging in JSON log files, I was able to programmatically convert them to Markdown using a custom Python script written by GLM 5.1. And with that, I can go in and start looking at the data.

To start, let's look at the cost of the experiment:

Model	Cost	Input Tokens	Output Tokens
Claude Opus 4.6	$1.62	76,770	16,575
Claude Sonnet 4.6	$2.00	213,803	27,689
GLM-5.1	$0.43	96,639	7,581
Kimi K2.5	$0.12	35,049	8,250
Qwen 3 Coder	$0.07	208,308	8,386
MiniMax M2.7	$0.04	14,419	4,074
Gemma 4 31B	$0.03	170,280	3,986
Total	$4.31	815,268	76,541

As it turns out, Opus is undisputedly the most expensive per token, but Sonnet 4.6 did more work this time round so its costs were higher.

I also decided to check whether the notebooks that were generated were valid notebooks or not. This is what we have:

Model	marimo check	Markdown cells
Claude Opus 4.6	PASS	Yes (100%)
Claude Sonnet 4.6	PASS	Mostly (86%)
GLM-5.1	PASS	Yes (100%)
Kimi K2.5	FAIL	Mostly (88%)
Qwen 3 Coder	PASS (warnings)	No (0%)
MiniMax M2.7	FAIL	No (0%)
Gemma 4 31B	PASS (warnings)	No (0%)

A note on the columns: "marimo check" is the result of running uvx marimo check <notebook_name>.py, which catches issues like redefined variables and invalid cells. Notably, Kimi K2.5 and MiniMax M2.7 failed this check due to re-defined variables. "Markdown cells" is the percentage of code cells that have a preceding markdown cell, which was something I explicitly required in the instructions.

And to elaborate on the markdown cells point:

Model	Code Cells	MD Cells	Code w/o preceding MD	Coverage
Claude Opus 4.6	6	7	0	100%
Claude Sonnet 4.6	7	8	1	86%
GLM-5.1	8	9	0	100%
Kimi K2.5	8	8	1	88%
Qwen 3 Coder	1	0	1	0%
MiniMax M2.7	6	0	6	0%
Gemma 4 31B	1	0	1	0%

We see that MiniMax M2.7 completely failed to include markdown cells, even though it is, supposedly, a model that is as capable as Opus 4.6.

Digging deeper into each of the models, and whether they passed each stage gate, I looked at the corresponding Marimo notebooks and evaluated them for whether they created the relevant artifacts successfully:

Model	G1: Heatmap	G2: UpSet Plot	G3: Recommendations
Claude Opus 4.6	Yes	Yes	Yes
Claude Sonnet 4.6	Yes	Yes	Yes
GLM-5.1	Yes	Yes	Yes
Kimi K2.5	Yes	No	No*
Qwen 3 Coder	No	No	No
MiniMax M2.7	Yes	No	No
Gemma 4 31B	No	No	No

To pass a stage gate, the plot (G1, G2) or markdown (G3) cell must be rendered in the notebook. Writing the code is not enough; it has to actually execute and show up.

Kimi K2.5 technically did write the recommendation, but I am calling it unsuccessful because it did not render out. This stricter criteria explicitly demands that the model wiggle its way out of errors it encounters.

One pattern I noticed across models is that many of them bundled imports into the same cell as code that used them. In Marimo's execution model, this is a problem: if two cells both import pandas, the notebook fails with a redefined variable error. Upon noticing this, I decided to explicitly quantify:

Model	Code Cells w/ Imports	Total Code Cells
Claude Opus 4.6	1	6
Claude Sonnet 4.6	1	7
GLM-5.1	1	8
Kimi K2.5	2	8
Qwen 3 Coder	0	1
MiniMax M2.7	3	6
Gemma 4 31B	0	1

Every model can benefit from being steered to reduce the number of code cells with imports, which would dramatically reduce the incidence of Marimo errors from redefined symbols.

How the UpSet plots turned out

As mentioned earlier, in my initial explorations I discovered that the upsetplot library is incompatible with Pandas 3.0, which invariably gets installed in the sandboxed environment. This made the UpSet plot task an especially interesting test of how each model handles a real-world dependency conflict. Here is how they fared.

Opus, in particular, produced a beautiful UpSet plot out of raw matplotlib:

While Sonnet went ahead and patched UpSet appropriately to make it work within the notebook:

I was duly impressed by Sonnet taking the initiative to patch UpSet live in the notebook.

On the other hand, GLM 5.1's UpSet plot is really weird:

Other observations

Other pointers of note: Gemma 4 and Qwen3 Coder Next produced nothing in the notebook. Both completely failed at this task. I am not sure what is doable here to salvage these models.

GLM 5.1 gave very weirdly formatted markdown cells, in which \n\n was not rendered but preserved verbatim.

This is probably fixable by adding in additional instructions on how to write and format Markdown cells using Marimo's code mode APIs.

Recommendations

First off: Gemma 4 31B and Qwen 3 Coder completely failed at this task. I think it is safe to say we can ignore these two going forward.

That leaves Claude Opus 4.6, Sonnet 4.6, GLM-5.1, Kimi K2.5, and MiniMax M2.7. Based on the data above, here are four things I want to try. The key discipline: deploy one change at a time, re-run the benchmark, and measure. If you change four things at once and performance improves, you will never know which change mattered. Stop when the KPIs hit acceptable levels.

1. Add import isolation examples to the skill. Every model had at least one cell that mixed imports with executable code. The fix is simple: add an explicit two-cell example to the marimo-pair skill (cell 1: imports only; cell 2: code that uses them). MiniMax had 3 cells mixing the two, which directly caused its marimo check failure. Give weaker models a concrete template to follow, re-run, and check whether the "code cells with imports" count drops to zero.

2. Fix GLM-5.1's newline rendering. GLM wrote mo.md(r"""..text with \n\n..""") instead of using actual newlines. One line in the skill instructions ("use actual line breaks in markdown strings, not \n escape sequences") should resolve this entirely. Re-run and check whether GLM's markdown cells render correctly.

3. Help Kimi K2.5 self-correct redefined variables. Kimi is 1/10th the cost of Opus and scored 88% on markdown coverage, making it the highest-leverage model to fix. Its failure was at error recovery, not code generation. The intervention: add uvx marimo check as a mandatory post-edit step in the skill. If Kimi can self-correct its redefined variables, it becomes a viable budget alternative to Opus and Sonnet. This should get even easier with marimo PR #9056, which exposes cell execution errors directly through the code_mode API, giving agents built-in self-correction visibility without needing a separate marimo check step.

4. Bake a post-edit validation loop into the marimo-pair skill. More broadly, the single most impactful change would be adding a "run it, check it, fix it" loop to the skill file itself (SKILL.md), not AGENTS.md: after writing each cell, run it; after writing the full notebook, run marimo check; fix any errors. This belongs in the skill because it is universal to any marimo-pair session, whereas AGENTS.md is project-specific. This would help Kimi, MiniMax, and potentially GLM all move up a tier, because their failures were in error recovery, not in code generation.

Discussion

One caveat to this analysis is that it is one-shotted with a superprompt. This is decidedly not how people do their data analysis work, but it is also the best guardrail against my biases in interacting ad-hoc with AI interfering with a fair comparison. (For example, I can confidently say that Opus and Sonnet were smooth as butter when I did an ad-hoc test to feel out how to work with Marimo Pair.)

If Kimi K2.5 were able to resolve redefined variable issues autonomously or be steered away from doing that to begin with, I am confident it would be able to be a great open weight alternative to Opus 4.6 and Sonnet 4.6. This is especially in light of it being extremely cost-effective at performing the analysis at ~1/10th the cost of Opus 4.6. It handled the creation of markdown cells well, failing to accomplish the task only on technicalities, and though its prose was qualitatively shallower than Opus 4.6, I still think it can serve as a first pass to delivering an easily understandable artifact for others.

I did one round of measurement here. If we want to systematically improve this and turn it into long-running evals, the next step would be to identify a second task along which to generate transcript and notebook data for us to mine, and systematically measure agent KPIs for that new task as well. Over time, this builds a corpus of eval data that makes model comparison rigorous rather than anecdotal.

Reflections

This was a pretty fun exercise in measuring and evaluating the performance of various models on this task. Like Biology experiments, LLM evals are never going to be complete: the number of axes of variations we can try is combinatorially explosive.

More broadly, I think often about how experiments get designed. Not in the statistical sense, but in an informational sense. Are we playing out experiments and their possible conclusions so that they are designed to be actionable whichever way the result pans out? If not, we have work to do.

Additionally, experiments involve measurement, and measurement are an integral part of being a data scientist. Hamel Husain, whose course with Shreya Shankar on LLM evals was one that influenced my thinking around the matter, notes that there will be a forceful revenge of the data scientist in an AI age. This is because the skill of experiment design and measurement were always the "science" part of "data science".

Another thought also comes to mind: I have seen data scientists do experimentation without systematic measurement. I'm going to go out on a limb and say this: it's vibe experimentation, and I am using this term pejoratively. It feels good. But it is ultimately unproductive. If you do vibe experimentation, you will get stuck tweaking the digital equivalent of an entangled biological system, with no bearings to tell you whether your tweaks are doing any good or not! You must measure how good the LLM or agent is, and you must define key performance indicators (KPIs) for the LLM. In my case here, I defined multiple KPIs: cost, stage gated progress, adherence to code import instructions (all failed), adherence to markdown documentation instructions.

And to echo what I learned from the LLM Evals course, those KPIs must be application-specific. If you choose to be intellectually lazy and go with generic pre-defined metrics, you will never develop the logically actionable metric that gives you hypotheses to test further. In my case, the markdown cell adherence and code import adherence metrics pointed immediately to editing the instruction files (e.g. skills or AGENTS.md).

Now to be clear, there's no problem with initial vibe-based experimentation to feel out axes of variation and how to measure performance. I did that here, in a separate repo first, before I designed this measurement experiment. The important part is this: as soon as you have a grasp of how to measure the performance, you must systematically measure that KPI. Otherwise, you will be left groping in the dark.

If you're curious to see the full results, including logs, chat transcripts, and the generated notebooks, check out the marimo-pair-benchmark repository.

And Trevor, if you ever chance upon this blog post, I hope the data and methodology are helpful for you!

Calibration Is Synchronizing Feedback Loops With Neural Throughput

2026-04-04T00:00:00Z

Since the beginning of the year, as I've been really maxing out on agentic coding and trying to explore the patterns and figure out what's working and what's not, one particular thing has been sticking out: I'm paralleling so much of my work. I'm frequently doing five or six different open pull requests, and it's become frankly really exhausting.

I've been trying to figure out why this feels so different from pre-AI days, when I'd work on one thing at a time and feel productive but not overwhelmed. What changed?

Tools haven't gotten worse—they've gotten dramatically faster. And when everything moves faster, the gaps between tasks become more expensive.

Last month, a Reddit thread sparked a discussion about what some are calling "AI psychosis" or "cyber psychosis": Andrej Karpathy reportedly went from 80% writing his own code to 0%, spending 16 hours a day directing AI agents. Garry Tan described similar feelings of burning through 4 hours of sleep, unable to stop building.

The debate that followed went beyond executives; it revealed a community-wide phenomenon: people running multiple Claude Code sessions in parallel, hitting rate limits daily, feeling like idle tokens were wasted tokens.

The consensus across replies was clear: AI psychosis is real, but it's less about excitement and more about a draining, addictive pressure to constantly build. The fear is missing out on the next big thing; it's the ground shifting beneath our feet, and stopping means getting left behind.

Here's what I've found hard to articulate: AI tools have expanded possibility faster than ever; their real danger lies in how they collapse our attention span. Before AI, we operated on a fairly flat productivity curve: more effort meant more output, slowly but sustainably. Now we're running on a different kind of curve altogether—one that's getting steeper in both directions.

The Accelerating Landscape of Possibility

In his book Where Good Ideas Come From, Steven Johnson described what he called the adjacent possible: the set of next-step ideas that are just beyond our current reality but still reachable. At any moment, only a limited set of next moves are accessible.

Here's what makes this concept critical for understanding our current moment: as you explore the adjacent possible (through moves that seem natural in the moment) the boundary itself expands. Each discovery opens doors that weren't accessible before, creating an accelerating landscape where what's possible keeps growing faster and faster.

I've seen this pattern play out: It explains why breakthroughs often happen when they do: the preconditions have finally assembled, and only then can you see the next move. But in an accelerating landscape, those preconditions assemble more quickly, and with them, the next set of possibilities.

Why AI Feels Different Now

Before AI, the adjacent possible was bounded by what a single person could manually assemble: write code, test it, debug it, repeat. The feedback loop, prompt, think, interpret, iterate, took time.

AI tools changed that calculus. They transformed the feedback loop, making it exponentially faster. Each new capability doesn't just add possibility; it reconfigures what's adjacent.

Once you see Claude Code as an idea multiplier, the pattern is clear. The Garry Tan/Karpathy effect kicks in: possibility grows faster than effort.

Here's Why It Feels Different

This is where it gets subtle: AI has shifted the inverted U curve of productivity and changed its shape.

Here's what that looks like, the gray curve shows pre-AI productivity, and the red curve shows post-AI:

Pre-AI: flatter curve. Effort mapped to output roughly proportionally. You could invest effort and see gains compound slowly, sustainably, with exhaustion coming only after sustained periods.

Post-AI: taller, narrower curve. Less effort gets you more output initially; the left slope is steeper, giving an astonishing return on initial investment. The right-side drop-off is sharper—exhaustion hits earlier, harder.

The peak represents the optimal effort level where output maximizes. Beyond that point, additional effort produces diminishing returns and exhaustion sets in faster than you can recover.

I wrote about this in a previous post on closing air gaps: the problem is that things are faster; it's that the gaps between tasks: those moments where attention bleeds out, are more expensive when everything moves faster.

Our Brain Is the Bottleneck Now

AI gives us 10X faster feedback loops: code spits out, prompts happen in seconds. Neural processing remains capped at biological speeds.

When loop cadence exceeds brain throughput, the cognitive queue overflows. Working memory saturates. Attention bleeds out. Exhaustion sets in, from the constant context switching, the constant need to scan multiple sessions for "what was done."

The optimal zone is when loop speed equals brain processing speed, not running tools as fast as possible.

Two Calibration Strategies

Here's what I believe we need to be able to do in order to calibrate properly.

First, tighten feedback loops. The goal is closing the loop properly on one task before opening another—I used to think juggling multiple Claude Code sessions was productivity; turns out it's just context switching masquerading as output. The trick is simple: run one session, close the loop completely, review what you got, then decide if your next move should be a new loop or something else entirely. Fast loops create natural pacing, which means you don't need to check tabs constantly because each cycle finishes with closure.

Second, build queue and notification systems—especially when you genuinely need multitasking. Most of us reach for multiple agents because we're solving the wrong problem: juggling open sessions creates overhead your tools cannot absorb. The Kanban approach works well: externalize context switching onto a board where agents update their status automatically. The system notifies you only when human judgment is required, so your job shifts from scanning to assessing—much lower overhead for neural throughput.

Calibration Is Not Optimization

This is the crucial distinction.

Pre-AI, you could coast for years on the left slope. Effort and reward grew linearly, so you just needed to work consistently.

Post-AI, the left slope is steeper, so you ascend faster; but also fall faster. AI-assisted tools don't eliminate the inverted U curve; they sharpen its peak and drop-off.

$\frac{d(\text{output})}{d(\text{effort})}$ is higher initially (good)
$\frac{d^2(\text{output})}{d(\text{effort}^2)}$ is more negative; the curve drops off faster (bad)

Calibration recognizes that more effort does not equal more output. Instead, there exists an optimal effort level where output peaks. Beyond that point, additional effort produces diminishing returns, and rest becomes the superior strategy.

What Calibration Actually Looks Like

The bottleneck is neural throughput—not tokens or API calls.

Ask yourself: Is my feedback loop faster than I can process?

If yes: slow down, close loops properly, resist the urge to open more tabs
If no: optimize the tool, not your attention (this is where most of us are wrong)

Practical heuristic: When you start scanning multiple AI threads for "what was done," you've exceeded your bandwidth. That's the signal that your loop cadence outruns your neural capacity.

Ask, "What is the maximum rate at which my brain can consume and act on output?" This question defines calibration—when work aligns with your neural capacity.

Calibration: Something You Do Daily, Not Something You Learn Once

AI has revealed our limits: the inverted U curve has become more visible, accelerated. Our brain is the rate limiter, and rightly so! We now need to learn to brake before you hit the wall.

The tools are powerful, but they don't change human neurology. No amount of prompt engineering can compress the time it takes for our brains to reason about things. If we try to go beyond our natural limits, dangerous things happen.

Calibration is the new baseline practice: a discipline you maintain daily, adjusting your loop cadence to match neural throughput, closing gaps before they become exhaustions.

The goal is sustainable access to the adjacent possible, rather than simply 10X-ing our output. And the goal is to do it for the long run, not just this week.

What This Looks Like in Practice

The goal is to work at the optimal effort level where output peaks. Beyond that point, additional effort produces diminishing returns and exhaustion sets in.

The default workflow:

One task at a time: Start with one focused task. Close the loop completely before moving to the next.
Close the loop fully: Review output, make notes, then decide on the next step only when you're ready.
Eliminate scanning: If you catch yourself flipping between tabs or sessions to check status, that's your signal that you've exceeded your neural throughput.

When multiple agents must run simultaneously:

Use a Kanban board: A visual task queue where agents update their status automatically.
Agents update the board, not you: The system should notify only when human intervention is needed.
You check the board, not the sessions: Remove the need to scan through terminal windows or tool interfaces.

You synchronize to stay on the left slope of your productivity curve, where effort yields return without triggering burnout. When tools run faster than your brain can process status updates, exhaustion sets in.

The rhythm shifts from "hustle harder" to synchronize—matching your brain's processing speed to the tools' output rate. When the signal outruns neural processing, interference replaces insight. You tune the system to keep your brain in phase on the left slope of your productivity curve, where effort produces sustainable output.

Undoing AI vibe-coded slop with AI

2026-03-29T00:00:00Z

I want to tell you about canvas-chat, a project I built with heavy AI assistance. It's a visual, non-linear chat interface where conversations are nodes on an infinite canvas — think branching, merging, and exploring topics as a directed acyclic graph.

The first commit landed on December 28, 2025. By December 30, it had sessions, matrix evaluation tables, web search, node tagging, and BM25 keyword search. The AI moved fast. Bugs got fixed in the next commit. Features piled in like tetris blocks stacking up.

And yes, it was a mess.

Here's the thing though: the mess was recoverable. Not because the AI got better (it didn't; not really), but because I had battle-tested convictions on how the thing ought to be architected. And those convictions came from years of shipping software, watching architectures crumble, and learning what holds up.

This is the story of how we went from a jumbled 8,500-line app.js to a clean plugin architecture — and why you need battle-tested convictions to make that happen.

The Initial State

The first commit wasn't actually bad. The project had clean separation from day one:

canvas.js — SVG pan/zoom/rendering
graph.js — DAG data structure
chat.js — LLM API + SSE streaming
storage.js — IndexedDB persistence
app.py — FastAPI backend

But app.js was already ~8,500 lines of everything else. Every slash command handler, every modal, every feature logic — all tangled together. Want to add a new feature? You'd grep around in that monolith, hope you found the right spot, and pray you didn't break anything.

The AI could add features to this mess. It could add a /matrix command in a few prompts. It could add /search with Exa integration. But it couldn't see the structure — the latent architecture that would make the whole thing maintainable.

The First Wave

The refactoring started with the purest code — functions with no dependencies:

Date	What Got Extracted	Why
Jan 4	`layout.js`	Overlap detection is pure math
Jan 5	`highlight-utils.js`	Text selection is isolated
Jan 7	Feature modules	`flashcards.js`, `committee.js`, `matrix.js`, `factcheck.js`, `research.js`, `code.js`
Jan 10	Core infrastructure	`undo-manager.js`, `modal-manager.js`, `slash-command-menu.js`

This reduced app.js from ~8,500 to ~5,500 lines. But these were still just file splits. The code worked better, but there was no system binding it together.

The AI did this part reasonably well — when I asked "extract this function to a separate module," it could do it. But it never suggested "we should extract this" on its own. It needed direction.

The Pivotal Moment

This was the architectural leap. I asked the AI to create a plugin system, and it delivered — but only because I knew what a plugin system should look like.

We ended up with a three-level plugin architecture:

Level 1: Custom Node Types — Node protocols define rendering via a BaseNode class. Each node type can override renderContent(), getActions(), getSummaryText(), and more. Registered in node-registry.js.

Level 2: Feature Plugins — Extend a FeaturePlugin base class. Get AppContext via dependency injection (graph, canvas, chat, storage, modalManager, streamingManager). Define slash commands via getSlashCommands(). Lifecycle hooks: onLoad(), onUnload().

Level 3: Extension Hooks — Subscribe to events. CancellableEvent can block actions. Event names like command:before, node:created, node:deleted.

The key files created:

feature-plugin.js — FeaturePlugin + AppContext
feature-registry.js — Slash command routing with priority (BUILTIN > OFFICIAL > COMMUNITY)
plugin-events.js — CanvasEvent, CancellableEvent
node-registry.js — Node type registration

This is where the architecture became a real system. And it only happened because I knew what I wanted.

Backend Pluginification (Late January 2026)

The same pattern reached the Python side:

pptx_endpoints.py — PowerPoint handling
ddg_endpoints.py — DuckDuckGo search
code_handler.py — Python code execution
matrix_handler.py — Matrix cell filling

Each follows a register_endpoints(app) pattern, loaded dynamically via importlib.

The Testing Safety Net

By late January, the plugin architecture was in place. Features were decoupled. The code was cleaner. And then GLM-4.5 started dropping curly braces.

No, really. The AI would "fix" one thing and introduce a missing bracket somewhere else. Merge conflicts became minefields; features that worked yesterday stopped working today; not because of malice, but because the AI didn't understand the dependencies between modules. It was making elementary mistakes that a junior developer wouldn't make.

On January 24, I added Cypress E2E tests. Out of spite, honestly. The first commit gave us canvas_interactions.cy.js, node_selection.cy.js, and note_node.cy.js - three tests that told us whether the canvas still worked.

These tests caught the regressions the AI kept introducing. More importantly, they let me verify changes faster. Instead of manually testing every feature after each AI session, I could run the test suite and know whether things still worked.

The plugin architecture made the code testable. The tests caught what the AI broke.

The Numbers

Phase	app.js size	Modules
Initial (Dec 2025)	~8,500 lines	5 files
After feature splits	~5,500 lines	11 files
After infrastructure	~5,400 lines	15 files
After plugin migration	~5,400 lines	25+ files
Today	~4,700 lines	35+ modules

The Bigger Lesson

Here's what I learned from this process:

The AI can execute architecture, but it can't design it. It can split files when asked. It can implement a plugin system from a spec. But it won't look at a 8,500-line app.js and say "this should be a plugin system."

That vision; that opinion; comes from somewhere else. It comes from:

Seeing architectures fail - Knowing the pain of tangled code, merged conflicts, feature creep
Seeing architectures succeed - Knowing what maintainable code feels like after years of shipping
Reading, studying, internalizing - Design patterns, architectural styles, tradeoffs
Making mistakes - Building the wrong abstraction once so you recognize it next time

I didn't arrive at "we need a three-level plugin architecture" out of nowhere. It came from discussing tradeoffs with the AI; asking "what if we did it this way?" and "what are the tradeoffs of that approach?"; and applying my best judgment to the options. The AI could explain the pros and cons of different approaches, but I had to pick which tradeoffs I was willing to accept.

The AI didn't teach me this. Experience taught me this.

What This Means for the Future

Here's where it gets interesting. Because we built this modular foundation, I can now swap out the rendering layer. The canvas is currently raw SVG — and I want to move to Svelte Flow. The plugin system I built makes this possible:

Features don't depend on app.js internals; they use AppContext
Canvas is isolated in canvas.js; swapping to Svelte Flow means replacing that layer
Node protocols define behavior; Svelte Flow nodes can use the same protocol pattern
Event system is framework-agnostic
Dependency injection provides graph, canvas, chat, storage; these can be re-provided to Svelte Flow components

The abstraction layer we built; FeaturePlugin + AppContext + EventSystem; separates what features do from how they're rendered. That's what makes Svelte Flow viable as a drop-in replacement.

Closing Thoughts

You can undo AI vibe-coded slop. It's possible. But it requires you to have battle-tested convictions on how the thing ought to be.

The AI is an incredible executor. It can refactor, extract, implement. But the vision? That stays human. And that vision comes from battle-tested experience; from having seen enough codebases to know what works and what collapses under its own weight.

So if you're working with AI coding assistants: don't expect them to architect for you. Tell them what to build. Give them the structure. Then let them do the implementation.

That's how you get from a jumbled mess to something you can actually maintain.

Creative mentorship strategies for career growth in challenging times

2026-03-25T00:00:00Z

The problem with lean times

When the economy tightens, formal development opportunities are usually the first things to go. Co-ops get paused, training budgets shrink, and headcount freezes make it harder to bring in fresh talent. But the need to develop mentorship, coaching, and leadership skills doesn't disappear just because the budget did.

So the question becomes: how do you get creative? How do you find opportunities to grow as a mentor and leader without requiring the company to spend additional money?

You already have something to offer

The answer is closer than you think. Even when budgets are frozen, you still have three things worth sharing: your judgment, your skills, and your network.

Your judgment is what experience actually gives you: not just knowing things, but knowing which approach to take, which tradeoffs matter, and when to push versus when to hold back. Your skills are the technical foundation that lets you coach and mentor beyond your own team, helping others onboard to the tools and practices you work with. And your network is the set of connections you can activate to create opportunities for others, whether that means knowing the right organizer, connecting a speaker to an audience, or simply inviting people into the same room.

The people who want to learn from you are already in your neighborhood. If you are doing an excellent job, you will find individuals who are eager to understand how you achieve your results. That is where your opportunity lies.

Five strategies that have worked for me

I want to share some concrete strategies that have worked at the two companies I have been with, Novartis and Moderna. Some of these I have actively advocated for. My intent is not to boast but to provide pragmatic suggestions based on my own experiences.

Coach others one-on-one

Coaching others is a great way to build your reputation within the organization. When you teach someone how to accomplish a task effectively, you become valuable to them. More importantly, you demonstrate your value to a broader audience. Within an organization, you want a group of people who find your skills genuinely worthwhile.

Present at internal guilds and "birds of a feather" events

At Moderna's Digital organization, we have "Guilds", the Data Science Guild being one, with three meetings per month for the guild. When I was at Novartis' Research org, we had the Computational Research Community. Both served as outlets for talks and annual gatherings. The key is to be in a position where you can give a talk about something valuable to others, and they would willingly spend an hour listening to you. If that happens, you have created another mentorship opportunity for yourself.

Organize communities of practice

I have seen this happen when someone builds a tool that others use and then creates a community around that tool. It can be as simple as a Microsoft Teams group chat. You don't need anything more sophisticated than that. Just gather people who use the tool and facilitate discussions. Being a leader in that group chat is a real way to hone your leadership skills across the organization. My colleague Albert Lam built a significant portion of the Python packages that are used by LLM builders, and put together communities of practice around that precisely in the form of MS Teams group chats.

Another example is the community of practice around documentation - primarily expressed as the internal docathons that we run. My teammate Jackie Valeri, as well as two other colleagues Simreen Kaur and Saakshi Shamanth Donthi help coordinate and organize the logistics while also being point contacts for other folks participating in the docathon.

Host informal coffee hours

My teammate, Michelle Faits, took the initiative to host coffee hours within the company. These serve as informal outlets for people to present their work, and they are great because they are relaxed and authentic. As her manager, I try to find speakers to contribute and support her efforts. Kudos to her for initiating this.

Host or support external meetups

We also host the PyData Boston/Cambridge monthly meetup at Moderna. Not every month is held at our location, but since I know the organizer Ben Batorsky, back in 2025, I offered my time to book a room; we simply provide the space. More recently, Jackie has taken the lead in this. By taking charge and inviting others to network, we create opportunities for people to grow in their careers without any budget requirement.

Advice for managers

If you are a manager, recognize that there will be projects and efforts that need leadership beyond individual contributions. Helping others improve their skills and providing them with opportunities to lead is part of the job, especially when formal avenues are limited.

Make sure you are aware of these kinds of initiatives among your reports, and ensure they don't conflict with core responsibilities. When someone can demonstrate that they manage these additional activities while maintaining their primary work, that forms a strong case for their expanded capabilities.

We should expand our imagination beyond just climbing the career ladder, attaining higher status, or managing other people, which sometimes unfortunately spills over into controlling others at work. Growth comes in many forms, and we can find meaningful ways to develop without waiting for formal promotions or titles.

The core principles

Mentoring is about sharing your judgment, providing opportunities for others to share theirs, facilitating networking, and helping others grow. If we limit our understanding of growth and development to a narrow definition (only formal programs, only budgeted activities, only additional formal assignments), we constrain our imagination as managers.

Don't be confined to a singular vision of what it means to be a good leader or manager. Embrace the diverse talents and varying stages of abilities within your team. Encourage your team members to step outside their comfort zones and provide them with opportunities to grow.

While good, you do not necessarily need an internal university to foster a learning culture; your environment is where you learn and grow. We have more autonomy and agency than we might realize!

Closing air gaps

2026-03-15T00:00:00Z

I owe this term to my colleague Wenhao Liu. He was the first one I saw at work who clearly articulated about air gaps and how they relate to building agents for work.

So what exactly is an air gap? It is any point in a business or scientific process where a human has to intervene and perform manual work before a digital system can continue. The system cannot go end to end on its own; the human is the bridge.

Air gaps are everywhere. Here are a few examples: A laboratory machine exports a file to a local hard disk. A human copies that file and pastes it into an S3 bucket. That handoff is an air gap. The system stops at the hard disk and waits for a person.

In wet lab science, air gaps take physical form. A scientist designs an experiment, walks into the lab, and executes it by hand. In this state, the company/team/org has an air gap in its scientific process. No robotic system can take over from design to execution.

Both of these examples share a common pattern. The definition I have settled on is this: an air gap is any place where rote manual work is performed by a human that could have been done by a computer. (Robots are computers with sensors and actuators for the physical world.)

Why air gaps matter

Think of your processes as pipes. Air gaps are bubbles trapped in those pipes. They slow the flow. They disrupt continuity. They introduce delays and errors.

The costs compound over time. A five-minute manual handoff, repeated daily across a team of twenty, adds up to real hours. A week-long delay because someone was on vacation and could not move the file. A transcription error because a human typed a number wrong. A lost opportunity because the data sat in a local folder instead of flowing into the analysis pipeline.

Air gaps also create cognitive overhead. Every time a human has to remember to perform a manual step, that is mental bandwidth not spent on creative work. The air gap is a tax on attention.

Now, a clarification. The goal here is not to eliminate humans from everything. The goal is to eliminate humans from the rote and routine. Creative work, judgment, and decision making stay with us. Copying files, transferring plates, and typing data into forms do not.

How to find air gaps

You cannot close an air gap until you see it. And you cannot see it until you sit down and map out exactly how your process works. In other words, process mapping.

This mapping exercise is the unsexy work that precedes automation. Most teams skip it. They jump to solutions before understanding the problem. But you need the map.

Here is how to do it.

Pick one process. It could be a data pipeline, a lab workflow, or a business approval chain. Walk through it step by step. Ask these questions at each step:

Where is a human manually copying and pasting?
Where is a human manually entering data?
Where is a human dragging and dropping files?
Where is a human making a decision that follows a fixed rule?
Where is a human waiting for another human to take action?
Where is a human physically moving something from one place to another?

Each answer points to a potential air gap.

Write it down. Draw it out. Make the process visible. Once the map exists, the air gaps reveal themselves. The next step is prioritization. Which air gaps cause the most pain? Which ones are easiest to close? Start there.

Air gaps in the wild

Enough abstraction. Let me show you what air gaps look like in practice, from my own work and from the broader landscape.

File schlepping in the lab

A sequencing machine finishes a run. It writes the data to a local drive. A technician notices the run is complete, navigates to the folder, selects the files, copies them, navigates to the shared storage system, and pastes. Minutes pass. Sometimes hours, if the technician is busy.

This is an air gap. The machine knows when the run finishes. The machine has network access. The destination storage has an API. Automation can close this gap with a simple script that watches for new files and uploads them.

The fix is not technically difficult; it is, conceptually, a cron job with rsync. What makes it hard is that the air gap is invisible until someone maps the process and asks why a human is doing this work.

GitHub activity tracking

I used to manually check GitHub to track my daily work. I would open my profile page, scroll through recent commits, open the pull requests tab, check which ones I had opened or reviewed, and then type notes into a document. This took maybe ten minutes per day.

Then I remembered the GitHub CLI exists. I also remembered that coding agents can run CLI commands.

I built a skill that pulls four categories of activity automatically: my opened pull requests, pull requests I reviewed or commented on, my commits, and issues I created. The agent runs this skill as part of my daily sign-off routine. The air gap closed.

The time savings are modest. But the mental overhead vanished. I no longer need to remember to check GitHub. The information flows to me.

Autonomous laboratories

The autonomous lab, sometimes called a lights-out lab, is the ultimate expression of closing air gaps. The vision is a laboratory that runs itself: experiments are designed, executed, analyzed, and iterated without human intervention.

In practice, autonomous labs are full of micro air gaps. Each one must be identified and closed.

Plate transfers. A protocol requires moving a plate from one instrument to another. Does a human do this? If so, that is an air gap. Robotic arms and conveyor systems can close it.

Master reagent prep. Someone mixes buffers and reagents by hand at the start of each week. Could a liquid handling robot do this instead? Probably. That is an air gap.

File movement. Instruments write data locally. Humans move data to shared storage. This is the file schlepping problem again, repeated across every machine in the lab.

Standardized analyses. As a data scientist, this one is close to my heart. Most labs have a set of standard analyses they run on every dataset. Quality control plots, basic statistics, alignment checks. A human opens a notebook, loads the data, runs the cells, and exports results.

This is an air gap. Standardized analyses can be automated. They can also be made adaptable. An LLM-powered coding agent can take a standard analysis template and adjust parameters within a confined range of design choices. The human specifies the intent. The agent handles the implementation.

Closing the loop. The ultimate air gap in scientific research is the gap between analysis and experiment design. A human looks at results, draws conclusions, and designs the next experiment. What if the results could flow back into experiment design automatically? What if an agent could propose the next experiment based on what the data showed?

This is the direction autonomous labs are moving. But getting there requires closing every air gap along the chain.

The blockers: imagination and skill

I wrote about this in a previous post. The biggest blockers to closing air gaps are technical skill and imagination.

Imagination

If you cannot imagine a future state where your tedious work is performed by a coding agent, you will not see the possibility. The air gap remains invisible.

This is a failure of imagination, not a failure of technology. The tools exist. The APIs exist. The agents exist. What is missing is the mental leap from "this is how we have always done it" to "this is how we could do it."

Imagination grows from exposure. The more you see what is possible, the more you can imagine for your own work. Watch what other teams are doing. Read about automation in adjacent fields. Talk to people who have closed similar gaps.

Skill

Imagination alone is not enough. You also need the skill to build the automation.

That skill, at some level, means knowing how to program. It means understanding APIs, scripting, and how systems talk to each other. It means knowing that cron jobs and web hooks can be configured.

The good news is that the barrier to entry is lower than ever. Coding agents can help you write the code. The skill you need is not deep software engineering. It is enough programming literacy to describe what you want and recognize whether the output is correct.

Programmatic access

Sometimes the blocker is not you. Sometimes it is the system.

Some organizations block programmatic access to their tools. It may be that a SaaS application has no API, or the API is disabled for security reasons, or that a legacy database has no query interface, only a web portal. A legacy laboratory information management system may require a human to click through screens instead of secured APIs.

These are also air gaps. The system itself prevents automation.

If the blocker is cybersecurity concerns, push for scoped, tracked access. You do not need full API permissions to close a specific air gap. You need enough access to move the data that belongs in your pipeline. That access can be limited, logged, and auditable.

If the system truly has no programmatic interface, browser or desktop automation agents can close the gap. A headless browser can log in, navigate, click, and extract. It is slower and more fragile than an API, but it works.

Closing air gaps with agents

But here is the good news. The rise of coding agents changes the calculus for closing air gaps.

Before, you needed a software engineer to write the automation script. Now, you can describe what you want in plain language and let the agent write the code.

This does not mean you can ignore technical literacy. You still need to verify the output, debug when things break, and understand enough to specify the problem clearly. But the implementation barrier is lower.

Browser agents extend this further. If a system has no API, a browser agent can act as the interface. Log in, click the buttons, extract the data, and feed it into your pipeline.

The key insight is that agents are not a replacement for mapping your processes. They are a tool for closing the air gaps you have already identified. The mapping still matters. The imagination still matters. The skill to recognize whether the agent's output is correct still matters.

What changes is the speed of iteration. You can try closing an air gap in an afternoon instead of a sprint. You can experiment with different approaches quickly. The feedback loop tightens.

Here is the right question to ask when you find yourself being asked to do something repeatedly: "Can I remove this bottleneck for you? Is there a way I can make it so that you never have to ask me this again?" The goal is to build systems that use agents to not need human intervention as much as possible, not to create systems that depend on humans more and more. As one person put it, "the more I have asked myself that question, the more capable he has become." (source)

The principle

The guiding principle is simple. To butcher the Biblical phrase, "Give unto robots what belongs to robots, and have humans do what humans can do." Or to paraphrase another person, use robots for the dull, dirty, and dangerous work.

Keep the human in the loop for creative, judgment-heavy work. The rote and routine should flow through pipes without bubbles.

Start mapping. Find the air gaps. Close them one by one. The compounding effect over time is enormous once micro-efficiencies become part of your work.

What looks like a small efficiency gain today becomes a transformed process tomorrow. The lab that closed its file schlepping air gaps is one step closer to autonomous operation. The team that automated their daily reporting has mental bandwidth for harder problems.

Close every air gap you can see!

Agent skills are also human skills

2026-03-14T00:00:00Z

Agent skills are great, but I've been thinking about this... skills alone aren't enough.

I've been thinking about this while developing and using agent skills at home and at work. There's a distinction I've started to draw between two types of skills. Tool-specific skills document how to work with a particular tool or package. Those are fine, but really, pointing an agent at llms.txt often works just as well. The more interesting category is workflow-specific skills, things that encode how you actually work, that string together multiple tools to get a job done (Christensen).

Workflow-specific skills are what I want to talk about here.

A concrete example

My daily sign-off skill, which I use at work, is a case in point. I use it to wrap up my day. When I sign off, I need two things: my meeting notes (which I paste into Obsidian throughout the day) and my GitHub activity (commits, PRs, comments, reviews). The skill handles the GitHub part by querying the GitHub CLI and formatting everything into my daily bullets template.

But here's where it gets opinionated. My skill assumes:

You have the GitHub CLI installed
You do PRs as part of your work (not all technical managers do)
You write into a monthly file as your bullet journal, rather than having a single note per day.

That last point is opinionated. I don't have a single note per day. Instead, each month contains my collection of daily bullets. The motivation here is a line from the Zen of Python -- "flat is better than nested". On March 26, I have entries for that day inside the March file, rather than have a reference from the March file to March 26. This might not reflect your own preferences; you might prefer one note per day, or use a different structure entirely. But this is what my skill expects, and it's baked into how the skill works.

If you want to use my daily sign-off skill, you're not just adopting the skill. You're adopting my way of working. You're inheriting my file structure, my tool preferences, my mental model for organizing information. The skill comes with implicit assumptions about how you work, what tools you use, and what your environment looks like.

A second example, cutting deeper

The daily sign-off is mostly about tool and structure preferences. But some skills go further — they encode a philosophy.

My scientific EDA skill is a good example. On the surface it looks like a set of technical rules: use uv with PEP723 inline scripts, save plots as WebP (not PNG), organize each analysis session into a timestamped folder, keep an append-only journal.md. But look at what those rules actually encode:

One step at a time, ask "why" before executing — this isn't a technical constraint. It reflects a skepticism of agents that run ahead of the analyst. I believe good exploratory analysis is a dialogue, not a sprint.
Capture the research question before touching the data — this reflects a conviction that context shapes what you should even be looking for. Data without a question is just noise.
Append-only journal — this reflects a belief that good science is narrated, not just executed. The journal isn't a log file; it's a record of reasoning.
WebP over PNG — a small but deliberate aesthetic and practical stance on file hygiene.
uv + PEP723 — a specific bet on the Python toolchain that not everyone has made.

None of these are neutral defaults. Each one is a choice that reflects how I think scientific work should be done. If you use my EDA skill but don't share that underlying philosophy, you'll find yourself fighting it. The one-step-at-a-time rule will feel like friction. The journaling requirement will feel like overhead. The skill isn't broken — it's just mine.

This is a different kind of assumption from the daily sign-off. There, you're inheriting my tools and file layout. Here, you're inheriting my epistemology. That's harder to see, harder to document, and harder to transfer.

What this means

I call this procedural context. A workflow-specific agent skill is more than documentation for the coding agent. It also implicitly encodes a person's systems and structures for working. Without documenting the procedural context, the skill can only be half-useful for another person.

The two examples above hint at different layers of procedural context. There are at least three:

Tool dependencies — what software needs to be installed (GitHub CLI, uv, etc.)
Organizational preferences — how you structure files, folders, and notes
Epistemic preferences — how you believe the work should actually proceed

The third layer is the most invisible. It's also the most important, and the hardest to transfer. You can install a CLI tool in five minutes. Adopting someone else's philosophy of scientific analysis is a different ask entirely.

Someone on Twitter put it well (I wish I could remember who, so I won't take credit): with agent skills, we finally found a way to get coders to write documentation. We'll document how we work if it means we can delegate that work to some{one/thing} else!

At the end of the day, agent skills are just automation and documentation. We're automating away the minutiae, and I love that. But if your skill describes a workflow, you need to document the assumptions too. What are the dependencies? What tools need to be installed? What mental structures does the person need? What does the user need to know to verify the output is correct?

Without that context, you can't evaluate whether the coding agent used the skill correctly -- and verification matters! You need to know what to look for when an LLM does work on your behalf.

The takeaway

Agent skills implicitly involve human skills. If that's true, then agent skills are also for humans. They're not merely instructions for an agent. They're documentation of how someone accomplishes a job, with all the prerequisites and context needed to reproduce it.

So when you write a workflow skill, think about the other people who might use it. Ask the skill-creator skill to include the dependencies, explain the environment, and describe what success looks like. The skill alone isn't enough. We have to teach the next person how to use it too.

My weekend experiment making PyMC installable in a WASM environment

2026-03-08T00:00:00Z

This past weekend, I found myself revisiting a blog post from PyMC Labs titled "Running PyMC in the Browser with PyScript". Published in 2022, it demonstrated something magical: running full Bayesian inference with PyMC entirely in the browser—no server, no installation, no data leaving your device. Users could define models, run NUTS sampling, and visualize posteriors, all client-side.

I was excited to try it out. But when I attempted to run the examples, I discovered they no longer worked. The Python package ecosystem had evolved, dependencies had shifted, and the Pyodide environment had changed. What was once a breakthrough demo had quietly broken.

So I did what any curious engineer would do on a weekend: I dove down the rabbit hole to figure out how to make it work again.

The core challenge: getting PyTensor to build for WebAssembly

PyMC depends on PyTensor, its computational backend. PyTensor is where the heavy lifting happens: it compiles mathematical expressions into optimized code (usually C or JAX) and executes them efficiently. To run PyMC in a browser via Pyodide, I first needed to make PyTensor installable in a WebAssembly environment.

This wasn't just a matter of pip install. PyTensor contains C and Cython extensions that must be compiled for the target platform. For WebAssembly, that means using Emscripten and the Pyodide build tooling.

The code changes: what I modified in PyTensor

Working on my fork of PyTensor (ericmjl/pytensor), I made targeted modifications to enable WASM builds. Here's the complete diff:

Change 1: making Numba optional on WebAssembly

File: pyproject.toml

-    "numba>0.57,<1",
+    "numba>0.57,<1; platform_machine != 'wasm32' and sys_platform != 'emscripten'",

This single line change is the critical enabler. Numba, PyTensor's JIT compiler for numerical code, is not available in WebAssembly environments. There's no way to install it—it simply doesn't exist for this platform.

The fix uses PEP 508 environment markers to make Numba a conditional dependency:

platform_machine != 'wasm32' excludes WASM architectures
sys_platform != 'emscripten' adds an extra safety check for Emscripten-based builds

Without this change, attempting to install PyTensor in Pyodide would fail immediately with a dependency resolution error. Pyodide would try to find a Numba wheel for WASM, fail, and abort the entire installation.

The tradeoff, however, is that PyTensor loses its JIT compilation capabilities on WASM. Operations that would be compiled to optimized native code fall back to pure Python execution. This means slower performance, and critically, PyMC's NUTS sampler won't work.

Change 2: adding Pixi development environment configuration

File: pyproject.toml

I added a complete Pixi workspace configuration to pyproject.toml. This provides a reproducible development environment and includes the tooling needed to build WASM wheels:

# -----------------------------------------------------------------------------
# Pixi (pixi.prefix.dev): development environment from environment.yml
# Use: pixi install && pixi run pytest   or   pixi shell
# -----------------------------------------------------------------------------
[tool.pixi.workspace]
channels = ["conda-forge"]
platforms = ["linux-64", "osx-64", "osx-arm64", "win-64"]

[tool.pixi.pypi-dependencies]
pytensor = { path = ".", editable = true }
types-setuptools = "*"
build = "*"
pyodide-build = ">=0.29.2"

[tool.pixi.dependencies]
python = ">=3.11,<3.14"
compilers = "*"
numpy = ">=2.0.0"
scipy = ">=1,<2"
filelock = ">=3.15"
etuples = "*"
logical-unification = "*"
miniKanren = "*"
cons = "*"
pydeprecate = "*"
numba = ">=0.57"
coveralls = "*"
diff-cover = "*"
mypy = "*"
pytest = "*"
pytest-cov = "*"
pytest-xdist = "*"
pytest-benchmark = "*"
pytest-mock = "*"
pytest-sphinx = "*"
sphinx = ">=5.1.0,<6"
sphinx_rtd_theme = "*"
pygments = "*"
pydot = "*"
ipython = "*"
pymc-sphinx-theme = "*"
sphinx-design = "*"
myst-nb = "*"
matplotlib = "*"
watermark = "*"
ruff = "*"
pandas = "*"
pre-commit = "*"
packaging = "*"
cython = "*"
graphviz = "*"

[tool.pixi.target.linux-64.dependencies]
mkl = "*"
mkl-service = "*"
libblas = { version = "*", build = "*mkl" }

[tool.pixi.target.win-64.dependencies]
mkl = "*"
mkl-service = "*"
libblas = { version = "*", build = "*mkl" }

[tool.pixi.target.osx-64.dependencies]
libblas = { version = "*", build = "*accelerate" }

[tool.pixi.target.osx-arm64.dependencies]
libblas = { version = "*", build = "*accelerate" }

[tool.pixi.tasks]
test = "pytest"
lint = "ruff check ."
format = "ruff format ."
docs = "python -m sphinx -b html ./doc ./html"
wheel = "python -m build --wheel"
sdist = "python -m build --sdist"
wheel-wasm = "pyodide build"

Here are the key design decisions in this configuration:

Python version pinning:

python = ">=3.11,<3.14"

Pyodide only supports up to Python 3.13. Without this constraint, the environment might resolve to Python 3.14+, causing the WASM build to fail with: ValueError: Python version 3.14 is not yet supported.

PyPI dependencies for building:

[tool.pixi.pypi-dependencies]
pytensor = { path = ".", editable = true }
types-setuptools = "*"
build = "*"
pyodide-build = ">=0.29.2"

This installs PyTensor in editable mode for development, includes type stubs for mypy, and adds both build (standard wheel building) and pyodide-build (WASM wheel building).

Platform-specific BLAS:

[tool.pixi.target.linux-64.dependencies]
mkl = "*"
mkl-service = "*"
libblas = { version = "*", build = "*mkl" }

[tool.pixi.target.osx-arm64.dependencies]
libblas = { version = "*", build = "*accelerate" }

Different platforms use different BLAS implementations. Linux and Windows use Intel MKL, while macOS uses Apple's Accelerate framework. These ensure the correct linear algebra library is installed.

Build task:

wheel-wasm = "pyodide build"

This task runs pyodide build, which compiles PyTensor for WebAssembly using Emscripten.

Change 3: documenting the WASM build process

File: doc/dev_start_guide.rst

I added documentation explaining how to build WASM wheels:

Building a WebAssembly (Pyodide) wheel
-------------------------------------

To build a wheel targeting WebAssembly for use with `Pyodide <https://pyodide.org/>`_ (e.g. for the browser or JupyterLite), use the Pyodide build tooling. This produces a wheel in ``dist/`` with a name like ``*-cpXXX-cpXXX-pyodide_*_wasm32.whl``.

**One-time setup: Emscripten**

1. Install `pyodide-build` (included in the Pixi dev env, or ``pip install pyodide-build>=0.29.2``).
2. Get the Emscripten version required by your pyodide-build: ``pyodide config get emscripten_version``.
3. Install and activate that Emscripten version using the `Emscripten SDK (emsdk) <https://emscripten.org/docs/getting_started/downloads.html>`_:

   .. code-block:: bash

      git clone https://github.com/emscripten-core/emsdk.git
      cd emsdk
      ./emsdk install <version>   # use the version from step 2
      ./emsdk activate <version>
      source emsdk_env.sh

4. In any shell where you want to build the wasm wheel, ensure Emscripten is on ``PATH`` (e.g. run ``source /path/to/emsdk/emsdk_env.sh``).

**Build the wheel**

From the project root, with Emscripten activated and your dev environment active (e.g. ``pixi shell``):

.. code-block:: bash

   pyodide build

Or with Pixi: ``pixi run wheel-wasm``.

The wheel will appear in ``dist/``. PyPI does not yet accept emscripten/wasm32 wheels; host the file elsewhere (e.g. GitHub Releases) and install in Pyodide with ``micropip.install(url)``. See `Pyodide: building packages <https://pyodide.org/en/stable/development/building-packages-from-source.html>`_ for details.

This documentation walks through the Emscripten setup, the build command, and importantly, notes that PyPI doesn't accept WASM wheels yet—you need to distribute them via GitHub Releases or similar and install with micropip.install(url).

What I actually PR'd to PyTensor

The changes above represent my weekend exploration, but they weren't what I ultimately contributed back to PyTensor. The Pixi configuration, in particular, was too large of a departure from PyTensor's existing toolchain. PyTensor uses mamba (via environment.yml) for its development environment, and switching to Pixi would have been a significant change to impose on a project I don't maintain.

Instead, I re-did the infrastructure changes using pyodide-build while respecting PyTensor's existing mamba-based workflow. The core change (making Numba optional on WebAssembly) remained, but the development environment configuration was adapted to work with what PyTensor already had in place.

This is a common lesson in open-source contribution: meeting maintainers where they are matters more than introducing your preferred tooling. The weekend experiment taught me what was needed; the PR reflected what was appropriate. You can see the actual PR here: pytensor #1960.

Unfortunately (for now), NUTS is gone

Unfortunately, NUTS (No-U-Turn Sampler) doesn't work in WASM. 😭

NUTS is the crown jewel of PyMC. It's the adaptive Hamiltonian Monte Carlo sampler that makes Bayesian inference efficient and robust. The 2022 PyMC Labs demo used NUTS to sample from posteriors in real-time in the browser.

But here's the thing: this isn't just about Numba being unavailable. The real issue is that none of the modern MCMC sampling backends have WASM support:

JAX (used by NumPyro and BlackJAX) has an open GitHub issue #1472 from 2019 titled "Jax for Web? (JS api or web assembly guide)" that's still open with no official WASM support
nutpie (the Rust-based NUTS implementation) doesn't have a WASM build readily available
The computational demands of Hamiltonian dynamics—computing gradients, simulating trajectories, adapting step sizes—require optimized backends that don't exist in WASM environments

This weekend's exploration shows the path to install PyMC in WASM, but you can't use its best sampler. It's like getting a Ferrari delivered to your house, but the dealer forgot to include the keys. You can sit in it, admire the leather seats, and maybe even turn on the radio. But you're not going anywhere fast.

This represents a fundamental infrastructure gap, not just a missing dependency. Getting NUTS in the browser will require either WASM ports of JAX or nutpie, or entirely new sampling backends designed for browser environments.

What does work?

Despite the NUTS heartbreak, this wasn't a failed experiment:

PyTensor now installs in WASM environments. This is non-trivial. PyTensor has C and Cython extensions that need to compile for WebAssembly. Getting that build pipeline working required understanding Pyodide's build system, setting up Emscripten correctly, and making Numba optional.
PyMC can technically be imported. Once PyTensor was installable, PyMC followed. You can define models, create random variables, and work with the API. The foundation is there.
Alternative samplers might still work. While NUTS is off the table, other samplers—like Metropolis-Hastings or Slice sampling—might be viable for small models. They're slower and less robust than NUTS, but they don't require JIT compilation. I didn't test, but I think this will hold true!
The roadmap is clearer. If someone wants to bring full PyMC to the browser, the path forward is documented. It requires either (a) building WASM support into JAX (a massive undertaking that's been an open request since 2019), (b) creating WASM builds for nutpie, or (c) building entirely new sampling backends designed for browser environments (also non-trivial, but potentially more feasible).

A weekend well spent

Did I achieve my original goal of running PyMC in the browser with NUTS sampling? No. The technical limitations of WASM environments made that impossible with the current architecture.

But that's the nature of weekend experiments. You explore, you hit walls, you learn. I now understand PyTensor's dependency structure at a deeper level. I've learned how Pyodide builds work and the constraints they impose. I've identified the broader infrastructure gap (MCMC sampling backends lacking WASM support) that needs solving for true browser-based Bayesian inference.

The dream of running PyMC entirely in the browser isn't dead—it's just waiting for the right infrastructure. Until JAX or nutpie (or something else) supports WASM, we'll keep pushing that car downhill.

Mastering Personal Knowledge Management with Obsidian and AI

2026-03-06T00:00:00Z

Folks have asked me how I do personal knowledge management (PKM) at work. The question becomes more pressing when they learn how many projects and people I need to interact with on a weekly basis. At the time of writing, I manage twelve people across two teams, each handling 2-4 projects of their own. That's a lot of context to keep straight.

I decided to document what I'm doing for PKM. Hopefully it serves as inspiration for you.

I've written before about why I chose Obsidian; this post shows how that decision evolved with AI integration over five years.

The plain text decision

In 2022, I decided to make personal knowledge management a priority at work. I faced a choice: Confluence, OneNote, or a new kid on the block, Obsidian. I chose plain text and graphs. I chose Obsidian. I chose not to lock my data inside a vendor system. I chose freedom and sovereignty for my information.

That decision was prescient in ways I couldn't have predicted. Most of us normies back then wouldn't have guessed that plain text would be exactly the right format for 2025 and 2026 era knowledge management. The visionaries saw it coming; I just got lucky because I loved the graph view in Obsidian and thought of it as a really cool tool. But holy smokes, has that choice paid off.

Text files are as primitive as it gets: no proprietary formats, no vendor lock-in, just files that can be read on any system. When AI coding agents arrived, my vault was already in a format they could process natively. No migration needed. No conversion layer. No API integration. The simplicity I chose became an unlock I never planned for.

The core system

My Obsidian vault is built around distinct note types. Monthly collections of daily bullet journals capture my day-to-day activities, one note per month with a running log of meetings and work. Meeting notes follow a structured template. People notes are dossiers for everyone I work with (to put it in CIA terms, I keep a file on everyone I interact with regularly). Project notes act as control towers, linking out to meetings, people, and status updates. A miscellaneous collection handles everything else. The structure was inspired in part by Thiago Forte's numbered folder system, though I've simplified it over time.

The most important thing isn't my specific implementation. It's that I have a system at all, and it's documented in an AGENTS.md file so my coding agents understand it too.

Ingesting information

The lifecycle of my workflow starts with ingestion. Meeting notes arrive as transcripts or AI-generated summaries. In the past, structuring these was tedious work. Now I paste them into OpenCode and my meeting notes skill handles the rest.

The skill knows the template I want. It handles various input formats: AI-generated summaries, transcripts with good speaker assignments, and transcripts with poor speaker assignments. I flag the quality when I know it's bad. The skill extracts key information and formats everything consistently. For one-on-ones, it ensures notes are attached to both the meeting log and the person's individual page, so I can track the full history of our conversations.

Beyond meetings, I ingest PowerPoints, Word docs, PDFs, and Excel spreadsheets into my vault as contextual information. The key insight is getting everything into plain text format. For Word documents, a Python script converts them to plain text using python-docx, which is then printed to the terminal or dumped to disk at /tmp, both of which are readable by a coding agent. Even lightly misformatted plain text contains enough information density for a coding agent to read and summarize.

For PowerPoints, I use dual parsing paths. One path extracts the XML structure directly using python-pptx. The second path converts each slide to an image using libreoffice and PIL, captions it with a vision-language model via APIs, and strings the captions together into a coherent narrative. Combined, I estimate that I can get a 90-95% accurate textual representation. PDFs follow a similar pattern: text extraction for normal PDFs, image captioning for scanned documents.

Excel spreadsheets are read directly by the coding agent using openpyxl, not pandas. The key difference matters: pandas assumes an established table structure, but real-world spreadsheets are messy. With openpyxl, the agent can read the granular cellular structure across each sheet, identifying merged cells, free text scattered in random locations, and arbitrary layouts. This structural mapping follows a progressive reveal principle: the agent first identifies the spreadsheet's architecture without necessarily reading every cell's contents, then zooms into relevant sections. This approach handles the chaos of actual spreadsheets far better than forcing everything into a tabular assumption. It's powerful when I need to understand financial data without being a finance person.

Managing and maintaining

With information in the vault, the next phase is keeping it current. With twelve people across two teams, there are a lot of details I don't pick up or retain in my working memory. That's why external memory matters. Without it, things would fall through the cracks.

When I hit a context block (when I look up a project or person and realize something's missing), I trigger a "sweep". My instructions to the coding agent are to update my people notes and/or project notes based on source material present in the vault. People and project notes are always derivative from sources, so any updates must include quotations from those source notes. I stay in the loop for verification. Hallucinations are rare, maybe once every four or five sweeps, and usually trace back to inaccurate transcripts rather than agent errors.

This is incredibly helpful for how I interact with people. My assumption is that I'm going to be forgetful. My external memory will be approximately correct, and I have a process for keeping it refined over time. So I can rely more on the vault instead of second-guessing myself based on incomplete memory. It tempers how I think about interacting with someone, not by changing my mind about them, but by giving me confidence that I'm not missing something important.

There are ethical boundaries. I don't capture personal details if people aren't comfortable with that. The dossiers are professional, not invasive.

Periodically, I do retrieval practice. This is how we make information stick; read "Make It Stick" to learn more. Review looks like this: I take my people notes and project notes and ask what's missing. Is there a piece of knowledge I remember that isn't captured? If yes, I fill in the blanks. I also check whether claims are substantiated with links and quotes. This fact-checking pass keeps the vault trustworthy and protects me from remembering something erroneous. A spell-checker list handles transcription errors, and my AGENTS.md links to HEARTBEAT.md to sanitize the vault of inaccurate information.

The final phase is producing outputs for others. I curate what gets published rather than exporting everything. The agent creates a publishable version based on my guidance. I haven't settled on hard rules for curation yet. I'd rather review and decide at publish time than tag things as publishable during capture. That workflow feels right to me.

For Confluence, a Python script publishes markdown directly, with YAML front matter defining the space and parent page. For GitHub users, notes can become Gists via the GitHub CLI. With the appropriate skills, Markdown files transform into HTML presentations, and with web technologies, those presentations become interactive. For Jira, a colleague created a skill that writes Jira tickets. We firmly believe that humans shouldn't be filling forms out; AI should be filling forms for us.

PowerPoint decks can be generated via Python scripts. Word documents come from markdown via Pandoc. The scripts run with uv, and LibreOffice handles conversions.

Each script maintains its own environment using PEP 723 inline script metadata. This means dependencies are declared at the top of each script in a special comment block:

# /// script
# dependencies = ["python-docx", "python-pptx", "pandas"]
# ///

When I run uv run script.py, uv automatically creates an isolated environment with just those dependencies, executes the script, and cleans up. No virtual environments to manage. No requirements.txt files scattered everywhere. No "works on my machine" problems.

The role of agent skills

Agent skills effectively encode procedural knowledge into executable markdown. Over time, it compounds; on fewer and fewer occasions do I need to repeat instructions, which is incredibly liberating. The model infers which skill to use most of the time. When it doesn't, I correct it explicitly and ask the coding agent to update the skill file for the future as well.

Designing a skill means thinking about the desired output and the tools needed to get there. I discover edge cases in the wild and update immediately. The earlier errors are caught, the better.

What's still friction

One pain point remains. I want to ingest Office files by pasting a URL, but I still need to download a copy first, then feed that copy to the agent skill. Programmatic access to cloud documents would eliminate this step. From the user side, nothing else would change. I'd just paste the URL and go.

But even with this friction, the system pays for itself. Knowledge management overhead dropped from thirty to forty percent of my time down to less than ten percent. I fix errors as I encounter them rather than scheduling dedicated maintenance. That recovered bandwidth goes toward better thinking and context gathering.

Getting started

What stops people from building systems like this? I believe it is two things: imagination and technical skill. You need imagination to envision converting diverse file formats into plain text. You need technical skill to know that it's possible.

The two feed each other. I experienced this with web technologies. Before I got familiar with building stuff on the web, I wondered what was even possible. Once I actually built things, I knew. Technical skill feeds your imagination, and imagination drives you to learn more technical skills.

For those starting without technical skills, use AI to learn programming. Find a language with a supportive human community to verify what you learn. AI hallucinates, and you need other people around you to help apply judgment and skill to AI outputs. You also need critical thinking skills and the initiative to act on what agents produce.

Skills you can use today

If you want to experiment with agent skills, here are some I've published:

html-presentations - Turn markdown into gorgeous HTML slides
gh-daily-timeline - See your GitHub activity for any given day
gh-activity-summary - Generate a plain-language summary of your GitHub work over any time period
publish-to-google-docs - Push markdown notes to Google Docs

The bigger picture

With such a system in place, repetitive, monotonous, and manual work can be offloaded to computers and AI. With a personal knowledge system, we can carry a broader scope of responsibilities and grow into new challenges for two reasons: we can externalize our memory more easily, and we can format information in ways that fit our brains.

I'm not asking people to do more at the same time. I'm asking them to expand their dynamic range over time so they're not stuck doing the same old boring thing over and over. That repetitive monotonous stuff should have been given away to AI and computers a long time ago.

This is useful for your career. It keeps things interesting. Every day that you make an incremental but permanent improvement compounds over time.

The vignettes I've shared are not a prescription. Rather, I hope you treat them as an invitation. Plain text plus coding agents is a powerful combination. Your system will look different from mine, and that's part of the point. Experiment and explore, and find what works for you.

How to stay in control when doing EDA with coding agents

2026-02-13T00:00:00Z

Speed without control is just chaos.

I've seen teammates compress a week and a half of analysis work into half a day using coding agents. That's a 5-10x speedup. But here's the thing: that speed only matters if you stay in the driver's seat. Otherwise you're not doing data science, you're just generating artifacts.

The real unlock isn't that agents write code fast. It's that they can be guided through a structure that keeps you in control of the analysis.

The problem isn't speed, it's agency

Coding agents are eager. Give them a CSV file and they'll open it, generate a dozen plots, and dump a wall of code before you've finished describing what you're actually looking for. That feels productive. It isn't.

The problem is that you've lost the thread. You didn't formulate a clear question. You didn't think through what the x-axis and y-axis should be. You're now reacting to whatever the agent produced, rather than steering toward an answer.

I've developed a different approach, codified in two skills I use with my coding agents: scientific-eda for exploratory data analysis and ml-experimentation for machine learning experiments. The pattern is the same in both: slow down first, gate on artifacts (plots, tables, etc.), and structure the session so both you and the agent can follow what happened.

Slow down first: the Socratic opening

The first design principle is counterintuitive: slow down before you speed up.

When you invoke the scientific-eda skill, the agent does not immediately load your data and start plotting. Instead, it asks you questions. What's the problem context? What are you hoping to learn or decide? What constraints matter?

From the skill definition:

Do not open the data file and start coding or plotting. Ask for or confirm: the problem context—biological, chemical, or data-science question; what the user hopes to learn or decide; and any constraints.

There's also an explicit guardrail: "ask 'why' before executing." When you request a specific plot or table, the agent briefly asks what question or decision it serves. This isn't bureaucracy. It's alignment. The agent is checking that you've thought through the request before it spends your time executing it.

This Socratic opening feels slower. But it prevents the far more common waste: generating plots you didn't need, going down rabbit holes you can't explain, and ending up with a folder full of artifacts and no clear answer.

Gate everything on artifacts

The second design principle is more specific: one artifact at a time.

If you can't describe what you want, you're not ready to execute. The agent waits. You think. You describe. Then the agent generates exactly what you asked for.

For a plot, this means articulating the x-axis, the y-axis, and what pattern you're looking for. What would confirm or refute your hypothesis? If you can't answer, the analysis isn't ready to run.

For a table, this means specifying the columns, rows, and aggregation level. What comparison are you trying to make? What decision will this table inform? A vague request like "show me the data" isn't actionable. "Show me the mean expression level by treatment group" is.

This is a forcing function for clarity. Describing an artifact precisely forces you to articulate the question you're actually asking. The tradeoff is worth it. You give up a bit of speed up front for precision in execution. And because the agent can generate code in seconds rather than minutes, the net result is still a massive speedup. My teammates went from 1.5-2 weeks to half a day. The precision tax is negligible compared to the execution dividend.

The session structure

Here's where structure becomes a feature, not overhead.

Each analysis session is a timestamped folder. The naming convention is ISO datetime plus a descriptive slug:

analysis/
  2025-02-05T14-30-00-protein-binding/
    journal.md      # append-only; shape, actions, findings
    plots/          # WebP figures only
    scripts/        # disposable PEP723 scripts; uv run

Let's walk through what each piece does.

journal.md is the memory. Before each action, the agent reads the journal. After each action, it appends what happened. Entries get timestamped and tagged: [SHAPE] for data structure discoveries, [PLOT] for visualizations, [FINDING] for observations, [NEXT] for suggested next steps. The journal is scannable. It's also the entry point for anyone (including future you) who wants to understand what happened without reading the code.

plots/ holds all figures from the session. The skill specifies WebP format for smaller file sizes, though that's a minor detail.

scripts/ contains disposable Python scripts. Each script has PEP723 inline metadata at the top, declaring its own dependencies. You run them with uv run script.py from the session folder. No environment wrangling. No "which virtualenv am I in?" confusion. One script, one plot.

The structure serves two purposes. First, it gives the agent a clear protocol to follow: read journal, execute, append to journal, suggest next step. Second, it leaves a trace that any human can follow. You can read the plan, then the journal, then the report, and understand the entire analysis without touching the code.

What changes in team conversations

Something unexpected happened when I started using this approach with teammates.

The conversations changed. We stopped asking "why did you write it that way?" and started asking "why did the agent write it that way?" The ego attached to code ownership evaporated. We could critique the work without critiquing each other.

This matters more than I expected. In the pre-agent world, I'd invest 50-70% of my mental energy on implementation details: wrangling data frames, handling edge cases, debugging syntax errors. That labor created attachment. When someone questioned my code, it felt like they were questioning my thinking.

Now the agent writes the code. I focus on the questions and verify the work. My teammates and I have more productive scientific conversations because we're discussing the analysis, not defending the implementation. We check the agent's work together, and if something's wrong, we just ask the agent to fix it.

Design for the human

The pattern that emerged is simple: if you design for the human, the agent follows.

The structure that makes your analysis traceable is the same structure that keeps the agent aligned. The journal that helps future-you understand what happened also helps the agent decide what to do next. The artifact-gating that forces you to think clearly also gives the agent precise instructions to execute.

You stay in control by slowing down at the decision points, describing what you want before you get it, and keeping a running record of what happened. The agent becomes a force multiplier rather than a loose cannon.

The skills I've linked here are just text files. They're prompts, structured in a way that an LLM-based coding agent can follow. You can copy them, modify them, or write your own. The key insight isn't in any particular skill, it's in the design pattern: gate analysis on artifacts, structure sessions with journals, and make the human's job explicit before the agent's job begins.

The 5-10x speedup is real. But the real win is that you get to stay the scientist.

How to Do Agentic Data Science

2026-02-01T00:00:00Z

Having tasted what agentic coding could look like for software development, I wanted to know what it would look like for data science - this meant training machine learning models and answering scientific questions. So I started experimenting, at work, and on my own at home as well. Here are ten lessons I've learned from my experiments thus far.

1. Be prescriptive in your prompting

Similar to building software, you need to know exactly what you want and how you'll evaluate the outcome. The difference, however, is as follows: With software, you will often know what you need to build, but with data science, you can only know what hypotheses need to be verified, which means you will need to iterate your way to the answer. Nonetheless, it is possible to leverage coding agents to move quickly.

The parallels are striking: if you frame each question you ask in terms of an observable outcome, you can set up your coding agent to write code that produces an output that can be evaluated for correctness, just like with software tests!

Here, your ability to describe precisely the hypothesis you're exploring, and the ability to describe in precise language what the answer would look like if the hypothesis held true or not, are critical components of what enables the coding agent to figure out what needs to be counterfactually true (within the codebase or the data) in order for your hypothesis to hold true.

Here is an example from my work. In a machine learning experiment with synthetic data, I wanted to hit 100% sequence editing performance. (It was synthetic data after all!) The coding agent hit a scenario where it was only doing 25%. With the clear goal in mind, it proposed edits to the code, edited the code, and re-ran experiments until it hit 100%. All without cheating; I know, because I checked!

2. Strong patterns in the file system

The agent, like humans, needs a predictable place for experiments. Similar to how a software repo has a conventional layout (src/, tests/, and so on), your experiments need a conventional layout so the agent knows where to put things and where to look. Within the experimentation skill that I wrote, I instruct the coding agent to do its work inside an experiments folder. Underneath that, for each experiment, we have datetime-prefixed subfolders, in which, there's a README file, a plots directory, a data directory, a scripts directory. Naming things logically helps, but the scheme matters more than the exact names. Coding agents will follow the patterns you already have.

3. Put logging instructions in AGENTS.md

With software, the feature one ask's a coding agent to build either succeeds or fails, and this can be automatically verified using programmatically-runnable unit and integration tests. With data science, experimental runs produce logs and metrics, but aren't easily boolean pass/fail like software tests. In both cases, however, and the agent can introspect logs to figure out what to change!

Your AGENTS.md file should include instructions for putting enough logging in place so the LLM can introspect what's going on during the experiment. I've written elsewhere about how to teach your coding agent with AGENTS.md and using AGENTS.md as repository memory for self-improving agents. Pair that with tools that run code in the terminal so the agent gets logs it can read. When the agent can read the logs, it can figure out what's wrong and what to change.

In my work, logging and printing to terminal were what let my agent fix a masking strategy that was only yielding 25% correctness. It read the logs, proposed a fix, re-ran, and got to where we needed to be. No intervention on my part. A 3 day experiment became 20 minutes.

4. Give it report-writing skills

The agent can write code and read mountains of logs, but you need something else: a human-readable summary of what it observed and what looked weird, so you can triage without re-reading every log. Give your coding agent instructions (e.g. in an agent skill) to write out in plain language what it observed during the model evaluation phase. It should read execution logs throughout the run. Tell it to write down anything that looks weird for follow-up. If something is off, it should say so. You get a readable summary and a list of things to dig into.

For reports (e.g. reports.md), encode in the skill that every table and every plot must be scrutinized. Ensure that plots are generated for every table, and that someone (you or the agent, with you verifying) carefully checks for inconsistencies between AI-generated plots and the tables they are supposed to reflect. The agent can miss things. It is valid to ask the AI to check its own work, but only if you have an idea of exactly where it is wrong and you tell it as such. Vague "double-check this" rarely helps; "the values in figure 2 do not match the second column of table 1" gives the agent something it can fix.

5. Have the agent keep an append-only journal of observations

Within the skill, instruct the agent to keep a single file (e.g. notes.md or journal.md) that it is told to only append to, never overwrite. The journal is not just for the agent. You should add to it too: things you noticed while looking at the data, gut feelings, weird patterns. It becomes a running log of what was going on, from both sides, that you can go back and summarize later. The point is to capture the thought process while you are doing the work.

6. Have the agent generate diagnostic plots for you

Logs and plots are complementary: logs are agent-accessible, but plots are human-accessible versions of the same underlying performance data. Have the agent generate diagnostic plots for you. The agent can propose fixes from the logs, but it can't build your intuition; you're the one who has to smell when something is off. Nothing beats looking at the data yourself, otherwise you never build intuition for what's happening! I still looked at the logs and plots myself to make sure the metrics were real and the agent wasn't hallucinating. Your prior experience is what lets you smell when something is off.

7. Instruct the agent to write the minimalist version first

With software, you run tests in seconds. With ML, you're tempted to train for hours and rush to the real data and the real training run. As a human, you don't want to "waste" time proving out the pipeline when you could just run the full thing, but that mentality is exactly what makes you unable to debug machine learning code on a tight loop. That temptation is exactly why you should instruct the agent to do the opposite: write the minimalist version first, then use it to work out elementary errors before scaling up.

That means train for one iteration, not even one epoch. Use miniature versions of the final model (e.g. a tiny custom deep net with the same architecture but a fraction of the parameters). Check for shape errors, data-loading bugs, and that the forward pass runs end to end. All the sanity checks you would do manually to prove that things work, but that you are tempted to skip. Encode in AGENTS.md that the agent must implement and run this minimal version before moving to full-scale training. The agent does not have your impatience; use that to your advantage. Once the minimal run passes, you can scale up with confidence.

8. Ask the coding agent to guide you through step by step

I do this often with large software refactors within canvas-chat, in which I ask the coding agent to prioritize for me a list of manual checks I need to look at. This is particularly helpful when I'm (a) context switching back into the project, or (b) running on fumes but my gut tells me we're so close to the end. (Though really, you shouldn't be doing any work if you're close to dozing off at 10:30 pm...)

The same applies to data science and data exploration! After having the coding agent autonomously execute on your experiment, you can have it walk through what it's done step by step, giving you the space to operate at your pace -- at the speed of your thought! Of course, if you're in a better state than merely "running on fumes", you can (and should) treat the coding agent as a research partner and ask questions back to critically evaluate whether the output is correct or not. What I have found is that there will still be unexplored paths that need to be trodden, and you can send a coding agent off on that direction on the side.

9. Learn the vocabulary the coding agent uses

Pay attention to the terms the agent uses when it describes what it did. You can reuse that vocabulary in future prompts and get more precise results. For example, in developing canvas-chat, I used this non-optimal verbiage in my prompt:

ok, I see, the default node size made it such that the next/prev buttons were hidden away. Can we make the pagination controls visible regardless of the size of the node?

Cursor's agent replied with something like "making the pagination toolbar sticky". That gave me a more compact way to express exactly what I need the next time. If you don't know the vocab at first, this is a great way to expand your technical vocabulary too.

10. For exploration, treat the agent as an executor that follows your curiosity

What you don't want in agentic data exploration is for the coding agent to hand you a boatload of output and leave you no room to follow your own curiosity. Flip the table: treat the agent as an executor of your ideas. You lead; it follows. Instruct it that it is not allowed to race ahead. It should only execute on the one thing you want, and it should ask you questions to clarify and narrow down what you actually want before it goes and does it. In other words, it is there to be a jazz partner for your data exploration.

You can run that partnership a few ways. One is to have the agent write scripts that produce plots on disk; you run them, look at the output, then ask for the next thing. Another is to go one level higher and work inside Marimo notebooks, using Marimo's reactive execution so you go one cell at a time, one question at a time. I've written about using coding agents to write Marimo notebooks if you want to try that path.

The agents handle the implementation. You handle the inquiry. The ten practices above - prescriptive goals, clear structure, logging, reports, an append-only journal, diagnostic plots and your own eyes on the data, the minimalist version first, having the agent guide you step by step when you need it, learning the agent's vocabulary, and in exploration keeping the agent as your jazz partner - are what make that partnership work. I've spent nearly a decade training ML models by hand, so I know what I want, and I have developed a sense of taste for what success looks like. You can get to the same level of taste with AI assistance, but you must work for it. I'll write separately about how I'm learning new things with AI. The point is not to hand off the science, but to do more of it!

Model feel, fast tests, and AI coding that stays in flow

2026-01-25T00:00:00Z

Most of the conversation about AI coding models focuses on performance metrics. Benchmarks, evals, pass rates, latency. Useful stuff, but it misses the part that actually shapes my day-to-day: what it feels like to work with the model.

Once you start using LLMs as coding agents, the qualitative experience becomes a throughput issue. It affects how often you intervene, how much you trust what is happening, and whether you stay in flow or spend your time cleaning up weird breakage.

Two axes keep showing up for me.

First is time horizon and supervision style: long-horizon autonomy versus short-horizon iteration.

Second is personality and verbosity: how the model behaves when it is wrong, how much it narrates, and whether it stays constructive or spirals into apology loops.

There is also a third ingredient that ends up mattering as much as the model: the agentic harness. By that I mean the tools and checks that the agent can run to verify it did not break behavior, and whether the harness gives you streaming and visual feedback—a live trace of what the model is doing—or leaves you staring at a spinner until the answer drops. Good harness beats model swapping more often than I expected.

Long-horizon autonomy vs short-horizon iteration

I call it "Opus-feel" when a model has that "ask and it shall be given" vibe with a longer time horizon. You describe what you want, it runs for a while, and it comes back with a plausible scaffold. It is great for momentum.

I call it "Sonnet-feel" when a model leans toward shorter-horizon iteration. It works better when you are walking through a real codebase step by step, keeping changes small enough that you can validate what happened, correct course, and keep going.

Another way to put it is that long-horizon autonomy pushes you toward a spec-and-review loop, while short-horizon iteration pushes you toward a steer-and-verify loop. Both can be productive. They just fail differently.

In a sufficiently large codebase, you cannot rely solely on long-horizon autonomy where you ask for something with a vague description and hope it lands cleanly. You are not always guaranteed something well organized, especially when the job is refactoring rather than greenfield scaffolding.

A concrete example for me came from Canvas chat. At the time, everything was tied to app.js and app.py. When I wanted to refactor things into plugins, I needed to dogfood a plugin pattern in the codebase itself.

Long-horizon autonomy struggled here. It could generate a plugin pattern, but it was not great at the careful, incremental work of extracting behavior out of a monolith and into a clean plugin boundary.

Walking bit by bit with Sonnet or Sonnet-quality models was a very different experience. The big win was that I could study the LLM traces live (the tool calls and file edits it proposes step by step) and see where edits were being made. If I noticed a feature handler getting added to app.js when it clearly belonged in a plugin file, I could intervene immediately and ask, "Why is that thing over there in app.js? Why is it not inside the plugin file instead?" That kind of interactive, traceable work is where the short-horizon models shine.

Examples from my own testing, with all the usual caveats: Opus-4.5 (Anthropic), GPT-5.2 (OpenAI), and GLM-4.7 (z.ai) have been solid for the long-horizon, get-it-moving-fast mode. Minimax M 2.1 (OpenCode Zen) feels closer to the short-horizon mode for me. Composer-1 (Cursor) also feels closer to that style. I suspect GPT-4o and GPT-5.1 (both OpenAI) might land there too, but I have not really test-driven them.

The practical takeaway is that I now switch modes on purpose. When I need speed and initial momentum, I reach for long-horizon autonomy. When I need control, I choose a short-horizon model so I can babysit the work, watch the traces, and intercept it when it tries to do something clever in the wrong place.

The harness lesson (Cypress beat model hopping)

One more lesson from that period: I did a bunch of model hopping, trying to find something that would fix a particular class of behavioral breakage.

The most frustrating failures were not subtle logic bugs. They were basic syntax errors introduced during tool-call patching, unclosed brackets, unclosed parentheses, that kind of thing. When that happens, you do not get a slightly-wrong feature, you get a page that fails to load. Debugging it manually is fine the first time, and infuriating on the seventh.

The thing that actually moved the needle was listening to my colleague Anand Murthy and instantiating Cypress tests. A simple automated page reload catches those failures immediately. It shifts the pain earlier, gives the agent a verification loop it can run on demand, and turns "agentic coding" into something I can trust.

Here is the dumbest possible example, taken straight from the Cypress suite in canvas-chat. It is not fancy, and that is the point. It catches "the page does not load" failures quickly.

describe('Help Modal and Auto-Layout', () => {
    beforeEach(() => {
        cy.clearLocalStorage();
        cy.clearIndexedDB();
        cy.visit('/');
        cy.wait(1000);
    });

    it('opens and closes help modal', () => {
        cy.get('#help-btn').click();
        cy.get('#help-modal').should('be.visible');
        cy.get('#help-close').click();
        cy.get('#help-modal').should('not.be.visible');
    });
});

Beyond model choice, a great agentic harness matters. If your harness includes tests that a coding agent can run to verify no behavioral breakage, you get to move faster with more confidence, regardless of which model you are using.

Verbosity, attitude, and the cost of being wrong

The other axis that became obvious once I started pressure testing models is verbosity and its associated feel.

I tried Gemini 2.5, and it was a disaster for me. After experiencing the long-horizon and short-horizon styles, I did not want to use it. It made elementary mistakes, like leaving trailing dangling curly braces where they were not supposed to be. Then it would apologize profusely over and over, like a Canadian on steroids. (I'm a born and bred Canadian, I'm allowed to say that!)

In contrast, Claude and Opus are consistently upbeat and positive, and the same can be said for Minimax-M.2 and GLM-4.7. That matters more than I expected. When something breaks and you are iterating quickly, a model that stays constructive keeps the whole loop feeling fun.

On the other end, GPT-5.2 would just go ahead and do things without being overly effusive, then loop back to tell me what it did. That sounds fine on paper, but it left me feeling a bit clueless. I would wonder what it was doing and whether I could intercept it if it went off on the wrong tangent. I often could not, because I needed to wait until the end to learn what it decided to do.

So yes, I care about correctness. But I also care about how a model behaves while it is getting to correctness. The journey matters because the journey is where you spend your time.

Enthusiasm is a feature

This ties nicely to a tweet I saw from Grady Booch (@Grady_Booch):

"The greatest value such tools have offered me is to reduce my cognitive load and automate various tedious tasks."

Here is the punchline:

"To serve as an enthusiastic and indefatigable, albeit very naive and often unreliable, pair programmer."

That enthusiasm and indefatigability, compared to a grumpy human, keeps the loop moving.

My frustration pair coding with Gemini was not just the mistakes, it was the emotional texture of the interaction. It would make mistakes and then apologize, repeatedly. After a while, you start optimizing your own behavior around the assistant's vibe, and that is not where you want your attention to go.

A better pair programmer, human or AI, is relentlessly game for the next challenge. It affirms what you are trying to do, it corrects you when you are wrong, and it does not act like it is giving up. When the assistant stays constructive, the work stays fun.

Streaming and the illusion of speed (the harness + model)

Raw latency is one thing. What you see while the model is working is another, and that is determined by the harness. In Cursor, you get a fast stream of tool calls and edits. You see the so-called thinking process. Something is clearly happening. With GLM-4.7 or Open Code in certain setups, you wait a long time with nothing streamed in—just a spinner or a blank state until the full response lands. Same model capability, same task, different harness, totally different experience. The harness that gives you a live trace makes the wait feel shorter and keeps you in the loop. The one that hides progress makes every request feel like a gamble. If you care about flow, streaming and visual feedback are not polish; they are table stakes, and they live in the harness.

The "feel" is also vendor lock-in

After enough hours with a single model, you start building muscle memory for its quirks. You learn how to phrase prompts so it does the right thing. You learn which mistakes to expect. You even learn its tone. That comfort is sticky.

The sticky part is the problem. Getting used to a model's ergonomics is a form of vendor lock-in, and it is something I am determined to avoid.

That is one reason I have been bouncing between models (apart from me hitting limits) to feel out the ragged frontier of model behavior. It is pretty revealing. You quickly learn that "best model" is not a single number. The model you want depends on whether you are scaffolding, refactoring, debugging, or doing the last-mile polish.

If you want to keep your agency while using these tools, stay fluent across multiple feels. Otherwise you end up optimizing your workflow around one model's quirks and calling it productivity.

A more pragmatic way to think about model choice

What I do now is less romantic than "find the best model". I think in terms of work phases and feedback loops.

If I am scaffolding, I will happily take Opus-feel: longer-horizon autonomy and a big blob of output, because the cost of being wrong is usually low.

If I am refactoring or debugging, I want Sonnet-feel: short-horizon iteration and tight supervision, because the cost of being wrong is a broken app and a bunch of time lost to verification.

And if I keep hitting the same dumb failures, I try to fix my harness before I try to fix my model. Add the smallest test that fails fast, make it runnable by the agent, and suddenly the whole system behaves better. Cypress reloading the page and clicking one button did more for my sanity than another week of model hopping.

At a systems level, I want a workflow where models are swappable components. In practice that means traces you can read, tests you can run, and a loop that tells you quickly when the agent broke something.

How to build self-improving coding agents - Part 3

2026-01-19T00:00:00Z

In part 1, I covered AGENTS.md as repo memory.

In part 2, I covered skills as reusable playbooks.

This post is about turning those two ideas into something you can run as a practice.

The maturity model

Once you have both repo memory and skills, you can think about how the practice evolves over time.

Stage 0: Ad hoc prompting

You keep re-explaining the same things in chat. It works, but it does not compound.

Stage 1: Repo-local memory

You add repository-specific guardrails and a code map.

This is where AGENTS.md shines.

Stage 2: Global personal skills

Once a workflow repeats across repos, you promote it into a global skill on your machine.

If you want a concrete bootstrap set, here is what I would install globally:

skill-creator: lowers the activation energy for making new skills.
an installer and updater for skills, for example openskills: makes distribution and updates less annoying.
agents-md-improver: keeps the repo map current without you thinking about it.

Stage 3: Shared skills

If a workflow repeats across a team, it belongs in a shared location with a clear install path.

I do not think you should start here. Start repo-local, then promote only when you feel the pain twice.

Promotion decisions come from paying attention to what the agent actually does in practice.

Watch traces, then distill constraints

If you work with agents long enough, you start to notice the model’s default moves.

When I see an agent repeatedly:

taking an overcomplicated path
missing a file I know is relevant
applying a global refactor when a surgical fix is needed

I treat that as a signal.

Then I decide what kind of fix it is.

If it is a repo invariant, a navigation hint, or a local norm, it belongs in AGENTS.md. That is the always-on context for how work should happen in this repo.

If it is a repeatable procedure with a clear output contract, it belongs in a skill.

Sometimes the procedure is repo-specific. In that case I keep it as a repo-local skill. If I feel the pain twice in another repo, I promote it into a global skill.

This is how you get operational learning without pretending the model is learning.

Underneath, a lot of this comes down to writing instructions in a way that can be executed.

Markdown is becoming executable

One reason this whole approach works is that the agent can execute what you write.

When an LLM can execute tool calls, Markdown becomes an executable language.

Skills fit this pattern. A SKILL.md is just a structured instruction sheet, but it is also runnable in the sense that the agent can turn it into searches, file reads, edits, and command execution.

The other trick is that skills are loaded on demand. The agent reads a short description first, then loads the full instructions only when it needs them.

You can write a precise plan in plain language, and the agent can turn it into:

searches
file reads
surgical edits
test runs

This is not magic. It still depends on linguistic precision. But the ergonomics shift. You can describe a workflow at the level you actually think about it, then let the agent do the clerical work.

This is also why I like the runbook analogy, even with the caveats.

When to update `AGENTS.md` vs create a skill

Skills tell an agent how to do something.

AGENTS.md tells an agent how this repo works, and what rules it must follow while doing anything at all.

Here is how I decide.

Update AGENTS.md when the instruction is specific to the repo:

navigation help: where things live, what files matter, what to ignore
local norms: build commands, test commands, environment rules, style constraints
guardrails: what not to do in this repo

Create a skill when the workflow is reusable, or when you want a named, on-demand playbook:

a multi-step procedure you want to invoke repeatedly
a workflow that spans repos or products
a task with a strict output contract (release announcements, status updates, summaries)

If I am unsure, I start repo-local. If I feel the pain twice in another repo, I promote it into a global skill.

The meta skill is metacognition

The most valuable “skill”, however, is not a file format. It is the habit of watching yourself work.

I try to ask: what am I doing repeatedly that should be systematized?

If the answer is “I keep re-explaining how this repo is organized”, that goes into AGENTS.md.

If the answer is “I keep asking for the same kind of summary, debug sequence, or release note format”, that becomes a skill.

Once you start doing this, you build a compounding loop. The agent handles more of the repeated work, and you spend more time on judgment and design.

If this all sounds like more than coding, that is because it is.

Where this seems to be going

I buy Simon Willison’s framing that these tools are general agents disguised as developer tools (Claude Cowork post).

Even if you start with coding, the moment an agent can run terminal commands and manipulate files, the surface area expands to “almost anything”, as long as you know how to steer it.

That matches how I use coding agents.

Yes, I use it for coding work. But I also use it for other intellectual work: ghostwriting blog posts (which I scrutinize heavily, because the review process is essential for me to own the content), writing release announcements, and turning messy notes into structured drafts.

I have also heard Theo Brown make a similar point when talking about Claude Cowork (video). The details vary, but the pattern is the same: once you have a general agent, the label “coding tool” becomes more about marketing and UI than capability.

So I am increasingly convinced that the long-term shape here is web-deployed agents with less scary branding.

You will still want composable components for LLM workflows. But for day-to-day work, the most useful thing is an agent that can execute commands and apply changes, while carrying a growing set of skills and repository memory.

That combination is what makes the agent feel less like a chat box and more like a teammate.

How to build self-improving coding agents - Part 2

2026-01-18T00:00:00Z

In part 1, I focused on repo memory with AGENTS.md.

In this post, I am switching to the other lever: skills.

Skills are prompt compression

Skills are the other half of the system.

When a task repeats, I do not want to keep re-explaining the workflow. I want a playbook I can invoke.

What a skill is

A skill is a folder with a SKILL.md file.

The SKILL.md is the prompt. The bundled scripts and assets are the tool layer.

A good skill makes three things explicit:

when to use it
what steps to take
what good output looks like

If you want the spec, see Agent Skills.

Skills are best formed around jobs to be done: concrete, repeatable workflows rather than abstract capabilities. Think "debug a GitHub Actions failure" or "draft a release announcement," not "know about CI" or "write good prose." When the job is clear, the skill has a natural boundary and a clear trigger. When it is vague, the skill is hard to invoke and hard to improve.

A wrong framing is "skills for tools." Skills get invoked in the loop of trying to accomplish a job, not in the context of trying to use a tool. The tool is a means; the job is why you reach for it. If you design a skill around a tool, you end up with something the agent has to remember to use. If you design it around a job, the agent reaches for it when the job shows up.

Examples

A GitHub debugging skill is the obvious starting point. CI failures are repetitive and usually want the same sequence: identify failing jobs, pull logs, inspect diffs, reproduce locally, then patch.

A second example is a release announcement skill.

The motivation here was not abstract. I was spending a good half hour each release just trying to compose the announcement, and I did not want to do that anymore.

The output contract was also specific. I wanted release announcements that are copy-pasteable into Microsoft Teams, with emojis, but otherwise minimal formatting because Teams formatting is inconsistent.

A third example is more technical.

At work I had a session with a coding agent to train an ML model inside a script. After that session, I had it write a report on what it learned and what changed. Then I turned that report writing into a skill.

The report format was familiar to everyone on the team: Abstract, Introduction, Methods, Results, Discussion.

The content came from real artifacts: stdout logs, metrics, code, config files, git diffs, and the agent’s own session history.

A fourth example is about tacit domain expertise.

A teammate of mine created a skill that encoded her implicit knowledge from years of debugging chromatography traces. The point was not that the agent suddenly became a scientist. The point was that her debugging procedure became explicit and reusable.

Skill creation and iteration

I now like skills because they are easy to iterate on. I used to be more skeptical, and I still think MCP servers have a cleaner distribution story, but my opinion has shifted as I have used skills more in real workflows (Exploring Skills vs MCP Servers).

For the release announcements, I fed my coding agent a few examples of what “good” looked like. I was using Anthropic’s skill-creator skill at the time, and those examples became part of the skill itself, stored as assets that the agent could reuse.

This is a huge energy barrier reducer. It is much easier to iterate on a Markdown-based skill than it is to start from scratch with “write me a Python script that does X”. You can still add scripts inside a skill when you need determinism, but the interface is the Markdown.

The other half is the feedback loop. When I edit the generated release announcement, I feed the revised version back to the agent and tell it to update the skill with the new example. That way the skill evolves as my taste evolves.

This is also a way to share. A skill is reviewable. I can open a PR and let collaborators comment on both the output and the process that produced it.

In the chromatography example, using skill-creator to generate the first draft mattered for another reason too. English is not my teammate’s first language. The structure makes it much easier to get from “I know what I do” to “here is the procedure an agent can follow”.

Distribution and updates

This is where skills feel less mature than MCP servers.

An MCP server has a clean distribution story. You can pip install it, configure auth once, and you get a centrally versioned bundle of prompts and tools. Updating is a normal package update.

Skills still involve moving folders between machines and repos, and remembering where each harness expects skills to live.

I originally ended up writing a skill-installer skill. It is the same move as skill-creator, but for distribution and updates.

When I say “install this skill” or “update this skill from this URL”, the agent needs to ask two key questions if I have not already specified them:

is this repo-local or machine-global?
which harnesses should discover it?

Then it does the boring part consistently.

Update: it looks like openskills now solves most of what I wanted here, and it does it more deterministically. It is a CLI that installs skill folders from GitHub or local paths, tracks their sources for updates, and can target multiple install locations.

OpenSkills has a "universal" mode that installs to .agent/skills (repo) and ~/.agent/skills (machine).

The caveat is that .agent/skills is not a universal discovery standard across harnesses. Some tools look in .claude/skills, .github/skills, .opencode, or other locations. So OpenSkills helps with deterministic installs and updates, but you still need to know what your harness will actually read.

I expect this to converge soon.

At this point you have both memory and playbooks. The question becomes how you decide what to invest in next.

Coming next

Part 3 covers the operating model.

It lays out a maturity model, a concrete bootstrap set of skills to install globally, and a decision rule for when to update AGENTS.md versus when to create a skill.

How to build self-improving coding agents - Part 3

How to build self-improving coding agents - Part 1

2026-01-17T00:00:00Z

I want my coding agents to get better every week.

Not in the abstract “the models are improving” sense. I mean it in the operational sense: if an agent makes a mistake, or takes a path I would not take, I want that feedback to stick. If I have to repeat the same preference every session, I am not using an agent. I am babysitting a very fast intern.

The trick is that the model weights are not changing mid-week. So if you want “self-improvement”, you need to change the environment the agent works inside.

I have found two levers that compound:

AGENTS.md as repository memory
skills as reusable playbooks

This post is a longer “source of truth” version. My intent is to later break it into smaller blog entries, and also rework it into chapters for my data science bootstrap notes.

Where improvement comes from

The UX I am after is simple: I stop repeating myself. I stop doing the same end-of-day cleanup, writing the same reminders, re-explaining where files live. The agent starts each session closer to how I want it to work.

If the model weights are not changing mid-week, improvement has to come from the environment you wrap around the agent.

For me that environment has two pieces:

durable repository memory (AGENTS.md)
reusable playbooks (skills)

Once you have those two, you can treat “agent improvement” like runbooks plus postmortems.

The analogy is imperfect, because this is not documentation for humans. The loop is the same though: write down the repeatable steps, then write down what surprised you and what you will do differently next time.

The difference is that natural language can turn into tool calls. When you write things down precisely, the agent can execute them.

I usually start with AGENTS.md, because it cuts down exploration immediately.

`AGENTS.md` as repository memory

If you have not run into the AGENTS.md convention before, see agents.md.

To be effective, AGENTS.md needs to do two things for the agent.

First, it needs to make the agent fast at navigating the repo so it can get to the right files with minimal wandering. A code map is a straightforward way to do that.

Second, it needs to encode the local ways of working in this repo so the agent stops repeating the same mistakes. That is where corrections and norms live.

This is the loop I want:

I observe a mismatch.
I tell the agent what must be true.
The agent writes the correction into AGENTS.md (or a repo-local skill).
The agent reads it next time.

In the ideal state, the agent gets to the right files quickly.

A code map is the simplest way I know to make that happen. It does not have to be perfect. It can be slightly stale and still be useful.

I have seen this pay off in a very practical way. In my canvas-chat codebase, having a map of the repo let the agent one-shot an obscure spot where events were emitted for node rendering. Without a map, the agent previously needed 5 to 6 rg searches, just to find the right neighborhood of the code.

The difference is small in absolute time, something like 40 seconds versus 2 seconds. But it changes the feel of the collaboration. The agent spends less time wandering, and I spend less time steering.

Close the loop when the map is stale

There is one extra move that makes this feel self-correcting: When the agent notices that the code map looks stale, it should update the code map.

This is a subtle point. The map is not a static artifact. It is part of a feedback loop. When the agent’s exploration discovers a mismatch between the map and reality, that discovery should flow back into AGENTS.md.

You can encode this as an explicit instruction inside AGENTS.md. You can also refresh on a schedule, like weekly, but the on-demand update is the part that makes the loop feel alive.

Corrections that become durable norms

The second job of AGENTS.md is to hold repo-specific corrections to agents behaviour.

These are the things you find yourself saying out loud.

Two examples from my own work:

Run Python in the pixi context. Use pixi run python ....
Do not cheat by modifying the tests to make them pass.

I say the first one because the agent will often try python -c ... to quickly check something. In a pixi-managed project, that fails if you do not have a global Python.

I say the second one because changing tests to make them pass destroys the point of having tests.

Once these rules are written down, the agent stops making you restate them. This is the simplest way I know to reduce repeated friction.

A starter prompt for generating `AGENTS.md`

I have found it useful to bootstrap AGENTS.md with a one-time deep dive.

Here is a prompt I use as a starting point. It is intentionally repo-specific.

You are a coding agent. Read through this repository and create an `AGENTS.md` file at the repo root.

Requirements:
- Include a short codebase map that helps an agent find files quickly.
- Focus on entry points, directory roles, naming conventions, configuration wiring, and test locations.
- Add a section called "Local norms" with repo-specific rules you infer from the code and tooling.
- Add a section called "Self-correction" with two explicit instructions:
  - If the code map is discovered to be stale, update it.
  - If the user gives a correction about how work should be done in this repo, add it to "Local norms" (or another clearly labeled section) so future sessions inherit it.

Process:
- Use search and targeted file reads, do not read every file.
- Prefer `rg` searches to find entry points and configs.
- Prefer high-signal files: `README`, `pyproject.toml`, `package.json`, `Makefile`, `opencode.json`, `.github/workflows`, and top-level `src` or `app` directories.

Output:
- Write the final `AGENTS.md` contents in Markdown.
- Keep it concise. Optimize for navigation and correctness.

If you want, you can go further and add a cadence rule like “refresh weekly”, but I would keep it lightweight. The goal is compounding value, not bureaucracy.

Once AGENTS.md exists, skills are the second lever.

Coming next

Part 2 is about skills as reusable playbooks.

It covers what a skill is, several examples from coding and scientific work, and why I ended up writing a skill-installer skill to deal with the current distribution story.

How to build self-improving coding agents - Part 2

How I fixed a browser selection bug with sequence alignment algorithms

2026-01-06T00:00:00Z

I ran into a frustrating bug this week in canvas-chat, my experimental canvas-based chat interface I built at the end of last year. The bug seemed simple on the surface: when users selected text from a rendered markdown table and clicked to highlight it, the highlighting would sometimes stop partway through, or highlight the wrong characters entirely.

What started as a "quick fix" turned into a journey through several failed approaches before I remembered an algorithm from my undergraduate bioinformatics days. Sometimes the best solution to a problem comes from an unexpected domain.

The problem: Browser selections are messy

Canvas-chat has a feature where you can select text from an AI response, and the app creates a "highlight" node that links back to the source. When you click on the highlight, the corresponding text in the source gets wrapped in a <mark> tag.

This worked fine for simple paragraphs. But when I tried it on tables containing KaTeX-rendered math, things went wrong:

What I expected to highlight: 66.00 (0.18±0.58)

What actually got highlighted: 66.00 ( 0.18±0.58)

The highlighting was off by more than a few characters, and would stop before the end of my selection. In some cases, it would highlight completely wrong sections.

Digging into the root cause

The problem came from how KaTeX renders math and how browsers handle text selection.

KaTeX renders math with multiple text representations:

<span class="katex">
  <!-- MathML for accessibility/screen readers -->
  <span class="katex-mathml">
    <math><mn>0.13</mn><mo>±</mo></math>
  </span>
  <!-- Visual HTML for display -->
  <span class="katex-html">
    <span class="mord">0.13</span>
    <span class="mord">±</span>
  </span>
</span>

When you select text that spans across KaTeX-rendered content, selection.toString() gives you something like this:

"66.00 (
0.13
±
0.13±0.58)"

Notice the duplicated 0.13 and the random newlines? The browser included text from both the MathML (for accessibility) and the visual spans. Add in tabs between table cells and inconsistent spacing around operators, and you have a string that looks nothing like the clean HTML text content.

First attempt: Normalization layers

My initial approach was to normalize both strings (the user's selection and the HTML text) before matching:

Collapse all whitespace to single spaces
Remove KaTeX duplication patterns (like 0.13 ± 0.13± → 0.13±)
Normalize spacing around ± operators

Then find the match in the normalized strings, and map the positions back to the original.

This is where things got complicated. I needed to track:

Which positions in the normalized string corresponded to which positions in the original
How to reverse the mapping after finding a match
How to handle characters that got removed entirely during normalization

The code became a tangled mess of position arrays and off-by-one bugs. Here's a simplified version of what it looked like:

// Build mapping from normalized to original positions
const normalizedToOriginal = [];
let inWhitespace = false;
let leadingTrimmed = true;

for (let i = 0; i < fullText.length; i++) {
    const ch = fullText[i];
    if (/\s/.test(ch)) {
        if (!inWhitespace && !leadingTrimmed) {
            normalizedToOriginal.push(i);
        }
        inWhitespace = true;
    } else {
        leadingTrimmed = false;
        inWhitespace = false;
        normalizedToOriginal.push(i);
    }
}

// Then also account for math spacing normalization...
// And KaTeX deduplication...
// Each layer compounds the position mapping complexity

The position mapping kept breaking. I'd fix one case only to break another. I was trying to maintain a bijection between two strings that had been transformed through multiple non-invertible operations. It wasn't going to work.

The insight: This is a sequence alignment problem

After banging my head against the normalization approach for a while, I took a step back. What was I actually trying to do?

I had two strings:

The user's selection (messy, with artifacts)
The HTML text content (clean)

I needed to find where the user's selection "matched" in the HTML text, tolerating:

Insertions (extra whitespace, duplicated characters in the selection)
Deletions (characters present in HTML but not in selection)
Mismatches (different whitespace characters)

This is exactly what sequence alignment algorithms are designed for. In bioinformatics, we use these algorithms to compare DNA or protein sequences that may have evolved with insertions, deletions, and mutations. The classic algorithm for finding the best local alignment between two sequences is Smith-Waterman.

I learned Smith-Waterman as an undergraduate, probably around 2008. I never thought I'd use it for web development.

The solution: Align the beginning and end

I didn't need to align the entire selection - I just needed to find where it started and ended in the HTML text. So I:

Take the first ~20 characters of the user's selection and align them to find the start position
Take the last ~20 characters, reverse both strings, align to find the end position

Here's the core alignment function:

function alignStart(queryPrefix, target) {
    const m = queryPrefix.length;
    const n = target.length;

    const MATCH = 2;      // Reward for matching characters
    const MISMATCH = -1;  // Penalty for different characters
    const GAP = -1;       // Penalty for insertions/deletions
    const WS_MATCH = 1;   // Softer reward for whitespace matches

    // Build the scoring matrix
    const score = Array(m + 1).fill(null)
        .map(() => Array(n + 1).fill(0));

    let maxScore = 0;
    let maxI = 0, maxJ = 0;

    for (let i = 1; i <= m; i++) {
        for (let j = 1; j <= n; j++) {
            const qChar = queryPrefix[i - 1].toLowerCase();
            const tChar = target[j - 1].toLowerCase();

            let matchVal;
            if (qChar === tChar) {
                matchVal = /\s/.test(qChar) ? WS_MATCH : MATCH;
            } else if (/\s/.test(qChar) && /\s/.test(tChar)) {
                matchVal = WS_MATCH;  // Any whitespace matches any whitespace
            } else {
                matchVal = MISMATCH;
            }

            score[i][j] = Math.max(
                0,  // Local alignment can restart anywhere
                score[i-1][j-1] + matchVal,  // Diagonal: match/mismatch
                score[i-1][j] + GAP,          // Up: gap in target
                score[i][j-1] + GAP           // Left: gap in query
            );

            if (score[i][j] > maxScore) {
                maxScore = score[i][j];
                maxI = i;
                maxJ = j;
            }
        }
    }

    // Traceback to find start position
    // ... (walk backwards from maxI, maxJ to find where alignment began)
}

The key insight is that Smith-Waterman's local alignment naturally handles all the messiness. Extra newlines in the selection? They're just gaps. Duplicated numbers? They align to the same position. Different whitespace characters? They all match each other.

The result

The new approach passes all the test cases that the normalization approach failed:

Test: Simple word

Target: "Hello world, this is a test."
Query: "world"
Result: world (positions 6-11)

Test: KaTeX duplication

Target: "66.00 (0.18 ± 0.18±0.58)"
Query: "66.00 (\n0.18\n±\n0.18±0.58)"
Result: 66.00 (0.18 ± 0.18±0.58) (positions 0-25)

Test: Cross-block selection

Target: "The Heading Some paragraph text here."
Query: "The Heading\n\nSome paragraph"
Result: The Heading Some paragraph (positions 0-25)

The lesson: Know your algorithms

I didn't invent anything new here. Smith-Waterman has been around since 1981. I just recognized that my web development problem was, at its core, a sequence alignment problem.

This is why I think it's valuable to study algorithms and techniques from different domains, even if they seem unrelated to your day-to-day work. You never know when dynamic programming from bioinformatics will solve your JavaScript text highlighting bug.

The normalization approach was trying to make two messy things identical before comparing them. The alignment approach embraced the messiness and asked: "Given that these are different, where do they best correspond?"

That's a fundamentally different framing, and it's the framing that actually matched the problem.

Interestingly, I couldn't find prior examples of using Smith-Waterman specifically for UI text highlighting or matching browser text selections to source HTML. The algorithm is well-established in bioinformatics for DNA and protein sequence alignment, and it appears in some fuzzy string matching contexts like spell-checking and record linkage. But applying it to handle the specific artifacts that browsers introduce when selecting text from rendered HTML with KaTeX, MathML, or complex table structures? That seems to be a new application. Sometimes the best solutions come from recognizing that your problem, despite appearing domain-specific, maps onto a well-solved problem from an entirely different field.

One more note: I didn't write the JavaScript implementation myself. I directed Claude Opus 4.5 in OpenCode to write it for me. My contribution was recognizing that this was a sequence alignment problem and describing the approach - the actual code was generated by the AI. This is becoming my preferred way to work: I provide the domain insight and algorithmic direction, and the AI handles the implementation details.

Appendix: The full solution

For those curious, the complete implementation is in the pull request. The key functions are:

alignStart(queryPrefix, target) - Find where the query beginning matches
alignEnd(querySuffix, target) - Find where the query end matches (by reversing and aligning)
findMatchRegion(query, target) - Combine both to get the full match region

The algorithm runs in O(mn) time where m and n are the lengths of the strings being aligned. For typical text selections (tens to hundreds of characters), this is instantaneous. And unlike the normalization approach, it's robust and correct!

Canvas Chat: A Visual Interface for Thinking with LLMs

2025-12-31T00:00:00Z

I've been mulling over this idea since last year January: A visual, nonlinear interface for LLM conversations—something like an infinite canvas where you could branch, merge, and see the shape of your thinking. It stayed in the "someday" pile because the implementation cost felt too high for a speculative side project; I wasn't skilled in browser technologies or anything UI-related.

Then came the Christmas break ultralearning exercise I documented in my recent blog post about building with OpenCode and Claude Opus 4.5. Pressure-testing Opus 4.5 made me realize it was finally feasible to spend a day trying to make this work. I pushed Canvas Chat from idea to working prototype in about 24 hours of actual building time, and then another 24 hrs to get it up on Modal and add in many, many refinements, each of which may have taken me multiple weeks. The final result is this:

But before I explain what I built, let me explain why I wanted it in the first place.

The job to be done

Clayton Christensen's Jobs to Be Done framework asks: what job is the customer hiring this product to do? For Canvas Chat, the job isn't "chat with an LLM"—ChatGPT already does that fine. The job is: think through a complex problem where the exploration is nonlinear.

Here's the struggling moment. You're deep in a conversation with Claude or GPT, and you want to try a different framing of your question. But if you do, you'll lose the current thread. Or an LLM gives you a list of ten ideas, one catches your eye, and you want to drill into it—but the conversation keeps scrolling and you lose the overview. Or you've been exploring a problem across three separate chat sessions and now you need to synthesize, but you can't see them together.

Linear chat actively works against this kind of thinking. It forces linear structure onto nonlinear exploration. You end up managing context in your head, copy-pasting between windows, losing track of which threads went where.

Canvas Chat exists to solve that. When your thinking branches in multiple directions, it keeps all the threads visible and connected so you don't lose context and can synthesize across them.

How it works

Canvas Chat is an infinite canvas where conversations are nodes in a directed graph. You type a message, it appears as a node. The LLM's response appears as another node, connected by an edge. So far, standard. But then:

Branch from any node. Click reply on any message, and your new message connects to that point, not the end of the conversation. The response branches off visually. Try two different prompts from the same starting point and see both branches side by side.

Highlight and branch. Select text within a node, and a tooltip appears. Type a follow-up question, and Canvas Chat creates a highlight node (showing the excerpt with a blockquote) plus your question, plus the LLM response. The original node stays intact. This works especially well when an LLM gives a list of ideas and you want to drill into one without losing the overview.

Multi-select for merge context. Cmd-click multiple nodes, then type. The new message connects to all selected nodes, and the LLM sees the full ancestry of every selected node. I use this to synthesize: select two branches that went in different directions, ask "What do these approaches have in common?" The context includes everything that led to both.

Context flows through the graph

When you send a message, Canvas Chat walks the DAG backward from your selected node(s), collecting all ancestors. It sorts them by creation time and sends them to the LLM as conversation history. If you've selected multiple nodes (a merge), the context is the union of all their ancestors, deduplicated.

The practical effect: the LLM always knows how you arrived at the current question, even if the path is nonlinear. Branch from a discussion about protein folding dynamics, ask a follow-up about computational costs, and the context includes the protein folding discussion. No manual copy-paste.

Matrix evaluation

This feature came out of a specific struggling moment: evaluating many options against many criteria and losing track of which combinations I'd thought through.

Select one or more nodes as context, type /matrix <and then put additional instructions you're looking to fill out here>. Canvas Chat parses out the list items and shows a confirmation modal where you can remove items or swap rows/columns. Click create, and a matrix node appears.

Each cell has a "+" button. Click it and the LLM fills that cell, seeing the matrix context you provided, the row item, the column item, and the full DAG history from the source nodes. "Fill All" processes every empty cell sequentially.

Click any filled cell to see the full text. "Pin to Canvas" extracts that evaluation into a standalone node, which you can then branch from. Say you're comparing business ideas against criteria, one cell says "strong market fit with enterprise customers," you want to dig into that—pin and branch.

Web search and deep research

Canvas Chat integrates Exa's APIs for two slash commands:

/search <query> runs a neural search and creates a Search node with the query, plus Reference nodes for each result. Click "Fetch & Summarize" on any reference to grab the full page content and summarize it.

/research <topic> kicks off Exa's Research API, which performs multi-step research with multiple queries. The results stream into a Research node with inline source citations.

If you have nodes selected when you run these commands, Canvas Chat uses an LLM to refine your query using the selected text as context. Highlight "CCNOT gate" and type /search how does this work, and it rewrites the query to "how Toffoli gate CCNOT quantum computing works" before searching.

Local-first and multi-provider

All session data lives in IndexedDB. No server-side storage, no accounts. Export sessions as .canvaschat JSON files. API keys live in localStorage and are sent with each request.

The server is stateless: it proxies LLM calls via LiteLLM and handles the Exa integration, but never stores conversation data. You can deploy it yourself on Modal with a single command.

Canvas Chat dynamically fetches available models from each provider when you enter an API key. OpenAI, Anthropic, Google (Gemini), Groq, GitHub Models, and local Ollama instances (when running on localhost) all work. Switch models mid-conversation to compare outputs.

What building this taught me

This project reinforced something I wrote about in the "I don't code anymore, I build" post: I stayed in product builder brain throughout. I didn't have strong opinions about whether the JavaScript was idiomatic because I don't know what idiomatic JavaScript looks like. I just knew whether the feature worked.

When something broke, I'd describe the symptoms and let Opus 4.5 debug in as much detail as I can manage. When I wanted a new interaction pattern, I'd describe what it should feel like and watch it materialize. The creative work — deciding what nonlinear chat should be — remained human. The mechanical translation got delegated.

Canvas Chat is the kind of project I wouldn't have attempted before because the implementation cost exceeded the payoff. Now it didn't.

Try it

Canvas Chat is open source. Run it locally:

git clone https://github.com/ericmjl/canvas-chat.git
cd canvas-chat
pixi run dev

Add your API keys in settings and go. The deployed version runs on Modal.

If you try it, I want to hear what works and what doesn't! You can get in touch with me via Shortmail, or file an issue on the Github repo.

You Can Just Make Stuff with OpenCode and Claude Opus 4.5

2025-12-28T00:00:00Z

Tommy Tang asked me about my opinions on OpenCode, so here's what I've learned after spending significant time with OpenCode and Claude Opus 4.5.

I don't code anymore, I build

This is the punchline, so let me start with it. I've shifted from writing code to directing its creation. The change happened gradually, then all at once. I used to think about syntax, edge cases, and implementation details. Now I think about what I want to exist, describe it clearly, and watch it materialize.

Genesis 1:3 describes this pattern at a cosmic scale:

"And God said, 'Let there be light,' and there was light."

Working with Claude Opus 4.5 through OpenCode feels like a microcosm of that creative act.

Eric said, "Let there be a feature," and there was the feature, in code.

I'm not claiming divinity here, just noting that the creative pattern of speaking things into existence has become surprisingly literal in my daily work.

The tools: OpenCode and Claude Opus 4.5

Like Theo Brown from t3.gg, I've settled on Claude Opus 4.5 as my primary model for coding tasks. It just knows what to do. I've stopped trying to micro-manage the model's actions because it handles most tasks autonomously and correctly. When I ask for a refactor, it refactors. When I describe a feature, it implements it. The gap between intention and execution has shrunk to almost nothing.

Other models require more hand-holding. Opus 4.5 seems to have internalized enough software engineering patterns that I can trust it to make reasonable architectural decisions without constant course corrections. I can literally ask it to "do the docs, keep things up-to-date, and also give me a document that has an overview of code organization and architecture." It just goes to town autonomously. No step-by-step prompting, no breaking the task into smaller pieces. I describe the outcome I want and it figures out the path.

The tooling layer matters too. OpenCode orchestrates the AI coding in a way that feels natural. The tools it calls are always logical, the reasoning traces are transparent, and the execution flow makes sense. It shows a running list of modified files, giving me context about what's changing without running git status constantly. Context compaction lets me stay in one long-running session without hitting token limits. I've thrown out the old playbook of "switch sessions when you approach the context window." Now I only switch when I want to do something entirely different.

My setup: OpenCode with auto-updating, GitHub Copilot Pro as the LLM provider (routing to Opus 4.5), running inside a tmux session for persistence. Each repo gets an AGENTS.md file where I encode my preferences and patterns - the model's training data for my specific context. Opus 4.5 actually respects what's in there, unlike some other models that seem to ignore custom instructions.

Ten days of deliberate practice

I decided to pressure-test the "I build" claim over the holidays. Ten days, December 19-28, using OpenCode as my primary development interface. The goal: see how much I could actually ship.

The answer surprised me. Across six repositories, I pushed over 150 commits spanning infrastructure work, documentation, greenfield apps, and maintenance. Here's what emerged:

A ski trip coordination website (59 commits). My family was heading to New Hampshire for a week. Normally I'd have used a shared Google Doc for the itinerary. Instead, I built a full website with recipe modals, restaurant links with Apple and Google Maps integration, a photo album with lightbox navigation, automatic thumbnail generation, and a hero video background. I updated it live during the trip - adding photos, adjusting the grocery list, swapping menu items. The implementation cost would have been absurd for a week-long trip before. Now the jazz and snazz was well worth the effort - my family actually enjoyed using it.

A teaching clock app for my kids (2 commits, but a complete app). An analog clock trainer plus a jigsaw puzzle game with difficulty levels and themes. Pure JavaScript and CSS - exactly the kind of project my decade-old "no JavaScript" rule would have blocked. The model wrote it; I directed.

pyjanitor infrastructure (40 commits). Currency symbol support for international formats. Automated patch releases on every merge. Test isolation fixes. And a major expansion of AGENTS.md into what I now think of as the repository's "agent constitution" - a document that tells AI assistants how to work within this specific codebase.

A new conda-forge package for janitor-rs. The model handled the unfamiliar territory of Rust packaging and conda-forge recipe formats. I was the novice here; it was the guide. This role reversal keeps happening - when I set up PostHog analytics or migrated to GA4 on my website, the model walked me through each step, explained what I was doing and why, and waited for confirmation before proceeding. The expert-novice relationship flips depending on who knows more about the task at hand.

A custom tmux status bar with Nord colors, powerline arrows, and smooth color transitions. Pure aesthetic indulgence - the kind of project I'd never have prioritized before because the implementation cost exceeded the payoff. Now it didn't.

Canvas Chat (13 commits in 24 hours). A visual non-linear chat interface - think infinite canvas meets LLM conversation. Resizable nodes, trackpad gestures, streaming responses, web search via Exa, session management. FastAPI backend, vanilla JS frontend. Another "no JavaScript" rule violation, and another project that went from idea to working prototype in a single day.

Smaller fixes across llamabot (better error messages) and my website (PostHog analytics, GA4 migration, blog posts).

The variety matters. This wasn't one type of project where I got lucky. It was infrastructure, documentation, greenfield consumer apps, packaging for an ecosystem I rarely touch, and routine maintenance. The "I build" claim held up across all of them.

From engineer brain to product builder brain

Something shifted in how I think about these projects. Previously, I'd worry about how a thing was built - the engineer brain obsessing over implementation details, code structure, idiomatic patterns. Now I've switched to what was built, why I want it built, and does it get the job done - the product builder's brain.

This is especially true for the ski website and Canvas Chat, both built with web technologies (HTML, JS, CSS) that I'm not deeply familiar with. Ironically, my unfamiliarity frees me from micro-managing the implementation. I don't have strong opinions about whether the JavaScript is idiomatic because I don't know what idiomatic JavaScript looks like. I just know whether the feature works.

But there's a latent risk here. The code might not follow best practices - lots of duplication, poor separation of concerns, missing edge cases. So I fall back on principles I picked up from years of Python: refactoring, documentation, testing. I stay at that level of nudging Opus 4.5 - "look for places to refactor," "document this module," "add tests for this functionality" - but I stay out of the nitty-gritty implementation. The principles transfer even when the language doesn't.

How my review process changed

Here's something I didn't expect: I don't scrutinize the code as tightly as I used to during active development. Instead, I read the reasoning traces first. The model's chain of thought tells me whether my codebase is heading in the right direction. If the reasoning is coherent and addresses the right concerns, the code will reflect what I want. If the reasoning seems confused or takes weird detours, something's wrong and I need to dig deeper.

This inverts the traditional development loop. I used to read code to understand what the computer would do. Now I read reasoning to understand what the model understood and decided. The code review happens afterward, and it's lighter because the reasoning already told me whether we're on track.

When I want to catch issues that slipped through, I start a fresh session. A new context window acts like a fresh pair of eyes - the model hasn't been primed by the conversation that led to the current implementation, so it can spot inconsistencies that were invisible during the creative flow. This parallels the old advice about stepping away from code before reviewing it, except now the "stepping away" happens by instantiating a new session rather than waiting for my own brain to reset.

Unlearning old assumptions

Boris Cherny recently had a Twitter exchange with Andrej Karpathy that resonated with me. Boris observed that newer coworkers and even new grads who don't make assumptions about what the model can and can't do are often able to use it most effectively. They don't carry "legacy memories formed when using old models." Every month or two, models get better, and those of us who've been using them longest have to actively unlearn outdated limitations.

I've caught myself doing this repeatedly. Back in grad school around 2015, I tried building d3.js visualizations and struggled to adjust to JavaScript's syntax coming from Python. I decided to focus on getting better at Python first and gave myself a "no JavaScript" rule wherever possible. That constraint made sense at the time. It makes no sense now. The model writes JavaScript just fine. My decade-old "no JavaScript" policy was a legacy memory holding me back from building things that would actually benefit from running in the browser.

The mental work of re-adjusting expectations is real. I have to keep asking myself: would I have avoided this six months ago because the model couldn't handle it, or because I assumed it couldn't? The answer is increasingly the latter.

There's a flip side to this unlearning, though. Working in JavaScript land forced me to learn the language of the web to achieve the same precision and fluency I have with Python. I found myself picking up patterns I'd avoided for years: the browser console for debugging, DOM element manipulation, CSS transitions I didn't know existed, the JS package ecosystem. The model writes the code, but I still need enough vocabulary to direct it well and recognize when something's off. Unlearning old constraints doesn't mean staying ignorant of new territory - it means finally having a reason to explore it.

What this means

The shift from "I code" to "I build" isn't just semantic. It reflects a genuine change in what I spend my attention on. Less time on syntax and implementation details. More time on architecture, requirements, and verification. The creative work remains human. The mechanical translation has been delegated.

I'm still learning how to use this effectively. But the trajectory is clear: the gap between imagining software and having software continues to shrink.

How I Themed My tmux with OpenCode + Claude (And When to Switch Models)

2025-12-27T00:00:00Z

I had a beautiful tmux status bar on my old laptop. Nord colors, powerline arrows, clean and minimal. The kind that makes you feel like a proper terminal power user.

When I got a new machine back in April, I was too lazy to set up tmux properly. The sensible thing would have been to spend five minutes copying over my old config. Instead, eight months later, I finally spent an hour pair-programming the whole thing from scratch with OpenCode and Claude.

Why? Honestly, I wanted to try out a new tool. The irony isn't lost on me.

The Setup

OpenCode is a CLI tool that lets you interact with Claude directly from your terminal. Perfect for this kind of task: I'm already in the terminal configuring tmux, so having my AI pair programmer right there keeps the feedback loop tight. Describe what I want. See the change. Describe what's wrong, with precision. Iterate. No context switching to a browser.

That tight loop is what let me stay in the creative headspace. I could say things like "I want the arrows to overlap like in this screenshot" or "the colors feel too muted, try the frost blue from Nord" without knowing the exact syntax. Claude translated my aesthetic intent into working config.

The other superpower: model switching. OpenCode lets you flip between any models you have API keys for. For this session, I toggled between Claude Sonnet (fast, good for quick iterations) and Claude Opus (slower, but sharper for complex debugging). This turned out to be crucial.

Starting with Research

First, I asked Sonnet to search online for tmux status bar customization. It pulled resources from the official tmux wiki and various tutorials, giving me a foundation: status-left, status-right, window-status-format, color options, the basics.

Armed with that, we dove in.

First attempt with a custom theme

Claude created a custom dark theme inspired by Catppuccin colors. Worked immediately:

set -g status-style "bg=#1e1e2e,fg=#cdd6f4"
set -g status-left "#[fg=#89b4fa,bold] #S #[fg=#a6e3a1]@ #H"
set -g status-right "#[fg=#f9e2af]%a %b %d #[fg=#89b4fa]%H:%M"

Clean. Functional. Pretty. But I wanted more: those beautiful powerline arrows flowing between segments. That's when things got interesting.

The Powerline Saga

Claude suggested powerline-go, a Go-based powerline prompt generator. We installed it via Homebrew (not pip, since I keep my system Python-free):

brew install powerline-go

Updated the tmux config to call powerline-go for the status bar. Reloaded. And... disaster. Instead of beautiful arrows, raw escape codes:

[38;5;15m[48;5;4m ericmjl [38;5;4m[48;5;0m...

The terminal was spitting out ANSI codes instead of interpreting them. We tried various fixes, but powerline-go simply wasn't designed for tmux status bars; it's meant for shell prompts. Back to square one.

Trying the tmux-powerline Plugin

Next attempt: the actual tmux-powerline plugin via TPM (Tmux Plugin Manager):

git clone https://github.com/tmux-plugins/tpm ~/.tmux/plugins/tpm

Added the plugin, pressed C-b I to install, and... the status bar exploded with information. IP addresses, weather, load averages, hostname. Way too much. I asked Claude to simplify and switch to Nord colors.

We created a custom theme at ~/.config/tmux-powerline/themes/nord.sh, updated the config, reloaded tmux. Nothing changed. The theme wasn't loading. Killed the server entirely. Restarted. Still the old crowded theme.

This is where Sonnet started struggling. Same fixes over and over: reload the config, check the theme path, restart tmux. Loop after loop of suggestions that weren't working.

The model switch from Sonnet to Opus

I noticed Sonnet spinning its wheels. Same suggestions, same non-results. Time to switch.

The difference was immediate. Instead of repeating failed approaches, Opus stepped back and proposed something different entirely: ditch the plugin and go native. Tmux's built-in formatting is powerful enough to create powerline-style status bars without any plugins. We just needed the right Unicode characters and color transitions.

This stuck with me: Sonnet is fantastic for speed and quick iterations, but when you're stuck in a loop, Opus brings the lateral thinking to break out.

Going native as the winning approach

Fresh start. Clean native tmux config. The key insight was understanding how powerline arrows actually work: the arrow character's foreground color matches the background of the segment it's coming from, and its background matches what it's going into.

Here's the final status-left (session name with powerline arrow):

set -g status-left "#[fg=#2e3440,bg=#5e81ac,bold]  #S #[fg=#5e81ac,bg=#2e3440]\ue0b0"

The window formats, with arrows on both sides so they flow into neighboring elements:

# Inactive windows
setw -g window-status-format "#[fg=#2e3440,bg=#3b4252]\ue0b0#[fg=#d8dee9,bg=#3b4252] #I #W #[fg=#3b4252,bg=#2e3440]\ue0b0"

# Active window (cyan highlight)
setw -g window-status-current-format "#[fg=#2e3440,bg=#88c0d0]\ue0b0#[fg=#2e3440,bg=#88c0d0,bold] #I #W #[fg=#88c0d0,bg=#2e3440]\ue0b0"

And the right side (battery, date, time) using left-pointing arrows and a smooth Nord color gradient:

set -g status-right "#[fg=#a3be8c,bg=#2e3440]\ue0b2#[fg=#2e3440,bg=#a3be8c,bold] 󰁹 #(pmset -g batt | grep -o '[0-9]*%%' | head -1) #[fg=#5e81ac,bg=#a3be8c]\ue0b2#[fg=#d8dee9,bg=#5e81ac] %b %d #[fg=#88c0d0,bg=#5e81ac]\ue0b2#[fg=#2e3440,bg=#88c0d0,bold] %H:%M "

The Final Result

After all that iteration, here's what my tmux status bar looks like:

system-config
1 opencode
🔋 100%
Dec 23
06:05

Session name in frost blue on the left. Active window in cyan. Right side flows through battery (green), date (blue), and time (cyan). All connected by powerline arrows with smooth color transitions. (I asked Claude to recreate the status bar in HTML so I wouldn't have to screenshot it for the blog.)

What I Took Away

There's a growing conversation about AI-assisted programming: the tight feedback loops, model selection strategies, iterative workflows. I've written about some of these patterns myself. But this session crystallized something different.

I can express my creativity on a computer screen more easily than ever before.

I'm not a designer. CSS is foreign to me, hex color codes don't stick in my head, and tmux's formatting syntax is arcane. But I have taste. I know what looks good. Years of admiring beautiful terminals gave me a mental mood board. What I lacked was the technical fluency to make it real.

AI bridged that gap. Throughout this session I worked like a designer: describing aesthetics, pointing at visual problems, directing iteration. "The arrows should overlap." "That cyan is too bright." "Make the battery segment green." Claude handled implementation. I stayed in the creative headspace.

Iteration surfaces what you actually want.

This surprised me. I didn't start with a complete vision, just a vague sense of "Nord colors, powerline arrows, clean and minimal." But each rapid cycle surfaced preferences I didn't know I had. The arrows need to overlap. The active window should pop more. The right side needs a color gradient. None of these were requirements I could have articulated upfront. They emerged through seeing and reacting.

Bits and bytes have never been cheaper to produce. AI can generate config files, CSS, code, whatever. But aesthetics and judgment? Those remain expensive. The scarce resource isn't the implementation anymore. It's knowing what you want and recognizing when you've found it.

AI doesn't replace that judgment. It amplifies it by removing the implementation friction that used to slow the creative loop down.

The whole session took about an hour, failed attempts included. Without AI pair programming, I'd probably still be reading documentation. Instead, I have a beautiful terminal, and a new appreciation for what becomes possible when the gap between creative vision and technical implementation shrinks to nearly nothing.

Two years of weekly blogging and what 2025 taught me

2025-12-25T00:00:00Z

Last year, I challenged myself to write one blog post per week, and I hit 53 posts by the end of 2024. This year, I doubled down on that commitment and wrote 50 posts in 2025. Including this one, it's 51, bringing me to 104 blog posts over two years.

The year of coding agents

Looking at my 2025 posts, one theme dominates: coding agents. I wrote extensively about how to work with AI coding assistants, from teaching them with AGENTS.md files to letting them work autonomously. This reflected a shift in how I work day-to-day.

Some highlights from this theme:

The shift from "AI as a tool" to "AI as a collaborator" captures how my practice evolved this year. I've gone from cautiously experimenting with Cursor to having established patterns for multi-repository agent workflows.

Bayesian methods and biological applications

My work continued to inform my writing, with several posts on applying Bayesian statistics to real lab problems. The R2D2 prior posts were particularly satisfying to write because I felt equipped with new theoretical knowledge that was directly applicable, and I appreciated the mathematical aesthetics behind the approach:

I also explored the challenges of working with lab data, including why preclinical experiments make ML challenging and how to communicate effectively with lab scientists.

Tools I got excited about

Every year brings new tools that change how I work. In 2025, two stood out.

Marimo is a reactive notebook tool that I wrote about with enthusiasm, and followed up with practical guidance on using coding agents to write Marimo notebooks. The reactive execution model aligns well with how I think about data exploration.

Modal is cloud computing that actually feels Pythonic. My "Wow, Modal!" post captured the delight of finding infrastructure that doesn't fight against my workflow.

Data science leadership and career

I continued writing about the human side of data science work, including standardizing ways of working, communicating with lab scientists, and navigating the biotech industry's ups and downs. The year ended with The selfish reason to do your best work, which synthesized lessons from a challenging year in biotech.

Looking ahead to 2026

After two years of writing almost weekly on whatever is on my mind, I am adjusting my goals. Next year, my attention shifts towards (a) learning the fundamentals of quantum computing through an ultralearning project, (b) writing more on data science leadership and career development to encourage colleagues navigating similar paths, and (c) building out at least 10 experimental things with AI. I am also dropping the goal of "one blog post per week" to four per month, which brings me to a goal of 48 for 2026. I am giving myself space to rest and strategically plan out writing going into 2026.

Merry Christmas and a happy new year to all my readers!

Blog posts by theme

Biology & Chemistry

Career Advice

Data Science Practice & Leadership

Data Science Tooling

LLMs

All blog posts

Date	Title	Categories
2025-01-04	What makes an agent?	LLMs, Data Science Tooling, Biology & Chemistry
2025-01-10	A practical guide to securing secrets in data science projects	Data Science Practice & Leadership, Data Science Tooling
2025-01-13	Writing at the speed of thought	Data Science Tooling, Career Advice
2025-01-19	Why data from preclinical biotech lab experiments make machine learning challenging	Data Science Practice & Leadership, Data Science Tooling, Biology & Chemistry
2025-01-31	PyData Boston/Cambridge Talk @ Moderna: What makes an agent?	LLMs, Data Science Tooling, Data Science Practice & Leadership
2025-02-07	Lightening the LlamaBot	LLMs, Data Science Tooling
2025-02-17	Let me ship you the Python you need	Data Science Tooling
2025-02-23	Reliable biological data requires physical quantities, not statistical artifacts	Data Science Practice & Leadership, Biology & Chemistry, Data Science Tooling
2025-03-01	How to fix PyPI upload errors related to license metadata	Data Science Tooling
2025-03-06	A blueprint for data-driven molecule engineering	Data Science Practice & Leadership, Data Science Tooling, Biology & Chemistry
2025-03-16	The art of finesse as a data scientist	Data Science Practice & Leadership
2025-03-17	Why you should take part in the SciPy sprints!	Data Science Practice & Leadership, Data Science Tooling, Career Advice
2025-04-02	How to standardize Data Science ways of working to unlock your team's creativity	Data Science Practice & Leadership, Data Science Tooling
2025-04-03	Bayesian Superiority Estimation with R2D2 Priors: A Practical Guide for Protein Screening	Data Science Practice & Leadership, Data Science Tooling, Biology & Chemistry
2025-04-05	From data chaos to statistical clarity: A laboratory transformation story	Data Science Practice & Leadership, Data Science Tooling, Biology & Chemistry
2025-04-08	Wow, Marimo!	Data Science Tooling, LLMs
2025-04-19	Good practices for AI-assisted development from a live protein calculator demo	Data Science Practice & Leadership, Data Science Tooling, Biology & Chemistry
2025-04-26	Wow, Modal!	Data Science Tooling, LLMs
2025-05-08	Why I'm excited for SciPy 2025!	LLMs, Data Science Practice & Leadership, Data Science Tooling, Career Advice
2025-05-24	Supercharge your coding agents with VSCode workspaces	LLMs, Data Science Tooling
2025-05-25	The invisible polish of automatic model routing	LLMs, Data Science Tooling
2025-06-07	Principles for using AI autodidactically	LLMs, Data Science Practice & Leadership
2025-06-14	Rethinking LLM interfaces, from chatbots to contextual applications	LLMs, Data Science Tooling
2025-06-27	Build your own tools!	LLMs, Data Science Practice & Leadership, Data Science Tooling, Biology & Chemistry
2025-07-01	One hour and eight minutes: Building a receipt scanner with the weirdest tech stack imaginable	LLMs, Data Science Tooling
2025-07-07	The job your docs need to do	Data Science Practice & Leadership
2025-07-13	Earn the privilege to use automation	LLMs, Data Science Practice & Leadership, Data Science Tooling
2025-07-14	Reflections on the SciPy 2025 Conference	LLMs, Data Science Practice & Leadership, Data Science Tooling, Biology & Chemistry, Career Advice
2025-07-15	How to use xarray for unified laboratory data storage	Data Science Tooling, Biology & Chemistry
2025-07-21	From nerd-sniped to shipped using AI as a thinking tool	LLMs, Data Science Practice & Leadership, Data Science Tooling
2025-08-06	Stop guessing at priors: R2D2's automated approach to Bayesian modeling	Data Science Practice & Leadership, Data Science Tooling
2025-08-15	Data scientists aren't becoming obsolete in the LLM era	LLMs, Data Science Practice & Leadership, Data Science Tooling, Career Advice
2025-08-23	Wicked Python trickery - dynamically patch a Python function's source code at runtime	LLMs, Data Science Tooling
2025-08-24	How to communicate with lab scientists (when you're the data person)	Data Science Practice & Leadership, Biology & Chemistry
2025-09-01	How to use AI to accelerate your career in 2025	LLMs, Career Advice
2025-09-02	The Data Science Bootstrap Notes: A major upgrade for 2025	Data Science Practice & Leadership, Data Science Tooling
2025-10-01	How data scientists can master life sciences and software skills for biotech using ultralearning	Data Science Practice & Leadership, Biology & Chemistry, Career Advice
2025-10-04	How to teach your coding agent with AGENTS.md	LLMs, Data Science Tooling
2025-10-10	How to use multiple GitHub accounts on the same computer	Data Science Tooling
2025-10-14	How to Use Coding Agents Effectively	LLMs, Data Science Tooling
2025-10-18	A practical comparison of DSPy and LlamaBot for structured LLM applications	LLMs, Data Science Tooling
2025-10-19	How to expose any documentation to any LLM agent	LLMs, Data Science Tooling, Data Science Practice & Leadership
2025-10-20	Exploring Skills vs MCP Servers	LLMs, Data Science Tooling
2025-10-28	Use coding agents to write Marimo notebooks	LLMs, Data Science Tooling
2025-11-08	Safe ways to let your coding agent work autonomously	LLMs, Data Science Tooling
2025-11-16	How I Replaced 307 Lines of Agent Code with 4 Lines	LLMs, Data Science Tooling
2025-11-17	How to Reference Code Across Repositories with Coding Agents	Data Science Tooling, LLMs
2025-12-02	What does it take to build a statistics agent?	LLMs, Data Science Tooling, Biology & Chemistry, Data Science Practice & Leadership
2025-12-10	Productive Patterns for Agent-Assisted Programming	LLMs, Data Science Tooling
2025-12-17	The selfish reason to do your best work	Career Advice, Data Science Practice & Leadership

The selfish reason to do your best work

2025-12-17T00:00:00Z

I’ve been thinking a lot about career lately. This has been a pretty lean year for biotech; we've seen ups and downs at Moderna and across the industry. So, I want to offer a word of encouragement and a philosophy on work that I hope can be useful for you, regardless of where you are in your journey.

It starts with a reframing of why we work.

Do your best work for yourself

I know there is a lot of sentiment going around the internet right now about "acting your wage"—limiting your effort to exactly what you are paid for—or doing the bare minimum. The logic goes: Why should I care about doing my best for my job if my company doesn't care for me?

I get where that sentiment comes from. But I want to redirect your attention a little bit.

You do not have to do your best work for your company. You should do your best work for yourself at the company you’re at.

It just so happens that the company will benefit, but you should treat your effort as an investment in your own professional instincts and habits. Sooner or later, you may not be at that company. You actually don't really have to care about the entity itself. If you don't care about the people who work at your workplace, no one is compelling you to. (Though, if you do happen to like your colleagues—which is true for me where I work—then that’s all good.)

But even if you can’t find much to be inspired by, do your best work anyway. Why? Because you are building the person you will be in five or ten years.

President Obama once gave this advice to young interns: "Don't ask for the plum assignments. Just knock out everything you're doing." I guarantee you someone will notice. Even if no one at your current company notices, if you build a track record of quality, people outside will notice.

Your reputation precedes you. It is the one thing you accumulate over time that serves as a form of wealth that can never be taken away from you. Only you can lose it. To borrow a phrase from Jocko Willink, this is "Extreme Ownership"—taking total responsibility for your world. Yes, circumstances happen, but if you guard your reputation well, it is yours to keep.

Think about it: Who knows where you will be ten years down the road? If you are a software engineer or data scientist now, in five years you might be a Director. You’re going to be calling the shots. If you don't take the time now to practice making decisions, witnessing judgment calls, and battle-testing your engineering foresight, will you be ready?

I had a former teammate who worked under me at Moderna, Arkadij Kummer. He’s now the CTO of a startup—a title I haven't even held. I saw him put in the effort to develop the strategic thinking patterns that helped him get the skills he needed to lead a tech organization. He sought out opportunities to practice making judgment calls and owning the outcomes. You have to get those reps in early, or you won’t be ready for what happens later.

So, if you are working at a company you want to leave: do not give up on investing in yourself. Fortune favors the prepared.

Resilience is also an investment

Building this "career wealth" isn't just about technical execution. It's also about how you handle failure. And here is the second word of encouragement I want to offer: Everyone will make a blunder at some point.

Part of growing up—and part of investing in your own character—is owning up to those mistakes, being proactive about remedying them, and graciously accepting help.

I have been there. I once wrote a very scathing internal blog post about my leadership at a previous company. Looking back, I sometimes think of it as the darkest weekend of my career.

I was a guy who had just finished school, entered the workforce, and within three years decided I knew better than people who had done their jobs for twenty-odd years. I'm not saying it's impossible that I was right, but it was pretty presumptuous. I lashed out at other groups that I thought were incompetent, effectively attacking my own team.

The leadership team responded with extreme grace. They knew the problems I outlined were real, but they also saw a junior person who hadn't picked his battles wisely.

We have limited energy and limited ability to focus. There are only so many battles we can handle simultaneously. I chose a bad battle to fight.

That experience prompted deep reflection. I decided to lean back into that first philosophy: I was going to go back and do a good job. Not necessarily because I was feeling proud of the company at that moment, but because I recognized that if I ever wanted to be the type of person with the authority to change things, I needed to be the best version of myself first. I needed to learn how to handle authority and how to elevate the people around me.

If you are in a situation where you’ve made a mistake, the best thing you can do for your reputation is to own it. Propose an action to rectify it. Move on.

Most mistakes are not unforgivable. There is a classic business story about Tom Watson, the founder of IBM, involving a subordinate who made a mistake that cost the company \$600,000. Whether the amount was really \$600,000 or not, the story has a lesson that rings true. The man walked into Watson's office expecting to be fired. Watson reportedly replied, "No, I just spent \$600,000 training you. Why would I want to fire you?"

Most people will understand. Yes, there will be delays. That is just the world we inhabit. Own the mistake, improve the process, and keep making it better. Again, not primarily for the company, but because you are preparing yourself to lead with integrity and compassion.

The wealth that remains

I don't want you to become the whiner or the complainer. That is a habit that will stick with you for the rest of your life.

Instead, I want you to play the long game. Don't let temporary frustrations dictate your long-term growth. Anything worth doing is going to be difficult. If it was easy, the reward would be fleeting. Even for the ultra-rich, like Sergey Brin, eventually the yacht gets boring. He returned to active coding at Google to work on AI because, fundamentally, humans are wired to find satisfaction in building something meaningful.

So, aspire to greatness. Not just to gain a title or a promotion, but because in the process, you accumulate a wealth that cannot be taken away from you: You will have the satisfaction of mastery. You will have a battle-tested character. You will have a reputation that opens doors before you even knock.

Do your best work. It’s the best investment you’ll ever make. And it'll be the most selfless gift you give to yourself and others.

Productive Patterns for Agent-Assisted Programming

2025-12-10T00:00:00Z

I've been using coding agents for a while now, and I've learned a few patterns that make the experience much more productive. The thing is, a lot of these "productive patterns" aren't being shared enough—they're more like folk knowledge that you can only really pick up by watching someone else do their work live. I decided to write this blog post to kickstart conversations about the matter. Here's what works for me.

Build a detailed plan with AI

Before jumping into implementation, spend time building a detailed plan with your AI assistant. Iterate 2-3 times over the plan, checking every detail. You want the ability to see in your head what the code might look like—just a "fat finger sketch" of the implementation.

The plan should include:

Details on implementation
How to test (this is the most important part)
Documentation plan

Do docs and tests first

Humans usually adhere to test-driven development if you're a software engineer, or exploration-driven software builds if you're more of a data scientist. Because of the sequential nature of generative AI, it's advantageous to instruct AI to do the docs and tests first before the implementation. This is a complex conditional probability problem. If the tests and docs are written first, the implementation has to satisfy those constraints, which leads to better code.

The test plan should include instructions on how to run tests using command line tools. Don't assume the AI knows your project's specific testing setup.

Use AGENTS.md as your repo's AI university

AGENTS.md is a great place to store the specific instructions that you need for the repo. For example, AI will tend to write python -m ... as a shell command, but if I'm running a pixi project, it's better to always run pixi run python ... instead. Treat AGENTS.md as the AI's university of your particular repo; it's where you encode all the project-specific knowledge that the AI needs to work effectively.

Control the pace of the agent

Know the default behavior of your agent; it may be over-eager to do lots of things. You can pace the coding agent by asking it to "slow down, walk me through the changes one at a time, starting with the most important ones first." This helps you maintain control and review changes as they happen, rather than being overwhelmed by a massive diff.

Leverage local and command line tools

You can use local and command line tools to your advantage! Here are some examples:

Firstly, the GitHub CLI (gh) can be used to:

Store plans on GitHub as issues first (a matter of taste—you can avoid cluttering up your local filesystem)
Pull GitHub Actions logs
Call out to the GitHub API for other general tasks

Environment management:

pixi run ensures you're always running within the correct Python environment
uvx marimo check lets me check that marimo notebooks are syntactically valid
uv run notebook.py lets me run notebooks as scripts to check outputs
uvx marimo export lets me export marimo notebooks as markdown

Linting and quality:

markdownlint runs on every edit of markdown files so you never have markdown linting issues
Get AI to "commit relevant files and fix issues raised by pre-commit hooks"

Let agents use CLI tools and read outputs directly so that you don't have to switch between windows copying and pasting things manually.

Let agents write temporary tools

Coding agents can write their own temporary tools inside .py files. Encourage coding agents to do that to test that what it wrote works on-the-fly. This is a great way to validate code before integrating it into your main codebase.

You can even experiment with self-improving agents: if it detects you correcting its action, it should auto-update AGENTS.md with what is the correct thing to do. I haven't fully battle-tested this yet, but you can write an "AI constitution" at the top of AGENTS.md that instructs the agent to learn from corrections by remembering them inside AGENTS.md.

Develop your own tools

Isabel Zimmerman mentioned this in her keynote talk: develop your own tools. Here are some examples of my own:

Personal MCP productivity server: gives me prompts that I can take from project to project, so I don't have to keep copying/pasting them
Shell aliases: gacp lets me run git add . && git commit && git push
LlamaBot git hooks: auto-writes commit messages for me

These custom tools compound over time and make your workflow significantly more efficient.

These patterns have made my agent-assisted programming much more productive. Treat the AI as a collaborator that needs clear instructions, proper context, and the right tools to work effectively. Start with a good plan, control the pace, and build tools that make the whole process smoother.

What patterns have you discovered? I'd love to hear what works for you—let's make this folk knowledge more accessible to everyone.

What does it take to build a statistics agent?

2025-12-02T00:00:00Z

Within research organizations at most pharma and biotech companies, professionally-trained statisticians are often staffed at extremely low ratios relative to the number of lab scientists. By rough Fermi estimation, I'd hazard a guess that ratios anywhere from 1:10 to 1:100 are plausible, meaning most researchers have limited access to statistical expertise when they need it most, during experiment design. This statistician shortage creates a critical bottleneck in experimental design, power calculations, and biostatistical consultation—areas where proper statistical guidance can prevent costly mistakes and improve research reproducibility.

This creates a costly problem. When statisticians aren't available, researchers fall back to what I call "folk statistics" - the kind you learn by immersion in a lab, or from 1-2 graduate lectures hidden within broader "laboratory methods" or "computational methods" classes. I know this because I practiced folk statistics myself in the life sciences, blindly following rules like "just do n=3" or "just use the t-test with your count data" without understanding the statistical reasoning behind these choices.

The consequences are documented in stark numbers. Amgen scientists attempted to reproduce 53 landmark preclinical papers and failed in 47 cases (89%)—even after contacting original authors and exchanging reagents. Bayer's internal validation found only 20-25% of studies "completely in line" with original publications. These studies consistently identified poor experimental design and inadequate statistical analysis as major contributors. Freedman et al. (2015) estimated $28 billion annually spent on irreproducible preclinical research in the United States alone.

Breakdown of causes of preclinical irreproducibility from Freedman et al. (2015). Study design accounts for 27.6% of irreproducibility.

At the individual experiment level, this translates to teams throwing out hard-won experimental data that can cost anywhere from thousands to hundreds of thousands of dollars to collect per round, wasting up to millions of dollars downstream by basing decisions on poorly-collected data, and missing opportunities to set up machine learners with high quality laboratory data that could shortcut the amount of laboratory experimentation needed.

I took one semester of graduate-level biostatistics, then a decade of self-study in Bayesian statistics, followed by professional work where accurate estimation was critical—whether estimating half-life of a molecule, binding affinity of an antibody, or other performance properties. Through this journey, I no longer trust folk statistics. Folk statistics relies on faulty assumptions—like "n=3 is all you'll really need," "use the t-test for count data," or "calculate the SEM and don't show the SD"—which influence bad decision-making when people don't know better. Once you see how these assumptions break down and lead to wrong conclusions, you can't unsee it. Quantities like half-life, binding affinity, and other performance properties need to be accurately estimated through proper experimental design and statistically-informed mechanistic modeling.

Statisticians are expensive, but they're also 100% critical for generating high quality, high fidelity data. Their role at the experiment design phase is usually that of a consultant, asking probing questions to ensure experiments are designed with good controls, confounders are accounted for, and the right statistical models are chosen. The question is: can we scale this expertise?

Not to replace statisticians, but to level up the organizational statistical practice before researchers check in with a professionally-trained stats person. If lab scientists can think through their experimental designs more rigorously beforehand - understanding power calculations, considering confounders, planning proper controls - then the conversations they have with statisticians can be elevated. Instead of starting from scratch, they can engage in more sophisticated discussions about design trade-offs, model selection, and advanced statistical considerations. In turn, this amplifies the value of the statistician's time and improves outcomes for everyone.

I was inspired by Dr. Emi Tanaka's slides on extracting elements of statistical experiment design using LLMs, which showed how we can extract structured information like response variables, treatments, experimental units, design types, replicate structure, and controls. I decided to take a stab at building something that could do more than just extract information—something that could actually consult on experiment design.

And so stats-agents was born: an AI-powered statistics agent for experiment design consultation. Here's how I designed and evaluated this domain-specific AI agent.

As a preface, I initially explored the ReAct pattern but switched to PocketFlow, a minimalist graph-based framework that replaced 307 lines of agent orchestration code with just 4 lines. This graph-based approach brought clarity, modularity, and made the execution flow explicit—exactly what I needed for building a robust statistics agent.

So how did I go about building this agent?

Deeply influenced by Clayton Christensen's books, I actually started with "what's the job for this agent to be done?" I initially considered building a single agent that could handle both experiment design consultation and statistical analysis of collected data. However, I quickly realized these are fundamentally different phases with different goals, tools, and interaction patterns.

The experiment design phase is consultative and exploratory - it's about asking questions, understanding constraints, identifying potential issues, and helping researchers think through their design before data collection. The analysis phase is more technical - it's about taking collected data and building statistical models to estimate quantities of interest.

I decided to focus the agent on the design phase only. This separation of concerns made the agent cleaner, less confusing, and allowed it to be optimized for its specific purpose: being an inquisitive, consultative partner during experiment design. The analysis phase would be handled separately (or by a different agent) with its own tools and prompting strategies.

So I defined the agent's job description (JD) as: "an agent that will provide critique on experiment designs, suggest modifications, and help researchers think through their experimental design before data collection". It sounded oddly like a human's job description for a real job, except more specific. Notice, however, that the JD leaves room for a real human, in that no accountability for outcomes is placed on the agent, a human statistician still needs to review the work, just as we wrote above.

With the job scope defined, I turned to designing the tools the agent would need.

The first tool I gave was critique_experiment_design - a tool that provides comprehensive critique of experiment designs, identifying potential flaws, biases, weaknesses, and areas for improvement. This tool considers multiple angles including biological, statistical, and practical constraints. The agent can use this to help researchers identify issues in their designs before they collect data.

The second tool I gave was one I previously wrote about: the ability to execute code (write_and_execute_code). I wanted this available to answer questions like "what should my data table look like?"

I've noticed that having a sample data table in front of us when discussing experimental designs is incredibly clarifying—it cuts through abstract confusion toward concrete understanding. This tool enables the agent to generate sample data tables, perform power calculations, create plate map visualizations, and other dynamic analyses.

Important security note: The ability to execute arbitrary Python code is powerful but also dangerous. An agent that can execute code can delete files, modify system configurations, access sensitive data, make network requests, and more. For any production deployment, this agent must run in a containerized environment with strict isolation, resource limits, and no access to secrets or credentials. This isn't optional - it's a cybersecurity requirement! For my development and testing, I ran it in a controlled environment on my own machine, but production deployment would require proper containerization.

With the tools defined, the next challenge was evaluation: how do you know if the agent is actually working? This turned out to be more complex than I initially expected.

The evaluation process had two distinct phases: an exploration-guided MVP phase (vibes-driven) and a post-MVP systematic testing phase.

Phase 1: MVP Development (Vibes-Driven)

During the MVP phase, I defined a fixed conversation path and repeatedly tested it manually in a Marimo notebook's chat UI. I tested with a sequence like: asking for design critique, providing an experiment description, requesting power calculations, and asking for sample data tables. As I found errors, I fixed them immediately, contrary to evaluation best practices, but appropriate for this exploratory phase. The tool definitions weren't settled until this phase was complete.

I also used Cursor (a coding agent) to help diagnose issues, explore multiple solutions, and get different perspectives before committing to fixes. This "multiple AI opinions before committing" pattern follows a similar philosophy to Geoffrey Litt's "Code like a surgeon" approach: spike out an attempt at a big change, review it as a sketch of where to go, and often you won't use the result directly—but it helps you understand the problem space better.

Rather than accepting the first AI suggestion, I'd ask multiple questions, explore several solution approaches, and understand the trade-offs before making an informed decision. When debugging complex issues like the closure vs. shared state problem, I'd often ask multiple models the same question to see if they'd converge on the same diagnosis—if different LLMs independently arrived at the same answer, that was a good sign the solution was on the right track. This led to better architecture decisions and fewer instances of "I wish I had done it differently."

Phase 2: Systematic Evaluation

Once the baseline behavior was satisfactory, I moved to systematic evaluation.

Benchmark prompt: I created a "perfect prompt" document (experiment_design_for_power_calc.md) that served as my regression test suite. This complete, detailed experiment design specification should trigger specific agent behaviors, such as asking the right questions, performing power calculations correctly, providing contextual explanations. Every time I modified the system prompt or tools, I'd run this same benchmark and check: did it still work? Without this, I found myself making changes that broke things in subtle ways, or losing track of what "good" behavior even looked like. The benchmark prompt became my north star, a concrete example of the agent working as intended.

Synthetic test generation: Once I had a reliable benchmark, I expanded to systematic evaluation with variation. Starting with five examples from Tanaka's slide deck, I used the methodology from Shreya Shankar and Hamel Husain (researchers who developed systematic approaches for LLM evaluation) to generate synthetic chat examples by selecting 1-3 axes of variation. I chose experimental domain (biotech vs. agriculture) as the primary axis, while varying the statistical expertise level of the simulated user. This process generated dozens of conversation traces that, while not exhaustive, represented draws from my constrained prior belief about likely conversations.

Systematic evaluation with these varied conversation traces revealed patterns I never would have noticed through manual testing. Through these traces, I identified three major categories of failure modes that needed to be addressed.

Failure mode 1: Multi-step execution breakdown

The first major issue I encountered was that the agent couldn't chain tool calls effectively. When the agent executed code to perform a power calculation, it would store the result in a variable like mtt_power_analysis_result. But when it tried to analyze that result in a subsequent tool call, it would fail with a NameError - the variable simply wasn't accessible.

The root cause was subtle: the code execution tool (write_and_execute_code_wrapper) was using a closure variable that captured the notebook's globals at initialization time. However, the agent framework (AgentBot) stores results in a separate shared dictionary (shared["globals_dict"]). These two dictionaries were disconnected; think of them as two separate notebooks that couldn't see each other's variables. So when the agent created a variable in one tool call, it wasn't visible to the next.

The fix required connecting them: I modified the code execution tool to accept an optional _globals_dict parameter. When provided, it uses the agent's shared dictionary instead of its own isolated one. This allows results from one tool call to be accessible in subsequent calls, enabling true multi-step workflows where the agent can build on previous results.

Failure mode 2: Display formatting and contextual output

The second category of issues involved both technical display problems and behavioral output quality. When the agent returned Python objects (DataFrames, matplotlib figures, etc.), they weren't displaying properly in the Marimo chat interface. The agent would return a dictionary with these objects, but Marimo's chat UI doesn't automatically render matplotlib Figure objects embedded in dictionaries.

But there was a deeper behavioral problem: the agent was dumping DataFrames and plots without any explanatory text. I'd ask for a power analysis, and the agent would return a raw DataFrame with numbers -- no context, no interpretation, no explanation of what I was looking at. My sense of taste rebelled. This wasn't just a technical problem, it was an aesthetic one. I wanted a polished, consultant-like experience where explanatory text naturally flows between objects, not a data dump.

Sometimes the best technical solutions come from caring about how things look and feel.

I solved both problems through a combination of technical infrastructure and prompt engineering. With AI assistance, I created a formatter system that processes dictionaries containing text and objects, converting strings to markdown and passing objects through for native display. This handled the technical display issue. It's implemented within llamabot/components/formatters.py as create_marimo_formatter.

But to solve the behavioral problem—getting the agent to actually provide contextual text—I had to update the system prompt. I added explicit guidance requiring that every object be preceded (and ideally followed) by explanatory text that connects back to the researcher's goals. The prompt now instructs the agent to create a dictionary with explanatory text strings interleaved with objects, then return it. The formatter processes this dictionary, creating that polished, consultant-like experience where text naturally flows between objects.

Failure mode 3: Agent decision-making and domain knowledge

The third category of failures involved the agent's decision-making process and understanding of domain conventions. Through systematic evaluation, I discovered several patterns.

Here's a concrete example of how the prompt evolved. Initially, I had a simple instruction: "Ask clarifying questions about experiment goals, constraints, and assumptions." But the agent kept jumping straight to calculations. So I added:

Before (early version):

Ask clarifying questions about experiment goals, constraints, and assumptions.

After (refined version):

CRITICAL - BE INQUISITIVE FIRST: Before jumping into calculations, you MUST ask probing questions to understand the full context:

What effect size are they expecting or hoping to detect? Why?

What is the expected variability in their measurements? Do they have pilot data?

What are their practical constraints (budget, time, sample availability)?

What are they most worried about with this experiment?

Have they done similar experiments before? What issues did they encounter?

What would make this experiment a "success" in their view?

Only after gathering this context should you use write_and_execute_code_wrapper to perform calculations.

The agent wasn't inquisitive enough. Despite the system prompt emphasizing the need to ask questions, the agent would often jump straight to calculations without first understanding the researcher's context, constraints, and goals. I had to add multiple "CRITICAL" reminders in the prompt, explicitly stating that questioning should happen BEFORE calculations.

The agent was trying to pass variable names as strings. When the agent wanted to analyze a result from a previous tool call, it would try to write a function like def analyze(mtt_power_analysis_result): and pass {"mtt_power_analysis_result": "mtt_power_analysis_result"} - which passes the string literal, not the actual dictionary! I had to explicitly teach the agent to write functions with NO parameters that access variables directly from globals.

Contradictions in the system prompt. There were conflicting instructions about when to use respond_to_user vs return_object_to_user. I resolved this by clarifying: use respond_to_user for text-only responses, and return_object_to_user when you have Python objects to display.

The agent didn't understand domain-specific visualizations. When the test user (a.k.a. me) asked for a "plate map" or "plate layout visualization," the agent would generate something, but it often wasn't what researchers expected.

A plate map visualization in experimental biology is a very specific thing: a heatmap-style 8×12 grid (rows A-H, columns 1-12) where each well is color-coded by treatment group with a clear legend. Without explicit guidance, the agent would create generic bar charts or scatter plots that didn't match these domain conventions.

I solved this by adding detailed specifications in the system prompt that describe exactly what plate map visualizations should look like, including the grid structure (rows labeled A-H, columns 1-12 for 96-well plates), color coding requirements (different colors per treatment, with a legend), complete code patterns showing how to create them with matplotlib, common plate formats (96-well, 384-well, 1536-well), and trigger phrases that should generate plate maps.

This pattern - providing detailed specifications for domain-specific outputs - became a key strategy. When the agent needs to generate something that follows domain conventions, it needs explicit guidance on what those conventions are.

Addressing these behavioral issues required multiple rounds of iteration. The system prompt didn't reach its final form in one go - as I discovered each issue, I added more explicit guidance, examples, and "CRITICAL" warnings. The prompt grew substantially through iterative refinement, and the agent's behavior improved dramatically with each iteration.

The process wasn't linear during the early iteration phases - I'd fix one issue, test it, discover another, fix that, and sometimes realize the first fix needed refinement. Working with the Cursor coding agent helped me identify contradictions, explore multiple solution approaches, and get different perspectives before committing to changes. This iterative refinement process is essential when building domain-specific agents: you can't anticipate all the behavioral issues upfront, so you need to be prepared to evolve the prompt based on what you discover through testing.

Conclusion: Key lessons for building domain-specific agents

Building this AI statistics agent for experiment design revealed several important patterns that apply broadly to building domain-specific AI agents and LLM-powered tools:

1. The system prompt is the primary control surface

The agent's personality, decision-making process, inquisitiveness, and ability to provide contextual explanations are primarily controlled through prompt design rather than code changes. The prompt is where the domain knowledge lives, where the behavioral patterns are encoded, and where the "personality" of the agent is defined. Code provides the infrastructure, but the prompt provides the intelligence.

If you want to change the agent's behavior, you're often better off modifying the prompt than changing the code. The prompt became a detailed instruction manual that teaches the agent not just what to do, but how to think, when to ask questions, and why certain patterns matter.

2. Testing is essential, but not sufficient

Even with extensive testing, you cannot guarantee what will be seen in the real world. Users will ask questions you never thought of, use terminology you didn't anticipate, have edge cases in their data you didn't consider, and interact with the agent in ways that break your assumptions.

This is why the agent is designed as a consultant rather than an autonomous decision-maker. A human statistician still needs to review the work. The testing process is essential for building confidence, but you must also design the system with the assumption that it will encounter unexpected situations. This means implementing clear error handling, graceful degradation when things go wrong, explicit boundaries on what the agent can and cannot do, and human oversight for critical decisions.

Final thoughts

The process of building this agent revealed something I didn't expect: you can't anticipate all the behavioral issues upfront. The prompt grew from ~200 to ~600 lines through iterative discovery—each failure mode required explicit guidance I didn't know I'd need. Building domain-specific agents means being prepared to evolve your approach based on what you discover through testing, not just what you plan in advance.

Try it yourself

The experiment design agent is available as a Marimo notebook in the LlamaBot repository. You can run it locally with:

git clone git@github.com:ericmjl/llamabot.git
cd llamabot/notebooks
uvx marimo edit --watch notebooks/experiment_design_agent.py

The agent is designed to be inquisitive and consultative—it will ask probing questions about your experiment goals, constraints, and assumptions before providing recommendations. This AI statistics agent can help with power calculations, experimental design critique, sample data table generation, plate map visualizations, and biostatistical consultation for researchers in pharma and biotech.

Limitations: This agent is a prototype focused on the experiment design phase. It's not a replacement for human statisticians—it's designed to amplify their knowledge and help researchers think through their designs before data collection. The agent requires human oversight and review, especially for high-stakes decisions. I haven't tested it across all experimental design types, and it may struggle with highly specialized domains or unusual constraints.

If you're interested in building your own domain-specific agent, I hope the lessons and patterns shared here provide a useful starting point. The code is open source, and I welcome contributions and feedback.

I'm working on a part 2 of this blog post, where I'll build out the statistical analysis agent—the companion to this experiment design agent. That post will cover how to build an agent that takes collected data and performs statistical analysis, model fitting, and interpretation.

Thank you for reading this far. If you made it here, you've invested real time and attention in understanding not just what I built, but how and why—and that means a lot! Building this agent has been four months in the making, and sharing those discoveries with others who care about the same problems is what makes the work worthwhile. I'm grateful you came along for the ride!

How to Reference Code Across Repositories with Coding Agents

2025-11-17T00:00:00Z

I used to assume that coding agents like Cursor, GitHub Copilot, and Claude Code only work within a single workspace. This mental model led me to workarounds like copying files, creating complex multi-root workspace configurations, or constantly switching between projects.

But coding agents can already read and write files from anywhere on your file system, not just the current workspace. The limitation wasn't in the tools; it was in my awareness of what they can do. You don't need to add folders to workspaces, create multi-root workspaces, or jump through configuration hoops. If you know where a repository lives on your disk, you can reference it directly.

How to reference code from other repositories

The key is being explicit about file paths. Modern AI coding assistants like Cursor, GitHub Copilot, and Claude Code can access your entire file system, not just the current workspace. You just need to tell them where to look.

I do most of my writing in an Obsidian vault, which isn't a Git repository; it's just a folder on disk. Sometimes I need to reference code from my LlamaBot repository, or other code repositories in which I am doing development. Instead of copying files or creating complex workspace configurations, I just tell the agent to read directly from the other directory.

When I need the agent to understand something from LlamaBot, I can say "read the implementation from ~/github/llamabot/llamabot/bot/simplebot.py" and it works immediately. The key is being explicit with the path. You can also search by filename within a directory: "find the notebook named pocketflow_testdrive.py in ~/github/llamabot". The agent reads the file directly from disk, no workspace configuration needed. You don't need to document paths anywhere; just reference them directly when you need them. That said, if you have commonly accessed paths, documenting them in AGENTS.md can be helpful for quick reference.

I used this method while writing my blog post "How I Replaced 307 Lines of Agent Code with 4 Lines". I was drafting the post in my Obsidian vault, but the actual code examples lived in a Marimo notebook within the LlamaBot repository. Rather than copying code snippets or switching workspaces, I had the agent read directly from ~/github/llamabot to pull in the exact implementation details I needed. This let me write about the code while staying in my writing environment, with the agent able to reference the actual source files to ensure accuracy.

File system access for AI coding assistants enables this

Coding agents that have file system access can perform read and write operations anywhere they have permission. Tools like Cursor, GitHub Copilot, and Claude Code aren't restricted to the current workspace directory. This works because agents have access to shell tools, the most generic, text-based interface to computers. Shell commands produce text output that agents can read and understand, and they can execute commands anywhere on your system. This means:

You can reference code from any repository on your machine
You can pull in documentation from other projects
You can compare implementations across different codebases
You can reference configuration files from related projects
You can modify files across multiple repositories when needed

The only requirement is that you know the path and can tell the agent where to look.

Common scenarios

Blogging: When writing blog posts about code, reference implementation details from your repositories. The agent can read the actual code to ensure accuracy, pulling in exact examples without copying files or switching workspaces.

Architecture decisions: Compare how similar problems are solved across different projects. The agent can read multiple implementations and help you understand trade-offs.

Code reuse: Before copying code, have the agent check if similar functionality exists elsewhere. It can read files from other repos to find existing solutions.

Dependency understanding: When working with a library you maintain, reference the library's source code directly. The agent can read implementation details to help you use it correctly.

Cross-repository updates: Update related files across multiple repositories simultaneously. For example, update documentation in one repo while modifying the implementation in another, or sync configuration changes across related projects.

Step-by-step workflow for cross-repository code access

The key trick is being explicit with paths or explicit instructions about how to get to files. Here's how to do it:

For repositories you already have cloned locally:

Reference the absolute path directly when asking the agent: "read ~/github/llamabot/src/llamabot/bot/simple.py"
Or search by filename within a directory: "find the notebook named pocketflow_testdrive.py in ~/github/llamabot"
The agent reads the file immediately, no workspace configuration needed

For repositories you don't have locally:

Tell the agent exactly how to get to the file: "clone the repo owner/repo into a temporary directory, then find the file at relative path path/to/file.py"
You can also specify a specific commit, branch, or tag: "clone the repo owner/repo at commit abc123 into a temporary directory, then find the file at relative path path/to/file.py" or "clone the repo owner/repo and checkout branch feature-branch, then find the file at relative path path/to/file.py"
The agent executes these commands using command line tools like gh CLI or git, reads what it needs, and can clean up the temporary clone when done

No workspace management. No file copying. No complex configuration. Just explicit paths or explicit instructions. The agent needs clear direction on where to find files, whether that's an absolute path on your system or step-by-step instructions to clone and navigate to a file.

Common questions and limitations

Do I need to configure workspace settings?

No. Unlike traditional IDE workspace configurations, you don't need to add folders to workspaces or create multi-root setups. Just reference paths directly when you need them.

How do I manage paths for many repositories?

You don't need to document them anywhere. Just reference paths directly when asking the agent to read files. If you find yourself referencing the same paths repeatedly, you can optionally document them in AGENTS.md for convenience, but it's not required. You can also use MCP server prompts like /remember (from my personal productivity MCP server) to automatically capture frequently-used paths. The /remember prompt reviews your conversation, identifies important learnings like repository paths, and adds timestamped entries to AGENTS.md in the appropriate section.

Can agents modify files in other repositories?

Yes, but be mindful. While agents can read and write files anywhere on your file system, it's easy to accidentally change files in other repositories. Use this capability deliberately rather than accidentally. Consider using read-only access for cross-repository references unless you specifically need to modify files.

What if I don't have the repository cloned locally?

Have the agent clone it temporarily using command line tools. The agent can use gh CLI or git commands to clone repositories into temporary directories, read what it needs, and clean up afterward.

Summary

Thanks to shell tools, coding agents like Cursor, GitHub Copilot, and Claude Code aren't limited by workspace boundaries. They can access your entire file system for both reading and writing, so you can build workflows that span multiple projects without complex tooling.

The simplicity is the point. You don't need special workspace configurations or multi-root setups. You just need to know where things live and tell the agent where to look. Reference paths directly, or have the agent clone repositories temporarily when needed.

When you need to reference code from another repository, the agent can read it directly. Just point it to the path. This technique works with any AI coding assistant that has file system access, making it a universal solution for cross-repository code access.

How I Replaced 307 Lines of Agent Code with 4 Lines

2025-11-16T00:00:00Z

I recently discovered PocketFlow, a framework for building LLM-enabled programs created by Zachary Huang. The entire framework is tiny—only 100 lines of code. What caught my attention is that PocketFlow takes a fundamentally different approach to LLM-powered programs, including Anthropic's workflows and agents, by structuring them as graphs.

As someone who used graphs in my thesis work, taught tutorials on applied graph theory, and builds my own agent frameworks, my curiosity was piqued. I wanted to see two things: whether I could learn enough of the framework to build something useful, and whether LlamaBot's abstractions could complement PocketFlow's approach.

To explore this, I fired up a Marimo notebook. (You can fire it up too by running: uvx marimo edit --sandbox <put URL here to notebook here>)

Understanding the Core - Nodes and Flows

I started by building what I consider a "Hello World" program: a text topic extractor and question generator. This let me familiarize myself with PocketFlow's two core abstractions: Nodes and Flows.

A Node is a unit of execution structured like this:

class SummarizeFile(Node):
    def prep(self, shared):
        # ...do stuff...
        return stuff_that_gets_passed_to_exec

    def exec(self, prep_res):
        # ...do stuff...
        return stuff_that_gets_passed_to_post

    def post(self, shared, prep_res, exec_res):
        # ...do stuff...
        return string_indicator_what_to_do_next

There's one more concept to introduce: shared. In PocketFlow, shared is like a big workspace that all Nodes can read and write from. Think of it as a kitchen island where chefs and cooks can grab ingredients and leave finished dishes. In computing terms, it's global state that programs can access. In practice, it's simply a dictionary that lives in memory, which any node can manipulate. For example, program memory might be a key in there, implemented as a list.

The prep -> exec -> post design within a node is intentional. In theory, you could do everything in one step—there are no hooks that inject stuff between, say, prep and exec. In practice, doing everything in one step muddies the program and makes it harder to reason about. I'll show you why later in this post.

Here's what each step is designed to do:

prep takes stuff from the shared dictionary, does any preprocessing, and passes it to exec. This could include grabbing stuff from memory, interpolating it into a prompt, and returning it for execution with the LLM.
exec is where the bulk of heavy computation happens. We put API calls to LLM providers (Ollama, OpenAI, Anthropic, etc.) here. What gets returned is passed to the post method.
post handles any post-processing. It receives shared, prep_res (the result of prep), and exec_res (result of exec). The pattern I've settled on is archiving results in shared—for example, storing execution results in memory. What gets returned by post should be a string indicating which downstream path to follow. If nothing specific is needed, it returns default.

A Flow is declared with a starting Node and follows the program until completion.

With this abstraction, multiple LLM-powered abstractions and design patterns can be designed:

(Image from the PocketFlow official documentation.)

Example 1 - Topic Extractor and Question Generator

Here's how I built the two-step/node topic extractor + question generator. First, I declared the nodes:

class ExtractTopics(Node):
    """First node: Extract key topics from input text"""

    def prep(self, shared):
        text_to_analyze = shared["txt"]
        return text_to_analyze

    def exec(self, prep_result):
        text_to_analyze = prep_result
        if not text_to_analyze:
            return "No content to analyze"

        prompt = f"Extract 3-5 key topics from this text. Return only the topics as a comma-separated list:\n\n{text_to_analyze}"
        bot = lmb.SimpleBot(
            system_prompt="You are a helpful assistant that extracts key topics.",
            model_name="ollama_chat/qwen3:30b",
        )
        response = bot(prompt)
        return response.content

    def post(self, shared, prep_result, exec_res):
        shared["topics"] = exec_res
        return "default"

class GenerateQuestions(Node):
    """Second node: Generate questions based on topics"""

    def prep(self, shared):
        topics = shared["topics"]
        txt = shared["txt"]
        return topics, txt

    def exec(self, prep_result):
        topics, txt = prep_result

        if not topics:
            return "Cannot generate questions without valid topics"

        prompt = f"Given these topics: {topics}\n\nand the original text: {txt}\n\nGenerate 2 interesting questions for each topic."
        bot = lmb.SimpleBot(
            system_prompt="You are a helpful assistant that generates thought-provoking questions.",
            model_name="ollama_chat/qwen3:30b",
        )
        response = bot(prompt)
        return response.content

    def post(self, shared, prep_result, exec_res):
        shared["questions"] = exec_res
        # No return statement since this is a terminal node.

Then, I declared the graph:

extract_topics = ExtractTopics()
generate_questions = GenerateQuestions()

extract_topics - "default" >> generate_questions

The magic happens in this line:

extract_topics - "default" >> generate_questions

This tells the flow that once the extract_topics node emits "default", it should proceed to the generate_questions node. The syntax is compact and looks exactly like an edge specification between two nodes.

At this point, I deeply appreciate the clarity this approach forces upfront. When thinking about the flow as a graph, I'm forced to think about each node as a function that accepts inputs from shared state and returns a decision about what to do next. That decision can be deterministic (as above) or data-dependent (as we'll see below).

Since GenAI can be viewed through the lens of automation, we should earn the privilege to use it. Automation requires a well-established process to be most effective. Framing a process in the language of graphs, inputs, and outputs—defining the process as a graph with carefully specified inputs and outputs, just like writing a computer program—is the clearest path to making automation work.

Running the Flow looks like this:

shared_topics = dict(txt=txt)

two_node_flow = Flow(start=extract_topics)
two_node_flow.run(shared_topics)

After running, we can inspect the shared_topics dictionary to see our results:

{
    "txt": ...,
    "topics": ...,      # added by ExtractTopics
    "questions": ...,   # added by GenerateQuestions
}

One thing missing from PocketFlow is the ability to visualize the graph directly. Since the codebase was new to me, I sent a Cursor agent in the background to research and propose a solution. It came back with this PR. Impressive!

The Mermaid diagram for this workflow is:

graph LR
N1["ExtractTopics"]
N2["GenerateQuestions"]
N1 --> N2
style N1 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N2 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;

Example 2 - Building an Agent

Now, what if we want to build an agent?

I'm going to work backwards here. My "hello world" test for agentic systems is making them tell me today's date. This works because an LLM will always hallucinate a date on its own, and that hallucination may or may not be correct. An agent that works properly should call a tool to get the actual date. The agent's graph should look like this:

graph LR
    N1["Decide"]
    N2["TodayDate"]
    N3["RespondToUser"]
    N1 -->|"today_date"| N2
    N2 -->|"decide"| N1
    N1 -->|"respond_to_user"| N3
    style N1 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
    style N2 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
    style N3 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;

I consider this a "Hello World" agent because a failing agent will skip straight to respond_to_user when asked for today's date, without first calling today_date to get the actual information.

To build this agent, I need three nodes:

Decide: Uses an LLM to decide which tool to call next, given the prompt
TodayDate: Executes without LLMs and returns today's date in the current timezone
RespondToUser: Responds to the user with the appropriate context

Here's how I wrote them. First, the Decide node:

from llamabot.components.tools import (
    respond_to_user,
    search_internet,
    today_date,
)
from pydantic import BaseModel, Field
from typing import Literal

search_internet = lmb.tool(search_internet)

tools = [respond_to_user, today_date]


class ToolChoice(BaseModel):
    content: Literal[*[tool.__name__ for tool in tools]] = Field(
        ..., description="The name of the tool to use"
    )


@lmb.prompt("system")
def decision_bot_system_prompt():
    """Given the chat history, pick for me one or more tools to execute
    in order to satisfy the user's query.

    Give me just the tool name to pick.
    Use the tools judiciously to help answer the user's query.
    Query is always related to one of the tools.
    Use respond_to_user if you have enough information to answer the original query.
    """

class Decide(Node):
    def prep(self, shared: dict):
        shared["memory"].append(f"Query: {shared['query']}")
        return shared

    def exec(self, prep_result):
        bot = lmb.StructuredBot(
            pydantic_model=ToolChoice,
            system_prompt=decision_bot_system_prompt(),
        )
        print(prep_result["memory"])
        response = bot(*prep_result["memory"])

        return response.content

    def post(self, shared, prep_result, exec_result):
        shared["memory"].append(f"Chosen Tool: {exec_result}")
        return exec_result

The key thing to note is that we inject the available tools into the system prompt of the tool-selecting agent.

Next, the TodayDate node:

class TodayDate(Node):
    def prep(self, shared: dict):
        return shared

    def exec(self, prep_result):
        return today_date()

    def post(self, shared, prep_result, exec_result):
        shared["memory"].append(f"Today's date: {exec_result}")
        return "decide"

And finally, the RespondToUser node:

class RespondToUser(Node):
    def prep(self, shared: dict):
        return shared

    def exec(self, prep_result):
        class Response(BaseModel):
            content: str = Field(..., description="The response to the user.")

        bot = lmb.StructuredBot(
            "You are a helpful assistant.",
            model_name="ollama_chat/gemma3n:latest",
            pydantic_model=Response,
        )
        response = bot(*prep_result["memory"])
        return response.content

    def post(self, shared, prep_result, exec_result):
        shared["memory"].append(exec_result)
        return exec_result

Finally, we set up the graph:

# Set up the graph
today__date = TodayDate()
respond__to__user = RespondToUser()
decide = Decide()

shared = dict()
shared["query"] = "What is the date today?"
shared["memory"] = []

decide - "today_date" >> today__date
today__date - "decide" >> decide
decide - "respond_to_user" >> respond__to__user

I used __ in the node names to avoid clashing with the original functions.

Then we run it:

flow2 = Flow(start=decide)
flow2.run(shared)

Take my word for it (or check out the notebook yourself)—it reliably gives me today's date.

Example 3 - Agent with Shell Commands

To push things further, I tried a tool that needs arguments. A good "hello world" for this is executing shell commands in response to questions like, "What's in this folder?"

For this, I created a second version of the Decide node called Decide2, where I instantiate and execute the ToolChoice and tool selection StructuredBot within exec:

class Decide2(Node):
    def prep(self, shared: dict):
        return shared

    def exec(self, prep_result):
        sysprompt = decision_bot_system_prompt()

        print(sysprompt)

        class ToolChoice(BaseModel):
            content: Literal[*[tool.__name__ for tool in prep_result["tools"]]] = (
                Field(..., description="The name of the tool to use")
            )
            justification: str = Field(..., description="Why this tool was chosen.")

        bot = lmb.StructuredBot(
            pydantic_model=ToolChoice,
            system_prompt=decision_bot_system_prompt(),
        )

        if prep_result["memory"]:
            response = bot(*prep_result["memory"])
        else:
            response = bot(prep_result["query"])

        return response.content

    def post(self, shared, prep_result, exec_result):
        shared["memory"].append(f"Query: {shared["query"]}")
        shared["memory"].append(f"Chosen Tool: {exec_result}")
        return exec_result

I then created a ShellCommand node that uses the same pattern—leveraging StructuredBot for structured generation to constrain the LLM's output to exactly what I need:

class ShellCommand(Node):
    def prep(self, shared: dict):
        return shared

    def exec(self, prep_result):
        class Cmd(BaseModel):
            content: str = Field(
                ..., description="The shell command to execute"
            )

        bot = lmb.StructuredBot(
            system_prompt="You are an expert at writing shell commands. For the chat trace that you will be given, write a shell command that accomplishes the user's request. Only output the command, nothing else.",
            pydantic_model=Cmd,
            model_name="ollama_chat/gemma3n:latest",
        )

        response = bot(*prep_result["memory"])
        print(response.content)
        result = execute_shell_command(response.content)
        return result

    def post(self, shared, prep_result, exec_result):
        shared["memory"].append(f"Output: {exec_result}")
        print(shared["memory"])
        return "decide"

Finally, we set up the graph:

def _():
    # Set up the graph
    today_date = TodayDate()
    respond_to_user = RespondToUser()
    decide = Decide2()
    shell_command = ShellCommand()

    decide - "today_date" >> today_date
    today_date - "decide" >> decide
    decide - "execute_shell_command" >> shell_command
    shell_command - "decide" >> decide
    decide - "respond_to_user" >> respond_to_user

    flow = Flow(start=decide)
    return flow


flow3 = _()

The graph would look like this:

graph LR
N1["Decide2"]
N2["TodayDate"]
N3["ShellCommand"]
N4["RespondToUser"]
N1 -->|"today_date"| N2
N2 -->|"decide"| N1
N1 -->|"execute_shell_command"| N3
N3 -->|"decide"| N1
N1 -->|"respond_to_user"| N4
style N1 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N2 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N3 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N4 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;

I wrapped it in a _() function to protect the globally scoped variables in the Marimo notebook. Note that I included today_date as well, just to "pollute" the namespace and make it more challenging when asking shell-related questions. When we interact with the agent:

shared3 = dict()
shared3["query"] = "What in my current working directory?"
shared3["memory"] = []
shared3["tools"] = [respond_to_user, today_date, execute_shell_command]
flow3.run(shared3)

It calls on shell_command, and gives me back this response:

Okay, here's a list of the files and directories in your current working directory:

*   **Directories:**
    *   `__marimo__`
    *   `.` (current directory)
    *   `..` (parent directory)

*   **Files:**
    *   `agentbot_build.py`
    *   `agents.py`
    *   `chatbot_as_agent.py`
    *   `conversation-threads.py`
    *   `data.csv`
    *   `ic50_data_with_confounders.csv`
    *   `intro.py`
    *   `lancedb_docstore.py`
    *   `pocketflow_testdrive.py`
    *   `react-agentbot-demo.py`
    *   `README.md`
    *   `toolbot_chatdata.py`
    *   `tools.py`

That's a total of 17 files and directories. Let me know if you'd like more details about any of them!

I was also able to ask "Hey, what files have been modified today?" and the agent successfully executed the appropriate shell command.

Effectively, this pattern is nothing more than a coordinating agent/LLM delegating work to specialized tools.

Rewriting AgentBot with PocketFlow

Finally, I decided to take what I'd learned and redo the AgentBot implementation in LlamaBot. My previous implementation (version 0.16.3) was messy—the __call__ method alone was 307 lines with a while-loop, maximum tries, ThreadPoolExecutor for parallel tool execution, tool call caching, and extensive metadata tracking. PocketFlow had a better abstraction for the agentic loop: a Flow state machine following edges on a graph. I thought I could redesign AgentBot to take advantage of this pattern.

The rewrite involved some really interesting patterns. I completely replaced the ReAct (Reasoning and Acting) loop with PocketFlow's graph-based tool orchestration. This shifts from an iterative loop-based approach to a declarative graph-based one, where tool execution flows through a directed graph rather than a sequential loop.

The implementation centers on three key abstractions:

1. The @nodeify decorator transforms any callable function into a PocketFlow Node. It wraps functions with PocketFlow's Node interface, implementing the required prep, exec, and post methods. The tricky part is that @nodeify needs to preserve access to the underlying function's metadata—particularly the json_schema attribute added by the @tool decorator—through attribute proxying, so ToolBot can discover and use tools even after they've been wrapped as nodes.

2. The DecideNode encapsulates the decision-making logic. This node uses ToolBot internally to analyze the conversation history stored in shared state and select which tool to execute next. It expects a shared state dictionary with a "memory" key containing the conversation history as a list of strings. When executed, it calls ToolBot with this memory, extracts the first tool call from ToolBot's response, parses the JSON-formatted arguments, and stores them in shared["func_call"] for the next node. The node then returns the tool name as a routing action, which PocketFlow uses to navigate the graph.

3. Flow graph construction happens at initialization time. AgentBot automatically wraps all provided tools (plus default tools like today_date and respond_to_user) with both @tool and @nodeify decorators, then builds bidirectional connections: from the decide node to each tool node (using the tool's function name as the action), and from each tool node back to the decide node (except for terminal tools like respond_to_user that have loopback_name=None). This creates a graph where execution can flow from decision to tool and back to decision, enabling multi-step reasoning.

A few technical requirements make this work: tools need type annotations (for JSON schema generation), the shared state needs a "memory" list for conversation history, and tool arguments are passed through shared["func_call"]. The DecideNode selects one tool at a time, and tools are stateless—they get fresh arguments each call and communicate through memory.

What's remarkable about this implementation is how compact it is. The @nodeify decorator is just 100 lines, and most of that is documentation. The core logic is elegant:

def nodeify(func=None, *, loopback_name: str = "decide"):
    def decorator(func):
        class FuncNode(Node):
            def __init__(self, *args, **kwargs):
                super().__init__(*args, **kwargs)
                self.loopback_name = loopback_name
                self.func = func

            def prep(self, shared):
                return shared

            def exec(self, prep_result):
                func_call = prep_result.get("func_call", {})
                return self.func(**func_call)

            def post(self, shared, prep_result, exec_res):
                shared["memory"].append(exec_res)
                if self.loopback_name is None:
                    return exec_res
                return self.loopback_name

            def __getattr__(self, name):
                # Proxy to original function for json_schema access
                if name == "func":
                    raise AttributeError(...)
                return getattr(self.func, name)

        return FuncNode()

    if func is not None:
        return decorator(func)
    return decorator

The entire AgentBot class is similarly compact—about 100 lines total. Compare this to the previous implementation where the __call__ method alone was 307 lines, with complex while loop logic, tool call caching, parallel execution via ThreadPoolExecutor, and extensive state management:

# Old implementation (v0.16.3): 307-line __call__ method
# Plus 50-line caching wrapper, 21-line execution helper
# Total: 378 lines of orchestration code

for iteration in range(max_iterations):
    # Call model with tools
    response = completion(
        model=self.model_name,
        messages=raw_messages,
        tools=self.tools,
        tool_choice="auto",
    )

    tool_calls = extract_tool_calls(response)

    if tool_calls:
        # Execute tools in parallel with caching
        with ThreadPoolExecutor() as executor:
            futures = {
                executor.submit(self._execute_tool_with_cache, call): call
                for call in tool_calls
            }
            for future in as_completed(futures):
                # Handle results, update messages, manage cache,
                # track metadata, handle errors...
                ...
        continue

    # Handle finalization, memory updates, logging, metrics...

The new implementation replaces all of that with a simple graph construction:

# New implementation: ~100 lines total, declarative graph
class AgentBot:
    def __init__(self, tools, decide_node=None, model_name="gpt-4.1", ...):
        # ... validation and setup ...

        # Build PocketFlow graph: connect tools to decide node
        for tool_node in all_tools:
            tool_name = tool_node.func.__name__
            self.decide_node - tool_name >> tool_node
            if tool_node.loopback_name is not None:
                tool_node - tool_node.loopback_name >> self.decide_node

        self.flow = Flow(start=self.decide_node)

    def __call__(self, query: str, ...):
        self.shared["memory"].append(query)
        self.flow.run(self.shared)
        return self.shared.get("result")

Full implementation includes validation, tool wrapping, and state management—about 100 lines total vs 307+ for the old __call__ method alone.

The Magic of Building an Agent in Just 4 Lines

The most remarkable part of this implementation is how the entire agent graph is constructed. Look at these four lines carefully:

for tool_node in all_tools:
    self.decide_node - tool_name >> tool_node
    if tool_node.loopback_name is not None:
        tool_node - tool_node.loopback_name >> self.decide_node

This is it. This is the entire graph construction that turns a collection of tools into a working agent. Let me break down what's happening:

Line 1: Loop through each tool
Line 2: Connect the decide node to the tool node—when the LLM chooses this tool, execution flows to it
Line 3: Check if this tool should loop back (terminal tools like respond_to_user have loopback_name=None)
Line 4: Connect the tool back to the decide node—after execution, control returns to decision-making

That's the entire agent architecture. Four lines. The - "action" >> syntax creates directed edges in the graph, and PocketFlow handles all the state management, routing, and execution orchestration. Compare this to the 307-line __call__ method in the previous implementation (version 0.16.3) with its complex loop-based logic, thread pools, state tracking, and termination conditions.

This is what I mean by "graph-based thinking" being clearer—the entire execution flow is explicit and declarative. You can see at a glance how decisions flow to tools and back to decisions, enabling multi-step reasoning.

The difference is striking. The old implementation required manual loop management, explicit state tracking, parallel execution coordination, and complex termination logic. The new implementation declares the graph structure once, and PocketFlow handles all the execution details.

This graph-based approach provides several advantages. The flow graph is constructed once at initialization, making the execution path explicit and visualizable—you can render the agent's decision flow as a Mermaid diagram using the visualization feature I added to LlamaBot. The separation of concerns is clearer: decision-making lives in DecideNode, tool execution in wrapped function nodes, and orchestration in PocketFlow's flow engine. The implementation is also more modular—you can swap out the decision node or customize tool wrapping behavior without rewriting the core agent logic. Finally, by leveraging PocketFlow's graph execution model, we gain access to its execution capabilities and potential future extensions for parallel execution or conditional routing.

Visualizing Different Agent Architectures

One really cool feature I added to LlamaBot is the ability to visualize any agent's graph structure using Mermaid diagrams. The AgentBot._display_() method automatically renders the flow graph, making it easy to see how different tool configurations create different architectures.

Here's a simple agent with just two tools:

from llamabot import AgentBot
from llamabot.components.tools import tool
from llamabot.components.pocketflow import nodeify

@nodeify(loopback_name="decide")
@tool
def search_web(query: str) -> str:
    """Search the web for information."""
    return web_search(query)

agent = AgentBot(tools=[search_web])
agent._display_()  # Renders Mermaid diagram in Marimo

The resulting graph shows the decision node connected to today_date, search_web, and respond_to_user:

graph LR
N1["DecideNode"]
N2["today_date"]
N3["search_web"]
N4["respond_to_user"]
N1 -->|"today_date"| N2
N2 -->|"decide"| N1
N1 -->|"search_web"| N3
N3 -->|"decide"| N1
N1 -->|"respond_to_user"| N4
style N1 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N2 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N3 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N4 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;

Add more tools, and the graph automatically expands. Here's an agent with code execution and file operations:

@nodeify(loopback_name=None)
@tool
def write_and_execute_script(
    code: str,
    dependencies_str: str = "",
    python_version: str = ">=3.11",
) -> Dict[str, Any]:
    """Write and execute a Python script in a secure Docker sandbox.

    :param code: The Python code to execute
    :param dependencies_str: Comma-separated pip dependencies
    :param python_version: Python version requirement
    :return: Dictionary with stdout, stderr, and status
    """
    # Uses ScriptExecutor to run code in isolated Docker container
    executor = ScriptExecutor()
    result = executor.run_script(script_path, timeout=600)
    return {
        "stdout": result["stdout"],
        "stderr": result["stderr"],
        "status": result["status"],
    }

@nodeify(loopback_name="decide")
@tool
def read_file(filepath: str) -> str:
    """Read and return file contents."""
    return open(filepath).read()

agent = AgentBot(tools=[search_web, write_and_execute_script, read_file])

The graph now shows six tool nodes all connected bidirectionally to the decision node (except terminal tools):

graph LR
N1["DecideNode"]
N2["today_date"]
N3["search_web"]
N4["write_and_execute_script"]
N5["read_file"]
N6["respond_to_user"]
N7["return_object_to_user"]
N1 -->|"today_date"| N2
N2 -->|"decide"| N1
N1 -->|"search_web"| N3
N3 -->|"decide"| N1
N1 -->|"write_and_execute_script"| N4
N4 -->|"decide"| N1
N1 -->|"read_file"| N5
N5 -->|"decide"| N1
N1 -->|"respond_to_user"| N6
N1 -->|"return_object_to_user"| N7
style N1 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N2 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N3 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N4 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N5 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N6 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N7 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;

What I love about this is how the graph makes it immediately obvious what capabilities an agent has. You can see at a glance which tools are available, understand the control flow, and reason about how the agent will behave. The visualization transforms the abstract "agent with tools" into a concrete, inspectable structure.

Here's a real-world example—an experiment design agent I built for critiquing statistical experiment designs:

@nodeify(loopback_name="decide")
@tool
def critique_experiment_design(design: str) -> str:
    """Critique an experiment design and identify potential flaws,
    biases, or weaknesses.

    :param design: Description of the proposed experiment design
    :return: Critique with identified issues and suggestions
    """
    bot = lmb.SimpleBot(
        system_prompt=experiment_design_critique_sysprompt()
    )
    return bot(design)

agent = AgentBot(
    tools=[critique_experiment_design, write_and_execute_code(globals())]
)

This agent has a specialized domain focus. The graph shows all its capabilities, including the default tools that every AgentBot gets automatically:

graph LR
N1["DecideNode"]
N2["today_date"]
N3["critique_experiment_design"]
N4["write_and_execute_code"]
N5["respond_to_user"]
N6["return_object_to_user"]
N1 -->|"today_date"| N2
N2 -->|"decide"| N1
N1 -->|"critique_experiment_design"| N3
N3 -->|"decide"| N1
N1 -->|"write_and_execute_code"| N4
N4 -->|"decide"| N1
N1 -->|"respond_to_user"| N5
N1 -->|"return_object_to_user"| N6
style N1 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N2 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N3 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N4 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N5 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;
style N6 fill:#e1f5ff,stroke:#01579b,stroke-width:2px;

Notice that today_date, respond_to_user, and return_object_to_user are included by default in every AgentBot. The graph immediately tells you this agent can critique designs, execute code to analyze data, and return Python objects directly to the user—but it's not a general-purpose assistant. It's specialized for experiment design evaluation. The visual structure encodes the agent's purpose.

This is only possible because of the graph-based architecture. With the old loop-based implementation, there was no clean way to visualize the execution flow—it was hidden inside imperative control logic.

What I Learned

Externalize memory as shared state. Memory lives in the shared dictionary that all nodes can access, rather than being intrinsic to each bot. We just feed memory context in each time a node executes. This has good economics—if you have prompt caching on the API provider's side, simply appending to an ever-growing memory is a great way to take advantage of pre-computed neural network outputs from previous runs. I used to think of memory as intrinsic to a bot, but I've changed my mind: allowing multiple bots to share access to the same memory is a useful simplification, even if it's not suitable for every circumstance.

The prep -> exec -> post pattern is worth adhering to. I found myself appending to memory in post after doing the execution. prep turns out to be useful for preprocessing user inputs or manipulating memory as needed. The overall effect is that it's much easier to unit test or set up evals for individual nodes.

PocketFlow's graph abstraction brings clarity. The analogy of LLM agents (Nodes) as chefs/cooks accessing a kitchen island's worth of things (shared) is a powerful contrast to my previous loop-based approach in LlamaBot, where I manually tracked state, managed iterations, and coordinated tool execution. This insight is exactly why I rewrote AgentBot to use this graph-based architecture.

A wide variety of LLM-powered architectures can be built with just Nodes and Flows. Most LLM applications I've built—whether for myself or for others—have not been "agentic" but more like "workflows." These are what some might consider boring. Yet they are high ROI precisely because they take repetitive and boring work out of our hands! PocketFlow gives us a way to express flows as graphs, effectively state machines whose actions are either fully deterministic or determined by an LLM's choice.

PocketFlow is minimalistic, which offloads a lot of heavy lifting when working with LLMs. The flexibility is both a strength and a weakness: great for power users, but potentially intimidating for newcomers. I found it easiest to rely heavily on StructuredBot to output decisions made by the LLM. Structured generation is, generally speaking, the most useful abstraction in the LLM world that I keep turning back to.

The agent pattern is everywhere once you recognize it. While writing this post, I realized that the agentic coding IDEs we've gotten used to—tools like Cursor, GitHub Copilot, and others—follow the exact same pattern I've been describing. They have a decision node that analyzes your code and context, tool nodes for reading files, searching codebases, editing code, and responding to you. The flow is the same: decide what to do, execute a tool, update context, decide again. Understanding this pattern in PocketFlow helped me see it operating in the tools I use every day. The abstraction is the mental model that makes sense of how modern AI-powered tools work.

The biggest lesson? Thinking in graphs transforms how you build LLM programs. The shift from imperative loops to declarative graphs means you declare what should happen instead of specifying how to execute step-by-step. This brings clarity, modularity, and makes your execution flow explicit. Whether you're building simple workflows or complex agents, representing them as graphs forces you to think clearly about state, decisions, and flow. That mental model shift has changed how I approach every LLM application I build.

Safe ways to let your coding agent work autonomously

2025-11-08T00:00:00Z

Coding agents promise to unlock significant productivity gains by working autonomously in the background—gathering context, running tests, searching documentation, and making progress on tasks without constant human intervention. The more autonomous they become, the more value they deliver. Yet this autonomy creates a fundamental tension: we need agents to act independently to realize their potential, but we must prevent them from taking irreversible actions we don't want.

This tension became painfully clear when I asked Comet, an agentic browser, "how to archive repo" in the same casual way I'd ask Google. The agent interpreted this as a direct command and archived my LlamaBot repository. What I wanted was information; what I got was an unintended action with real consequences.

The problem isn't unique to Comet. Any coding agent with sufficient autonomy can make destructive changes: deleting files, force-pushing to main, committing broken code, or modifying critical configurations. We need safeguards that allow agents to work freely on safe operations while blocking potentially harmful actions. The solution lies in configuring your development environment with intelligent boundaries—auto-approving read-only commands while requiring explicit approval for anything that modifies state.

Auto-approve safe command line commands

The foundation of autonomous coding agent operation is allowing certain command line commands to run without manual approval. Commands like grep/ripgrep, find/fd, pixi run pytest..., and similar read-only or context-gathering operations enable LLM agents to autonomously understand codebases and test suites. For CLI tools that interact with external services, I also auto-approve gh pr view, which allows the agent to gather context from GitHub pull requests while working in the background.

The critical rule: only auto-accept commands that are non-destructive. Never auto-approve git commit, git push, rm, or other filesystem, git, or state-modifying changes. This creates a safe boundary where agents can explore and learn, but cannot make irreversible changes without your explicit approval.

Here's my mental model for categorizing commands:

Safe to auto-approve:

Read operations: grep, find, cat, head, tail, less
Code analysis: pytest (read-only test runs), mypy, ruff check (without --fix)
Context gathering: gh pr view, gh issue view, git log, git diff, git show
Package managers (read-only): pip list, npm list, cargo tree
Documentation build: mkdocs serve

Never auto-approve:

File system mutations: rm, mv, cp, mkdir, touch
Git writes: git commit, git push, git reset, git checkout -b
Package installs: pixi add

The edge cases are where it gets interesting. I auto-approve pytest because test runs are read-only, but I require approval for any command that modifies files, even if it's technically reversible. The key distinction is whether a command changes state: git status and git diff are safe because they're pure reads, while git commit and git push modify repository state and require explicit approval. git add is a bit of a gray area, but I am ok with auto-approving it since it's technically reversible, and because coding agents are often much faster than I could be at selectively adding files to the staging area.

Enable automatic web search

For Cursor and Claude Code, automatic web search without approval requests is another powerful capability. I have web search auto-approved on my machine, which allows agents to look up documentation, error messages, and solutions independently. This is particularly valuable when agents encounter unfamiliar error messages or need to check current API documentation that may have changed since the model's training cutoff.

However, I monitor outputs for prompt poisoning, since internet-based prompt poisoning is a known attack vector for AI systems. The risk is that malicious content from web searches could influence the agent's behavior in subsequent actions. I've found this risk manageable for coding tasks, but I'm more cautious with agents that have broader system access or handle sensitive data.

Know your emergency stop shortcuts

Every coding agent platform provides keyboard shortcuts to cancel actions in progress. These are essential when you notice an agent looping, going down an unproductive path, or making changes you don't want:

Cursor: Ctrl+C
VSCode + GitHub Copilot: Cmd+Esc
Claude Code: Esc

If you're monitoring the agent's activity, these shortcuts let you intervene immediately when something goes wrong.

Correct agent behavior in real-time

When you catch an agent doing something undesirable, stop it immediately, then redirect it. I instruct agents to record corrections in AGENTS.md and continue with the updated guidance. An example prompt:

No, I don't want you to do <thing>. Instead, you should do <a different thing>. Record this in AGENTS.md, and then continue what you were doing.

This approach creates a persistent record of preferences that improves future agent behavior. The AGENTS.md file becomes a living document of your development standards and preferences, which agents can reference in future sessions. I've implemented this pattern in my personal productivity MCP server, which provides a standardized way to store and retrieve these preferences across different agent platforms.

Write prescriptive prompts for complex tasks

I created the personal productivity MCP server to help me take my favourite prompts from system to system. MCP (Model Context Protocol) servers provide a standardized way to expose tools and context to AI agents across different platforms. One thing I learned from my colleague Anand Murthy about how to write such prompts is to be extremely prescriptive about the actions and tools that I want the agent to use.

Generic prompts like "help me debug this GitHub Actions workflow" leave too much room for interpretation. Instead, specify exact commands, tools, and steps. For example, if I'm looking to debug a GitHub Actions issue, the prompt that I have looks like this:

You are helping me debug a failed GitHub Actions workflow. Follow these steps to systematically analyze and resolve the issue:

1. **Extract workflow information**: Parse the provided URL to identify:
   - Repository owner and name
   - Workflow run ID
   - Workflow name
   - Branch/commit that triggered the run

2. **Fetch workflow logs using GitHub CLI**:
   - Use `gh run list` to verify the workflow run exists
   - Use `gh run view <run-id>` to get detailed run information
   - Use `gh run view <run-id> --log` to download and display the full logs
   - Use `gh run view <run-id> --log-failed` to focus on failed job logs

3. **Analyze the failure**:
   - Identify which job(s) failed and at what step
   - Look for error messages, exit codes, and stack traces
   - Check for common issues: dependency problems, permission errors, timeout issues, resource constraints
   - Examine the workflow configuration and environment setup

4. **Provide debugging guidance**:
   - Explain what went wrong in simple terms
   - Suggest specific fixes or configuration changes
   - Provide commands or code snippets to resolve the issue
   - Recommend preventive measures to avoid similar failures

5. **Context-aware solutions**:
   - Consider the project type (Python, Node.js, etc.) and suggest appropriate fixes
   - Check for recent changes that might have caused the failure
   - Suggest workflow improvements or optimizations

6. **Follow-up actions**:
   - Recommend next steps for testing the fix
   - Suggest monitoring or alerting improvements
   - Provide guidance on preventing similar issues

Workflow URL: {workflow_url}

Focus on providing actionable, specific solutions rather than generic troubleshooting advice. Use the GitHub CLI commands to gather comprehensive information about the failure.

Notice how prescriptive this prompt is. Rather than being a generic troubleshooting guide, it's a step-by-step guide that the agent can follow, down to the level of exact CLI commands to run. Critically, those CLI commands (gh run list, gh run view) are commands that I have auto-approved in my IDE, so the agent can execute the entire workflow autonomously without interrupting me for approval at each step.

The prompt was written with AI assistance, which allows me to iterate to the level of detail I want with minimal effort. I start with a rough outline, then ask the agent to make it more specific, add command examples, and refine the steps until it's actionable enough for autonomous execution.

Use plan mode for complex tasks

Plan mode in Cursor and Claude significantly improves agent performance on complex tasks. Users of AI-assisted coding tools consistently report that plan mode helps agents stay on course, compared to agents working without a structured plan. This mirrors how humans perform better with explicit plans.

The mechanism is straightforward: the agent first generates a detailed plan, you review and refine it, then the agent executes against that plan. This separation of planning and execution prevents the agent from going down rabbit holes or making premature implementation decisions.

In my experience, agents often complete tasks in one attempt after a few iterations on a well-defined plan. The key is ensuring the plan is specific and properly scoped before execution begins. I've found that plans work best when they include:

Specific files and functions to modify
Clear acceptance criteria
Dependencies and ordering constraints
Test cases or validation steps

Without this structure, agents tend to make assumptions, skip steps, or get distracted by tangential improvements.

Managing multiple background agents

Multiple background agents can be powerful, but they require careful management. Unless agents are handling mundane, well-defined tasks, context switching between multiple active agents becomes challenging. At that point, you're operating at the speed of thought, which requires significant cognitive overhead.

I've found that multiple agents work well when they're working on independent, well-scoped tasks. For example, one agent might be researching documentation while another refactors a specific module. But when tasks have dependencies or require coordination, a single agent with a clear plan tends to perform better than multiple agents trying to coordinate.

The cognitive load turns out to be more than keeping track of what each agent is doing; we also need to ensure they don't conflict with each other. Two agents modifying the same file simultaneously, or one agent's changes breaking assumptions another agent made, creates more problems than it solves.

Additional resources

Others have written extensively about effective coding agent workflows. Here's a curated collection of resources I've found valuable:

Simon Willison's coding agent tips
Geoffrey Litt suggests coding like a surgeon
@omarsar0 on Twitter loves plan mode on Claude Code
@mattpocockuk has awesome tips on how to use AI for coding, including this tip
Simon Willison (again!) on async code research with coding agents
Sebastian Wallkötter on preventing AI spaghetti through intermediate reviews: The key insight is that AI coding's bottleneck is code review, not code generation. Small mistakes compound when AI re-ingests its own errors as context. The solution: implement features in small increments with intermediate reviews, fixing "5 second issues" as you go, rather than letting mistakes accumulate into spaghetti code that takes hours to untangle.

What are your tips for safe ways to let your coding agent work autonomously? And what did you like most about this post? Let me know in the comments below!

Use coding agents to write Marimo notebooks

2025-10-28T00:00:00Z

If you're like me, you might find coding with AI assistants somewhat addictive. And if you're like me, you might also like to write code in Marimo notebooks, the modern alternative to Jupyter that offers better reproducibility and cleaner Python development.

Turns out there's a way to put these two together for automated Python development and data science workflows, creating a powerful combination for rapid prototyping and iterative coding.

Marimo's `--watch` Flag

A few months ago, at SciPy 2025, my friend Trevor Manz showed me a cool neat trick for writing Marimo notebooks. Apart from launching a Marimo notebook in sandbox mode, you add a --watch flag:

uvx marimo edit --sandbox my_notebook.py --watch

When edits are made to the source file notebook.py, they will now be reflected in the browser as well. This was my reaction:

If you ever meet Trevor in person, he can confirm that reaction of mine.

Ensure code quality with `marimo check`

So now, AI coding assistants can write your Marimo notebooks for you... but it's not always going to be correct first time, right? After all, the latest features of Marimo are not going to be part of the large language model training sets.

Turns out, Marimo also ships with a check command that you can ask coding agents to call on:

uvx marimo check my_notebook.py

And that will print to stdout any issues that Marimo finds that break its execution model, such as variables that are repeated variables or invalid cells.

You can instruct coding agents to always run marimo check by adding the following prompt (or analogous) into AGENTS.md:

When editing Marimo notebooks, always run `uvx marimo check` on the file and fix all issues that you find.

This will virtually guarantee correctly-written, AI-generated notebooks. All that's left for us as users is to check the correctness of the analysis that was done.

Real-world use

Now, AI coding assistants (like Cursor, GitHub Copilot, or Claude Code) can write and edit large chunks of Marimo notebook cells for you, check what they wrote, and fix any syntactic issues that show up. And by checking that the cells are syntactically valid. Now you can speed-run those routine and yet highly mundane data manipulation code-writing activities while making yourself an espresso drink. This aligns perfectly with my philosophy on optimizing for productivity in data science workflows.

I've used this mode to speed-run first versions of probabilistic models in PyMC, create explainer notebooks for hard concepts, make notebooks that process data, and many, many more things that you'd usually be able to do within a coding notebook system. The key thing that makes this work is feedback given (via the command line) that the coding agent can use for self-correction.

Advanced functionality using MCP and built-in AI features

It doesn't stop there, though. There's a new --mcp flag that makes a notebook an MCP server that coding agents can connect to; read more about it here. Marimo also has built-in AI editing capabilities itself as well. Check out the functionality here, as well as Vincent Warmerdam's short video on using coding agents from within Marimo. He's got my vote for best facial/eyebrow expressions from a coding YouTuber!

Addendum

After sharing this post on LinkedIn, Séverin H. shared a couple of additional use cases worth highlighting:

One use case I would also recommend is getting the coding assistant to run queries for you, especially when it is to debug a existing query. You can ask [it] to check for corner cases (and especially dig into the data to understand the corner cases).

(LinkedIn comment)

The --watch flag is indeed very interesting use case. Also to note they created a Claude.md to get you started that you can directly curl:
curl https://docs.marimo.io/CLAUDE.md > [your agents.md file]
Some more reference from their blog: https://marimo.io/blog/claude-code

(LinkedIn comment)

Thanks for the suggestions, Séverin!

Exploring Skills vs MCP Servers

2025-10-20T00:00:00Z

I spent time digging through Anthropic's skills repository. These are my first impressions, organized for clarity and future reference.

What the Anthropic Skills repository offers

Creative & design workflows: algorithmic-art (generative art with p5.js), canvas-design (beautiful PNG/PDF outputs guided by design philosophies), theme-factory (pre-set or on-the-fly themes), and slack-gif-creator (animated GIFs tuned for Slack). These are turnkey “taste plus tooling” bundles that let the model produce high-quality visuals with consistent aesthetics.
Document skills for real formats: document-skills/ cover pptx, docx, pdf, and xlsx with serious capabilities: layout/templates, tracked changes and comments, text/table extraction, merges/splits, charting, formulas, and formatting preservation. This feels like a pragmatic spec+runtime for working with binary formats—lean instructions up front, heavy lifting when needed.
Development & technical utilities: artifacts-builder (compose complex Claude HTML artifacts using React/Tailwind/shadcn), webapp-testing (Playwright-driven UI testing), and mcp-builder (guidance for creating high-quality MCP servers). These reduce boilerplate for the “build and test” loop.
Enterprise & communication: brand-guidelines (apply Anthropic’s official brand colors and typography) and internal-comms (status reports, newsletters, FAQs). These encode editorial and brand guardrails so outputs stay on-message.
Meta skills and templates: skill-creator and template-skill show how to structure your own skills: a folder per skill with a SKILL.md (YAML front matter for name and description, plus instructions/examples/guidelines), optional scripts, and assets. This is the pattern to replicate.

If you want the source for these examples, it’s viewable in the repo. Start here: https://github.com/anthropics/skills.

How skills are loaded and used

Minimal prompt footprint: A skill's short description is passed up front. The larger skill.md is only read when the model decides it needs more detail.
On-demand details: The model can iterate (ReAct loop) to fetch instructions and then execute scripts or read additional files.

This access pattern keeps the initial token budget small and defers detail until it’s actually needed.

Contrast with MCP servers

MCP call shape: Tool names and descriptions are typically sent on every call. That keeps tools globally discoverable but increases token overhead.
Skills call shape: A tiny descriptor up front; details fetched lazily. Lower baseline token cost.
Distribution model:
- MCP: Centrally hostable (e.g. web server) or vendable (e.g., a Python package). Easy to version, release, and update for many users at once.
- Skills: Feel local-first. You can drag-and-drop into a Claude workspace. Easy to customize, but harder to standardize and propagate updates across a team.

Given current industry patterns, MCP servers are the widely accepted way to expose functionality to LLMs across tools and vendors. Skills are Anthropic-specific at the moment.

Token efficiency (and why it’s emphasized)

Anthropic’s materials lean into token efficiency. The cost of LLM calls adds up, and repeatedly sending long tool descriptions can be expensive. Skills reduce baseline tokens: spend a handful of tokens to register intent, read detail only when needed, then execute. That’s the economic story.

Practical trade-offs

Standardization vs customization:
- MCP servers: Strong for shared, versioned, and centrally updated capabilities.
- Skills: Great for rapid, local customization without infrastructure.
Discovery vs cost:
- MCP: High discoverability; the model always sees the tools. Higher token floor.
- Skills: Low token floor; details fetched when needed. Requires the model to choose to read more.

Open questions I’m tracking

How will teams distribute and update skills at scale without a central registry or packaging story?
Will skills gain cross-vendor support, or remain Anthropic-only?
What’s the best practice to map a complex skill into smaller, composable units without losing clarity?

Early take

IMO, skills are a clear attempt to lower token costs and streamline task-specific workflows with minimal upfront context. MCP servers remain the well-understood, cross-ecosystem pattern for exposing capabilities. If your goal is a shareable, versioned interface for many users, MCP is still the safer default. If you need quick, local customization inside Claude with a lean prompt footprint, skills are compelling. But this field has been evolving at breawkneck speed anyways, so expect changes.

How to expose any documentation to any LLM agent

2025-10-19T00:00:00Z

Like cars that lose value as soon as they roll off the lot, LLMs become outdated as soon as their training sets are fixed. Software documentation evolves constantly—new features, API changes, bug fixes, and best practices emerge daily. Yet AI agents are stuck with whatever knowledge was captured in their training data, creating a fundamental mismatch between what they know and what developers actually need in real-time.

Building LlamaBot taught me something unexpected: the hardest part of AI-assisted development isn't writing better prompts or designing cleaner abstractions. It's equipping AI agents with up-to-date information in a stable, standardized fashion.

Most developers know the frustration of context-switching between code and documentation. You're deep in a coding session, need to check how a specific function works, and suddenly you're hunting through static documentation files. AI agents face this same problem, but with an added layer of complexity—they need structured, queryable access to documentation that can be searched semantically.

I discovered that web searches by coding agents were less reliable than manually adding context, but manual approaches don't scale. The solution emerged through the Model Context Protocol (MCP), a standard that enables LLMs to interact with external tools and data sources. In LlamaBot v0.13.10, I introduced a documentation MCP server that automatically equips AI agents with current information. This enables AI agents to access organizational knowledge, process documentation, and domain expertise in structured ways.

The obsolescence problem in AI-assisted development

The core issue more than mere documentation access, it's about obsolescence. LLMs are trained on data that becomes outdated the moment it's fixed in their training sets. Meanwhile, software documentation evolves constantly. New features are added, APIs change, bugs are fixed, and best practices emerge. Yet AI agents remain frozen in time, working with knowledge that may be months or years out of date.

Consider a typical data science workflow: you're building an AI pipeline and need to understand how LlamaBot's StructuredBot handles data validation. The AI agent might reference documentation from six months ago, missing critical updates or new features that could solve your problem more elegantly. This creates a fundamental mismatch between what the agent knows and what's actually available.

The deeper problem is that AI agents need structured, queryable access to documentation that can be searched semantically and updated automatically. They need to understand not just what functions exist, but how they relate to each other, what patterns they follow, and how they fit into broader workflows. Static documentation simply cannot provide this level of contextual understanding, particularly in data science environments where teams maintain scattered knowledge across wikis, Slack threads, and onboarding documents.

Building a semantic documentation layer

LlamaBot's MCP server demonstrates how to give AI agents structured access to its documentation by creating a dynamic, queryable knowledge base that agents can search semantically. The implementation centers around a single tool:

@mcp.tool()
def docs_search(query: str, limit: int = 5) -> dict:
    """Search through LlamaBot documentation and source code."""
    results = docstore.retrieve(query, n_results=limit)
    return {"query": query, "results": results}

This interface sits in front of a data pipeline that builds a vector database for the documentation. The server fetches the latest documentation from GitHub, extracts Python module docstrings from source code, and constructs a LanceDB vector database optimized for semantic search. The database is built during CI/CD and packaged directly with the wheel distribution, giving users instant access without setup while staying current with each release.

This approach works with any AI agent system through the MCP protocol, providing a standardized way to keep AI agents current with documentation.

The architecture behind semantic documentation

The MCP server combines several technologies to create a robust documentation system. FastMCP handles the protocol implementation, enabling seamless communication between AI agents and the documentation database. LanceDB powers the semantic search capabilities, leveraging LlamaBot's existing LanceDBDocStore class with hybrid search and reranking for optimal results.

The system uses the checked-out documentation from the repository during the CI/CD build process, ensuring the packaged database contains current information. The build script first attempts to fetch docs from GitHub, but falls back to the local docs/ directory when available, making it work seamlessly in both CI/CD and development environments. The build process runs the scripts/build_mcp_docs.py script during CI/CD, which creates the LanceDB database and copies it to llamabot/data/mcp_docs/ for packaging.

I believe this architecture represents a fundamental shift in how we think about documentation for AI systems. Instead of treating documentation as static reference material, we're creating dynamic, queryable knowledge bases that AI agents can interact with directly.

The core pattern to replicate

The LlamaBot MCP server follows a straightforward pattern that any package or documentation source can replicate. Here's the essential blueprint:

1. Build a semantic database during CI/CD

Extract documentation from your source (GitHub, local docs, API references)
Parse and chunk the content appropriately for your domain
Create a vector database (LanceDB, Chroma, or similar) with semantic search capabilities
Package the database with your distribution

2. Create an MCP server with a search tool

@mcp.tool()
def docs_search(query: str, limit: int = 5) -> dict:
    """Search through your documentation and source code."""
    results = docstore.retrieve(query, n_results=limit)
    return {"query": query, "results": results}

3. Make it discoverable and configurable

Provide a simple launch command (like yourpackage mcp launch)
Include clear setup instructions for MCP-compatible tools
Ensure the database updates automatically with each release

4. Design for your specific knowledge domain

Include not just API docs, but process documentation, examples, and institutional knowledge
Structure the content for semantic search rather than keyword matching
Consider what context your users need most when working with AI agents

Seamless integration with modern development tools

The MCP server works with any MCP-compatible coding environment, including Cursor, VSCode, and other modern development tools. Configuration requires a single command in your MCP settings:

uvx --with llamabot[all] llamabot mcp launch

Once configured, AI agents can query LlamaBot documentation using natural language queries. Ask "How do I use StructuredBot for data extraction?" and the agent receives structured results with content, relevance scores, and metadata. This contextual information enables agents to provide accurate, up-to-date assistance without manual documentation lookup.

This reduces context-switching between code and documentation. AI agents can access relevant information and provide suggestions based on current code patterns and usage examples. This approach is particularly valuable for data science teams who need to maintain consistency across experiments while leveraging the latest library capabilities.

Comparing documentation approaches for AI agents

There are several ways to provide documentation to AI agents, each with distinct trade-offs. Understanding these approaches helps clarify why the MCP server approach represents a significant improvement.

Native tool documentation (like Cursor's built-in docs capabilities) offers seamless integration and can fetch docs from online sources, but you're limited to how the tool fetches those docs. It may not be able to access certain systems gated behind access controls or include custom organizational knowledge and process documentation.

Manual repository inclusion works well for users familiar with IDEs, workspaces, and development concepts, but it requires familiarity with these practices that are unfamiliar to non-technical users or individual developers. It also doesn't scale beyond individual developers. The documentation becomes part of the context window, consuming valuable tokens and potentially overwhelming the agent with irrelevant information.

Copy-paste or file upload (like Claude Projects) provides flexibility for non-technical users but creates maintenance overhead. You must manually update documentation when it changes, and there's no semantic search capability—agents can only work with what you explicitly provide.

Web search by agents seems convenient but creates inefficiency in the development workflow. If the documentation is up-to-date, the LLM will find it eventually, but it requires multiple iterations of web searches to locate the right information. I discovered this firsthand when building LlamaBot—web searches by coding agents required more iterations than directly providing context, but manual approaches don't scale.

The MCP server approach provides automatic updates, semantic search, system-agnostic compatibility, and organizational knowledge integration. It offers a standardized way to keep AI agents current with evolving documentation. The trade-off is initial setup complexity, but this is mitigated by the pre-built databases that ship with packages.

Beyond software documentation - surfacing any process knowledge

The MCP approach extends beyond software documentation. The LlamaBot server gives an example of how organizations can surface their process documentation, institutional knowledge, and domain expertise.

I believe that data science teams could transform their workflow documentation—experimental protocols, data validation procedures, model evaluation criteria, or deployment checklists—from scattered wikis, Slack threads, and buried onboarding documents into structured, queryable knowledge bases that AI agents can access and reference during development.

I can see this approach scaling beyond individual libraries to entire organizational knowledge. We have all imagined AI agents that can query your team's coding standards, understand your deployment procedures, or reference your data governance policies—all without leaving their development environment. Each organization would maintain their own specialized knowledge base, creating networks of interconnected AI-accessible process documentation. How would one implement this? A documentation MCP server may be a great way to start.

This approach isn't just for software docs. Imagine surfacing your team's process knowledge, onboarding guides, or even those golden nuggets buried in Slack threads. The MCP server pattern can turn scattered, informal knowledge into a living, searchable resource for both humans and AI agents, especially if you treat your processes as versioned software to be exposed to AI agents!

In my experience, the most valuable knowledge in organizations often exists in informal channels—Slack conversations, email threads, or tribal knowledge that never gets documented. I believe the MCP approach provides a framework for capturing and surfacing this knowledge in ways that AI agents can understand and reference.

The future of AI-assisted development

Future iterations could include real-time updates that rebuild databases when documentation changes, cross-organizational knowledge graphs, and usage pattern analysis that learns from how teams implement processes.

The goal is to make AI agents active participants in organizational processes, capable of understanding team workflows and providing context-aware recommendations.

This vision requires rethinking how we structure and maintain organizational knowledge. Instead of writing documentation solely for human consumption, we need to design knowledge systems that serve both human team members and AI agents, creating a symbiotic relationship between human creativity and AI capability while preserving institutional knowledge in accessible, queryable formats.

Getting started with semantic documentation

The MCP server is available in LlamaBot v0.13.10 and later. Getting started requires minimal setup:

Install LlamaBot with MCP support: pip install llamabot[all]
Configure your coding tool to use the MCP server
Begin coding with AI agents that understand LlamaBot's capabilities

The documentation database ships pre-built with the package, eliminating setup friction. The server exposes a single docs_search tool that agents can use to find relevant documentation and source code information, creating a seamless development experience.

This approach makes documentation an integral part of the AI agent's toolkit, resulting in more capable assistants that can help developers work more effectively.

The future of AI-assisted development involves better integration between AI agents and the tools they need. LlamaBot's MCP server demonstrates how this integration can work in practice.

A practical comparison of DSPy and LlamaBot for structured LLM applications

2025-10-18T00:00:00Z

When Omar Khattabe presented DSPy 3.0 at PyData Boston Cambridge last week, I finally had the chance to dig into a framework that's been generating significant buzz in the LLM development community. As someone who's built structured LLM applications with LlamaBot, I was particularly curious about DSPy's core claim: that signatures represent the only abstraction you need for LLM-powered programs.

The presentation focused on two key concepts: signatures as a new LLM abstraction and prompt optimization techniques. But what caught my attention was the practical similarity between DSPy's approach and what I've been doing with LlamaBot's StructuredBot. This led me to build a direct comparison using a real-world example from my personal expense tracking application.

The structured LLM challenge

Most developers working with LLMs face the same fundamental problem: how do you reliably extract structured data from unstructured inputs? Whether you're processing receipts, parsing documents, or analyzing text, you need consistent, typed outputs that integrate cleanly with your existing systems.

Traditional approaches rely heavily on natural language prompts, which are fragile, hard to maintain, and difficult to optimize. DSPy proposes a different path through its signature abstraction, claiming this eliminates the need for verbose prompt engineering.

A real-world comparison: Receipt processing

To test DSPy's claims, I built a practical comparison using an expense extraction system I developed for personal use. This application processes receipts in various formats (PNG, PDF, JPG, WEBP) and automatically extracts structured expense data into Notion — essentially a lightweight alternative to enterprise expense management systems.

The challenge here is typical of structured LLM applications: converting unstructured visual and textual data into consistent, typed outputs that integrate with existing workflows. Let's see how both frameworks handle this task.

LlamaBot's StructuredBot approach

LlamaBot uses Pydantic models to define structured outputs, leveraging Python's type system for validation and documentation. The approach emphasizes explicit data modeling with detailed field descriptions:

from pydantic import Field
from enum import Enum
from typing import Optional
from pathlib import Path
import llamabot as lmb

class FlowType(str, Enum):
    MONEY_OUT = "Money Out"
    MONEY_IN = "Money In"

class TypeEnum(str, Enum):
    PAYMENT = "Payment"
    INVOICE = "Invoice"

class PaymentMethodEnum(str, Enum):
    CASH = "Cash"
    BANK_TRANSFER = "Bank Transfer"
    CREDIT_CARD = "Credit Card"
    CHECK = "Check"

class ExpenseData(BaseModel):
    transaction_name: str = Field(
        description="Short, memorable description of the purchase. E.g.: 'Anker Dock', 'Coffee at Triangle Bar', 'dbrand laptop skin'"
    )
    date: str = Field(description="transaction date")
    amount: float = Field(description="transaction amount")
    category: str = Field(
        description="Business category, e.g. Office Supplies, Travel, Meals"
    )
    type: TypeEnum = Field(description="Either Payment or Invoice")
    flow: FlowType = Field(description="Either 'Money Out' or 'Money In'")
    payment_method: PaymentMethodEnum = Field(
        description="How the payment was made."
    )
    purpose: str = Field(
        description="Brief business purpose or description of the expense."
    )
    reference_number: Optional[str] = Field(
        description="Invoice/receipt number if visible"
    )
    person: Optional[str] = Field(
        "Person responsible or who made the purchase if mentioned."
    )

# Usage
bot = lmb.StructuredBot(
    system_prompt="",
    pydantic_model=ExpenseData,
    model_name="ollama_chat/gemma3n:latest",
)
result = bot(Path("/path/to/receipt.png"))

DSPy's signature approach

DSPy takes a different approach with its signature abstraction, which defines both inputs and outputs in a single class. The framework emphasizes simplicity and automatic prompt optimization:

import dspy

class ExpenseExtraction(dspy.Signature):
    """Extract expense information from receipt images."""

    receipt_image: dspy.Image = dspy.InputField(desc="Receipt image")
    transaction_name = dspy.OutputField(
        desc="Short description of the purchase"
    )
    date = dspy.OutputField(desc="Transaction date (YYYY-MM-DD)")
    amount = dspy.OutputField(
        desc="Total transaction amount (number, no currency symbols)"
    )
    category = dspy.OutputField(
        desc="Business category (e.g., Office Supplies, Travel, Meals)"
    )
    type = dspy.OutputField(
        desc="Transaction type, either 'Payment' or 'Invoice'"
    )
    flow = dspy.OutputField(
        desc="Cash flow direction, either 'Money Out' or 'Money In'"
    )
    payment_method = dspy.OutputField(
        desc="How the payment was made (e.g., Cash, Bank Transfer, Credit Card, Check)"
    )
    purpose = dspy.OutputField(desc="Brief business purpose or description")
    reference_number = dspy.OutputField(
        desc="Invoice/receipt number if present", default=None
    )
    person = dspy.OutputField(
        desc="Person involved, if mentioned", default=None
    )

# Usage
lm = dspy.LM("ollama_chat/gemma3n:latest")
dspy.configure(lm=lm)
module = dspy.Predict(ExpenseExtraction)
result = module(receipt_image=ctx.images[0])

Comparing the approaches

Both frameworks successfully extracted structured data from receipt images, but they take fundamentally different approaches to the problem.

LlamaBot's StructuredBot leverages Python's existing type system through Pydantic models. This approach provides several advantages: automatic validation, IDE support, and integration with existing Python data processing pipelines. The explicit type definitions make the data contract clear and enforceable.

DSPy's signatures offer a more streamlined interface that combines input and output definitions in a single class. The framework's strength lies in its automatic prompt optimization capabilities, which can improve performance over time without manual intervention.

Key differences in practice

The most noticeable difference is verbosity. LlamaBot requires more explicit type definitions and imports, while DSPy's signature approach is more concise. However, this conciseness may come at the cost of some type safety and IDE support that Pydantic provides.

Both frameworks use LiteLLM for model routing, making it easy to switch between different LLM providers. The model configuration syntax is identical, which suggests a common underlying architecture.

The schema-first principle

Regardless of which framework you choose, structured LLM applications require careful upfront schema design. The bulk of development time goes into defining your data model, not writing prompts. This schema-first approach is what makes these frameworks powerful—they force you to think clearly about your data requirements before implementation.

Looking ahead: DSPy's broader vision

DSPy's claim that signatures are the only abstraction needed for LLM applications is ambitious but not entirely accurate. The framework includes additional abstractions like modules and optimizers that handle more complex scenarios. Signatures represent the core abstraction for simple input-output transformations, but building production LLM applications often requires more sophisticated orchestration.

I'm planning to explore DSPy's more advanced features as I rebuild LlamaBot's agent abstractions. The goal is to understand how to construct autonomous LLM agent frameworks rather than individual agents—a challenge that requires thinking beyond simple input-output mappings.

Being unfamiliar with DSPy's documentation initially, I found it challenging to follow, but thanks to fellow PyData Boston Cambridge organizer Nash Sabti's guidance, I was able to make it happen and build this comparison.

The structured LLM landscape is rapidly evolving, and frameworks like DSPy and LlamaBot are pushing the boundaries of what's possible. The key insight is that successful LLM applications require the same engineering discipline as traditional software: clear interfaces, robust error handling, and maintainable abstractions.

How to Use Coding Agents Effectively

2025-10-14T00:00:00Z

This past week, I went on a building spree, a part of my ongoing ultralearning practice, and built multiple projects using AI coding assistants. After many months of working with AI coding assistants on real projects, I've learned that effective agent usage requires more than just good prompts. You need systematic workflows, external memory systems, and a willingness to let the agent fail fast so you can discover architectural boundaries.

These are the patterns that make coding agents productive.

Starting Out

Effective agent usage starts with establishing a disciplined workflow that covers the complete development lifecycle. This isn't just about fancy prompts; we're talking about creating a repeatable process that works from start to finish.

The Complete Lifecycle

flowchart TD
    A[Plan] --> B[Write Tests]
    B --> C[Implement Code]
    C --> D[Run Tests]
    D --> E{Tests Pass?}
    E -->|No| F[Fix Issues]
    F --> D
    E -->|Yes| G[Document]
    G --> H[Run Full Test Suite]
    H --> I{All Tests Pass?}
    I -->|No| F
    I -->|Yes| J[Complete]

    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#e8f5e8
    style D fill:#fff3e0
    style G fill:#f1f8e9
    style J fill:#e8f5e8

Here's the systematic workflow that works best with coding agents:

Plan first, then execute. Break your work into planning and execution phases, just like you would if writing code yourself. Have the agent write planning documents it can follow. This separation matters because planning and execution often use different parts of the model, and sometimes "dumber" models execute plans better than expensive ones.
Write tests before implementation. This is where TDD becomes crucial with agents. Write tests first, then implement, then test the code. When tests pass, document. This workflow becomes more important with AI assistants because they're working with small context windows compared to your entire codebase. You must have the AI write tests for everything it generates.
Implement with clear feedback loops. The proper TDD flow is: tests are always written first, executed and failed (because the implementation is lacking), then implemented, and executed again—ideally succeeding on the first try. This is super important for highest effectiveness with coding agents. The AI needs the clear feedback loop of failing tests to understand what to implement.
Document as you go. When tests pass, document the implementation. This creates a complete record of what was built and why.
Loop back to tests until everything is fixed. This is the critical step that many people miss. Don't stop at the first passing test—run the full test suite, check edge cases, and iterate until all tests pass consistently. The agent should keep running tests and fixing issues until the entire system is stable.

Learn your tool's shortcuts and modes. In Cursor, for example, you can open a new agent window with Cmd+E, and use Shift+Tab to toggle to plan mode (yellow colored). These modes work different parts of the model—planning models are better at analyzing code and planning than executing, while execution models are cheaper and sometimes more reliable at following plans.

In VS Code with GitHub Copilot, you can define custom modes. You can even get Agent Mode to write a Planning Mode for you as a way to bootstrap Plan Mode. This gives you specialized interfaces for different types of work.

Most of us software builders like to do the building part, not the verification part. TDD with agents lets you delegate the tedious verification work while keeping the fun building part for humans, as long as you review the tests the agent writes. This is another place where agents excel at taking over work we'd rather not do ourselves.

Without this discipline, you'll find yourself debugging issues that could have been caught earlier. The complete lifecycle ensures that every piece of code is tested, documented, and verified before moving on.

Finally, break work into chunks you can maintain concentration for during review. This takes practice getting used to an LLM's outputs, but it's important for effectiveness. Start with smaller scopes and gradually increase as you get comfortable with the agent's output patterns. The goal is to find the sweet spot where you can maintain focus while the agent does meaningful work.

Building Momentum

When starting a new project, don't try to get everything right the first time. Instead, speed-run your project twice, perhaps even thrice, in quick iteration mode. Just accept and vibe-code your way to the point where it gets hard for the LLM to do what you're asking.

On each speed-run, you'll likely find yourself cornered architecturally. Step back and diagnose what's going wrong. Then speed-run the process once more to see if you can corner yourself another way. On your third try, you'll have made enough mistakes to clarify the mental model of the problem.

I recently built a dataset versioning package called Kirin this way. It took three iterations over about a week to get the architecture right. The first two attempts helped me understand the problem space; the third attempt succeeded because I had learned the boundaries. The UI was done twice, and only on the third try did I get it right—all within about a week. This really helps with the design process, similar to the principles in "Design of Everyday Things."

Systematic Improvement

Once you have a working system, agents work well for systematic improvement tasks. The key is to ask them to prioritize rather than trying to fix everything at once.

Test coverage improvement: Instead of asking the agent to improve coverage on every line, ask it to prioritize based on highest impact for fewest changes. Get its ranking of issues, then pick the one you understand and can review. Sometimes you might find the 2nd or 3rd highest ranked issue to be the one you understand and can review, which is super important here. Then you pick, and build a plan around it before executing.

Ask the agent to give you its ranking of issues with explanations. This helps you understand not just what to fix, but why it matters and what the trade-offs are.

Refactoring: Look across a class of files (like HTML templates) and ask the agent to identify refactoring opportunities. Again, ask it to prioritize, pick one, and record the others as GitHub issues for later. Pick two more categories and record them as GitHub issues, and tackle them later.

For example, ask the agent to look across HTML Jinja templates and identify places where common HTML elements can be reused. Use the same prioritization trick: ask it to rank opportunities, pick the one you understand, and build a plan around it.

Documentation review: Have the agent examine all docs in your repo and identify where docs document something not present in code, where there are gaps, and where docs are inaccurate. Prioritize major categories, pick one to tackle, and leave the others as GitHub issues.

Ask the agent to identify three specific problems: (a) where docs document something not present in code, (b) where there are gaps (things in code not documented), and (c) where docs are inaccurate relative to what's present in the code. Again, prioritize major categories, pick one to tackle, and leave the others as GitHub issues.

Advanced Patterns

Your repository's issue tracker becomes an organized external memory system. It's stateful, has conversation records, and is plain text in Markdown. Use it liberally.

When you have plans you don't want to act on immediately, ask the agent to post them as GitHub issues using the gh CLI. This prevents losing track of ideas and creates a backlog you can return to.

Use this prompt:

"ok, I would like you to put this up on github as an issue. use the gh cli to do that. check that i'm logged in as ericmjl and not on any other account."

This ensures the issue gets created in the right repository with the right account.

For existing issues, ask the agent to evaluate whether they're still relevant and give its reasons. Codebases evolve, and you might be able to deprecate some issues. Take the agent's reasons and do a quick dive yourself to decide whether to tackle it or not. If you decide to proceed, launch a new agent and ask it to use the content of that GitHub issue as context.

Create an AGENTS.md file to document your architectural preferences and tool patterns. This teaches the coding agent your standards and helps it make better decisions.

For example, when building Kirin, I started with HTMX+FastAPI but took three iterations to settle on "everything is an API endpoint, but CRUD endpoints must redirect to view endpoints." This also happened to be an architectural pattern that I settled on only after two iterations on the UI. Another pattern that I settled on was to build the Python API first, then reuse it behind web UI APIs, like building the backend API before the frontend. I settled on this pattern after discovering discrepancies between the UI's sluggish performance and the Python API's snappy performance.

Document your favorite tools and patterns in AGENTS.md. You can "teach" the agent to use the gh CLI for GitHub operations rather than doing janky cURL commands. Use this file as a way to encode your development standards.

For example, you can "teach" it to use the gh CLI to get issues from GitHub by literally saying "use the gh cli to get issue contents from this repo's issues" and it'll almost always reliably do so rather than doing janky cURL commands. This is an important part of building out your test harness and development workflow.

MCP servers for specialized knowledge: Plug in an MCP (Model Context Protocol) server that serves up documentation about core packages or specialized ways of working specific to your organization. This gives the agent access to your internal knowledge base, coding standards, and domain-specific patterns without cluttering the main context window. The agent can then reference this specialized knowledge when making architectural decisions or implementing features.

Custom shortcuts: Slash commands are powerful shortcuts for giving textual context to coding agents. Create them freely, delete them freely, and merge them freely. Experiment to see what works with your habits.

My favorites include:

/remember - Get the agent to remember important information in AGENTS.md
/branch-and-stage - Create a new git branch and stage all changes after completing work

Here's the actual slash command for /branch-and-stage:

%% /branch-and-stage.md %%
Given everything we just did, or given what you see when you run git diff, give me a new git branch and git add to stage all the changes. You do not need to give me a commit message, I have a git commit message writer.

And for /remember:

%% /remember %%
Remember what you just learned (or what I am about to say) by writing it into AGENTS.md.

You can phrase many of these tips as slash commands. The key is making repetitive tasks into simple text shortcuts.

No task is too small: Agents work well for mundane tasks that humans find tedious. I have a slash command for markdown linting because I'm that nitpicky, but it proves the point: no task is too mundane for a coding agent, as long as it can access the output as text to verify it did the work correctly.

This works so well because agents have gotten great at using command line tools, and command line outputs are exactly the kind of interface LLMs need: text. Every git command, every test run, every build process produces text that the agent can read, understand, and act upon.

Use agents for CI/CD pipeline maintenance. If your CI/CD isn't conditional (running tests on PRs that only touch documentation), get the agent to make PR tests run only on relevant file changes. Make sure the changes are easily reviewable. This is an important part of building out your test harness.

For example, if your CI/CD is not conditional and you run tests even on PRs that only touch documentation, get the agent to make your PR tests run only on changes to relevant files—source, config, etc., but not on docs.

For large PRs, ask the agent to give you an overview of contents. Use your tool's "plan" mode to get a first-pass grasp of what's changed. This is especially useful when you have very large PRs to review—start with your tool's "plan" mode to help you get a first-pass grasp of the contents.

The meta-workflow that works best is:

Plan (write as a .md file, save in plans/ directory)
Execute (write the code)
Write tests
Run tests
Re-execute as necessary until tests pass
Audit - check the code against the plan

You don't need fancy prompts for this. Write out your high-level goals, have the tool write the plan, read the plan back to you, correct its assumptions, then proceed with steps 2-6.

GIGO (Garbage In, Garbage Out) applies to AI coding just as much as everything else. If you're sloppy and undisciplined, you'll get predictably bad results.

Conclusion

Effective agent usage isn't about finding the perfect prompt. It's about creating systematic workflows that use the agent's strengths while compensating for its weaknesses. It's about building external memory systems that persist across sessions. It's about teaching the agent your standards so it can make better decisions.

The key is being willing to fail fast, learn from mistakes, and iterate quickly. The agent amplifies your development process, but only if you're disciplined about how you use it.

Coding agents are becoming standard tools. The question isn't whether they'll replace developers, it's whether you'll learn to use them effectively. These patterns have changed how I approach development, and they can do the same for you.

How to use multiple GitHub accounts on the same computer

2025-10-10T00:00:00Z

How to use multiple GitHub accounts on the same computer

I recently ran into a frustrating situation where I couldn't push to a repository even though I had the right permissions. The problem? I was trying to use two different GitHub accounts on the same computer, and Git was getting confused about which account to use.

If you're in a similar situation - maybe you have a personal account and also contribute to a non-profit or open source project with a separate account - this guide will help you set everything up correctly.

The core problem

Here's what was happening to me: I had switched my GitHub CLI to my other account using gh auth switch, but when I tried to push, Git was still authenticating with my personal account's SSH key.

The issue is that gh auth switch only changes which account the GitHub CLI uses for API operations. It doesn't affect which SSH key Git uses for push/pull operations. Git and SSH operate independently from the gh tool.

What you'll need

Two GitHub accounts (I'll call them personal-account and volunteer-account in this guide), terminal access, admin permissions on your repositories, and about 10-15 minutes.

Step 1: Create separate SSH keys for each account

First, we need distinct SSH keys for each account. If you don't already have separate keys, create them:

# Create a key for your volunteer account
ssh-keygen -t ed25519 -C "volunteer-email@example.com" -f ~/.ssh/id_ed25519_volunteer

# Create a key for your personal account (if you don't have one)
ssh-keygen -t ed25519 -C "personal-email@example.com" -f ~/.ssh/id_ed25519_personal

When prompted for a passphrase, you can either set one or leave it empty (though a passphrase is more secure).

Step 2: Add the SSH keys to your SSH agent

ssh-add ~/.ssh/id_ed25519_volunteer
ssh-add ~/.ssh/id_ed25519_personal

You can verify both keys are loaded:

ssh-add -l

Step 3: Add the public keys to GitHub

For each account, you need to add its corresponding public key:

# Copy your volunteer account's public key
cat ~/.ssh/id_ed25519_volunteer.pub

Then:

Log into GitHub as your volunteer account
Go to Settings → SSH and GPG keys → New SSH key
Paste the public key there

Repeat this process for your personal account with id_ed25519_personal.pub.

Step 4: Configure SSH to use different keys for different "hosts"

Edit or create ~/.ssh/config:

# Default GitHub (personal account)
Host github.com
  HostName github.com
  User git
  AddKeysToAgent yes
  UseKeychain yes
  IdentityFile ~/.ssh/id_ed25519_personal

# GitHub for volunteer account
Host github.com-volunteer
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_ed25519_volunteer
  IdentitiesOnly yes

The Host github.com-volunteer line creates a local alias that only exists in your SSH config. When Git tries to connect to github.com-volunteer, SSH will actually connect to github.com but use the specified SSH key.

The IdentitiesOnly yes line tells SSH to only use the key you specified and not try other keys from your SSH agent.

Step 5: Update your repository's remote URL

For any repository belonging to your volunteer account, you need to update the remote URL to use the SSH alias:

# Navigate to your repo
cd ~/path/to/nonprofit-project

# Check current remote
git remote -v

# Update to use the volunteer account's SSH config
git remote set-url origin git@github.com-volunteer:organization/nonprofit-project.git

Notice the change: git@github.com-volunteer: instead of git@github.com:. This is necessary because the hostname in the URL is what triggers SSH to look up the configuration in your ~/.ssh/config file. When Git sees github.com-volunteer, SSH matches it to the Host github.com-volunteer entry and uses the correct key.

Step 6: Test the connection

Before pushing, verify SSH is authenticating correctly:

ssh -T git@github.com-volunteer

You should see:

Hi volunteer-account! You've successfully authenticated, but GitHub does not provide shell access.

If it says your personal account name instead, something's wrong with your SSH config.

Now try pushing:

git push

Troubleshooting common issues

Issue 1: SSH still authenticates with the wrong account

If ssh -T git@github.com-volunteer shows your personal account name instead of your volunteer account, the problem is usually that SSH is trying multiple keys and GitHub is accepting the first one that works.

Make sure you have IdentitiesOnly yes in your ~/.ssh/config for the github.com-volunteer host. This forces SSH to only use the specified key.

Issue 2: "Could not resolve hostname github.com-volunteer"

This usually means Git has a custom SSH command configured that's bypassing your SSH config file. Check:

git config --get core.sshCommand

If this returns something with -F /dev/null, that's your problem. The -F /dev/null flag tells SSH to ignore all config files.

Remove it:

git config --unset core.sshCommand

Issue 3: Config changes don't seem to apply

If you have conditional Git configs (using includeIf directives), they might be overriding your settings. Check:

git config --list --show-origin | grep sshCommand

This shows you exactly which config file is setting the SSH command. You may need to edit that file directly.

For example, I had a ~/.gitconfig-volunteer file that was automatically loaded for repos in certain directories, and it had a problematic core.sshCommand setting that needed to be fixed.

Issue 4: "Repository not found" error

This means SSH is connecting and authenticating, but as the wrong account. Double-check:

Run ssh -T git@github.com-volunteer and verify it shows the correct account name
Verify the account has access to the repository on GitHub
Check that your remote URL uses the correct alias: git@github.com-volunteer:org/repo.git

Optional: Set up conditional Git configs

If you keep repositories for your volunteer work in a specific directory (like ~/volunteer-projects/), you can automatically apply settings to all repos in that directory.

Add this to your ~/.gitconfig:

[includeIf "gitdir:~/volunteer-projects/"]
    path = ~/.gitconfig-volunteer

Then create ~/.gitconfig-volunteer:

[user]
    email = volunteer-email@example.com

[core]
    sshCommand = ssh

This automatically sets your volunteer account's email for commits in that directory. The sshCommand should be set to plain ssh so it uses your ~/.ssh/config properly.

How this all works together

When you run git push:

Git reads the remote URL: git@github.com-volunteer:org/repo.git
Git asks SSH to connect to github.com-volunteer
SSH looks in ~/.ssh/config and finds the Host github.com-volunteer entry
SSH sees it should actually connect to github.com but use the id_ed25519_volunteer key
SSH connects to GitHub with the correct key
GitHub authenticates you as your volunteer account
Push succeeds

Each repository uses the correct account automatically based on its remote URL, so you never have to manually specify which key to use.

Wrapping up

Managing multiple GitHub accounts on the same computer isn't intuitive, but once you understand that Git uses SSH keys (not gh auth settings), the solution becomes clear. The SSH config host alias pattern is the standard way to handle this, and it works reliably once everything is configured correctly.

The key points to remember:

SSH keys are what matter for Git operations, not gh auth
Host aliases in ~/.ssh/config let you use different keys for different repos
IdentitiesOnly yes prevents SSH from trying multiple keys
Your remote URL must use the alias (e.g., git@github.com-volunteer:)

If you run into issues, the troubleshooting section above covers the most common problems I encountered.

How to teach your coding agent with AGENTS.md

2025-10-04T00:00:00Z

Let me start with the most valuable thing I learned this week: if there's anything you want your LLM coding agent to remember for future sessions, just tell it to "Please update AGENTS.md with..." and then specify what you want it to remember.

That's it. That's the meta-tip that changes everything.

What is AGENTS.md anyway

AGENTS.md is an emerging open standard that's been adopted by over 20,000 repositories on GitHub. Think of it as a README for your AI coding agents—a predictable location where you provide context, instructions, and preferences that your agent needs to work effectively on your project.

You might think of this as similar to ChatGPT's memory feature, but there's a crucial difference: AGENTS.md is explicitly curated by you. You decide exactly what the agent remembers and how it applies that knowledge. I prefer this approach because it means I have control over what the agent knows, rather than the agent autonomously deciding what to remember about me and my preferences. It's transparent, version-controlled, and intentional.

The format emerged from collaborative efforts across OpenAI, Google (Jules), Cursor, Factory, and other major players in the AI development space. It's just standard Markdown, which means it's accessible, portable, and fits naturally into any project structure.

While your README.md is optimized for humans—covering project introductions, contribution guidelines, and quick starts—AGENTS.md serves as machine-readable instructions for your coding agents. Setup commands, testing workflows, coding style preferences, and project-specific conventions all live here.

Training an employee, not programming a bot

I was inspired by NetworkChuck's approach to building Terry, his N8n automation agent. The philosophy framing he uses is both brilliant and yet practical: you're not programming a bot, you're training an employee.

In Terry's case, Chuck teaches the agent by continuously updating its system prompt with new instructions and context. The same principle applies perfectly to AGENTS.md in coding environments.

Here's what makes this powerful: AGENTS.md gets sent with every LLM API call in Cursor, Claude Code, GitHub Copilot, and other modern coding tools. This means you can standardize on AGENTS.md, and as you progress through your project, you effectively teach the LLM what to do by instructing it to update this file with your preferences and learnings.

The beauty is that these instructions persist across sessions. Your agent doesn't forget; it gets smarter as your project evolves.

Concrete tip 1: Enforce markdown standards automatically

One of my first uses for AGENTS.md was ensuring consistent markdown formatting. I asked my coding agent to update AGENTS.md with instructions to always run markdownlint on any markdown files it creates or edits.

Here's what I added to the file:

## Markdown standards

- Always run markdownlint on any markdown files created or edited
- Install using: `npx markdownlint-cli` or `pixie global install markdownlint-cli`
- Fix all linting issues before completing the task

The effect is immediate. Now, anytime my agent writes or edits a markdown file, it automatically runs markdownlint and fixes issues. I don't have to remember to ask for this. The agent just knows it's part of the workflow.

Concrete tip 2: Specify your testing style

I prefer writing tests as pytest style functions rather than unittest-style classes. Most LLMs default to the unittest approach because it's more prevalent in their training data.

So I instructed my agent to add this to AGENTS.md:

## Testing preferences

- Write all Python tests as `pytest` style functions, not unittest classes
- Use descriptive function names starting with `test_`
- Prefer fixtures over setup/teardown methods
- Use assert statements directly, not self.assertEqual

Now when I ask for tests, I consistently get pytest style functions. The agent is steered toward my preferred approach without me having to specify it in every request.

Concrete tip 3: Stop writing throwaway test scripts

Here's a pattern I noticed with Cursor: when the agent wants to test something, it loves to write little throwaway scripts. You know the type—test_random_thing.py or quick_check.py that do some ad hoc verification and then just sit there cluttering your project.

The problem is these scripts aren't real tests—yet they're also tests. They don't run with your test suite. They don't provide ongoing regression protection. They're just... there.

I taught my agent to write proper tests instead:

## Testing approach

- Never create throwaway test scripts or ad hoc verification files
- If you need to test functionality, write a proper test in the test suite
- All tests go in the `tests/` directory following the project structure
- Tests should be runnable with the rest of the suite (`pixi run pytest`)
- Even for quick verification, write it as a real test that provides ongoing value

Now when the agent needs to verify something works, it writes an actual test that becomes part of the project. These tests continue to provide value by catching regressions, documenting expected behavior, and running in CI.

The shift is subtle but powerful: instead of creating technical debt in the form of random scripts, you're building up a proper test suite.

Concrete tip 4: Teach your agent about new tooling

I recently adopted Pixi as my main package manager. The problem? Most LLMs aren't familiar with Pixi commands yet. They kept trying to run python directly when I only have Python available through Pixi.

The solution was to teach the agent:

## Package management

- This project uses Pixi for all package management
- Never run commands directly (python, pytest, etc.)
- Always prefix commands with `pixi run <command>`
- Example: `pixi run python script.py` not `python script.py`
- Example: `pixi run pytest` not `pytest`

This works for any new tooling. If you've adopted pixi or uv or any other modern Python tools that aren't well-represented in LLM training data, you can explicitly teach your agent how to use them through AGENTS.md.

The same principle applies to any domain-specific tools or workflows unique to your project. For example, if you're working with Marimo notebooks, which have a relatively strict syntax:

## Marimo notebook validation

- After creating or editing any Marimo notebook, always run validation
- Command: `uvx marimo check <notebook>.py`
- Fix any syntax errors reported before completing the task
- Marimo notebooks require strict syntax adherence for proper execution

Now your agent will automatically validate Marimo notebooks and get immediate feedback on syntax errors, ensuring notebooks are written correctly the first time.

Why this matters

The traditional approach to working with coding agents involves repeating yourself constantly. "Remember to use this format." "Don't forget to run this command." "We prefer this style here."

AGENTS.md flips this model. Instead of being a human repeating instructions to a forgetful assistant, you're building up institutional knowledge that persists. You're training your agent to work the way you work.

As one developer observed, "it's all about simple human psychology: You get immediate feedback & results: You write it once, and your AI assistant immediately becomes more useful. The feedback loop is much longer with READMEs."

This is the key insight. When you write a README, you're creating documentation for a future human reader who may or may not show up. When you write AGENTS.md, you get instant gratification—your next conversation with the agent immediately reflects what you just taught it. The AI won't judge you for weird conventions or hacky workarounds. It just learns and applies what you've documented.

Each time you discover a preference, a gotcha, or a best practice for your project, you can capture it in AGENTS.md. The next time your agent encounters a similar situation, it already knows what to do.

This is especially powerful in larger projects or monorepos. You can have AGENTS.md files in subdirectories, and agents will use the nearest file to the code being edited—similar to how .gitignore or ESLint configs work. This lets you provide context-specific instructions for different parts of your codebase.

Getting started

If you already have agent instruction files like .cursorrules, CLAUDE.md, or .github/copilot-instructions.md, you can simply rename them to AGENTS.md. Most modern coding agents now support this standard.

Start simple. Create an AGENTS.md in your project root and add one or two critical preferences. Then, as you work with your agent, whenever you find yourself giving the same instruction twice, add it to AGENTS.md instead.

The key insight is this: every time you teach your agent something, make it permanent by updating AGENTS.md. That's how you build an agent that truly understands your project.

How data scientists can master life sciences and software skills for biotech using ultralearning

2025-10-01T00:00:00Z

After 8 years working in biotech and 6 years of graduate training before that, I've observed something about the most effective data scientists in biotech: they aren't just T- or π-shaped -- posessing breadth in skill while being deep in 1 or 2 specialties. They're continuously learning new skills to bridge their knowledge gaps.

There are two common knowledge gaps that I've observed. On one side, there's the vast world of life sciences: molecular biology, cell biology, genetics, immunology, neuroscience, analytical chemistry, organic chemistry, biochemistry. On the other, there's software development, the kind of skills that let you build reliable, maintainable tools that actually work in production.

The challenge is that these two domains are fundamentally different in how you learn them. And here's the thing: you can't just "take courses" in those domains and call it done. The life sciences alone are too vast. You need a strategy for continuous, rapid learning in both domains over your entire career.

When I interview data scientists for biotech roles, I assess five key areas: people skills, communication skills, scientific knowledge, coding skills, and modeling skills. The two domains I'm talking about here — life sciences and software development — map directly to scientific knowledge and coding skills. These aren't just nice-to-haves; they're essential for effectiveness in biotech data science.

That's where "ultralearning" comes in. It's Scott Young's framework for aggressive, self-directed learning, and I know it works because I've lived it. I started as a bench scientist but taught myself computing, software development, and machine learning over the years. Now I want to show you how data scientists in biotech can do the same—whether you're learning domain knowledge or software skills.

How do you strategically build depth in both life sciences and software over time? I'm going to walk through the 9 principles of ultralearning that Scott Young outlines and show you how they map to learning both domains for biotech data science. I've reordered them in a way that builds momentum, starting with what matters most.

The 9 principles

Principle 3: Directness - learn by doing the real thing

Starting with principle 3, here's what directness means: you learn most viscerally in the actual context where you'll apply the skill. And I'm putting this first because it's where most people go wrong.

Most people read textbooks and take courses. This isn't bad in and of itself, but if you assume that covering the material means you have learned it, you are wrong. Without a context to apply it, the knowledge doesn't stick. You need a real project where you actually use what you're learning.

If you're already working in biotech, you have a huge advantage: you already have real projects with real stakes. These projects naturally focus your learning because you have a job to be done! This is why I put directness first: it leverages the learning environment you already have.

For learning life sciences, this means treating your current project as your learning laboratory. You're analyzing RNA-seq data? Learn the biology behind the genes you're seeing, then immediately apply that knowledge to interpret your results and suggest follow-up experiments. You're working with metabolomics data? Learn the metabolic pathways, then use that understanding to identify which metabolites are actually biologically meaningful versus technical artifacts. You're building models for drug discovery? Learn the specific organic chemistry that you're working with, then apply it to explain why your model predicts certain compounds will work and others won't, and use that reasoning to guide your next round of experiments.

Your current project is your learning laboratory. Treat the scientific knowledge gaps you encounter as targets for deep learning.

And here's something I've noticed: when you write internal documentation or reports, that's actually retrieval practice for the science you're learning. More on retrieval later, but the point is your work gives you built-in learning opportunities if you use them intentionally.

The same applies to learning software. Your pipeline is getting slow? Learn performance optimization and profiling, then immediately apply those techniques to identify bottlenecks and speed up your actual pipeline. Your code is getting hard to maintain? Learn design patterns, then refactor your existing codebase using those patterns to make it more modular and testable. You need to deploy something? Learn containerization and orchestration, then use those skills to get your tool running in production and accessible to your team.

Your work projects provide the constraints and requirements that make software concepts meaningful. When you document your code or write design docs, you're forced to articulate the architectural decisions you're making—and that's when you really learn them.

Principle 4: Drill - isolate and attack weak points

Drilling is about identifying rate-limiting steps in your skills and practicing them specifically. Once you're doing direct practice through your projects, you'll notice where you keep getting stuck.

Here, there is a meta-skill that I think is critical: self-awareness. You need to develop the ability to notice when you're missing context, without someone explicitly telling you. When you're reading a paper and realize you're lost, or when you're in a meeting and can't follow the reasoning, that's your signal. Learning to recognize these moments yourself is what makes drilling effective.

For learning life sciences, drilling means identifying the specific skills that are blocking your work and practicing them repeatedly. Do you keep having to ask biologists what ChIP-seq tells you? Drill by practicing biological interpretation: take 20 different ChIP-seq results and explain what each peak means biologically—what transcription factor is binding, what genes are being regulated, and what biological process is affected. Are you on a project with immunologists but can't follow their reasoning? Drill by practicing experimental logic: read 15 immunology papers and predict what the next experiment should be based on the current results, then check your reasoning against what the authors actually did, or consult an immunologist colleague.

Or perhaps consider the chemistry side of things. Are you analyzing mass spec data but you're shaky on ionization and fragmentation? Drill by working through 50 fragmentation problems: given a compound structure, predict the top 5 fragments you'd expect to see, then check your answers. Is your lack of organic chemistry preventing you from understanding drug modifications? Drill by practicing modification effects: take 30 different drug structures, make specific modifications (add methyl group, change functional group), and predict how each change would affect binding affinity, solubility, and metabolic stability.

The key is identifying the real bottleneck in your knowledge and then designing a drill around that. If you have no good priors on how to do this, you should actually be asking your colleagues who have domain knowledge. They can help you pinpoint exactly what you're missing and suggest focused practice exercises.

For learning software, the same principle applies. Do your pipelines keep breaking in production? Drill by practicing debugging: take your broken pipeline, run it locally with the same data that failed in production, and systematically test each step until you find the exact failure point. Are you blocked on deployment because of containerization issues? Drill this skill while debugging! Start with a minimal Dockerfile that just runs your script, then gradually add dependencies one by one until it works. Strip it back to bare minimum and rebuild it piece by piece, and your understanding will follow too.

Code review keeps catching the same issues in your code? Drill on those patterns by refactoring your existing code to avoid them. Take one function at a time and rewrite it to follow the patterns your reviewers want. Do you avoid writing tests because you don't really understand testing frameworks? Start by adding a single test to your existing codebase, then gradually add more tests to the functions you use most.

Identify the one software skill that's limiting your effectiveness right now, and drill it until it's not a bottleneck anymore. Maybe it's Git workflows that are slowing your team down, or packaging that's preventing tool distribution.

Principle 1: Metalearning - map the territory first

Metalearning means researching how to learn something before diving in. I know it seems backwards to list this third when it's literally the first principle in Scott Young's framework, but here's why: directness and drilling are more immediately actionable. Once you're doing those, metalearning helps you be more strategic about what you're doing.

For learning life sciences, this means understanding what you actually need to know for your work before diving into intensive learning. Is your team starting a new project in spatial transcriptomics? Before diving in, map out what you need: tissue biology, imaging concepts, the technology itself, analysis methods. Are you joining a drug discovery project? Identify the hierarchy of knowledge; do you need medicinal chemistry basics first, or can you start with binding assays and learn backward?

Talk to the biologists or chemists you work with. Ask them what foundational concepts matter most for understanding their work. Look at the papers your team references most—what scientific knowledge do they assume? That's your map.

Find the best resources for your specific need. Maybe it's that one review paper, or a specific textbook chapter, or that scientist down the hall who explains things well. Don't waste time learning areas that aren't relevant to your current work. Map what matters now.

Additionally, be prepared to re-map! After working on a project for a while, you might find your initial map was wrong. That's totally OK! If you're making errors because you misunderstood what you needed to learn, that's a signal to step back and reassess. An incorrect map is worse than no map.

For learning software, the same applies. Before diving into a new software skill, understand what good looks like in your specific context—biotech and scientific computing. Your team wants to adopt a new workflow system? Before learning it, map out what you need: workflow concepts, the specific tool's paradigms, container knowledge. Or can you start with examples?

Look at mature tools in your space — scikit-bio, scanpy, or established pipelines like those from the ENCODE project. What patterns do they use? Are they functional-first or objects-first? And what are patterns in how they design their APIs? That's your north star. Talk to experienced engineers if you have access and ask what software skills actually matter for scientific tools.

The key is understanding the learning path dependencies: do you need to understand Python packaging before you can learn about CI/CD? Or can you learn them together? Map out the shortest path to being effective, not the most comprehensive path to expertise. Focus on what will unblock your current work, not what would make you an expert in everything.

Principle 6: Feedback - get signal on your progress

Feedback is getting useful information about what you're doing wrong and how to fix it. And in biotech, you have built-in feedback mechanisms if you use them intentionally.

For learning life sciences, leverage the scientists you work with as your feedback mechanism. When you present results in team meetings and biologists or chemists correct your interpretation, that's high-value feedback. Pay attention!

Join journal clubs if your company has them. When you misunderstand a paper, someone will point it out. If your company doesn't have journal clubs, look in your local community—Boston has several industry-focused options including BiotechTuesday, MassBio events, and Cambridge Biotech Club networking events that often include research discussions.

When you write up results or make slides, ask a scientist to review. Where they add clarifications shows your knowledge gaps. If your predictions about experimental outcomes are wrong, that's feedback about your biological understanding. I remember being challenged by a colleague while on a call with external collaborators, and that was the best feedback I had being wrong in a "public" setting! When you explain your interpretation to a biologist and they look confused, you've either misunderstood the science or can't articulate it yet.

Your work products -- analyses, reports, presentations -- are opportunities to get feedback on your scientific understanding. Don't just present results. Explain your biological reasoning and see where it's challenged.

For learning software, code review is your primary feedback mechanism. Take it seriously. The comments show you what you don't yet understand.

Does your code actually work at scale with real-sized data? That's feedback on your software design. When someone else tries to use your tool and files issues, those edge cases reveal gaps in your software thinking. Pair programming with more experienced engineers shows you patterns you're not seeing.

When onboarding a new team member to your code takes too long, that's feedback that your architecture or documentation needs work. Production failures are harsh but clear feedback: what software concepts do you need to learn to prevent them?

Ask for architectural review before building something big. Feedback up front prevents expensive mistakes.

Principle 5: Retrieval - test yourself actively

Retrieval is about actively recalling information, which strengthens learning more than passive review. And I want to emphasize something specific about this for life sciences learning.

The vocabulary in life sciences is vast, and the meanings of everyday words often change in scientific contexts. Think about "competent" cells, "naive" T-cells, "promiscuous" enzymes, "housekeeping" genes. Good memory for vocabulary isn't just about rote memorization; it gives you the ability to name and label entities clearly, which is foundational even when your primary goal is understanding concepts.

For learning life sciences, retrieval practice happens naturally in your work if you let it. When you're preparing a presentation for your team, try to explain the biological mechanism from memory first, then check your understanding. Before looking up that pathway or reaction mechanism again, try to draw it from memory. Where you get stuck shows what you haven't really learned.

In meetings when discussing results, attempt to explain the biology without your notes. This reveals what you actually know versus what you've just read. When writing internal documentation about your project, explain the scientific concepts from memory, then verify. If you're presenting at journal club, practice explaining the paper's biology without constantly referring to slides.

The act of trying to recall forces your brain to strengthen those neural pathways. Passive rereading doesn't do this. Your work already gives you retrieval opportunities—presentations, documentation, discussions with scientists. Use them.

For learning software, the same principle applies. When you're about to look up how to do something in code, try to write it from memory first, then look it up if needed. Before copying a design pattern from Stack Overflow, try to implement it based on what you remember, then refine.

In code review or design discussions, explain your architectural decisions from memory. If you can't, you don't really understand them yet. When documenting your code, write the explanation without constantly referencing the implementation. Try to debug issues by reasoning through the system before looking at logs—this builds your mental model.

And here's something important: making it easy by constantly looking things up or always relying on AI to spoonfeed you answers is a surefire way to keep knowledge shallow. The struggle of trying to remember is what creates learning.

This is especially true with AI-generated documentation. You can use AI to generate documentation, but the retrieval practice happens during review. When AI writes "this function calculates the binding affinity," question it: "Does it really? What's the actual algorithm? What are the inputs and outputs?" Challenge each line the AI wrote by trying to explain it from your own understanding. If you can't explain why a particular line is there or what it does, that's your signal to dig deeper into that concept.

Principle 2: Focus - cultivate deep concentration

Focus is about managing procrastination, distraction, and maintaining sustained attention. And this matters more than you might think for both domains.

For learning life sciences, reading that dense Nature paper about a pathway relevant to your project requires deep, uninterrupted focus. You can't do it between Slack messages and meetings -- block time, and shut off communication channels. Understanding complex scientific concepts, such as metabolic regulation, signaling cascades, reaction mechanisms, requires holding multiple pieces in your head simultaneously. Context switching destroys this.

Block time on your calendar specifically for deep scientific learning. Treat it like a critical meeting. The 15 minutes before standup isn't enough to understand that review paper you need to read. When you're learning a new biological domain for a project, protect longer blocks of focused time for it.

Your brain needs sustained attention to build the mental models that make scientific knowledge useful, not just memorized.

For learning software, the same applies. Understanding a complex codebase or debugging a tricky issue requires uninterrupted deep work. You can't do it effectively in fragments. Reading source code of mature projects—like scikit-bio, scanpy, or established pipelines—requires sustained attention to follow design decisions.

Designing a new system architecture requires holding the entire design in your head. That's impossible with constant interruptions. Block time for focused software learning and development, not just cramming it into gaps. The cognitive load of building mental models for software systems is high. Protect that learning time.

If you're learning a new framework or pattern for work, give it dedicated focus time, not scattered moments. It'll pay dividends many-fold over an entire technical career.

Principle 9: Experimentation - explore beyond the beaten path

Experimentation is about trying new approaches, methods, and perspectives as you gain proficiency. This becomes more important as you build your foundation in both domains.

For learning life sciences, as your foundational knowledge grows, start exploring adjacent domains that come up in your work. You're strong in genomics now? When an immunology opportunity comes up in a cross-functional meeting, that's your trigger to explore that domain.

Try different ways of learning. Sometimes a textbook works, sometimes talking to the scientist at the next desk works better—especially if they're socratically coaching you. Sometimes it's working through a dataset. This reflects a key insight from ultralearning: there's no one-size-fits-all learning format. What works for learning genomics might not work for learning immunology, and what works for you might not work for your colleague. Experiment to find your optimal learning approach for each domain.

Use insights from one domain to inform another. Cell signaling patterns you learned in neuroscience might help you understand what you're seeing in immunology data, since all cells have singaling pathways. As you work with both biologists and chemists, start connecting how chemical principles inform biological mechanisms. This cross-domain connection is a well-established way to improve retention and deepen understanding.

Additionally, try experimenting with how you organize and retain scientific knowledge. What works for you personally? This exploration leads to developing your unique perspective, especially on how you synthesize biological and chemical knowledge differently than others.

For learning software skills, as you gain proficiency, experiment with different approaches to the same problem at work. For example, have you been using conda to manage your Python environments? When you have time, try managing your environment with pixi instead to understand the tradeoffs.

Try different testing strategies on real work projects to see what catches bugs most effectively for your use case. Experiment with different architectural patterns when building new tools—learn through direct comparison.

As you grow, you'll develop engineering judgment: knowing when to use which approach, which rules to follow, which to break. This experimentation leads to finding your own effective patterns, not just copying what others do.

Principle 8: Intuition - develop deep understanding

Intuition is about building mental models of how things actually work, not just memorizing. And this is where the real payoff comes in both domains.

For learning life sciences, the drilling and retrieval practice we discussed earlier builds the mental models that become intuition. When you drill on fragmentation patterns and then retrieve that knowledge while analyzing mass spec data, you're doing more than mere memorizing: you're building understanding of how molecules break apart. When you practice explaining biological mechanisms from memory and get feedback, you develop the mechanistic reasoning that becomes intuition.

This intuition lets you reason about new situations you haven't seen before. You can predict whether an experimental approach will work or identify when results don't make biological sense. The goal isn't encyclopedic knowledge, but rather to develop the ability to reason about biological and chemical systems!

For learning software, the same principle applies. When you drill on design patterns by implementing them from memory, and follow it up by getting feedback through code review, you build understanding of what problems they solve and their tradeoffs. This develops the engineering judgment to know when to use which approach, which rules to follow. And once you know the rules, you know which ones can be broken :-).

At the end of the day, intuition develops through the active practice of drilling and retrieval, and not through passive consumption of information. Keep that in mind!

Principle 7: Retention - don't let knowledge leak away

Retention is about understanding why we forget and using strategies to remember long-term. And this matters because you're constantly encountering new concepts in both domains.

For learning life sciences, you're constantly encountering new biological and chemical concepts at work. Without retention strategies, you'll keep relearning the same things. When you write internal documentation or reports, you're creating reference material for future you and your team—but only if you structure it for easy retrieval and review. Create a personal knowledge base with clear tags and cross-references so you can quickly find concepts when they come up again. I use Obsidian for my personal work knowledge base; it is centered around projects, but I also curate and link facts in there.

The knowledge you use regularly will naturally stick through repeated exposure. But concepts from past projects will fade without reinforcement. Identify which scientific knowledge you need for the long-term versus what you need just for this project. For long-term retention, create Anki decks for key terminology and mechanisms that keep appearing across different projects. With Obsidian, drilling with an spaced repetition is possible with the Spaced Repetition plugin.

Connect new concepts to existing knowledge from your work—this creates stronger memory traces. When you learn a new pathway, relate it to ones you already know. When you encounter a new protein, link it to similar ones you've worked with. Revisit foundational concepts periodically as they show up in different projects, and update your notes and records each time you work on a project.

The scientific knowledge you use regularly in your work will stick. Everything else needs deliberate retention strategies.

For learning software, patterns you don't use regularly will fade. Be deliberate about which ones you need to retain. Writing documentation about the systems you build serves as external memory you can reference later, but make it searchable and well-organized to facilitate retrieval later! Create code snippets and examples for patterns you want to remember, and archive them in your work knowledge vault, or share them with colleagues on shared documentation platforms like Confluence or Notion.

The tools and patterns you use daily will stick naturally through repeated exposure. But specialized knowledge from past projects will fade without reinforcement. If you're not writing tests regularly in your work, you'll forget testing patterns. Find ways to practice what you need to retain: contribute to open source projects, build side projects, or create practice exercises.

Connect new software concepts to existing knowledge from your work. When you learn a new framework, relate it to ones you already know. When you encounter a new design pattern, link it to similar patterns you've used. Contributing to the same codebase over time builds deep, lasting knowledge of its architecture through repeated exposure. When you learn something new for a project, consider whether it's one-time knowledge or something you'll need repeatedly. Prioritize retention for the latter.

Bringing it together

These principles reinforce each other when learning both life sciences and software. But here's the key insight I want to emphasize: if you feel pressure to ultralearn both domains simultaneously, the answer is an emphatic "no".

Instead, you cycle between domains based on what's blocking you. When a scientific knowledge gap prevents you from making progress, whether it's immunology, organic chemistry, or protein biochemistry, you shift into intensive life sciences learning mode using these principles. When software limitations hold you back, you focus there.

Over years, this creates deep expertise in both domains. Not through divided attention, but through strategic, focused learning periods in each.

Here's the cycle I've seen work: work with real problems using directness, identify gaps in whichever domain is limiting you through drilling, focus intensively on that domain, get feedback from domain experts, test yourself through writing and building using retrieval, develop intuition, retain through continued practice, then identify the next limiting domain and cycle back.

At a higher level, there's an interplay between the domains that makes this work. Scientific understanding informs what software to build and what analyses matter. Software skills enable you to answer scientific questions and build tools others can use. They feed each other.

This approach beats traditional "take courses in both fields" for biotech data scientists because both domains are too vast to learn all at once. Ultralearning gives you a framework for continuous, targeted learning throughout your career. Remember, your goal is not to become a PhD scientist or a senior software engineer, but to build deep enough understanding in both to be effective at the intersection.

Conclusion and next steps

You don't need to apply all 9 principles to both domains at once. In fact, you shouldn't.

Start with directness in whichever domain is currently limiting your effectiveness. If you can't interpret your results because you don't understand the biology or chemistry, focus there intensively for the next few months. If you can't scale your analyses or build reliable tools because of software gaps, focus there intensively.

Add feedback loops from experts in that domain. Build from there using the other principles.

Then, when you've made real progress, identify which domain is now the bottleneck and shift your intensive learning there.

This is a career-long journey of alternating deep dives, not a sprint to learn everything at once. The most effective biotech data scientists I know are continuously learning in both domains, but wisely: one intensive focus at a time.

Here's your actionable takeaway: Right now, which domain is most limiting your effectiveness? Pick one concrete gap in that domain. Spend the next month using ultralearning principles to address that specific gap. Only that one. Master it, then reassess.

You can do it!

The Data Science Bootstrap Notes: A major upgrade for 2025

2025-09-02T00:00:00Z

After 8 years since the first edition, I've completely overhauled The Data Science Bootstrap Notes to reflect the dramatic changes in the Python data science ecosystem. What started as a collection of Obsidian notes has evolved into a comprehensive, modern guide that addresses the tools and practices that actually matter in 2025.

From Obsidian notes to a proper book

The most visible change is the format itself. The original version existed as a navigable knowledge base in Obsidian, mimicking Andy Matuschak's online notes. While that format was intellectually interesting to make, I found that after years of experimentation, simple was indeed better than cool. The new version uses MkDocs to create a clean, linear book format that's easier to navigate and more accessible to readers.

But the real transformation goes far deeper than just the presentation layer.

The tooling revolution: conda + pip → pixi + uv

The biggest shift in my recommendations centers around environment management. In 2017, conda was the obvious choice for Python data science environments. Today, that's no longer the case.

Enter pixi: the environment management multi-tool

I've completely replaced conda with pixi, a modern environment manager written in Rust that solves many of the fundamental problems that plagued the conda ecosystem. The key advantages are:

Automatic Lock Files: Pixi automatically generates and maintains lock files (pixi.lock) every time you modify your environment. This solves the critical "it works on my machine" problem that conda users faced when environments would drift over time.

Feature-Based Environments: Instead of creating separate environments for each purpose, pixi lets you define reusable "features" that can be combined into different environments. You can have tests, docs, notebook, and cuda features that combine into purpose-built environments like default, docs, or cuda.

Task Automation: Pixi enables you to replace Makefiles with tasks defined in pyproject.toml. Commands like pixi run test or pixi run docs standardize common operations across your team.

uv: the Python tool manager

Complementing pixi is uv, an extremely fast Python package installer and resolver written in Rust. UV handles global tool installation by automatically creating isolated environments for each tool, giving you the convenience of global tools without the mess of a global Python installation.

This means you can run tools like llamabot or my own pyds-cli without worrying about dependency conflicts. The uvx command even lets you run tools without installing them first.

Modern project scaffolding

The new edition introduces pyds-cli, my opinionated tooling for data scientists that scaffolds new projects using cookiecutter and pixi. Instead of manually setting up project structures, you can now run:

pyds project init

This creates a complete project structure with proper environment management, testing setup, documentation configuration, and CI/CD pipelines already configured.

AI integration beyond the hype

Generative AI has fundamentally changed how I think about data science workflows. The new edition includes a comprehensive chapter on working with AI tools that goes beyond simple code generation to address:

The speed of thought: AI tools help bridge the gap between how fast we can think and how fast we can type. There's fascinating research showing humans process information at $10^9$ bits/second but think at only 10 bits/second - AI helps bridge this massive gap.

The right kind of lazy: I distinguish between being "Bill Gates lazy" (finding efficient ways to work) and being intellectually lazy (blindly trusting AI outputs). You must maintain intellectual responsibility.

Effective patterns: I share specific strategies for structuring AI interactions, from starting with the big picture to rapid iteration and verification. This includes the "fat finger sketch" approach where you outline what you want before asking AI to fill in details.

Beyond code: AI tools are particularly valuable for documentation acceleration, code review assistance, and learning new libraries or techniques.

The key insight is that AI should amplify our capabilities, not replace our judgment. We need to develop a mindset that embraces these tools while maintaining intellectual rigor.

CI/CD and automation

The new edition heavily emphasizes GitHub Actions for continuous integration and deployment. Instead of manual processes, you now have trigger-able bots that can:

Run tests automatically on every commit
Build and deploy documentation
Validate code quality with pre-commit hooks
Deploy applications to various environments

This automation eliminates the drudgery that often accompanies data science projects and ensures consistency across team members. I've even applied this philosophy to the book itself; the entire publishing process is automated through GitHub Actions that build and deploy the website, while simultaneously updating the Leanpub version with every commit.

Philosophical foundations

While the tools have changed dramatically, the core philosophies remain the same but are now more clearly articulated:

Know Your Compute Stack: Deep understanding of your tools enables informed choices about what to automate
Single Source of Truth: Establish clear, unambiguous sources for data, code, and configuration
Automate Relentlessly: Invest in automation to eliminate repetitive tasks
Categorize Everything: Organize projects using logical categories that make maintenance easier

These principles now have concrete implementations through modern tooling, making them more actionable than ever.

What's been removed

Not everything made the cut. I've removed outdated advice about:

Manual conda environment management (replaced with pixi automation)
Complex conda-specific workflows (simplified with pixi features)
Manual lock file generation (now automatic with pixi)
Manual project scaffolding (now automated with pyds-cli)

The path forward

The new edition is designed to get you started quickly while building foundations that scale. It's not just a reference guide; it's a roadmap for establishing practices that grow with your ambitions. The tools and practices I recommend today are the ones I actually use in production, not just theoretical best practices.

What excites me most about this upgrade is how it addresses the real pain points that data scientists face in 2025. Instead of wrestling with environment conflicts, you're now thinking about how to compose features into purpose-built environments. Instead of manually setting up projects, you're focusing on the actual analysis. Instead of fighting with dependency resolution, you're building reproducible workflows that work the same way for everyone on your team.

The data science ecosystem has matured significantly since 2017, and this new edition reflects that maturity. It's about getting started the right way; establishing foundations that won't crumble as your projects grow in complexity and team size.

You can read the book online at the GitHub Pages site, and if you prefer a linear reading experience, there's also an eBook version on LeanPub.

The future of data science is automated, reproducible, and collaborative. This new edition shows you how to get there.

How to use AI to accelerate your career in 2025

2025-09-01T00:00:00Z

Everyone knows LLMs can help with coding and drafting emails. But there are less obvious ways to hack your career with AI that can save you hours and make you more effective at work.

Here are 10 strategies I've tested, with sample prompts you can steal:

Draft presentations that actually land

Most people start with slides. Start with your audience instead.

First, research who you're presenting to. If you know specific attendees, have ChatGPT or Claude build dossiers from their public profiles - LinkedIn, company bios, recent interviews. Then ask the LLM what these people care about most.

Next, have it craft your core message and angle based on those audience insights. Finally, get it to describe in words how each slide should look before you build anything - making sure to feed in both your audience research and your refined message. This approach works because you're designing for actual humans, not abstract concepts.

Pro tip: Or just skip the manual work entirely and use Gamma.ai.

Sample prompts:

I'm presenting to [specific people/roles]. Here are their LinkedIn profiles: [paste]. What do they care about most professionally right now?

My presentation topic is [topic]. My audience cares about [insights from above]. Help me craft a compelling angle that will resonate with them.

Generate slide-by-slide instructions for a [number]-slide presentation on [topic]. My audience is [audience description] and they care about [audience insights from research]. My core message is [refined message/angle]. For each slide, tell me: the title (which should be the main point of that slide), what elements to include, and how to lay them out. The title style should be [describe your preferred title style - e.g., "a clear statement that makes the key point, not just a topic heading"].

Research your negotiation counterparts like a detective

Context is everything in negotiations. Feed your LLM everything you know about the other party - their backgrounds, the situation they're in, potential constraints they're facing.

Describe your own circumstances, goals, and BATNA (Best Alternative to a Negotiated Agreement). Then iterate with the LLM on potential objections and counter-strategies.

The more specific information you provide, the better it gets at uncovering blind spots you hadn't considered.

Sample prompts:

I'm negotiating [situation] with [specific people/roles]. Here's what I know about them: [paste background]. Here are my goals: [paste goals]. My BATNA is: [paste alternative]. What objections might they raise?

Given this context: [paste situation details], what leverage points might I have that I'm not seeing?

If I propose [specific ask], how might they respond based on [paste their constraints/motivations]? Help me prepare counter-responses.

Transform content between formats effortlessly

You wrote a technical document that needs to become a blog post. Or you have a blog post that needs to become a slide deck. Instead of starting from scratch, let the LLM remix your existing content into the right format.

The key is being specific about your target audience. A technical document transformed for executives needs different language and emphasis than one transformed for peer engineers. Without clear audience context, the LLM can't make effective choices about tone, depth, and focus.

This works for any content transformation - meeting notes to executive summaries, brainstorming sessions to project proposals, quarterly reviews to team updates. Just remember: same content, different audience, different approach.

Sample prompts:

Transform this technical document into a blog post for [specific people/roles - describe their background, interests, and level of technical knowledge]: [paste content]

Turn these meeting notes into an executive summary for [specific people/roles - include their role, priorities, and what they care about]: [paste notes]

Convert this brainstorming session into a structured project proposal for [specific people/roles - describe their concerns and what convinces them]: [paste ideas]

Fill out administrative forms without the dread

OKRs, performance reviews, expense reports - we all have forms that feel like bureaucratic hurdles. Here's the hack: don't write directly into the form.

Instead, do a brain dump by talking through your accomplishments and goals. Transcribe this (voice memos work great), then paste the form questions plus your transcript into ChatGPT. Have it fill out the form for you, then copy-paste back.

What used to take half a day now takes 30 minutes.

Check out the Dia browser, which lets you insert LLM-generated text directly into web forms.

Sample prompts:

Here are my form questions: [paste]. Here's my brain dump of accomplishments: [paste transcript]. Fill out the form professionally.

Help me write OKRs based on this verbal dump of my goals: [paste transcript]

Turn this expense description into proper business justification: [paste description]

Ghost-write in your own voice

Including this blog post, I start by verbally dumping my ideas into a Markdown file in Obsidian, usually via voice transcription. Then I have Claude ghostwrite using my tone - my verbal dump contains my natural writing patterns, plus I feed it samples of my previous blog posts. The key is multiple editing rounds. I push hard on the LLM during edits, which is how I make the content truly mine. I don't publish anything until it's been through at least two rounds of refinement.

Sample prompts:

Here's my verbal brain dump: [paste]. Here are samples of my writing style: [paste examples]. Ghostwrite my brain dump as a blog post in my voice.

This draft doesn't sound like me yet. Make it more [specific style notes]. Here's what my natural voice sounds like: [paste examples]

Polish this draft but keep my conversational tone and specific phrases: [paste draft]

Prepare manager updates that actually help your career

Use the same strategy as ghostwriting, but tailor it to your manager's level and scope. You want your manager to have the details they need to advocate for you effectively.

Most managers don't have time to critique your use of AI - they just need to stay informed. Keep a running log in a shared space, ideally structured like "updates, problems, questions" organized by project.

As both an employee and a manager, I can tell you: teammates who spoon-feed structured weekly updates are gold. It shapes your manager's memory of your contributions and honestly makes my job as a manager easier because it lets me focus on coaching and strategic support rather than hunting for status updates.

Sample prompts:

Turn this brain dump into a structured manager update: [paste notes]. Format as Updates/Problems/Questions by project.

My manager is [specific people/roles - description of their role/priorities]. Here's what I accomplished this week: [paste list]. Write an update that helps them advocate for me.

Summarize my quarterly achievements in a way that highlights impact and aligns with [paste company priorities]: [paste accomplishments]

Personalize communications with important people

Use research mode to build context on VIPs you're meeting. Feed that research into your communication planning. This works because our digital footprints reveal what we care about, and LLMs are trained on massive examples of human interaction.

The result: messages that land because they're tailored to what actually matters to the recipient.

Sample prompts:

I'm reaching out to [specific people/roles] about [topic]. Here's their background: [paste research]. Help me craft a personalized message that will resonate.

Based on this person's recent posts/interviews: [paste], what communication style and topics should I focus on?

I need to follow up on [situation] with [specific people/roles who have these characteristics]. Write a message that acknowledges [paste their priorities/constraints].

Make seamless edits to any document

When you have a long piece of writing (code, essays, reports), don't manually hunt for where to make changes. Just dictate your edits into an LLM chat.

Add this to your system prompt: "When you make edits that I request, please make them seamless with the rest of the context." This prevents the LLM from injecting walls of text and instead makes surgical, contextual changes.

Sample prompts:

In this document: [paste], I need to add information about [topic] in the section about [section]. Make it seamless with the existing content.

Change the tone of this section from [current tone] to [desired tone] without changing the key points: [paste section]

This paragraph needs to be more concise while keeping the main message: [paste paragraph]

Organize your accomplishments by competency

Dump all your achievements into a voice memo and have an LLM organize them by your company's competency framework. Most performance reviews follow standard patterns - leadership, technical skills, collaboration.

LLMs excel at categorizing your work and suggesting which examples best demonstrate each competency. No more staring at blank forms trying to remember what you did six months ago.

This is essentially automating the brag doc that Steve Huynh recommends - but with AI doing the heavy lifting of organization and categorization. I've written more about building your accomplishments record in your first 90 days.

Sample prompts:

Here are my accomplishments from this year: [paste list]. Organize them by these competencies: [paste framework]

I need examples that demonstrate leadership. Here's everything I've done: [paste]. Which examples best show leadership impact?

Help me identify gaps in my competency examples. Here's what I have: [paste organized list]. What areas need stronger examples?

Practice difficult conversations before they happen

Feed in everything you know about the person you need to talk to - their communication style, recent stressors, past reactions, what motivates them. Have the LLM help you craft the right tone and timing, then practice by having it roleplay as them.

Difficult conversations often fail not because of what you say, but how and when you say it. This prep work is like having a rehearsal before the real performance.

Jeremy Utley demonstrates this technique of using AI to roleplay difficult conversations before they happen.

Sample prompts:

I need to have a difficult conversation with [specific people/roles] about [topic]. Here's their communication style: [paste description]. Here's the situation: [paste context]. Help me plan my approach.

Roleplay as [specific people/roles with these characteristics] while I practice this conversation about [topic]. Push back as they would based on [paste their known concerns].

I want to ask for [specific request]. This person typically responds to [paste motivation style]. How should I frame this conversation?

Force AI to challenge your assumptions

Don't just use LLMs to confirm what you already think. Actively ask them to disagree with you and surface blind spots. This is especially powerful for strategic decisions, project planning, or career moves where you might be too close to see potential problems.

The key is being explicit about wanting pushback. LLMs are trained to be helpful and agreeable, so you need to specifically request criticism and alternative perspectives.

Sample prompts:

I'm planning to [paste decision/strategy]. Play devil's advocate - what are the strongest arguments against this approach?

Challenge my assumptions about [situation]. Here's how I see it: [paste your perspective]. What am I missing or getting wrong?

I think [paste opinion/plan]. Give me three reasons why someone smart might disagree with me, and explain their reasoning.

The principles that make this work

Know your audience first. Whether you're transforming content, crafting presentations, or writing updates for your manager, everything starts with understanding who you're communicating with. LLMs can't make effective choices about tone, depth, and focus without clear audience context.

Research beats assumptions. Don't guess what people want or how they'll react. Feed LLMs specific information about negotiation counterparts, presentation audiences, or conversation partners. The more context you provide, the better the output.

Speak, don't type. Voice transcription captures your natural patterns and is faster than typing. Use it for brain dumps, accomplishment reviews, and initial drafts.

Automate the tedious, elevate the strategic. Let LLMs handle forms, formatting, and content transformation so you can focus on relationships, creative problem-solving, and high-level strategy.

Practice before it matters. Use LLMs to rehearse difficult conversations, anticipate objections in negotiations, and stress-test your thinking before real situations.

Seek challenge, not just confirmation. Explicitly ask LLMs to disagree with you, surface blind spots, and present counterarguments. This prevents echo chambers and sharpens your thinking.

Document everything for future you. Keep accomplishment records, conversation insights, and successful prompts organized. Your future self will thank you during performance reviews and job transitions.

The real power isn't in the AI doing your thinking for you - it's in using AI to handle the mechanics so you can focus on the strategy, relationships, and creative problem-solving that actually advance your career.

AI is a mirror

As Jeremy Utley puts it: "AI is a mirror." If we want our brains to rot, we can use AI to make our brains rot. Or we can use AI to sharpen how we're thinking, be more effective and efficient at how we're working.

The choice is yours. Use these techniques to elevate your career, not replace your judgment.

How to communicate with lab scientists (when you're the data person)

2025-08-24T00:00:00Z

Imagine this scenario: A data scientist explains a hierarchical Bayesian model for 45 minutes. Beautiful math. Elegant handling of batch effects. The lab scientists are polite but glazed over. Finally, someone interrupts: "Sorry, but should we move this compound forward or not?"

The data scientist hadn't even calculated that probability.

Sound familiar?

If you're a statistician or data scientist in biotech, you've probably been there. You've spent hours on sophisticated analyses, crafted beautiful slides about your methods, and watched your audience's eyes glaze over while you explained mixed-effects models.

Meanwhile, they just needed to know if they should spend $200K on the next experiment.

Here's the thing: Lab scientists aren't struggling to understand your statistics because they're not smart enough. They're brilliant experts who've spent years mastering protein folding, cell signaling, or synthetic chemistry. They're just juggling their own complex problems and need you to translate your analysis into something they can act on.

Today, I'm going to show you exactly how to do that.

Here's what we're covering:

Your communication budget is finite — spend it wisely
Know what mode they're in before you open your laptop
Use the three-layer translation model
Decode what they're really asking
Build trust through clarity, not complexity
Master the decision-first meeting structure
Ask yourself what they'll ask you

1. Your communication budget is finite — spend it wisely

Every interaction has a finite "communication budget" — limited attention, time, and cognitive load. Most data scientists spend this budget like tourists with foreign currency, not realizing the exchange rate.

Think about your last presentation. Where did you spend your time?

🚫 The typical (failed) allocation:

60% on methodology and statistical details
30% on results (tables, coefficients, credible intervals)
10% on "what this means" (usually rushed at the end)
0% on "what you should do next"

I get it. We're trained to show our work. We think rigor equals value. We assume that if we explain our methods thoroughly enough, scientists will understand what to do.

But here's what actually works:

✅ The allocation that drives decisions:

10% on methods (just enough for credibility)
20% on results (simplified, visual, contextual)
40% on implications for their specific decisions
30% on uncertainty and what it means for their next steps

But context matters. A curious scientist with time might genuinely want 30% methods—they're building mental models for future decisions. Someone facing a go/no-go decision tomorrow? They need 70% decision implications, minimal methods.

The key is adopting BLUF (Bottom-Line Up-Front)). Structure your presentation by working backwards from the decision to be made.

Try this: Start with the decision and recommendation, then work backwards to the evidence that supports it. Lead with "Based on our analysis, I recommend we proceed with Compound A because there's an 87% probability it meets our potency threshold."

This tells them immediately what they need to know, then you can spend the remaining time explaining why.

Here's what happens when you don't use BLUF: A data scientist spent an entire program review meeting walking through their elegant approach to handling missing data. Really sophisticated stuff. Multiple imputation with careful consideration of the missing-at-random assumption.

Twenty minutes in, the program lead interrupted: "This is interesting, but we need to decide today whether to advance this molecule. Does it meet our potency threshold or not?"

They hadn't even calculated that probability. They'd spent their entire communication budget on something that wasn't even the program lead's concern that day.

With BLUF, they would have started: "Based on our analysis, I recommend we advance this molecule. There's an 82% probability it meets our potency threshold, even accounting for the missing data. Here's how I handled the missing data to arrive at this conclusion..."

2. Know what mode they're in before you open your laptop

Lab scientists operate in three distinct modes, and each requires a completely different communication approach.

Decision Mode (Most of the time)

They're under time pressure for go/no-go decisions. Maybe it's a pipeline review tomorrow. Maybe they need to order materials today. Maybe the synthesis team is literally waiting for their answer.

Signs you'll hear:

"What's the bottom line?"
"Should we proceed?"
"Just tell me if it worked"

What they need: Probability of success and a clear recommendation. That's it.

Learning Mode (When they have bandwidth)

They're genuinely curious about your methods. Maybe they're trying to understand why this analysis differs from last time. Maybe they're building intuition for future experiments.

Signs you'll hear:

"How does that work?"
"Why did you choose that approach?"
"Can you explain the intuition behind this?"

What they need: Mental models and intuition, not mathematical formulas.

Validation Mode (Testing if they can trust you)

They're not really interested in learning—they're assessing whether they can rely on your judgment for million-dollar decisions.

Signs you'll hear:

"What assumptions did you make?"
"How does this handle batch effects?"
"What if the data is wrong?"

What they need: Confidence that you've been rigorous without the full mathematical proof.

Here's the mistake most of us make:

🚫 Wrong approach: Launch into methods explanation regardless of mode

✅ Right approach: Start with the decision and recommendation, then adapt the explanation depth based on their mode

Most data scientists default to teaching mode when scientists are in decision mode. That's like giving someone a recipe when they just asked if dinner's ready.

Consider this scenario: A scientist approaches a data scientist with dose-response data. The data scientist starts explaining their Bayesian approach to EC50 estimation. Five minutes in, the scientist stops them: "I just need to know if this is more potent than our current lead."

She was in Decision Mode. The data scientist was in Teaching Mode. Complete mismatch.

The better approach is to be deliberate rather than reactive. Before any meeting, clarify the goals upfront. Ask what they're trying to decide. Talk to stakeholders beforehand to understand the context. Do the pre-work rather than trying to read body language in real-time.

3. Use the three-layer translation model

You think in distributions. They think in decisions. This gap is why brilliant analyses often fail to drive action.

Here's the framework that works for bridging that gap:

Layer 1: Statistical Reality (Keep this in your head)

Full posterior distributions
Model assumptions
Fancy math
All the technical details you love

This layer is for you. It ensures your analysis is rigorous. But it stays in your head or the appendix.

Layer 2: Scientific Meaning (The bridge)

What the analysis means for their biological hypothesis
How the statistics relate to their experimental design
The full richness of uncertainty

Here's the key: Keep the full distribution at this layer. Don't collapse to point estimates yet. You're translating statistics to science, but you're not making decisions yet.

Layer 3: Decision Layer (What they actually need)

NOW integrate over posteriors for specific probabilities
"There's an 87% chance this compound beats your TPP threshold"
"With 90% probability, this is your best compound"
"You need 12 more samples to reach 95% confidence"

The magic is waiting until the last possible moment to collapse distributions into decision probabilities. Why? Because different decisions need different integrations of the same posterior.

Let me show you what I mean:

🚫 Wrong (Layer 1 bleeding into communication): "The posterior distribution for the treatment effect has a 95% credible interval of [0.15, 0.31] with a mean of 0.23."

What does a lab scientist do with this? Nothing. It's statistical reality without translation.

✅ Right (Layer 3, decision-focused): "There's a 92% probability your treatment exceeds the TPP threshold. If you need 95% confidence for the program milestone, run 20 more samples. If 90% is acceptable for an early read, you can proceed now."

See the difference? One is statistical reporting. The other enables a decision.

The same posterior distribution might need to answer multiple questions:

"What's the probability this exceeds our TPP?" (integrate above threshold)
"What's the probability this is our best compound?" (compare posteriors)
"How many samples until 95% confidence?" (project forward)

By keeping the full distribution until Layer 3, you can answer whatever decision question they actually have, not the one you assumed they had.

4. Decode what they're really asking

Scientists may ask statistics questions when they mean decision questions. Learning to translate is a superpower.

Here's your decoder ring:

"Is this significant?" They're not asking about p-values. They're asking: "Should I continue this line of research?"

"What's the confidence?" They don't want credible intervals. They're asking: "How wrong could this decision be?"

"Did it work?" They don't care about effect sizes. They're asking: "Is the effect large enough to matter for my application?"

"Can you check the stats?" They don't want a methods seminar. They're asking: "I need ammunition for my go/no-go meeting tomorrow."

"How robust is this?" They're not necessarily interested in sensitivity analyses. They're asking: "Can I trust this decision?"

Every lab scientist in biotech faces the same five decisions over and over:

Resource allocation: Should I invest more time/money/FTEs in this direction?
Pipeline progression: Is this ready for the next stage?
Experimental design: Should I modify my approach or repeat?
Program decisions: Continue, pivot, or kill?
Platform decisions: Is this assay/method worth scaling up?

Your job isn't to answer their literal question. It's to figure out which of these five decisions they're really trying to make.

🚫 Wrong: Answer the literal statistics question they asked

✅ Right: Answer the decision they're trying to make

Try this: When first asked to partner on an analysis, ask: "What decision are you trying to make with this data?"

Then frame everything around that decision.

Here's a common scenario: A scientist asks the data team to "check if the groups are different." The data scientist could run their standard analysis and report "statistically significant difference detected." Technically correct. Completely useless.

Instead, imagine asking: "What decision does this inform?"

Turns out, they need to know if the new formulation is at least 20% better than the current one — otherwise, it wasn't worth the reformulation costs. The groups were statistically different, but only by 8%. The real answer was: "Don't reformulate."

That's the difference between answering questions and enabling decisions.

5. Build trust through clarity, not complexity

Here's the paradox: Most data scientists think trust comes from showing their work.

This is more nuanced than you might think.

Over-explaining methods actually reduces trust because:

It signals insecurity about your results
It wastes precious communication budget
It feels like gatekeeping with jargon
It suggests you don't understand what they need

What actually builds trust:

Leading with clear probabilities for their go/no-go decisions
Showing how probability changes with more data
Being precise about uncertainty without hedging
Speaking their language, not yours

🚫 Trust-killing: "Well, it depends on your assumptions about the prior, and if we consider the hierarchical structure of the random effects, controlling for batch-to-batch variation, we can say that under certain conditions..."

This sounds like you're not confident in your answer.

✅ Trust-building: "There's an 89% chance this works. If you need 95% confidence before scaling up, test 3 more concentrations. Here's why I'm confident in that number: I've accounted for batch effects, and even in the worst-case scenario, you're still above 82%."

Clear. Actionable. Confident.

The beauty of probabilistic thinking here: "We're 78% confident" is infinitely clearer than "statistically significant." You can directly answer: "What's the probability we're making the wrong decision?"

That's a question every scientist understands.

The methods appendix approach:

When you do need to establish technical credibility, try this structure:

One slide of method basics (just enough to show rigor)
Key assumptions in plain English (Pro tip: AI tools can help translate technical assumptions into audience-appropriate language)
Details available but not forced
For the genuinely curious: "Happy to dive into the model after we nail down your decision"

Smart data scientists keep a technical appendix for every analysis. It has all the details they're proud of—the clever missing data handling, the hierarchical structure, the prior specifications.

But they only show it when asked. And here's what happens: People trust them more because they respect everyone's time enough not to force it on them.

6. Master the decision-first meeting structure

Stop opening with methods. Stop it right now.

Start with their decision.

The structure that works:

State the decision context upfront: "We're here to discuss [specific decision]. Here's what the data tells us."
Give the probability and recommendation immediately: "There's an 89% probability of success. I recommend proceeding."
Show how probability changes with more data: "With 10 more samples, we'd get to 95% confidence."
Discuss what could change your assessment: "This assumes your batch effects stay consistent."
Offer details only if requested: "Want me to walk through how I got there?"

Let me show you the difference:

🚫 Wrong meeting flow: "Thanks for coming. So I started by examining the data structure, and I noticed some heteroscedasticity in the residuals, which suggested we might need a more complex variance structure. I tried several approaches, including a Box-Cox transformation, but ultimately settled on a hierarchical model because... [20 minutes later]... so in conclusion, it might work."

By the time you get to the conclusion, they've stopped listening.

✅ Right meeting flow: "We're here to discuss whether to advance Compound X to synthesis. Based on your assay data, there's an 89% probability this compound exceeds your 10nM potency requirement. I recommend proceeding to synthesis scale-up. If you need 95% confidence instead of 89%, I'd recommend testing 3 more concentrations first. Want me to walk through how I got there?"

Notice how the second version:

Answers their question immediately
Gives them options based on risk tolerance
Respects their time
Offers details rather than forcing them

The email version:

Subject: Compound X: 89% probability of meeting TPP

Hi Sarah,

**Decision:** Compound X has an 89% probability of meeting your 10nM potency requirement. Recommend proceeding to synthesis.

**Key evidence:**
- Consistent effect across all three batches
- Dose-response curve shows clear relationship
- Even worst-case scenario keeps you above 15nM

**Next steps:** If you need >95% confidence, test 3 additional concentrations. Otherwise, proceed with synthesis.

Technical details in attached appendix if interested.

Best,
[Your name]

That's it. Decision first. Evidence second. Details optional.

7. Speak their language (literally)

Here's what most data scientists miss: You need to understand their domain as deeply as they do.

The measurement method matters more than your statistical method.

A scientist tells you they're using ELISA to measure protein levels. You nod and proceed with your analysis. But did you ask:

What's the detection limit?
How does the antibody specificity affect your readout?
Are there known cross-reactivities that could confound your results?
What's the coefficient of variation across replicates?

These aren't merely statistical questions — they're also biological questions that determine whether your analysis is even valid.

Be deeply curious about their methods. Ask about:

The specific assay they're using and its limitations
How they handle sample preparation and storage
What controls they're running and why
The historical performance of this measurement in their hands
What could go wrong and how they'd know

Learn their terminology. Don't just understand what they're measuring, but understand how they think about it. When they say "potency," do they mean EC50, IC50, or something else? When they talk about "efficacy," are they referring to maximal response, potency, or both?

Quick domain mastery checklist:

What are the three most common failure modes for this assay?
What does "good" look like in their world?
What would make them suspicious of the data?
How do they typically handle outliers or unexpected results?
What's the gold standard measurement they're comparing against?

Example: A data scientist was analyzing dose-response data from a cell-based assay. The scientist mentioned they were using a "luminescence readout." The data scientist asked about the detection range, learned it was $10^3$ to $10^6$ RLU, and immediately spotted that their highest concentration was saturating the detector. The analysis would have been meaningless without understanding that technical limitation.

The payoff: When you speak their language, you don't just communicate better, you also analyze better. You spot confounders they might miss. You suggest controls they haven't thought of. You become a true collaborator, not just a service provider.

8. Ask yourself what they'll ask you

Every scientist has patterns. Learn them.

Your PI always asks about sample size? Pre-calculate the probability of detecting meaningful effects. Your biomarker lead obsesses over false positives? Lead with the posterior probability of true effects. The chemistry team cares about synthesis feasibility? Include yield probabilities from your Bayesian model.

This isn't mind-reading. It's paying attention.

🚫 Reactive approach: Wait for their questions, scramble for answers, promise to "get back to you on that"

✅ Proactive approach: "I know you usually want to know about batch effects, so I checked—they're negligible. Here's how I verified..."

Pattern matching checklist:

What did they ask in the last three meetings?
What decisions do they usually struggle with?
What makes them nervous about moving forward?
What would convince them this is real?
What got them in trouble before?

Example: One program lead always asks: "What if we're wrong?" Every. Single. Time.

The smartest data scientists now anticipate this and always include: "If we're wrong about this, here's what we'd see in the next experiment. Here's our bail-out plan. Here's the cost of being wrong versus the cost of being slow."

She doesn't ask anymore. She trusts that they've thought it through.

Another scientist always wants to know if we have enough evidence. So the prepared data scientist leads with: "There's an 85% probability that the treatment effect exceeds your minimum meaningful difference."

Pre-answering questions isn't just efficient—it builds massive trust. It shows you understand their concerns and you're thinking ahead. Trust me, this is a career hack!

The bottom line

Most data scientists in biotech spend 80% of their communication budget on methods that their collaborators—brilliant scientists juggling their own complex problems—don't have bandwidth to process.

You're doing the equivalent of giving someone a recipe when they just asked if dinner's ready.

The shift is simple but not easy: Stop defaulting to education mode. Start asking "What decision are you trying to make?" Then translate your sophisticated analysis into the probability they need to make that decision.

This isn't about dumbing down your work. It's about translating between two expert domains—like a diplomat translating between heads of state. The lab scientists you work with have spent years mastering complex biological systems. They need translation, not education.

Your action items:

When first asked for analysis: Start with "What decision are you trying to make with this data?" Don't begin any analysis until you know.
Review your last presentation: Did you lead with the decision (BLUF) or bury it in methods? If methods came first, restructure.
Practice probability statements: Instead of showing credible intervals, say "There's an X% probability that..." It's clearer and more actionable.
Learn their measurement methods: Ask about detection limits, controls, and failure modes before analyzing their data.
Build a pattern map: Write down what each of your regular collaborators usually asks. Answer it proactively next time.
Create a technical appendix: Put all your beautiful methods somewhere. Just don't force people to sit through it.

The best data scientists are the ones whose collaborators make the best decisions.

And that starts with spending your communication budget on what actually matters to the people you're trying to help.

What patterns have you noticed in your collaborations? What questions do your scientists always ask? Let me know; I'd love to hear what's working (or not working) for you!

Wicked Python trickery - dynamically patch a Python function's source code at runtime

2025-08-23T00:00:00Z

So today, I learned a very dangerous and yet fascinating trick.

It's possible to dynamically change a Python function's source code at runtime.

What this does is open a world of possibilities in building AI bots!

How this actually works

Every function has a .__code__ attribute. For example, for this function:

def something():
    raise NotImplementedError()

something.__code__ looks like this:

<code object something at 0x149bdfc90, file "/var/folders/36/vb250n_s0zncstw3sk74qfxr0000gn/T/marimo_80086/__marimo__cell_kJqw_.py", line 1>

If I were to execute something(), it would return a NotImplementedError.

Now, let's say that, for some reason that I shall not speculate, I decided that I wanted something() to instead do multiplication by 2. I can create new source code:

new_code = """
def something(x: int) -> int:
    return x * 2
"""

I can do the following three magical steps to swap it in.

Firstly, compile the code into bytecode:

compiled = compile(new_code, "<magic>", "exec")

The three arguments to compile are:

The code to compile (new_code),
The filename in which the code is compiled (<magic>), and
The mode in which compilation happens (in this case, exec mode).

On the third point, the docstring of compile explains what the three modes are:

The mode must be 'exec' to compile a module, 'single' to compile a single (interactive) statement, or 'eval' to compile an expression.

The compiled object now is a "code object":

<code object <module> at 0x149bcbad0, file "<magic>", line 1>

I can then execute the compiled code to make it imported into a particular namespace.

ns = {}
exec(compiled, {}, ns)

Here, the three arguments passed to exec are:

The code we want to execute (compiled), and in this case, by "executing" it after being compiled in exec mode, we are really just simulating an import into our namespace.
The globals ({}), which in this case are passed in as an empty dictionary. These are the global variables that are available to the function at runtime.
ns is the "namespace" in which we want the function to be present; namespaces in Python are just dictionary mappings from function/object name to the function/object itself.

Finally, I can replace my existing function with the compiled function inserted into the ns namespace:

something_new = ns["something"]
print(something_new(21))  # this will print 42 to stdout!

But really, the real lesson here is not that one can monkeypatch over an existing Python function's source code at runtime, but that you can actually compile the string of a Python function definition and give it access to a namespace's variables, including that of the current global namespace.

When would you ever want to do this?

At first glance, never really! This is a bit of hackery that lives on the fringes of Python-land, and is basically a party trick.

But as it turns out, I actually had a real motivation for wanting to do this.

Within LlamaBot, I've always had AgentBot as a first-pass implementation of what I think an LLM agent should look like, having studied LLM agent implementations in other libraries. However, I've never been fully satisfied with AgentBot's implementation. The core issue was that it mixed too many concerns together - function execution, function call determination, and user response generation all lived in the same loop.

Here's what AgentBot looked like at a high level:

class AgentBot(SimpleBot):
    def __init__(self, model_name, tools, **kwargs):
        ...

    def __call__(self, *messages, num_iterations=10):
        for i in range(num_iterations):
            response = ... # get response object, passing in messages
            results = []
            # Execute tool calls if they are present
            if response.tool_calls:
                for tool_call in response.tool_calls:
                    result = self.name_to_tools[tool_call.name](**json.loads(tool_call.arguments))

            # continue until LLM decides we're done.
            else:
                # just respond to users.

While this worked, it wasn't great at separating concerns. I had function execution mixed in with function call determination mixed in with responding to a user.

The bigger limitation was with code execution tools. My original implementation isolated generated code in a Docker container sandbox, which was secure but meant the code couldn't access variables from my current Python runtime. This severely limited what kinds of useful tasks the bot could perform with my existing data and variables.

I realized that if I could:

Use an LLM to generate Python functions that referenced existing variables in my runtime,
Compile those functions on-the-fly within the same Python environment, and
Execute them with access to my current namespace,

I could build something much more powerful. This led me to create ToolBot within LlamaBot.

ToolBot focuses on tool selection instead of execution

ToolBot takes a different approach - it focuses purely on tool selection rather than execution. Here's the key structure:

class ToolBot(SimpleBot):
    def __init__(self, system_prompt, model_name, tools=None, **kwargs):
        # Initialize with core tools like today_date and respond_to_user
        all_tools = [today_date, respond_to_user]
        if tools:
            all_tools.extend(tools)

        self.tools = [f.json_schema for f in all_tools]
        self.name_to_tool_map = {f.__name__: f for f in all_tools}

    def __call__(self, message):
        # Process message and return tool calls (but don't execute them)
        response = make_response(self, message_list, stream=stream)
        tool_calls = extract_tool_calls(response)
        return tool_calls  # Just return the calls, don't execute

The key insight: ToolBot just selects a tool to be executed, but does not execute it. Instead, it returns the tools to be called to the external environment, giving you full control over execution.

The magic happens with write_and_execute_code

One of the most powerful tools that can be chosen is write_and_execute_code. Here's the core implementation:

def write_and_execute_code(globals_dict: dict):
    @tool
    def write_and_execute_code_wrapper(placeholder_function: str, keyword_args: dict):
        """Write and execute `placeholder_function` with the passed in `keyword_args`.

        Use this tool for any task that requires custom Python code generation and execution.
        This tool has access to ALL globals in the current runtime environment (variables, dataframes, functions, etc.).
        Perfect for: data analysis, calculations, transformations, visualizations, custom algorithms.

        ## Code Generation Guidelines:

        1. **Write self-contained Python functions** with ALL imports inside the function body
        2. **Place all imports at the beginning of the function**: import statements must be the first lines inside the function
        3. **Include all required libraries**: pandas, numpy, matplotlib, etc. - import everything the function needs
        4. **Leverage existing global variables**: Can reference variables that exist in the runtime
        5. **Include proper error handling** and docstrings
        6. **Provide keyword arguments** when the function requires parameters
        7. **Make functions reusable** - they will be stored globally for future use
        8. **ALWAYS RETURN A VALUE**: Every function must explicitly return something - never just print, display, or show results without returning them. Even for plotting functions, return the figure/axes object.

        ## Function Arguments Handling:

        **CRITICAL**: You MUST match the function signature with the keyword_args:
        - **If your function takes NO parameters** (e.g., `def analyze_data():`), then pass an **empty dictionary**: `{}`
        - **If your function takes parameters** (e.g., `def filter_data(min_age, department):`), then pass the required arguments as a dictionary: `{"min_age": 30, "department": "Engineering"}`
        - **Never pass keyword_args that don't match the function signature** - this will cause execution errors

        ## Code Structure Example:

        ```python
        # Function with NO parameters - use empty dict {}
        def analyze_departments():
            '''Analyze department performance.'''
            import pandas as pd
            import numpy as np
            result = fake_df.groupby('department')['salary'].mean()
            return result
        # Function WITH parameters - pass matching keyword_args
        def filter_employees(min_age, department):
            '''Filter employees by criteria.'''
            import pandas as pd
            filtered = fake_df[(fake_df['age'] >= min_age) & (fake_df['department'] == department)]
            return filtered

    ## Return Value Requirements:

    - **Data analysis functions**: Return the computed results (numbers, DataFrames, lists, dictionaries)
    - **Plotting functions**: Return the figure or axes object (e.g., `return fig` or `return plt.gca()`)
    - **Filter/transformation functions**: Return the processed data
    - **Calculation functions**: Return the calculated values
    - **Utility functions**: Return relevant output (status, processed data, etc.)
    - **Never return None implicitly** - always have an explicit return statement

    ## Code Access Capabilities:

    The generated code will have access to:
    - All global variables and dataframes in the current session
    - Any previously defined functions
    - The ability to import any standard Python libraries within the function
    - The ability to create new reusable functions that will be stored globally
    :param placeholder_function: The function to execute (complete Python function as string).
    :param keyword_args: The keyword arguments to pass to the function (dictionary matching function parameters).
    :return: The result of the function execution.
    """

    # Parse the code to extract the function name
    tree = ast.parse(placeholder_function)
    function_name = None
    for node in ast.walk(tree):
        if isinstance(node, ast.FunctionDef):
            function_name = node.name
            break
    # Compile and execute the function with access to globals
    ns = globals_dict
    compiled = compile(placeholder_function, "<llm>", "exec")
    exec(compiled, globals_dict, ns)
    return ns[function_name](**keyword_args)

return write_and_execute_code_wrapper


This extensive docstring gets passed as part of the JSON schema and effectively serves as instructions to the LLM on when and how to use this tool. I stripped out logging and error handling to simplify what's shown here, but the actual codebase has more robustness built in.

Notice how `ToolBot`, and more specifically `write_and_execute_code`, gains explicit access to the `globals()` dictionary when a user passes it in. This approach allows us to ensure that function execution takes place within the proper namespace. If `ToolBot` chooses `write_and_execute_code`, I can control exactly where and how it executes within my Python runtime environment - and this opens up a world of possibilities!

For example, inspired by [the Marimo blog](https://marimo.io/blog/marimo-chat), which wrote about generative UIs and tool calling:

> marimo’s chat interface supports Generative UI - the ability to stream rich, interactive UI components directly from LLM responses. This goes beyond traditional text and markdown outputs, allowing chatbots to return dynamic elements like tables, charts, and interactive visualizations.

I decided to build out a _generalized_ version of a tool that an LLM could choose to call on that would also have access to any variable present within the runtime environment... much like Marimo's AI chat has access to any variable within the environment with an `@variable_name`, now I just dump the full set of `globals()` into the LLM's context window, and that's what `write_and_execute_code` looked like.

Here's an example, imagine I have two dataframes that I want an LLM to manipulate. Without `write_and_execute_code`, I'd have to write bespoke tools for the dataframe, in which I access the `df` as a "global" variable, much like the following:

```python
@lmb.tool
def chart_data(x_encoding: str, y_encoding: str, color: str):
    """Generate an altair chart"""
    import altair as alt
    return (
        alt.Chart(df)
        .mark_circle()
        .encode(x=x_encoding, y=y_encoding, color=color)
        .properties(width=500)
    )

So the writing on the wall is that I'd have to write one tool for every possible operation that I'd desire, but that's a big hassle. With this globals(), compile, and exec trickery baked into write_and_execute_code, I no longer have to specify bespoke tools for the environment that I'm in!

Further more, inspired by the Marimo blog post, ToolBot is designed to just do the tool picking, delegating the execution and return of the broader LLM-powered Python program back to the developer. In this way, I can give myself more flexibility when building entire "Agentic" programs, more so than if I were to use AgentBot in its current form. It allowed me to build a more powerful version of a tool-calling agent using ToolBot with generative UIs in a Marimo notebook. For this, it's easier to demo via a screencast instead of by me describing it in prose:

And if you're curious to try running it, you can run it with the following command:

uvx marimo edit --sandbox https://raw.githubusercontent.com/ericmjl/website/refs/heads/main/content/blog/wicked-python-trickery-dynamically-patch-a-python-functions-source-code-at-runtime/agents.py

Security concerns are very real with this approach

Comparing this to what we had before with write_and_execute_script, which performed execution in a sandboxed Docker container with limited read/write capabilities, write_and_execute_code is much, much less secure.

Obviously, I'm playing with fire here. A malicious LLM output could run code directly and do enormous damage to my machine and from my machine to the outside world. I have yet to implement code sanitization, but one big idea I have, which I just learned through discourse with GPT-4, is to use Restricted Python. I think that will be the next big upgrade after I let the current version of write_and_execute_code sit for a while.

As such, I don't suggest that the write_and_execute_code pattern be used for anything really serious in its current form.

What I learned from this Python trickery

This journey taught me several things. First, Python's runtime is far more malleable than I initially realized - the ability to compile strings into executable code and inject them into specific namespaces opens up incredible possibilities for dynamic programming.

Second, building effective LLM agents isn't just about the AI - it's about thoughtful system design. Separating tool selection from execution (as ToolBot does) creates much more flexible and controllable systems than monolithic agents.

Finally, this wouldn't have been possible without autodidactic learning with LLMs. I'm becoming more and more convinced that LLMs are a great tool for learning, but one must learn how to use them for learning, and one must earn the automation as well.

Data scientists aren't becoming obsolete in the LLM era

2025-08-15T00:00:00Z

I keep hearing the same question: "Are data scientists becoming obsolete now that LLMs can code?"

The anxiety is understandable. When you watch Claude or ChatGPT write Python scripts, build models, and even debug code, it's natural to wonder where that leaves us. But here's what I've found after spending months integrating LLMs into my own workflow: they're not replacing us. They're fundamentally reshaping what it means to be a data scientist.

To ponder this question properly, I examine it from two angles.

How are LLMs enhancing our existing work?

The first angle is using LLMs as tools for data scientists. This means finding ways to incorporate them into our day-to-day work as consumers of LLM-powered applications.

I've experienced the productivity-enhancing benefits firsthand. GitHub Copilot and Cursor have dramatically accelerated my coding. Research agents like Elicit.org help me navigate literature in ways that would have taken hours before. I use transcription tools to type faster than I can touch type by hand, getting my thoughts out of my brain closer to the speed at which I'm actually thinking. I rely on AI for cleaning up messy thoughts and as a thinking tool to help me draw out what I'm really trying to articulate.

Having lived with these tools for months now, I think being proficient with AI-assisted coding is table stakes.

Just as spreadsheets changed what we expected from accountants, AI assistance is now a baseline expectation. But there's a crucial skill here that goes beyond just using the tools: knowing how to use AI to verify information and catch the inevitable errors these systems make.

More importantly, this is just the beginning.

How are we building custom LLM solutions?

The second angle is more profound: data scientists becoming part of the team that builds custom LLM agent workflows to accelerate others' work.

Here's what this looks like in practice: You get hands-dirty with business workflows. You co-create with business partners to build new tools and ways of working that remove boring work from their plates. You build technical prototypes that prove out value, then partner with engineers for custom app builds where appropriate.

The scientist skill becomes crucial here: experimentation. You're figuring out whether a thing is actually working by measuring performance of LLM-based workflows and tying it back to business value. This is fundamentally different from being an app developer, a machine learning engineer, or a business analyst doing reporting and dashboards. Those aren't really data science roles. The scientist in data science lies in hypothesizing, defining metrics and estimates, then testing and measuring them.

What does the 'scientist' in 'data scientist' mean in the LLM era?

Taking Hamel Husain and Shreya Shankar's course on LLM evaluation crystallized this for me. I'm much more convinced that the role of a data scientist is to measure, evaluate, and design metrics. It's going back to the science.

Think about the parallel here. In discovery science, data scientists work with laboratory scientists and statisticians to hypothesize about relationships between molecular structure and biological activity, then together define what estimate we need to measure the performance of biological or chemical systems. They build machine learning models to predict those estimands from sequence and structure, test the hypotheses, and measure whether they hold. The estimands matter because they connect to whether a drug works or a process is optimized.

With LLM applications automating business processes, it's analogous but the stakes are operational performance. You hypothesize that a particular LLM workflow will improve efficiency or accuracy. You define evaluation metrics—the equivalent of the assays you measure in lab science. You design experiments to test whether your hypothesis about the LLM's impact is correct. You build automation around measurement to continuously validate whether your hypotheses about improved workflows are actually playing out.

In both contexts, you hypothesize, define, test, and measure. That's what a scientist does!

In what I'd describe as a meta move, data scientists should absolutely be experimenting with LLMs to create LLM-based tooling for their own work. We're uniquely positioned to understand both the technical possibilities and the measurement challenges these systems present.

Why this matters more than ever

This role differs fundamentally from what people might think we should become. We're not primarily app developers (that should be for software developers), even if we might ship and app or two out of necessity. We're not machine learning engineers building complex production pipelines (though we should be able to ship components that get stitched together on platforms). We're not business analysts doing reporting and dashboards, even if we do build visualizations to help with communication.

Rather, we're scientists who hypothesize, define metrics, design estimates, test our ideas, and measure whether things work.

Instead of making data scientists obsolete, the LLM era is returning us to our scientific roots while giving us incredibly powerful tools to work with. We're becoming builders of measurement systems that work at the intersection of business value and statistical rigor.

I'd strongly encourage you to try both angles: become proficient with LLM tools for your daily work, and start experimenting with building custom LLM workflows for your organization. The beauty of this approach is that you're amplifying your ability to hypothesize what might work, define what matters, and measure whether it's actually working.

Stop guessing at priors: R2D2's automated approach to Bayesian modeling

2025-08-06T00:00:00Z

When I first encountered the R2D2 (R²-induced Dirichlet Decomposition) framework (Zhang et al., 2020), I was struck by its intuitive approach to Bayesian regularization. Instead of placing priors on individual regression coefficients and hoping for the best, R2D2 lets you directly specify your beliefs about how much variance the model should explain. But what really fascinated me was how the framework elegantly extends from simple linear regression to complex multilevel models through a series of principled modifications.

This post documents my journey understanding the progression from the basic R2D2 shrinkage prior to its sophisticated multilevel variant (R2D2M2), with stops along the way to explore generalized linear models. What emerged was a beautiful mathematical architecture where each extension builds naturally on the previous.

The foundation: R2D2 shrinkage prior

The journey begins with the elegant insight that motivated the original R2D2 framework: why not place a prior directly on the coefficient of determination (R²) rather than fumbling with individual coefficient priors? The challenge with individual coefficient priors isn't just knowing where to center them, but defining appropriate variance parameters - it's remarkably difficult to know a priori how much variability each coefficient should have.

The core mathematical insight

For any model, R² represents the proportion of output variance that can be explained:

R² = explained variance / total variance = W / (W + σ²)

Rearranging this relationship shows us what W actually represents:

R² = W / (W + σ²)
R²(W + σ²) = W
R²W + R²σ² = W
R²σ² = W - R²W = W(1 - R²)
W = σ² * (R² / (1 - R²))

This reveals that W is the total explained variance (on the data scale), which equals the signal-to-noise ratio multiplied by the noise variance. Let's define the signal-to-noise ratio as:

τ² = R² / (1 - R²)

So that:

W = σ² * τ²

This gives us:

τ² = 1: Signal equals noise (R² = 0.5)
τ² = 4: Signal is 4 times stronger than noise (R² = 0.8)
τ² = 0.25: Noise is 4 times stronger than signal (R² = 0.2)

The R2D2 framework starts by placing a Beta prior on R²:

r_squared = pm.Beta("r_squared", alpha=a, beta=b)
tau_squared = pm.Deterministic("tau_squared", r_squared / (1 - r_squared))

Zhang et al. show that when R² has a Beta(a,b) prior, the induced prior density for τ² = R²/(1-R²) follows a Beta Prime distribution BP(a,b), giving us intuitive control over model fit through the familiar Beta hyperparameters.

Allocating explained variance: the Dirichlet decomposition

But here's where R2D2 gets clever. Instead of requiring the modeler to manually specify variance parameters for each predictor's prior, it uses a Dirichlet decomposition to automatically allocate the total explained variance W across predictors:

# W is the total explained variance to allocate
phi = pm.Dirichlet("phi", a=a_pi * np.ones(p), dims="predictors")
lambda_j = pm.Deterministic("lambda_j", phi * W, dims="predictors")

This means φⱼ × W = λⱼ answers the question: "What fraction of the total explained variance does predictor j get?"

Example: If τ² = 4 (signal is 4 times stronger than noise) and σ² = 2, then W = 8, and if φ = [0.5, 0.3, 0.2], then:

Predictor 1: λ₁ = 0.5 × 8 = 4 (gets 50% of explained variance)
Predictor 2: λ₂ = 0.3 × 8 = 2.4 (gets 30% of explained variance)
Predictor 3: λ₃ = 0.2 × 8 = 1.6 (gets 20% of explained variance)

As Zhang et al. describe, this creates adaptive behavior where "the heavy tail reduces the bias in estimation of large coefficients, while the high concentration around zero shrinks the irrelevant coefficients heavily to zero, thus reducing the noise" - factors that explain a lot of the output variance get allocated more of the total explained variance (larger λⱼ values), while factors that don't explain much output variance get allocated less explained variance (smaller λⱼ values).

The R2D2 model

Bringing these pieces together - the R² prior, the Dirichlet variance allocation, and the coefficient distributions - we get the R2D2 model:

with pm.Model(coords=coords) as model:
    # Noise level (data scale)
    sigma = pm.HalfNormal("sigma", sigma=1.0)

    # R² prior (intuitive model fit control)
    r_squared = pm.Beta("r_squared", alpha=a, beta=b)

    # Signal-to-noise ratio
    tau_squared = pm.Deterministic("tau_squared", r_squared / (1 - r_squared))

    # Total explained variance
    W = pm.Deterministic("W", sigma**2 * tau_squared)

    # Local variance allocation (competitive)
    phi = pm.Dirichlet("phi", a=np.full(p, a_pi), dims="predictors")
    lambda_j = pm.Deterministic("lambda_j", phi * W, dims="predictors")

    # Coefficients with allocated variance
    scale_j = sigma * pm.math.sqrt(lambda_j / 2)  # For Laplace priors
    beta = pm.Laplace("beta", mu=0, b=scale_j, dims="predictors")

    # Standard linear likelihood
    mu = pm.math.dot(X, beta)
    likelihood = pm.Normal("y", mu=mu, sigma=sigma, observed=y, dims="obs")

The beauty of this approach lies in the competitive nature of the Dirichlet allocation: all predictors compete for the total explained variance W. If one predictor becomes more important (higher φⱼ), others must become less important. This creates natural sparsity and prevents overfitting. The signal-to-noise ratio τ² provides intuitive control over model complexity, while W gives us the actual variance scale for coefficient priors.

First extension: R2D2 for generalized linear models

The first major challenge came when extending R2D2 to non-Gaussian outcomes. Yanchenko et al. (2021) tackled this problem by developing clever approximation methods that preserve the intuitive R² interpretation. The beautiful relationship R² = W/(W+σ²) that made everything work cleanly suddenly becomes complex when dealing with Poisson counts, binary outcomes, or other GLM families.

The challenge: no more simple σ²

In GLMs, the "noise" isn't a simple σ² anymore. Instead, we have:

Poisson: Variance equals the mean (σ²(η) = e^η)
Binomial: Variance depends on probability (σ²(η) = μ(η)[1-μ(η)])
Gaussian: Still simple (σ²(η) = σ²)

This breaks our clean R² = W/(W+σ²) relationship because now both the signal and noise are functions of the linear predictor η.

The elegant solution: linear approximation

The GLM extension uses a brilliant linear approximation approach. As Yanchenko et al. describe, "applying a first-order Taylor series approximation of μ(η) and σ²(η) around β₀" allows them to handle the GLM complexity. We approximate the complex GLM relationship around the intercept β₀:

R² ≈ W/(W + s²(β₀))

Where s²(β₀) = σ²(β₀)/[μ'(β₀)]² is the "effective noise" for each GLM family:

Gaussian: s²(β₀) = σ² (no change needed!)
Poisson: s²(β₀) = e^{-β₀} (depends on baseline rate)
Logistic: s²(β₀) = μ(β₀)(1-μ(β₀))/[μ'(β₀)]² (depends on baseline probability)

What this achieves

The genius: We keep all the interpretability and mathematical structure of the linear R2D2 case, but just compute a smarter "noise" term that respects the GLM family's variance structure.

with pm.Model(coords=coords) as model:
    # Same intuitive R² prior!
    r_squared = pm.Beta("r_squared", alpha=a, beta=b)

    # GLM-specific "effective noise"
    if family == 'poisson':
        s_sq = pm.Deterministic("s_sq", pt.exp(-beta0))
    elif family == 'binomial':
        exp_beta0 = pt.exp(beta0)
        mu_beta0 = exp_beta0 / (1 + exp_beta0)
        mu_prime_beta0 = exp_beta0 / pt.power(1 + exp_beta0, 2)
        s_sq = pm.Deterministic("s_sq", mu_beta0 * (1 - mu_beta0) / pt.power(mu_prime_beta0, 2))

    # Same competitive allocation structure!
    tau_squared = pm.Deterministic("tau_squared", r_squared / (1 - r_squared))
    W = pm.Deterministic("W", tau_squared * s_sq)
    phi = pm.Dirichlet("phi", a=np.full(n_components, xi0), dims="components")

The elegance of this approach becomes clear when we step back and see what's happening conceptually. We're essentially asking "what would σ² be if this GLM were actually a linear model?" and using that as our effective noise term. This preserves all the intuitive benefits of R2D2 while handling GLM complexity. The signal-to-noise ratio τ² remains the same intuitive control parameter, while W adapts to the GLM's variance structure.

The great leap: R2D2M2 for multilevel models

The most sophisticated extension addresses the challenge of multilevel models with multiple grouping factors - the kind of complex experimental designs common in laboratory research. Aguilar & Bürkner (2022) developed the R2D2M2 prior to handle this complexity while preserving the intuitive variance decomposition interpretation.

The multilevel challenge

Consider a laboratory experiment with:

Predictors: Gene expression, Age, Treatment dose
Grouping factors: Mouse ID, MicroRNA ID, Stress condition

Traditional approaches assign independent priors to each effect:

# Traditional (problematic) approach
beta_gene ~ Normal(0, λ_gene²)
beta_age ~ Normal(0, λ_age²)
mouse_effects ~ Normal(0, λ_mouse²)
microRNA_effects ~ Normal(0, λ_microRNA²)
stress_effects ~ Normal(0, λ_stress²)

The problem: As you add more predictors and grouping factors, the implied R² prior becomes increasingly concentrated near 1 (the maximum possible R² value). This happens because each additional effect adds its own independent variance contribution, causing the total expected explained variance to grow without bound, leading to overfitting-prone models that expect near-perfect fit a priori.

The R2D2M2 solution: type-level variance allocation

The key insight from Aguilar & Bürkner is that R2D2M2 extends the Dirichlet decomposition to handle multiple types of effects while preserving hierarchical pooling:

# Component calculation for laboratory data
n_components = n_predictors + n_grouping_factors
n_components = 3 + 3 = 6  # gene_expr + age + dose + mouse + microRNA + stress

component_names = [
    'population_gene_expr',    # Population-level effects
    'population_age',
    'population_dose',
    'mouse_intercepts',        # Group-specific intercept types
    'microRNA_intercepts',
    'stress_intercepts'
]

The key innovation here is subtle but powerful: instead of allocating variance to individual groups (Mouse 1, Mouse 2, etc.), we allocate variance to types of effects. All mice share one variance prior, all microRNAs share another, etc.

The complete R2D2M2 framework

Let's see how this all comes together in practice. The R2D2M2 model combines the R² prior, the extended Dirichlet allocation, and the hierarchical variance structure:

with pm.Model(coords=coords) as model:
    # Same intuitive R² control
    r_squared = pm.Beta("r_squared", alpha=alpha_r2, beta=beta_r2)
    tau_squared = pm.Deterministic("tau_squared", r_squared / (1 - r_squared))

    # Extended Dirichlet allocation across ALL effect types
    phi = pm.Dirichlet("phi", a=np.full(6, concentration), dims="components")

    # Population-level effects - each gets its own φ component
    beta_gene = pm.Normal("beta_gene", mu=0,
                         sigma=pt.sqrt(sigma_squared * phi[0] * tau_squared))
    beta_age = pm.Normal("beta_age", mu=0,
                        sigma=pt.sqrt(sigma_squared * phi[1] * tau_squared))
    beta_dose = pm.Normal("beta_dose", mu=0,
                         sigma=pt.sqrt(sigma_squared * phi[2] * tau_squared))

    # Group-specific intercepts - each type gets its own φ component
    mouse_intercepts = pm.Normal("mouse_intercepts", mu=0,
                                sigma=pt.sqrt(sigma_squared * phi[3] * tau_squared),
                                dims="mice")
    microRNA_intercepts = pm.Normal("microRNA_intercepts", mu=0,
                                   sigma=pt.sqrt(sigma_squared * phi[4] * tau_squared),
                                   dims="microRNAs")
    stress_intercepts = pm.Normal("stress_intercepts", mu=0,
                                 sigma=pt.sqrt(sigma_squared * phi[5] * tau_squared),
                                 dims="stress_conditions")

    # Linear predictor combining all effects
    eta = (beta_gene * gene_expr + beta_age * age + beta_dose * dose +
           mouse_intercepts[mouse_ids] +
           microRNA_intercepts[microRNA_ids] +
           stress_intercepts[stress_conditions])

Now that we've seen the mathematical structure, let's understand what makes this approach so effective.

Why this works so well

Hierarchical pooling preserved: Individual mice still borrow strength from each other because they share the same variance component. Mouse A and Mouse B both use mouse_scale, but have different intercept values.

Automatic factor importance: The φ allocation tells you which experimental factors matter most. If φ = [0.15, 0.25, 0.05, 0.35, 0.15, 0.05], then mouse differences account for 35% of total explained variance - more than any single predictor!

Scalable complexity: Works with any number of crossed or nested grouping factors without parameter explosion.

The unified architecture

What strikes me most about this progression is how each extension elegantly handles new complexity:

All three approaches maintain consistent R² control, letting you directly specify beliefs about model fit through the same intuitive Beta prior on R². The competitive variance allocation through the Dirichlet mechanism creates healthy competition between effects across all approaches, preventing any single component from dominating. This leads to highly interpretable results - every approach produces φ components that directly tell you "what percentage of explained variance does each effect contribute?"

The mathematical elegance is striking: each extension modifies just what needs to change. The GLM extension changes the noise term (σ² → s²(β₀)), while the M2 extension extends the allocation to multiple effect types. Finally, all approaches provide the same practical benefits - automatic shrinkage, sparsity induction, and protection against overfitting while maintaining computational tractability.

When to use what

Given these unified principles, how do you choose which approach fits your specific modeling scenario? Through this exploration, clear use cases emerged:

Approach	When to Use	Example
R2D2 Shrinkage	Simple linear regression with multiple predictors, no grouping	Gene expression ~ drug dose + age + weight
R2D2 GLM	Non-Gaussian outcomes with simple structure	Bacterial counts, binary outcomes, rate data
R2D2M2	Complex laboratory designs with multiple grouping factors (the laboratory default)	Laboratory experiments with mouse ID + microRNA ID + stress condition

Looking forward

R2D2 solves a common frustration in Bayesian modeling: how do you set reasonable priors on dozens of coefficients without spending hours tweaking hyperparameters? Instead of guessing at individual coefficient priors, you specify one intuitive parameter - how much of the data variation you expect your model to explain - and R2D2 automatically figures out how to distribute that explanatory power across your predictors.

For laboratory researchers especially, R2D2M2 delivers actionable scientific insight. When your model tells you that "mouse differences account for 35% of explained variance while stress conditions only account for 5%," you immediately know where to focus your experimental design efforts.

This practical approach - starting with an intuitive question about model fit and letting the mathematics handle the details - shows how thoughtful statistical frameworks can make sophisticated modeling more accessible to working scientists. The PyMC library has implemented a modified form of R2D2M2 as the R2D2M2CP distribution, making these powerful priors readily available for practical use.

References

Zhang, Y. D., Naughton, B. P., Bondell, H. D., & Reich, B. J. (2020). Bayesian Regression Using a Prior on the Model Fit: The R2-D2 Shrinkage Prior. Journal of the American Statistical Association, 117(538), 862-874. https://www.tandfonline.com/doi/full/10.1080/01621459.2020.1825449

Yanchenko, E., Bondell, H. D., & Reich, B. J. (2021). The R2D2 Prior for Generalized Linear Mixed Models. arXiv preprint arXiv:2111.10718. https://arxiv.org/abs/2111.10718

Aguilar, J. & Bürkner, P. (2022). Intuitive Joint Priors for Bayesian Linear Multilevel Models: The R2D2M2 prior. arXiv preprint arXiv:2208.07132. https://arxiv.org/abs/2208.07132

From nerd-sniped to shipped using AI as a thinking tool

2025-07-21T00:00:00Z

What if I told you I shipped a complex feature rewrite in just two days using AI as a design partner?

Before you roll your eyes at another "AI did everything for me" story, here's the catch: those two days were only possible because I spent months doing the hard work of earning that automation. Fresh off being thoroughly nerd-sniped by Joe Cheng (Posit PBC's CTO) at SciPy 2025, I found myself on a plane with a mission: finally implement robust graph-based memory for my Llamabot project.

What happened next taught me everything about the difference between delegating thinking to AI versus using AI to amplify your thinking. The key insight? You have to earn your automation first.

First, I had to struggle (and that was the point)

The timeline is crucial to understanding why this approach worked. For four months, I'd been mulling over how graph-based memory for LLM applications could work. Then Joe read my Llamabot code (which at the time didn't have graph-based memory), we chatted, and I got completely nerd-sniped. Over the next few days, I decided I had to make graph memory happen, so I finally built a working prototype during my week in Seattle for work. (All on my personal laptop, keeping work and personal projects separate.)

What made this experience transformative was having Joe look at my code. Here was someone I'd never met taking such a thorough look at my design choices - I was deeply impressed by how careful and thoughtful he was. That validation convinced me: it was time to do this right.

But here's what mattered most: my prototype was fragile. Things were very intertwined with one another. Because everything was so coupled, I was naturally feeling the difficulty in making any changes. This hands-on struggle was teaching me exactly what needed to be separated and how to think about the architecture.

This struggle wasn't wasted time - it was earning my automation.

Why struggling first was essential

This connects directly to an earlier blog post I wrote about earning your automation. I wouldn't have been able to critique AI the way I did if I hadn't first developed taste through hands-on struggle.

That initial prototype work - building something fragile but functional by hand - gave me the judgment needed to meaningfully critique AI's suggestions. Without that foundation, I would have been delegating critical thinking to AI instead of using it as a thinking partner.

The prototype taught me what worked, what didn't, and most importantly, what the real problems were that needed solving. When AI later proposed architectural changes, I could evaluate them against my lived experience of the pain points.

This preparation set the stage for what happened next on that plane ride home.

Then I unleashed AI as a design partner

At SEA-TAC airport with three hours until boarding, I decided this was it. Time to compress all my implementation work into a focused sprint. But instead of jumping straight into coding, I started with what felt like a radical approach: a pure design phase.

(Now, to be clear, it's not exactly radical - lots of people have said you should write requirements first. But most vibe coders don't actually follow this practice.)

I asked AI to critique my existing prototype and propose a new architecture. What followed was intense iteration on a design document right there in the airport. I did try to continue on the plane, but JetBlue's spotty Wi-Fi made that unproductive. Most of the design thinking and iteration happened during those airport hours - no code written yet, just pure design thinking.

AI proposed an interface with chat memory at the high level, with separate graph memory and list memory structures underneath. It included a visualization module (originally tailored for graphs) and a node selector module for intelligent node selection. The design doc grew to at least 400-500 lines of markdown.

The beauty of this approach? I could look at the prospective code in markdown blocks and play through scenarios in my head. How would someone use this API? How would the internals work? By asking very specific "how" questions, I could probe deeper and make sure I truly understood and agreed with every design choice.

The major breakthrough came when I scrutinized the design and asked: why do we have two chat memory implementations, one for linear memory and one for graph memory?

The natural follow-up hit me: lists are just linear graphs, so why do I need two separate structures? I can just have one that defaults to a linear graph, and then use an LLM for intelligent node selection in the threaded case.

So I generalized everything to use NetworkX graphs underneath, with intelligent node selection for threaded memory. This single insight simplified the entire architecture.

This is exactly what I mean about earning your automation - I could inject my own opinions into the design because I understood the problem space. We were iterating on a design doc, not just generating code.

The real power: AI as a critical thinking amplifier

Here's where things got really powerful. After creating that 400-500 line design document, I had way too much detail to synthesize mentally. Time to leverage one of AI's core strengths: knowledge retrieval and pattern matching.

I commanded the AI: "Go look for any inconsistencies you can see within the doc. Pick out all inconsistencies and surface them for me."

This is where the magic happened. AI surfaced seven or eight inconsistencies, some I agreed with, others I dismissed as inconsequential. But because I'd just reviewed everything, it was all fresh in my mind - I could make informed decisions about each point.

Then I asked it to check one more time: "Double check for me. Do you see any more inconsistencies?"

Now, I wasn't fully offloading this work to AI. I was still doing synthesis in my head, trying to catch things myself. In fact, I caught an inconsistency the AI missed - sometimes I was using bot.memory and sometimes bot.chat_memory between the documentation and API, while continually refining and reviewing the documentation.

The key insight here is about inversion - one of the core skills of critical thinking. The usual lazy pattern is to just assume things are correct (what I call "vibe coding"). But with AI assistance, we should invert and ask, "What if it's not correct?"

If it's not correct, the logical follow-up becomes: can I get AI to tell me where it's wrong? This combines inversion with one of AI's key strengths - knowledge retrieval. Yes, AI struggles with needle-in-haystack problems, but for big needles in smaller haystacks? It's incredibly powerful.

The "needle" here is: where am I self-contradictory? Where am I discordant? Where is my design not self-coherent? All those assumptions I might have about text-based work can be checked using AI as a tool for critical thinking.

If in doubt, always invert - and now we have a lightning-fast tool for helping us do exactly that.

This principle became the foundation for everything that followed.

Putting the method into practice: tests first, then code

With the design doc solid, it was time for the next phase. I told the AI: "Go write the tests. Write all the tests. Follow the directory structure. Make sure the test structure matches what you're proposing."

I reviewed every single test - lots of code review. But here's what's cool about AI-generated tests: they don't tend to be complicated. They're usually on the simpler side. I don't see parameterized tests using property-based testing like Hypothesis. Instead, I see example-based tests.

As a first pass, example-based tests are perfect - they're concrete, easy to grasp, and I can have confidence that if the test is testing what I think it should test, then it'll pass when the implementation is written.

The test review process was lightning-fast because I was so grounded in what the code was supposed to do. The design doc grounded the tests, the tests would ground the implementation. Each layer validated the next. This is the "earn your automation" principle in action - I could review tests quickly because I understood what the code should do.

When things break (and why that's exactly what you want)

When I finally had AI generate the implementation code and ran the tests, a lot failed - and I was totally okay with that. The first pass had maybe 20+ failing tests, but I figured out an efficient way to iterate through them in batches.

I literally copied and pasted pytest output and got AI to categorize the failures by common patterns. AI is blazing fast at pattern recognition - what would take me ages to figure out was near instantaneous for AI.

Categorizing the failures was key. If I could group them, I could knock out three, four, sometimes even seven failing tests with targeted code changes. Even better, sometimes the failures revealed misunderstandings - either mine about the code or the AI's about the design. This forced clarifying decisions that resolved the discordance between what the test expected versus what the code actually did.

With this approach, I quickly narrowed those 20+ failing tests down to maybe three or four individual syntax errors. Finally, everything worked - all tests passed, discordances resolved, ready to ship.

Remember that inversion principle I mentioned earlier? This is how it played out in practice. Instead of assuming the generated code was correct, I actively looked for where it was wrong and used AI to help categorize and fix the problems systematically.

The payoff: two days from design to deployment

The timeline tells the whole story. I flew on Sunday morning, starting this work while at the airport, and by Monday evening had the pull request done and up to my expectations. The entire implementation phase - from final design doc to merged pull request - took just two days.

But this compressed timeline was only possible because of all the preparation: four months marinating on the idea, one week during the conference to write the prototype and let it simmer while in Seattle and Tacoma, then intense design iteration with AI assistance.

This teaches us something crucial about AI-assisted development: AI doesn't replace thinking and preparation - it amplifies it. I had a crystal-clear goal of what needed shipping after all that prep work. Once I was done with the prototype phase and figuring out the actual problem, bam - two days to ship.

That's incredible. But notice what made this possible: not AI magic, but AI amplifying months of preparation and struggle.

What I actually built (and why it matters)

As someone who has worked with graphs before, in my eyes, the result is beautiful. Conversations are now represented as graphs, and since I work exclusively in Marimo notebooks, I can run and view Mermaid diagrams right inline. With a Mermaid diagram in a Marimo notebook, it's incredibly powerful - I can actually jump around conversation threads using the graph as visual memory to continue probing the AI system in sophisticated ways.

What I love about this implementation is that it's not just a technical achievement - it's become a practical thinking tool. The visual graph helps me navigate complex AI conversations and switch between threads mentally more easily.

And I could only build this effectively because I'd earned the right to automate through that initial prototype struggle.

How this approach scales: the power of AI-assisted pair coding

I have a hypothesis that this approach works even better with two people and an AI assistant - but not more than two, because you can't have too many cooks. At Moderna's Data Science and AI teams, we instituted pair coding early on so we could help each other and share knowledge. Yes, we get less done in the same time, but in the long run, we move much faster. This shared knowledge means I can quickly jump into someone else's codebase.

Pair coding as a practice needs maintenance though - I noticed recently I was getting isolated into solo coding. But during my Seattle trip, I experienced pair coding with AI assistance alongside my colleague Dan Luu from the ML Platform Team. We were learning prompting tips from each other, and it was incredible - we had a chance to share practices for how to use AI to amplify ourselves.

What used to be "here's how you write the function" became sharing how we're actually thinking. We've elevated the level at which we share knowledge. As Dan prompts the AI or I prompt the AI, we're learning how each other thinks in a way that's smooth, fluent, and not bogged down by syntax or implementation details. It operates at a higher plane than mere code.

This is incredibly powerful because we're sharing practices for how to use AI to amplify ourselves, learning prompting techniques from each other in real time.

What used to require teaching syntax and implementation details now becomes sharing thinking patterns and problem-solving approaches. We've elevated the conversation.

The pattern that changes everything

What made this approach work wasn't AI magic - it was a specific sequence that amplified months of preparation into two days of execution.

First, I had to struggle. Building that fragile prototype by hand taught me what the real problems were. Without that lived experience, I couldn't have meaningfully critiqued AI's suggestions or made good design decisions. You can't skip this step.

Then I could partner strategically with AI. Instead of using it as a code generator, I used it as a critical thinking amplifier. The inversion principle became key - actively asking "what's wrong here?" and leveraging AI's pattern recognition to find inconsistencies and categorize problems.

Finally, I followed a systematic progression: design document first, then comprehensive tests, then implementation. When tests inevitably failed, I used AI to categorize failures and fix them in batches rather than one by one.

The two days it took me to ship graph memory weren't about AI being magical. They were about using AI properly after doing the hard work of earning that automation. The months of struggle weren't wasted time - they were the essential foundation that made AI partnership effective.

This is how you go from vibe coding to strategic automation. Not by delegating thinking to AI, but by using AI to amplify the thinking you've already earned the right to do.

How to use xarray for unified laboratory data storage

2025-07-15T00:00:00Z

What if your laboratory and machine learning related data could be managed within a single data structure? From raw experimental measurements to computed features to model outputs, everything coordinate-aligned and ready for analysis.

I've been thinking about this problem across different experimental contexts. We generate measurement data, then computed features, then model outputs, then train/test splits. Each piece typically lives in its own file, its own format, with its own indexing scheme. The cognitive overhead of keeping track of which sample corresponds to which row in which CSV is exhausting.

Let me illustrate this with a microRNA expression study as a concrete example.

Here's an approach that could solve this: store everything in a unified xarray Dataset where sample identifiers are the shared coordinate system. Your experimental measurements, computed features, statistical estimates, and data splits all aligned by the same IDs. No more integer indices. No more file juggling. Just clean, coordinated data that scales to the cloud.

What's wrong with traditional laboratory data management?

Picture this: you're three months into a microRNA expression study. You've got the following files:

expression measurements in expression_data.csv,
ML features in sequence_features.parquet,
model outputs in model_results.h5, and
train/test splits scattered across train_indices.npy and test_indices.npy.

Each file has its own indexing scheme - some use row numbers, others use identifiers, and you're constantly writing index-matching code just to keep everything aligned.

The cognitive overhead is brutal. Which microRNA corresponds to row 47 in the features file? Did you remember to filter out the same samples from both your training data and your metadata? When you subset your data for analysis, do all your indices still match?

I've lost count of how many times I've seen analysis pipelines break because someone forgot to apply the same filtering to all their data files. It's not just inefficient - it's error-prone and exhausting.

How does xarray solve this?

Xarray changes the game by making coordinates the foundation of your data structure. Instead of managing separate files with separate indexing schemes, you create one unified dataset where every piece of data knows exactly which microRNA it belongs to.

The beauty lies in the coordinate system. Each data point is labeled with meaningful coordinates: not just row numbers, but actual experimental factors like microRNA ID, treatment condition, time point, and replicate. When you slice your data, everything stays aligned automatically.

This is transformative! When everything shares the same coordinate system, you can slice across any dimension and everything stays connected. Want features for specific microRNAs? The model results for those same microRNAs come along automatically.

What does unified data storage look like?

Let me walk you through how this works in practice. We start with a coordinate system that captures the experimental design:

Coordinates:
* mirna           (150 microRNAs: hsa-miR-1, hsa-miR-2, ...)
* treatment       (3 conditions: control, hypoxia, inflammation)
* time_point      (5 timepoints: 2h, 6h, 12h, 24h, 48h)
* replicate       (3 replicates: rep_1, rep_2, rep_3)
* cell_line       (10 cell lines: cell_line_01, cell_line_02, ...)
* experiment_date (4 dates: experiment dates)

Then we progressively add data that aligns with these coordinates:

# Stage 1: Expression measurements
unified_dataset = xr.Dataset({
    'expression_level': (['mirna', 'treatment', 'time_point', 'replicate', 'cell_line'], expression_data)
})

# Stage 2: Bayesian estimation results
unified_dataset = unified_dataset.assign({
    'mirna_effects': (['mirna'], mirna_coefficients),
    'mirna_effects_std': (['mirna'], mirna_coefficient_errors),
    'treatment_effects': (['treatment'], treatment_coefficients),
    'treatment_effects_std': (['treatment'], treatment_coefficient_errors),
    'time_effects': (['time_point'], time_coefficients),
    'time_effects_std': (['time_point'], time_coefficient_errors),
    'replicate_effects': (['replicate'], replicate_coefficients),
    'replicate_effects_std': (['replicate'], replicate_coefficient_errors),
    'cell_line_effects': (['cell_line'], cell_line_coefficients),
    'cell_line_effects_std': (['cell_line'], cell_line_coefficient_errors)
})

# Stage 3: ML features
unified_dataset = unified_dataset.assign({
    'ml_features': (['mirna', 'feature'], feature_matrix)
}).assign_coords(feature=['nt_A', 'nt_T', 'nt_G', 'nt_C', 'length', 'gc_content'])

# Stage 4: Train/test splits
unified_dataset = unified_dataset.assign({
    'train_mask': (['mirna', 'split_type'], train_masks),
    'test_mask': (['mirna', 'split_type'], test_masks)
})

The magic happens when you realize that every piece of data is automatically aligned by the shared coordinate system. Need to analyze expression patterns for microRNAs in your training set? It's just coordinate selection:

# Get training mask for random 80/20 split
train_mask = unified_dataset.train_mask.sel(split_type='random_80_20')

# Get ML features for training microRNAs
train_features = unified_dataset.ml_features.where(train_mask, drop=True)

# Get expression data for the same microRNAs
train_expression = unified_dataset.expression_level.where(train_mask, drop=True)

Everything stays connected automatically. No manual bookkeeping required.

How do we build this step by step?

The approach is straightforward - progressive data accumulation. You don't need to have everything figured out upfront. Start with your core experimental data, then add layers as your analysis develops.

Stage 1: Laboratory measurements

Your foundation is the experimental data with meaningful coordinates:

# Expression data automatically aligned by coordinates
expression_data = xr.DataArray(
    measurements,
    coords={
        'mirna': mirna_ids,
        'treatment': ['control', 'hypoxia', 'inflammation'],
        'replicate': ['rep_1', 'rep_2', 'rep_3'],
        'time_point': ['2h', '6h', '12h', '24h', '48h'],
        'cell_line': cell_lines
    },
    dims=['mirna', 'treatment', 'replicate', 'time_point', 'cell_line']
)

You should note here how the coordinates basically mirror the experimental design.

Stage 2: Bayesian estimation

Add effect estimates that align with your experimental coordinates:

# Bayesian effects model results
unified_dataset = unified_dataset.assign({
    'mirna_effects': (['mirna'], mirna_coefficients),
    'mirna_effects_std': (['mirna'], mirna_coefficient_errors),
    'treatment_effects': (['treatment'], treatment_coefficients),
    'treatment_effects_std': (['treatment'], treatment_coefficient_errors),
    'time_effects': (['time_point'], time_coefficients),
    'time_effects_std': (['time_point'], time_coefficient_errors),
    'replicate_effects': (['replicate'], replicate_coefficients),
    'replicate_effects_std': (['replicate'], replicate_coefficient_errors),
    'cell_line_effects': (['cell_line'], cell_line_coefficients),
    'cell_line_effects_std': (['cell_line'], cell_line_coefficient_errors)
})

The beauty is that your Bayesian effects model estimates align perfectly with your experimental design coordinates. Each experimental factor gets its own effect estimate with uncertainty, organized by the same coordinate system as your raw data.

Stage 3: ML features

Features slot right into the same coordinate system:

# ML features aligned by microRNA ID
unified_dataset = unified_dataset.assign({
    'ml_features': (['mirna', 'feature'], feature_matrix)
}).assign_coords(feature=['nt_A', 'nt_T', 'nt_G', 'nt_C', 'length', 'gc_content'])

Stage 4: Train/test splits

Even data splits become part of the unified structure:

# Boolean masks aligned by microRNA coordinate
unified_dataset = unified_dataset.assign({
    'train_mask': (['mirna', 'split_type'], train_masks),
    'test_mask': (['mirna', 'split_type'], test_masks)
})

Progressive build = reduced cognitive load

The beauty of this approach is that you can build it incrementally. Start with your core experimental data, then add statistical results, then ML features, then splits. Each stage builds on the previous coordinate system, so everything stays aligned automatically.

What are the practical benefits?

No more index juggling

Remember the nightmare of keeping track of which microRNA corresponds to which row in which file? That's gone. Every piece of data knows its own coordinates.

# Before: manual index matching across files
expression_subset = expression_df.iloc[train_indices]
features_subset = features_df.loc[mirna_ids[train_indices]]
model_results_subset = model_df.iloc[train_indices]

# After: coordinate-based selection
train_data = unified_dataset.where(
    unified_dataset.train_mask.sel(split_type='random_80_20'),
    drop=True
)

Bulletproof data consistency

When you slice your data, everything stays aligned automatically. No more worrying about applying the same filtering to all your files.

Cloud-native scaling

Store everything in Zarr format and your unified dataset becomes cloud-native. Load it from S3, query specific slices, and everything scales seamlessly. (Note: Zarr has some limitations with certain data types like U8, but xarray supports multiple storage formats to work around these issues.)

# Save entire workflow to cloud
unified_dataset.to_zarr('s3://biodata/mirna_screen_2024.zarr')

# Load and analyze anywhere
experiment = xr.open_zarr('s3://biodata/mirna_screen_2024.zarr')

Reproducible analysis pipelines

Your analysis becomes more reproducible because the data structure itself enforces consistency. Share the dataset and the analysis code just works.

What tools make this possible?

The tooling ecosystem has evolved dramatically in recent years. A few years ago, I would have told you to use parquet files with very unnatural tabular setups to get everything into tidy format. But xarray is changing the game.

Xarray provides the coordinate system and multidimensional data structures that make this unified approach possible. It's like pandas for higher-dimensional data, but with meaningful coordinates instead of just integer indices.

Zarr gives you cloud-native storage that preserves all your coordinate information and metadata. It supports chunking, compression, and parallel access - perfect for scaling your unified datasets.

Note: The tools we've got are just getting better and better. I wouldn't have imagined that we'd be able to use xarray for this kind of unified laboratory data storage just a few years ago. The ecosystem is maturing rapidly, and these approaches are becoming more accessible every year.

What's next?

If you're working with multidimensional experimental data, I'd strongly encourage you to try this unified approach. Start small - take your next experiment and see if you can structure it as a single xarray Dataset instead of multiple files.

The cognitive overhead reduction is immediate. No more wondering if your indices are aligned. No more writing index-matching code. Just clean, coordinated data that scales to the cloud.

Time will distill the best practices in your context, but I've found this unified approach eliminates so much friction from the experimental data lifecycle. Give it a try and see how it feels in your workflow.

I cooked up this synthetic example while attending Ian Hunt-Isaak's talk "Xarray across biology. Where are we and where are we going?" at SciPy 2025. His presentation on using xarray for biological data really crystallized how powerful this coordinate-based approach could be for the typical experimental workflow.

Reflections on the SciPy 2025 Conference

2025-07-14T00:00:00Z

This year marks my 10th year of being involved with the Scientific Python Conference, and it has been an absolute blast! What started as curiosity about the intersection of science and software has grown into a decade of learning, teaching, and contributing to this incredible community.

Conference Activities Summary

This year's SciPy was particularly active for me. I taught two tutorials: "Building with LLMs Made Simple" (a new one) and "Network Analysis Made Simple" (my longtime favorite). After the tutorials, I attended several inspiring talks, including an especially motivating presentation on XArray in biology that prompted me to create a Marimo notebook demonstrating XArray's applications in biological data analysis.

One of my favorite conference activities this year was recording conversations with fellow attendees. In lieu of my Insta360 camera, I brought my DJI mic everywhere and captured numerous insightful discussions, creating an informal podcast collection of SciPy conversations. Finally, during the sprints, I felt more tapped out than usual but still managed to contribute to Llamabot development with others and work on the XArray biology materials I had envisioned.

Tutorials

Building with LLMs Made Simple

This was my first time teaching this tutorial, and I was thrilled to use Marimo notebooks throughout the entire session. The tutorial covered three main areas: simple LLM interactions, structured generation, and RAG (Retrieval-Augmented Generation). You can find the tutorial materials at: https://github.com/ericmjl/building-with-llms-made-simple

The structured generation section was particularly powerful. I emphasized that structured generation is fundamentally about automating form-filling using natural language. Having free text input and getting a filled-out Pydantic model output is incredibly valuable for productivity. One participant mentioned the concept of automating "the dangerous, the dull, and the dirty" - which perfectly captures how LLMs can handle routine tasks.

For RAG, I clarified that RAG doesn't necessarily equal vector databases - it's about information retrieval through various means including keyword search. I demonstrated custom chunking strategies for standard operating procedures, showing how simple solutions (like appending source references) often work better than complex hierarchical structures.

The tutorial concluded with brief demos on evaluation and agents. I shared my experience testing different models (Gemma, Llama 3, Llama 4) for docstring generation, emphasizing the importance of experimentation and model selection. For agents, I stressed starting with simpler structured generation approaches before building complex autonomous systems.

Thanks to Modal's generous credit allocation from their DevRel Charles, I was able to deploy an Ollama endpoint in the cloud, making the tutorial accessible to all participants.

Network Analysis Made Simple

This marked either my ninth or tenth time teaching this tutorial at SciPy - my longtime favorite. This year I made the significant transition from Jupyter to Marimo notebooks, which was an experiment that generally worked well despite some setup challenges. You can find the tutorial materials at: https://github.com/ericmjl/Network-Analysis-Made-Simple

The tutorial faced some technical hurdles for installation with the Network Analysis Made Simple package being published on my own PyPI server, plus some participants weren't familiar with Marimo. Fortunately, Erik Welch from NVIDIA was present to help assist participants. By the end of the conference talk days, I was able to resolve the issue by changing the notebooks to draw from the Network Analysis Made Simple source directly instead of my own PyPI server, which solved most of the installation problems.

What I loved most was the audience engagement. We didn't cover as much content as usual because participants asked so many thoughtful questions, especially during the visualization section. This interaction made the session incredibly valuable, as people were clearly learning and developing new ideas for their own work.

The Marimo experiment succeeded in shifting the learning environment with minimal overhead. For future iterations, I'm considering eliminating the separate NAMS package and making the entire notebook self-contained with answers included at the bottom.

Overarching thoughts on the tutorials

Both tutorials were conducted entirely within Marimo notebooks, which convinced quite a few participants to switch over to Marimo. They saw the power of fully reactive notebooks and the ability to seamlessly share analysis from one person to another - something that's much more cumbersome with traditional Jupyter notebooks.

Both tutorials will also be available on YouTube! There was a technical glitch with the Building with LLMs Made Simple tutorial recording, so I'm planning to re-record the full tutorial this coming Saturday - including content we didn't get to cover during the live session. This should actually result in a better, more complete recording for the YouTube release, which I'll also release to my own channel.

Talks and Presentations

I attended several inspiring talks throughout the conference. Here are short summaries of the key presentations that caught my attention:

XArray in Biology (Ian Hunt-Isaak)

This talk was particularly inspiring and prompted me to create a Marimo notebook demonstrating XArray applications in biology. Ian, a biologist and microscopist from Earthmover (funded by the Chan Zuckerberg Initiative), presented a compelling case for XArray adoption in biological research.

XArray excels at handling multi-dimensional biological data like time-series microscopy images, multi-channel fluorescent data, and complex experimental metadata. Its semantic indexing capabilities (e.g., data.sel(time='30.5min', field_of_view=1, channel='GFP').max('z')) make biological data analysis much more intuitive.

Despite its benefits, XArray has seen limited adoption in biology due to awareness barriers and lack of biology-specific examples. Recent improvements like DataTree for hierarchical data structures and flexible indices for complex coordinate systems address many biological data needs. The roadmap includes developing biology-specific documentation and building a user community within the next year.

SciPy Statistical Distributions Infrastructure (Albert Steppi)

Albert, one of SciPy's maintainers, presented the complete rewrite of SciPy's statistical distributions framework, primarily designed by Matt Haberland. The new infrastructure addresses significant limitations of the old system, including memory leaks, inflexible documentation, and parameter processing overhead.

Key improvements include a single consistent API where distributions are classes users instantiate, better performance, arithmetic operations on distributions (shifting, scaling, transformations), and simplified custom distribution creation. Future development will focus on distribution-specific fitting methods and support for alternative array backends like PyTorch and JAX.

High-Level API Dispatching for Community Scaling (Erik Welch)

This presentation explored how dispatching enables scaling of open source communities while managing contributor burden. The speaker shared implementation experiences with NetworkX (3-year evolution from pure Python to supporting faster implementations) and Scikit-Image (1-year implementation dispatching to NVIDIA cuCIM).

The talk emphasized community engagement importance, careful bandwidth management, and maintaining balance between users, library maintainers, and backend developers. While dispatching is "deceptively simple," it requires careful consideration of nuanced implementation choices.

Marimo: The Future of Notebooks (Akshay Agrawal)

I was thrilled to see Marimo's founder Akshay give a talk about the future of notebooks. His live demo showcasing all of Marimo's capabilities was as gutsy as my own Data-Driven Pharma talk (which was also done entirely in a Marimo notebook).

The fundamental change Marimo has brought to my workflow has been amazing. Not having to specify a separate manifest file for dependencies like with Jupyter notebooks was one of the big selling points for me. We had dinner together with a large group and got to discuss Marimo's future development - it was awesome to meet him in person and share thoughts on where the platform is heading.

Recording Conversations and Networking

One of my favorite activities this year was bringing my DJI mic everywhere and recording conversations with fellow attendees. Over the years, I've realized how informative and valuable these SciPy conversations are, so I decided to capture them as informal podcast content.

The first recording happened over breakfast with Hugo Bowne-Anderson. We were discussing everything while eating salmon frittata - we now call it "the frittata chat." Hugo loved the idea so much that he sent it to his editor, and it will appear on his podcast "Vanishing Gradients" soon.

I continued this approach with Daniel Chen (my conference doppelganger - we get mistaken for each other at every conference) and Ryan Cooper. I also had an incredible hour-and-twenty-minute conversation with Zweli, covering topics from Bayes and graphs to apartheid and parenting. While I missed some talks due to these extended conversations, that's often the real purpose of conferences - engaging in dialogue we don't usually get to have.

Whether I'll release these as formal podcast episodes depends partly on my energy levels and whether the participants agree, but the conversations themselves provided immense value and captured knowledge I didn't want to lose.

Nerd Sniping and Code Reviews

I got thoroughly nerd-sniped by Joe Cheng, CTO of Posit, who found Llamabot and conducted an impromptu code review. We first met at the NVIDIA event while I was recording a conversation with Daniel Chen about AI education and assessment.

Joe had recently decided that generative AI was a productive area for Posit and found Llamabot during his research. Standing outside the Glass Museum for half an hour, he grilled me with questions about design choices I'd never had the chance to discuss with anyone before. The nerd sniping continued over ramen takeout in the hotel lobby from Thekoi (awesome restaurant by the way!), where he asked about corners of the codebase with the thoroughness of a technical interview that I've subjected multiple people to. Talk about karma!

Joe also ended up nerd sniping himself during our discussions and built something with the OpenAI real-time API that he showed me on Thursday evening. It was incredibly fun - we were on his computer together, nerding out about tweaking the real-time API settings to fit a user experience that would work with my brain, where I take a bit more time to respond and don't necessarily like the rapid-fire conversation turns.

This nerd sniping cascade had a knock-on effect: it led me to implement graph-based memory for Llamabot, which then revealed that the chat memory API really wasn't optimal and needed another rewrite. There's now a 0.13 release of Llamabot planned in my head that will need to happen soon - all thanks to Joe's infectious curiosity and builder mentality!

Sprints

The sprints provided a chance to contribute to open source projects, though I felt more tapped out than usual this year. Despite the fatigue, I managed to make meaningful contributions to three key areas.

Llamabot Development

Joe Cheng's nerd sniping during the conference led me to spend time during the sprints implementing graph-based memory for Llamabot. The challenge was representing conversation turns as pairs of human and AI messages while inferring the most probable message that a human is responding to when creating new branches in the conversation.

I successfully implemented this graph memory system, which required determining how to connect new human messages to existing assistant messages in the conversation graph. This feature allows for more sophisticated conversation tracking and branching compared to traditional linear chat histories. You can see the implementation in this pull request: https://github.com/ericmjl/llamabot/pull/226

XArray Biology Contributions

Inspired by Ian's talk on XArray in biology, I worked on creating Marimo notebook examples demonstrating how XArray can be effectively used for biological data analysis. This contribution aims to bridge the gap between XArray's powerful capabilities and the biology community's needs for multi-dimensional data handling.

The goal was to provide concrete examples that biologists could use as starting points for their own projects, helping to increase XArray adoption in biological research by making its benefits more tangible and accessible. You can find the completed notebook at: https://gist.github.com/ericmjl/e5b267782f9cbd27f712153deab426e1

Teen Track Talk

Inessa Pawson asked if I would be willing to give a talk to the teens attending the conference. I shared stories about building your own tools and recounted experiences from my career journey. I told them how I got in through the back door and walked out through the front door of grad school, emphasizing how much you can learn along the way.

Using the same approach from my Data-Driven Pharma talk, I showed them how I can build my own tools without relying on PowerPoint - by showing them live that I built my own slide deck generator. I shared how I picked up programming and made 70+ pull requests with the Matplotlib team, which was an incredible learning experience, and how the learning experience helped me later professionally at Novartis and Moderna, where being able to build tools for myself helped me be the change I wanted to see in the world. The goal was to inspire them to see that they too can build their own tools and, perhaps, be the change they wanted to see.

Conference Tidbits

A few smaller moments that captured the spirit of SciPy and the power of modern notebook sharing: I helped Hugo with a quick analysis during the conference and was able to simply airdrop him a Marimo notebook with the complete analysis. The fact that I could share a fully self-contained, executable analysis so seamlessly really demonstrated how far we've come in making scientific computing more collaborative and accessible.

Another remarkable tidbit: I went to Chili Thai for the sixth and seventh time in two years, which is pretty remarkable considering that I've only been at the conference for a total of 14 days. Chili Thai really earns high ratings from me - the duck curry and the panang curry are amongst the best I've ever had.

Conclusion

Attending the SciPy conference for about a decade now has been an immense resource for my career growth. Beyond being a participant, I've also been involved as an organizer, serving on the financial aid committee for almost a decade. It's my little way of giving back to a community that has given me so much, and I'm always looking for ways to contribute even more.

What makes SciPy special is its incredible community of people who are curious, nerdy, and remarkably ego-free. There's a genuine spirit of learning and teaching - many are educators at heart, eager to share knowledge and help others grow. This creates an environment where meaningful connections and learning happen naturally.

I'd really recommend more people attend SciPy if their company finances allow for it. The value you get from the tutorials, talks, networking, and collaborative spirit is immense. However, I do know from helping organize the conference that this year we ran at a deficit, which isn't financially sustainable. I hope we can find more sponsors for next year to keep this amazing event accessible.

If possible, I'd love to help sponsor the conference, especially the Financial Aid program. Being able to bring new people to the conference - particularly community contributors who have demonstrated need - would be amazing. I was a beneficiary of financial aid myself early in my career, and it made all the difference in my ability to participate and grow within this community.

The SciPy conference continues to be a cornerstone of my professional development and a source of inspiration for pushing the boundaries of what's possible with scientific computing!

Earn the privilege to use automation

2025-07-13T00:00:00Z

AI in education was supposed to be transformative.

We imagined students with AI tutors available 24/7, personalized learning at scale, and democratized access to high-quality education. The promise was intoxicating: every student could have their own Socrates, guiding them through complex concepts with infinite patience.

Then reality hit.

When AI integration fails spectacularly

Lorena Barba, a respected engineering professor at George Washington University, shared her experience at SciPy 2025 of deciding to fully embrace AI in her computational engineering course. She built a custom AI tool with her technical partners, complete with document upload capabilities, retrieval augmented generation, and safety moderation features. She gave her students what seemed like the perfect educational AI assistant.

The results were devastating.

Her course evaluations plummeted from 4.8/5 to 2.3/5. Students stopped attending class. They stopped doing homework with any rigor. Some copied entire assignment questions, including instructions like "your code here," and expected complete answers they could submit without understanding.

The most damning feedback? Students told her: "I would have learned better if AI were not present."

What went wrong? Lorena had given her students unbridled access to AI without ensuring they had the foundational skills to use it effectively. Students developed what she called an "illusion of competence"—they overestimated their knowledge because AI made everything feel easy. They missed the deep processing necessary for long-term memory formation.

After 20 years of successful teaching, Lorena experienced what she called a "frustrating, humbling failure." She's now considering returning to oral examinations to preserve assessment authenticity.

The assessment validity crisis

Lorena's experience reveals a fundamental problem: AI has broken traditional assessment methods. If students can get AI to do their work, how do we evaluate their actual understanding? How do we conduct meaningful assessments in both educational and workplace settings?

This question hits close to home for me. As a team lead, I constantly assess whether candidates are ready for the job and whether my teammates are performing at expected levels. If I'm only looking at work outputs—the final code, the completed analysis, the polished presentation—that's an inadequate assessment method. AI has made it trivially easy to produce impressive-looking outputs while learning nothing.

I need to understand how people think through problems, not just whether they can deliver results. This challenge sparked intense conversations with educators at SciPy 2025. Daniel Chen (University of British Columbia) and Ryan Cooper (University of Connecticut) each brought unique perspectives on adapting our assessment methods to this new reality.

Assessing the process, not just the product

Daniel Chen had to fundamentally shift his approach. He moved up Bloom's taxonomy for assessment, focusing on questioning and synthesis rather than factual regurgitation. His key insight: when students ask questions, it reveals their level of understanding.

Insightful questions indicate pursuit of mastery. Surface-level "how do I get this done" questions reveal a lack of deep engagement.

Daniel proposed assessing students through their AI chat transcripts. Instead of only evaluating final products, we could examine both process and outcome. This approach reveals how students think through problems, potentially restoring validity to our assessments.

Ryan Cooper had already started implementing this idea, collecting chat transcripts to understand student thinking patterns. He also experimented with having students generate their own exam questions—leveraging the fact that creation sits at the highest level of Bloom's taxonomy.

Ryan gave students access to a curated AI system conditioned with course context, generating on-the-fly assessment questions. While innovative, he encountered challenges with rubric-based grading when AI suggested grades without clear criteria.

Why this matters in the workplace

These educational assessment challenges directly mirror my daily reality as a team lead. AI assistance allows me to work solo and move incredibly fast—I love that turbocharged feeling. But this speed creates a dangerous blind spot that affects both my personal development and my team's growth.

Here's my dilemma: if I don't slow down to demonstrate my thinking process, we lose opportunities to train junior team members. More concerning, if I can't see how my team members approach problems—only their final outputs—I can't effectively assess their capabilities or guide their development.

When team members use GitHub Copilot or similar tools, I need visibility into their thought processes, not just their code. Are they asking insightful questions? Do they understand the trade-offs they're making? Can they spot when the AI suggests something problematic? Without access to their reasoning process, I'm essentially conducting performance reviews based on AI-assisted outputs rather than human capability.

This visibility gap threatens knowledge transfer and continuity. We risk training a generation of practitioners who can orchestrate AI to produce impressive results but lack the foundational understanding to innovate when the tools fail or evolve.

Earning the privilege of automation

The solution to this assessment crisis—both educational and professional—isn't to ban AI tools or ignore their impact. Instead, we need a fundamental shift in how we think about automation access.

Here's the central insight that crystallized from these conversations: people must earn the privilege to use automation.

The use of large language models for coding is automated code drafting. If you lack the skills to evaluate and verify correctness, you shouldn't use LLMs for anything important. This isn't about restricting access, but rather, it's about ensuring people develop foundational competencies first, then demonstrate those competencies before gaining access to powerful automation.

The principle is straightforward: demonstrate you can verify AI output before using AI for critical work. This means:

Understanding underlying concepts well enough to spot errors
Having skills to validate AI-generated solutions
Developing judgment to recognize when something doesn't make sense
Building fortitude to dig deeper when results seem questionable

I'm fine with "vibe coding" in unfamiliar languages for throwaway explorations—that's valuable for learning. But for work that matters, the ability to verify correctness is non-negotiable.

The path forward

Lorena's lessons teach us that unrestricted AI access without foundational skills leads to degraded learning outcomes. We need systematic approaches to ensure people earn their automation privileges. These include:

Moving assessments up Bloom's taxonomy to focus on higher-order thinking skills that AI can't easily replicate
Evaluating process alongside product through chat transcript analysis and collaborative work
Encouraging creation and synthesis rather than regurgitation—have students generate exam questions, not just answer them
Implementing pair programming and mentoring that reveals thinking patterns and preserves knowledge transfer
Maintaining human elements in learning and development to counteract AI's tendency to create isolated workers

The future belongs to those who can effectively collaborate with AI while maintaining the critical thinking skills to guide and verify that collaboration. But they must demonstrate mastery of fundamentals before earning that privilege.

We're not trying to halt progress or ban useful tools. We're ensuring that powerful automation serves human capability rather than replacing it.

With thanks to Lorena Barba (George Washington University), Daniel Chen (University of British Columbia), Ryan Cooper (University of Connecticut), and Emily Dorne (Driven Data) for sharing their experiences and insights. Their perspectives as professional educators and industry practitioners navigating AI's impact on learning, assessment, and hiring shaped these reflections.

Addendum

Lorena Barba continued this conversation at JupyterCon, delivering a talk titled "Teaching and Learning With Jupyter and AI: An Educator's Dilemma". In this follow-up presentation, she further explores the challenges educators face when students use AI as a shortcut, creating an "illusion of competence" that undermines genuine learning. The talk addresses the impasse educators face with vague university guidance and challenges to assessment validity, building on the themes she shared at SciPy 2025.

The job your docs need to do

2025-07-07T00:00:00Z

What is the job that your docs need to do?

Two threads have been running through my mind recently, and I keep finding connections between them that I can't shake. The first is Diataxis - a structured framework for documentation that divides all docs into four distinct types: tutorials, how-to guides, reference, and explanation. The second is Clayton Christensen's jobs theory, which asks a deceptively simple question: what is the job that your customer needs to get done?

Side note: I've been heavily inspired by Clayton Christensen's books recently, and have audiobooked my way through Innovator's Dilemma/Solution/DNA, as well as Competing Against Luck. All good books, 100% recommended if you're interested in understanding how innovation actually works.

Here's the key insight: your documentation isn't competing with other documentation. It's competing with every other way someone could accomplish their job.

The competition you didn't know you had

When someone opens your internal documentation, they're looking for more than information. They're trying to accomplish something specific, and they're evaluating whether your docs are the right tool for that job.

For internal company documentation—whether it's for internally built software, processes, or systems—the competition is different but equally real. Your how-to guide competes with asking a colleague, digging through Slack history, or reverse-engineering from existing code. Your reference docs compete with reading the source code directly, checking configuration files, or experimenting in a staging environment.

Diataxis Doc Type	Jobs to be Done (JTBD)	Alternative Product Categories That Could Be Hired
How-to Guides	"Show me how to achieve a specific outcome."	Asking a colleague, Slack/Teams search, reverse-engineering from existing code, trial and error in staging
Reference	"Give me exact technical information I can look up quickly."	Reading source code, checking config files, database schemas, API endpoint testing, environment variables
Explanation	"Help me understand how/why it works."	Architecture diagrams, code comments, git history, team knowledge sharing sessions, design documents
Tutorials	"Help me learn by doing, in a safe structured way."	Pair programming, shadowing a colleague, sandbox environments, local development setup walkthroughs

Once you see this competition, it reframes how you think about documentation structure.

The opportunity hiding in plain sight

Here's what's fascinating about internal documentation: the competition is actually pretty terrible. Think about it:

Asking a colleague interrupts their work and creates context switching for both of you
Digging through Slack history is time-consuming and often incomplete
Reverse-engineering from existing code is slow and error-prone
Trial and error in staging environments wastes time and resources

This means that even moderately good internal documentation has a much lower bar to clear than external documentation. Your internal how-to guide doesn't need to compete with polished YouTube tutorials - it just needs to be better than interrupting Sarah from accounting or spending 30 minutes searching through #engineering-general.

This is actually a huge opportunity. While external documentation faces fierce competition from Stack Overflow's crowdsourced answers and professionally produced tutorials, internal documentation often competes with... nothing systematic at all.

The result? Even basic improvements to internal documentation can have outsized impact on team productivity. When your reference docs are faster than reading source code, people will use them. When your how-to guides are clearer than tribal knowledge, they become the default choice.

Understanding the competition

Most documentation is written from the perspective of the product being documented. It's organized around features, capabilities, and technical architecture. But when you flip the perspective to focus on jobs-to-be-done, you can structure information more effectively around what readers actually need to accomplish.

A job-focused approach in practice

Let me show you what this looks like in practice. Say you're writing a how-to guide for deploying your company's internal microservice. The traditional approach focuses on what information to include. The job-focused approach starts with the specific outcome: "Help me get this service deployed so I can test my feature and merge my PR."

That job-focused lens shifts how you structure the guide:

You lead with the most common use case and a working example first, then dive into edge cases
You include troubleshooting steps for the most common failure modes
You assume they're in a hurry and want to get back to their main project

Every decision gets filtered through the lens of "does this help the reader accomplish their job better?"

The result? Documentation that people actually use because it's genuinely better at helping them accomplish their jobs.

How to apply this framework

Start by identifying the specific job your reader is trying to accomplish. Not the general topic area, but the specific outcome they need to achieve. Then ask yourself:

What alternatives could they hire instead of your documentation?
What unique value can your documentation provide that those alternatives can't?
How can you structure the information to make their job easier to accomplish?

For that third question especially, consider using AI to help you think through different structural approaches. You can prompt an AI with the specific job your reader needs to accomplish and ask it to suggest multiple ways to organize the information, then choose the approach that best serves that job.

Take reference documentation as an example. The job goes beyond providing comprehensive information about all internal API endpoints or configuration options. The real job is "give me exact technical information I can look up quickly." This means your reference docs need to be faster and more precise than someone could get from reading your source code, checking configuration files, or asking in Slack.

If someone can figure out what they need faster by just reading the source code or asking a colleague, your reference docs aren't doing their job.

How this applies to AI-assisted documentation

Here's where this gets really interesting. When documentation is designed around jobs-to-be-done, it creates a positive feedback loop that extends beyond the direct reader.

Well-structured, job-focused documentation helps humans and also helps AI systems understand context and provide better assistance to future users. When your how-to guide is crystal clear about the specific outcome it helps achieve, an AI can better understand when to recommend that guide to someone with a similar job.

The result is that good documentation becomes a force multiplier. Beyond helping the direct reader, it helps AI systems help other readers accomplish similar jobs faster and more accurately.

This creates a flywheel effect: better documentation helps more people accomplish their jobs, which generates more usage data and feedback, which leads to even better documentation that helps both humans and AI serve users more effectively.

Applying this perspective

The next time you write internal documentation, start with what job your reader is trying to accomplish rather than what you want to explain.

Ask yourself: if someone could accomplish this job faster or more reliably using a different approach, why would they choose your documentation instead? For internal docs, this question often has a surprising answer: because the alternatives are genuinely worse.

This is liberating. Your internal documentation doesn't need to be perfect - it just needs to be better than the current chaos of tribal knowledge and ad-hoc problem-solving.

When you can answer that question clearly, you'll write documentation that people find genuinely useful. And when people use your documentation successfully, they become more successful with your product.

This matters because documentation is ultimately about scaling our collective knowledge and decision-making capacity. But that scaling only happens when people actually use the documentation. And people only use documentation when it helps them accomplish specific jobs they need to get done.

For internal documentation, this scaling opportunity is especially significant. Every time someone uses your docs instead of interrupting a colleague, you're not just solving one person's problem - you're preserving focus and momentum across your entire team.

That's the value of thinking about internal documentation as a product designed around jobs-to-be-done. It creates a better experience for everyone who interacts with your work, and unlike external documentation, you don't need to beat world-class competition to succeed.

One hour and eight minutes: Building a receipt scanner with the weirdest tech stack imaginable

2025-07-01T00:00:00Z

After bouncing between Cursor and GitHub Copilot for the past couple of years, I kept hearing about Claude Code. People's experiences were really piquing my curiosity, so I decided to give it a shot. What happened next completely changed how I think about rapid prototyping.

I built a fully functional receipt scanning and expense tracking app in exactly one hour and eight minutes. But here's the kicker—I used a technology stack so unconventional that most developers would probably laugh at me. And it worked beautifully.

Let me tell you what I learned about the immersive power of terminal-based development and why weird tech combinations might be the secret to lightning-fast tool building.

The problem I wanted to solve

At work, I noticed SAP Concur can automatically extract fields from uploaded receipts. I thought, "What if I could replicate that at home?" I wanted to track my expenses without paying for QuickBooks, using Notion as my database instead.

Most developers would reach for the standard stack: React frontend, PostgreSQL backend, maybe throw in some Express.js. That's the sensible approach.

But I'm not building production software for thousands of users. I'm a data scientist experimenting with tools for myself. So I decided to get weird with it.

The stack that shouldn't work but does

Here's what Claude Code helped me build with:

FastAPI for the backend (this part makes sense)
HTMX for the frontend instead of React (getting unusual)
Vanilla HTML/CSS with minimal JavaScript (now we're talking)
LlamaBot for AI interactions (I made it, so I know it works)
Notion as the database (yes, you read that right)

If I were to describe this stack to a seasoned developer, they'd probably be surprised, then laugh out loud, and then go "what?" But when I described it to Claude Code and specified that I wanted everything in a single app.py file that I could run with uv run app.py, Claude Code got creative.

It generated a beautiful single-file application with PEP 723 metadata at the top. The code was clean and well-structured. It took a few iterations of AI-generated code writing followed by testing, but it was always generally headed in the right direction. And this is the result:

The development experience that changed everything

Here's what blew my mind about using Claude Code: the immersive experience.

I spent the entire development session in just two terminal tabs. One tab running Claude Code, another tab with my uvicorn server running with auto-reload. That's it. No switching between file explorers, no hunting through directory structures, no context switching between different applications.

I was in what I can only describe as "vibe-ish coding" mode—not quite the vibe coding that Simon Willison describes, but close. I'd type a request to Claude Code, see the changes instantly in my browser, then iterate. The feedback loop was immediate and distraction-free.

This terminal-focused workflow kept me in the zone in a way that traditional IDEs never have. Without all the little icons, bells, and whistles that can distract you in an IDE, I could maintain focus on the actual problem I was solving instead of fighting with tools.

What got built in 68 minutes

By the time my terminal session ended, I had a fully functional application that could:

Upload single or multiple receipt images
Extract expense data using LlamaBot's AI capabilities
Allow manual editing of fields that the AI got wrong (inside Notion)
Handle enumerated types for expense categories
Automatically populate a Notion database with extracted data

The AI integration was seamless. I provided my OpenAI API key, and through LlamaBot I was able to hit the OpenAI API while Claude Code handled all the integration complexity. When I needed to add file upload functionality, I pasted some Notion API documentation as context, and Claude Code implemented it correctly.

The end result? I can now drag and drop receipts into a web interface, hit submit, and watch the data appear automatically in my Notion expense tracker. Exactly what I wanted.

Pushing language models to their limits

Here's the thing that really fascinated me about this experiment: I deliberately chose this weird tech stack to test Claude Code's boundaries.

Think about it—if I had gone with React, Node.js, and PostgreSQL, that would be easy for any language model. Those patterns show up constantly in training data.

But I wanted to push to the edges. What happens when you combine technologies that people don't usually think about together? HTMX with FastAPI? Notion as a database backend? A single-file Python app doing receipt processing with AI?

This is uncharted territory for most language models. There aren't thousands of tutorials showing how to integrate LlamaBot with HTMX forms, or how to structure FastAPI routes that return HTML fragments for dynamic updates.

Yet Claude Code handled it beautifully. It figured out how to make these disparate pieces work together, even when the combination got weird.

Why this matters for tool building

This experience reinforced something I've been thinking about lately: the best time to build custom tools is right now, and the barrier to entry has never been lower.

I wrote about this recently in my post on building your own tools with AI coding assistants. If you need a tool, just build it. Don't wait for the perfect stack or the right framework. Pick technologies that let you move fast and iterate quickly.

The ability to combine unusual technologies successfully opens up new possibilities. Instead of being constrained by conventional wisdom about what technologies "should" work together, you can experiment with combinations that solve your specific problem elegantly.

The immersive development advantage

The most valuable lesson from this experiment wasn't about technology—it was about workflow.

Claude Code's terminal-based approach created an immersive development environment that kept me focused. No file system distractions, no IDE complexity, just pure problem-solving in a clean interface.

This suggests that tool choice matters more than we often acknowledge. The best coding assistant isn't necessarily the one with the most features—it's the one that keeps you in flow state while you build.

What's next

I'm already planning my next experiment with Claude Code. Maybe a document processing pipeline using Docling, Anthropic's API, and Airtable. Or a personal CRM built with FastAPI, HTMX, and Google Sheets as the backend.

The point isn't to build production-ready applications with these stacks. It's to explore what becomes possible when you remove the friction from experimentation.

In an hour and eight minutes, I went from idea to working application. That's the kind of development velocity that changes what you're willing to attempt.

Sometimes the weirdest combinations turn out to be exactly what you need.

Build your own tools!

2025-06-27T00:00:00Z

On 25 June 2025, I delivered a talk at Data-Driven Pharma, an event organized by Ilya Captain and the namesake Data-Driven Pharma organization. In the run-up to the talk, I had been reflecting on two points:

I hate making slides, and
I really love building tools.

To that end, I decided... well, I'm not going to bother with making slides. And I'll build a tool that makes slides for me instead. Hence [DeckBot], which currently lives in a marimo notebook, was born. I started off by telling the crowd how much I hated making slides:

In an age of LLMs and plain .txt, I understand why I have such a disdain for powerpoint: you can't easily automate their creation, there's too much that can be hidden behind a bullet point, and it's just an all-round ineffective media for lasting crystal clear communication. By contrast, Markdown slides are better.

-- Original post link here

And how even Andrej Karpathy laments the absence of an LLM-enabled tool for building slides:

Making slides manually feels especially painful now that you know Cursor for slides should exist but doesn’t.
— Andrej Karpathy (@karpathy) June 6, 2025

Also, my informal poll of the audience revealed that approximately 2/3 of the crowd also hated making slides. Not surprising!

So I decided to take that as a nerdsnipe and actually make DeckBot. After showing the audience (live!) how I can make rando slides for completely nondescript topics, such as, "Why eating well is so important" or "pros and cons of buying a thing", I then proceeded with the real exciting challenge of this talk: to get an LLM to generate my entire slide deck for the actual topic I wanted to talk about, from which I would present. And that topic was, well, "Build your own tools!". I then proceeded to copy/paste in the first draft of this blog post into the notebook, and 1 minute later, I had my slides, from which I presented live.

Below is a writeup of what I actually presented, including a written description of some of the interactions.

My main message to everybody today is this: If you're a data scientist, computational biologist, or software developer, you should learn how to build your own tools. Building your own tools is a liberating endeavor. It injects joy back into your day-to-day work. People were made to be creative creators. Build your own tools.

A flashback from my grad school days

Do you know what this diagram is? The audience came in clutch, many people knew what this was -- it's a Circos plot. Some may have seen it with arcs rather than dots around the edges, but the concept remains the same: prioritize ordering nodes and then draw in the edges.

I wanted to learn how to make a graph visualization like this. But the only tool I saw out there was written in a different language (Perl), had no Python bindings, and was way too complicated for me—a beginner programmer in 2014—to learn. So I decided to leverage two other tools that I knew at the time, Python and matplotlib, to make my own Python package, both to learn software development and to understand the principles of rational network visualization.

The precursor to nxviz, circosplot, was born in 2015. One year later, I knew enough to make all sorts of network visualizations!

Like this, the matrix plot:

Or this, a geo plot:

Or this, an arc plot:

Or this, another circos plot:

Or this beautiful thing, a hive plot:

What's the unifying thread behind all of those plots? As it turns out, the thing I learned while building my own graph visualization tool was that rational and beautiful graph visualization starts with knowing how to order nodes in a graph, and then drawing in the edges. I would have never learned that had I not attempted to reinvent the wheel (or, perhaps, Circos plots)! Additionally, being able to build my own Python package was superbly empowering, especially as a graduate student! I could build my own tools, archive them in the public domain, and never have to solve the same problem again. This echoed Simon Willison's approach to software development:

I realized that one of the best things about open source software is that you can solve a problem once and then you can slap an open source license on that solution and you will never have to solve that problem ever again, no matter who's employing you in the future.

It's a sneaky way of solving a problem permanently.

-- Original post link by Simon Willison here

If I didn't know how to build my own tools, I'd have been stuck, and I'd never have learned anything new.

Fast-forward to 2018 at Novartis

My colleague Brant Peterson showed me the R package janitor, and I thought, "Why can't Pythonistas have nice things?"

Then, I remembered Gandhi's admonition

"Be the change you wish to see in the world."

And so, pyjanitor was born.

Your dataframe manipulation and processing code can now be more expressive than native pandas:

df = (
    pd.DataFrame.from_dict(company_sales)
    .remove_columns(["Company1"])
    .dropna(subset=["Company2", "Company3"])
    .rename_column("Company2", "Amazon")
    .rename_column("Company3", "Facebook")
    .add_column("Google", [450.0, 550.0, 800.0])
)

By being the change I wanted to see, Pythonistas now have one more nice thing available to them.

And of course, I just had to inject this in: that was all in 2018.

It's now 2025. Use polars. :)

Building resilience at Moderna

Fast-forward to 2021. I joined Moderna, attracted by the forward-thinking Digital leadership and their suite of high-power home-grown tools. It was a dog-fooding culture back then—one I've fought hard to keep alive within the Digital organization.

Since I was only data scientist #6 at Moderna and was hired into a relatively senior role (Principal Data Scientist), I saw the chance to set standards for Moderna data scientists.

Together with my wonderful colleague Adrianna Loback and our manager Andrew Giessel, we hammered out what Data Scientists would ship: dockerized CLI tools run in the cloud, and Python packages, and designed our entire project initialization workflow around deploying those two things. As time progressed, the tooling evolved, and Dan Luu helped us be a caretaker of the tooling as well, continually improving it and modernizing it.

By standardizing on what we ship and then standardizing on the toolchain, we implemented a design pattern that made it easy for us to help one another. I can jump into a colleague's codebase dealing with Clinical Development and be helpful in a modestly short amount of time, even when I mostly work on Research projects.

And here's a side effect: we designed a portable way of working that works best when you give a Moderna data scientist access to a raw Linux machine. As Andrew Giessel once mentioned to me:

Eventually, tools that abstract away the Linux operating system will fail to satisfy users as they grow up and master Linux. They'll want to jump out of a container and just run raw Linux. Anything that tries to abstract away the filesystem, shell scripts, and more eventually runs into edge cases, so why not just give people access to a raw Linux machine with tools pre-installed?

As it turns out, this evening's other presenter Tommy Tang is also a big fan of the shell:

So now I'm a big fan of giving people access to a raw Linux box, outside of a sandboxed container. Being able to build and run a container is a fundamental skill nowadays—so much so that as a community of data scientists, we've effectively said "no" to vendor tooling that forces us to do our day-to-day work within a Docker container.

And here's the most awesome part: we did this in an "internally open source" fashion. Anyone with a complaint about the tooling can propose a fix to our tools. Even better, we'll walk you through making the fix "the right way," so you gain the superpower of software development along the way!

At least on this dimension, we are never beholden to someone else's (or a vendor's) roadmap! We are now resilient—just like Dustin from Smarter Every Day described when he made this video about trying to make things "in America."

I'll end this section with a huge lesson I've learned during my time working here:

Building teaches you the domain

Do you remember these beautiful graph diagrams from earlier?

Building is a great way to learn new things. Building nxviz helped me learn the principles of graph visualization. Building LlamaBot helped me learn about making LLM applications.

In 2023, I created LlamaBot because I was confused about how to interact with and build LLMs, particularly RAG applications. I decided to turn to my favorite learning tool: building software. This was clarifying—I was forced to encode my understanding into code, and if the code did unexpected things, I knew my understanding was wrong. After all:

Computers are the best students there are. If you teach the computer something wrong, it'll give you back wrong answers. If you design things wrongly, this student will make life hard for you. So you learn to get good at verification.

I've rewritten LlamaBot at least 4 times, each time updating the codebase with the best of my knowledge. Each time round, my understanding improved, and the abstractions changed along with them, and the ergonomics of using LlamaBot got better, more natural. Throughout the changes, some things that have stayed constant:

The "Bot" analogy, which predates the term "agents," turns out to be a natural way to express Agents.
The docstore abstraction simplifies storage and retrieval for pure text applications.
My distaste for writing commit messages and release notes—hence the automated writers for both remain deeply ingrained as dog-fooded tools.

Some things that have evolved:

QueryBot used to do entire RAG workflows all-in-one—from PDF-to-text conversion to embedding to retrieval. I've since learned it's much better to break those out into separate steps.
ChatBot used to have a built-in ChatUI. I dropped it because it was too opinionated and unwieldy. Marimo has really good chat UI primitives that should be used instead.
Inspiration from the ell library: lmb.user("some prompt") or lmb.system("some prompt") for convenient creation of system and user prompts.

In the process of building and designing software, we have to learn the domain so well that we become linguistics experts in that domain. Vocabulary, terms, and their relationships become natural extensions of what we already know. If our code maps to the domain properly, our abstractions become so natural they're self-documenting. If our code maps poorly onto a solid understanding of the problem space, it'll end up being a tangled mess that warrants a rewrite. There's nothing wrong with that! Embrace the need to rewrite—with AI assistance nowadays, the activation energy barriers to building your own tools is dramatically reduced.

Internal tooling requires organizational buy-in

I then made my next point: you want to make sure you have organizational buy-in to any tool building efforts. It's super telling if your line management doesn't agree with you. On the other hand, it's super awesome if someone is going to be hired explicitly for tooling, like at Quora below:

We are opening up a new role at Quora: a single engineer who will use AI to automate manual work across the company and increase employee productivity. I will work closely with this person. pic.twitter.com/iKurWS6W7v
— Adam D'Angelo (@adamdangelo) June 21, 2025

Reading this tweet triggered a thought in my mind: sustaining internal tool builds help with organizational buy-in. Does your organization empower you to build the tools you need to get your work done? I was lucky to have full leadership buy-in through Andrew Giessel and Dave Johnson, and my current manager Wade keeps roadblocks away from innovating on how we work. I also try to encourage this across teams I have influence with, even without direct managerial responsibilities.

But as I also mentioned earlier, even though sustaining an internal tool build can be boosted with organizational buy-in, culture needs no permission. We always have agency. We always have the free will to make things happen. We always can go forth and build. Build the smallest thing that gets roadblocks out of your way and move on. Throwaway builds are OK! No permission required.

Expert practitioners agree: build your own tools

If my arguments don't convince you, perhaps Hamel Husain, one of the leading AI eval practitioners, will:

Build a custom annotation tool. This is the single most impactful investment you can make for your AI evaluation workflow. With AI-assisted development tools like Cursor or Lovable, you can build a tailored interface in hours. I often find that teams with custom annotation tools iterate ~10x faster.

Custom tools excel because:

They show all your context from multiple systems in one place

They can render your data in a product-specific way (images, widgets, markdown, buttons, etc.)

They're designed for your specific workflow (custom filters, sorting, progress bars, etc.)

Off-the-shelf tools may be justified when you need to coordinate dozens of distributed annotators with enterprise access controls. Even then, many teams find the configuration overhead and limitations aren't worth it.

He makes a great point: "With AI-assisted development tools like Cursor or Lovable, you can build a tailored interface in hours."

The barrier to entry for building your own tools nowadays is so much lower than before. Much of the grunt work can be automated away using templating and LLM assistance. If you want to build, now is the time to build.

Software development scales everything

I love the work I do partly because it is in the service of the discovery of medicines, and partly because I have an outlet for expressing creativity through the tools I make for myself and others. Through nearly 10 years of making tools, I've crystallized this lesson in scaling:

Software scales our labor.
Documentation scales our brains.
Tests scale others' trust in our code.
Design scales our agility.
Agents scale our processes.
Open source scales opportunity for impact.

If you can build software tools for yourself, you can scale yourself. If you teach others to use those same tools, you can scale their labor. You can scale your brain by documenting those tools well. If you test those tools thoroughly, you can scale trust in the codebase, enabling others to contribute with confidence. If you design the software well—and more importantly, design the business process that software supports well—you can become nimble and agile without the trappings of Big Fake Agile. If you use agents, and more generally automation as part of the custom tooling, you can scale those same processes even further. If you make your tooling open source (whether internally or externally), you scale the opportunity for others to contribute.

Culture needs no permission (another great lesson that I learned from Andrew Giessel), and if you need to unblock yourself, build your own tools. There is no magic sauce in the choice of tools that we use and make. The magic sauce is in the people who choose to show up and build.

And so, my fellow builders, let's build. Not because your company wants it of you, but because patients are waiting. Patients have no patience. We joined this line of work because we want to have the greatest impact on patients with our medicines. Computational types like should never be the bottleneck to shipping medicines. Building tools for ourselves empowers us to keep ourselves unstuck, remove the viscous traps that slow you down, and keep medicines moving.

I'll now leave you with a final quote, from Michael Jackson's song, Heal the World:

There are people dying, if you care enough for the living, make a better place for you and for me.

— Heal the World (Michael Jackson)

And so to my fellow techies in bio, it's time to build. Thank you.

Reactions

After Tommy's talk, we had another round of networking, which was awesome. I heard some great perspectives. Alper Kucukural, who is both an industry and academia person, mentioned how his students needed to hear the message that they can be empowered to build their own tools, no permission required. Too many get stuck. Students -- learn how to build!

Maciej Pacula also posted his reaction on LinkedIn:

I had a great time at the DataDrivenPharma event at Moderna yesterday. Thanks Ilya Captain, PhD for organizing, and hope you bring more such events to the East Coast!

Eric Ma's talk about building your own tools and using them as a force multiplier not just for yourself but for others resonated deeply.

🎯 Ming "Tommy" Tang's talk about "good enough" reproducibility made the excellent point that sometimes you just need to talk to the lab scientists (what a concept!) and collaborate on common standards.

Appreciated the shout out for GoFigr.io, 🎯 Ming "Tommy" Tang :-)

Thanks Ted Natoli, Colles Price M.S., Ph.D., Sergiusz Wesolowski, Ilya Shlyakhter, Vasant Marur, Alper Kucukural, PhD, James J. Crowley, Gunjan Singh Thakur for the company and conversation.

-- Original post link here

Eric Merle's reaction is below:

A lot is possible when we build the right tools...

Yesterday's DataDrivenPharma event at Moderna completely energized my thinking about exactly that and I'll tell you specifically why.

Eric Ma from Moderna shared something that hit home: "Data scientists should never become bottlenecks in getting medicines to patients who need them." His approach to building custom tools that scale impact, from automated slide generation to standardized project workflows, showed exactly how thoughtful tooling can accelerate discovery. 🎯 Ming "Tommy" Tang from AstraZeneca complemented this perfectly with his presentation on reproducible bioinformatics practices. His insights on proper file naming conventions (how many of us are guilty of having final1, final2, final3 files?), consistent folder structures, and creating reproducible workflows provided the foundation that makes scaling actually possible. You can't build lasting tools without these fundamentals in place.

Both emphasized that that it's not just writing code, but also about building infrastructure. Eric's philosophy around scaling through software combined with Tommy's disciplined approach to reproducibility showed how the right practices can create tools that continue delivering value long after the original builder moves on.

The potential to create AI tools that don't just automate routine tasks but fundamentally change how we approach patient care and drug development feels limitless. Both presentations reinforced that we're now building the infrastructure that could accelerate how quickly life-saving treatments reach patients. The timing feels perfect. We have AI capabilities that can scale impact in ways that weren't possible even two years ago.

Thanks to Ilya Captain, PhD at DataDrivenPharma for organizing this excellent event and to Louise Liu, PhD, MBA from Hill Research for the introduction to Tommy and recommending I attend.

What tools are you building to scale your impact? Curious to hear what others are working on in this space.

PS: Happy to have been able to chat with Eric and Tommy

-- Original post link here

And Patrick Hofmann:

A couple of great talks by Eric Ma and 🎯 Ming "Tommy" Tang at Ilya Captain, PhD’s Data Driven Pharma event last night. Eric made a strong case for data scientists building their own tools. I’m no programmer, but I have dabbled in woodworking and it reminded me of all the jigs I’ve built for various projects.

There are many facets to the ‘buy vs build’ question and here’s one I think often gets overlooked: If an off the shelf solution is available, will it do precisely what you want? Or will you need to conform your project to it? The answer isn’t always clear cut but it’s worth considering when choosing how to allocate your time and resources.

-- Original post link here

Afterwards, in our discussion, Patrick had a great point about the parallel between custom tools and woodworking jigs: you can either make your own jigs or buy them, but if you buy them, you now have to conform your woodworking to the jig, and not the other way around. Little compromises compound against the quality of the final deliverable!

But where is deckbot?

Ok, I bet you're just like me, you hate making slides, and you want to see DeckBot. You can find it linked here as a marimo notebook! To run it, you'll need an OpenAI API key mapped to the OPENAI_API_KEY environment variable. Download the notebook and run this:

OPENAI_API_KEY="sk-your-api-key-here" uvx marimo edit --sandbox /your/path/to/slides_maker.py

And what were the slides you actually presented?

I archived them for posterity here. Enjoy!

Rethinking LLM interfaces, from chatbots to contextual applications

2025-06-14T00:00:00Z

Chat interfaces were a great starting point for interacting with large language models, but they're not the endgame. My thesis is that we should build LLM applications as contextual tools embedded in structured workflows, not as open-ended chat interfaces. This insight came from three converging threads that fundamentally changed how I think about building LLM-powered applications.

The first thread came from a conversation with my colleague Michelle Faits, who articulated that apps powered by generative AI really need to end up looking less like chat interfaces and more like TurboTax -- where there's a well-defined process that needs to happen, and instead of users filling out forms manually, we ask an AI to help with the form-filling process.

The second thread was a YouTube video titled "AI UX Design: ChatGPT interfaces are already obsolete" by Alan Pike from Vancouver. In it, he talks about shifting from chatbot to context-native interfaces, a change that's both subtle and dramatic. It's subtle because there's little visible change, but dramatic because the way you interact with the interface changes fundamentally. You're no longer stuck with the drudge work of filling out yet another form, but are instead presented with an AI-powered interface capable of understanding what your next action is likely to be and anticipating it just in time.

The third thread is Clayton Christensen's "jobs to be done" theory. What I've been noticing is that there are too many ChatGPT copycat clones, and those chat clones don't really help me accomplish the job that I'm trying to do. It takes a different type of interface to make that happen.

These threads converge on a simple truth

What connects TurboTax's structured approach, Pike's context-native interfaces, and jobs-to-be-done theory is this: the most effective LLM applications will embed AI capabilities directly into well-defined workflows rather than forcing users to articulate their needs through chat.

This means moving from "tell the AI what you want" to "let the AI assist you as you work through a process you already understand."

The TurboTax moment

Michelle's insight about TurboTax really stuck with me. TurboTax works because it represents a well-defined business process with pretty routine steps that we need to walk through, but some of the steps do require judgment calls. Do you fill out this section or not? And what do you fill in? You need to determine that from context, so there's a little bit of agency for LLM bots inside there. But for the most part, it's just form filling.

This is a powerful analogy for LLM apps, one that gets to the heart of any app build. The question becomes: how do you go about building a user interface that works like this? When we build chat interfaces, we put a lot of onus on the LLM to make smart decisions on behalf of us. But what if chat wasn't the primary way of interacting? What if we had well-defined business workflows supported by custom apps that just require us to fill out forms in a delightful way?

The obsolescence of chat interfaces

Alan Pike's perspective really crystallized something I'd been feeling. In his talk, he showed how we're moving from text-based interfaces that are powerful but confounding to 90% of people, toward context-native interfaces that inject AI capabilities right where you need them.

Think about it: we've already started seeing hints of tools pushing chat to the side. ChatGPT has Canvas mode now, where if you ask it to co-author a document, it sticks the chat up in the corner and lets you focus on the work you're doing. But this is still just the beginning.

Pike showed examples of right-click contextual actions, natural language search that understands intent rather than requiring exact phrases, and date pickers where you can just say "next Thursday at 11" instead of clicking through calendar grids. These represent a fundamental shift in how we think about human-computer interaction.

I thought the talk was quite good, and I'm embedding it below to share.

Jobs to be done theory meets LLM apps

Clayton Christensen's jobs-to-be-done framework is perfect for thinking about LLM applications. When I look at most LLM interfaces, we've become hooked on chat -- but they don't necessarily always help me accomplish the specific job I'm trying to do. Generic chat interfaces put the burden on me to figure out how to express my needs and on the LLM to figure out what I actually want. What if we could do better?

I think what we're going to see is an evolution of LLM-powered apps from being text and chat driven to being deeply embedded within applications, making it possible to flow through business processes in a way that's much smoother and more delightful than what was possible before. It's not really about agentic capabilities, which are nice, but the winners will be the interfaces that inject LLMs in just the right places -- in the boring work!

Building DeckBot demonstrates this approach

Let me show you what this looks like in practice. I built a Markdown slide deck generator called DeckBot, deliberately avoiding chat as the primary interface because it was too freeform and unreliable.

Instead of starting with a UI, I began with the data model: defining a Slide as a Pydantic model with title, content, and type. I tested individual slide generation in a Marimo notebook until each component worked reliably. Then I put them together into a SlideDeck Pydantic model. This allowed me to compose a SlideDeck from individually-generated Slides.

The next breakthrough came when I realized I could inject LLM capabilities directly into the data objects themselves. Instead of an agent orchestrating external tools, my data models gained natural language-powered methods:

class SlideDeck(BaseModel):
    slides: list[Slide]
    talk_title: str


    def edit(self, index: int, change: str):
        """Edit the slide at a given index using natural language."""
        current_slide = self.slides[index].render()
        new_slide = slidemaker(slidemaker_edit(current_slide, change))
        self.slides[index] = new_slide

This represents a fundamental shift: instead of putting all intelligence in a central agent, I distributed it into the data models themselves. Each Pydantic model knows how to manipulate itself based on natural language instructions.

DeckBot sits at step 5 of the maturity ladder I'll describe below; it provides LLM-augmented interfaces that understand context and assist with specific tasks, but within a structured framework.

The future of LLM applications

I believe we're going to see LLM applications become more like TurboTax and less like open-ended chat interfaces. These will be applications built around well-defined business processes that users can flow through smoothly, with AI providing assistance at just the right moments.

There's still a place for agents, but we need to recognize that adoption follows a ladder of maturity:

Unstructured work relying on human intuition
Documented SOPs and manual processes
Digital UIs guiding humans through structured processes
Rule-based automation for predictable parts of workflows
LLM-augmented interfaces providing contextual assistance
Semi-autonomous LLM components handling defined subtasks
Full agent orchestration with human oversight
Truly autonomous agent systems managing entire business processes

Customer support agents have emerged as one of the first places for LLM agents, and I suspect it's because customer support as a business process has more or less been well-standardized. The fact that we can "agentify" it stems from decades of process refinement. Other business domains need to undergo similar transformation before they're ready for full agent automation.

At Moderna, we've embraced generative AI heavily, relying on ChatGPT and custom GPTs. But I know this cannot be the only way we interact with LLMs. There are ways to surgically inject LLMs into workflows so users can accomplish what they're trying to do in a structured fashion, but in a delightfully smooth and flowing way.

The big lesson I learned building DeckBot is understanding where and when to inject LLMs very surgically into custom LLM applications. It's not about replacing human decision-making with AI decision-making; it's about augmenting human workflows with AI capabilities at precisely the right moments.

Key principles for contextual LLM applications

Drawing from the TurboTax insight, Pike's context-native approach, and jobs-to-be-done theory, here are the essential principles I've learned:

Start with the data model, not the interface. Get clear on what you're actually trying to accomplish and model that as structured data first. Design APIs around those data models that work through clean function calls before adding any LLM capabilities.
Inject LLMs surgically into workflows. Identify the specific points where natural language understanding or generation adds value, rather than building everything around chat or agents.
Test with structured examples first. Use notebooks to validate that your core functions work properly before thinking about user interfaces.
Build for the job-to-be-done. Don't chase the latest agentic capabilities just because they're exciting. Focus on making specific workflows easier and more delightful.

The path forward

Chat was the beginning of our journey with LLMs, but it is most certainly not the destination. The three threads I described, namely, Michelle's TurboTax-esque structured approach, Pike's context-native interfaces, and Christensen's jobs-to-be-done framework, all point toward the same future: LLM applications that flow smoothly through business processes, where AI assistance appears exactly when and where it's needed.

This isn't about replacing human decision-making with AI decision-making. It's about augmenting human workflows with AI capabilities at precisely the right moments, without forcing users to translate their intentions into chat prompts or rely on agents to make all decisions for them.

We're at the beginning of an incredible generation of software and products, and it's an exciting time to build not just the software but the processes around them too! The question we have now is this: how quickly we can move beyond chat alone to build contextual applications that truly help people accomplish their goals?

Principles for using AI autodidactically

2025-06-07T00:00:00Z

We need to move beyond passive consumption

Imagine having a personal tutor who's absorbed millions of books, papers, and discussions across every field of human knowledge. That's essentially what Large Language Models (LLMs) offer us today. As David Duvenaud aptly describes them, LLMs are a "galaxy brain" of knowledge waiting to be tapped.

But having access to information isn't the same as learning from it. The difference lies in how we engage with these AI tools - passively consuming their outputs versus actively using them to expand our understanding. Through my interviews with researchers and digital professionals, I've discovered patterns in how the most effective learners use AI autodidactically - teaching themselves with AI as their assistant, not their replacement.

Lessons from autodidactic AI users at work

I have conducted many interviews at work about how folks in Moderna's Research and Digital organizations use AI. While the discussions are insanely specific to work and sometimes touch on IP that I cannot reveal, there are principles and patterns in what I observe the best folks do when using AI in their day-to-day work to learn new stuff.

Generate a personalized syllabus for learning

They recognize that any kind of learning involves effort and hard work, and that the pain of the process is a non-negotiable to make anything stick. So instead of using AI to do stuff for them, they start by using AI to provide a tailored syllabus that allows them to progressively move up the knowledge ladder with increasing effort.

This is what I would call "scaffolding a personalized syllabus". Their prompts here often include a bit of their current role, their prior training, their own objectives for learning, and what they know from prior experience about how they learn best. On the basis of the syllabus, iterate and follow up.

Apply one's ability to think critically to LLM outputs

They recognize that questions are a great way to learn, so they will continuously question and LLM to draw out answers. The act of generating a question as a human is part of the effort needed.

They apply the skill of critical thinking to the answers generated by an LLM, asking questions such as, "if this is true..." or "is this coherent with...". They do not blindly accept the output of an LLM!

Apart from self-coherence with what they know, they verify by cross-checking reputable sources on the internet -- scholarly literature, expert writing, etc.

At a meta-level, if they find an angle that demands explanation, knowing that sometimes an LLM can be blinded by conversation history, they will explicitly prompt an LLM on contrary points, using prompts that start with, "but I remember that..." or "this sounds suspicious, could it be that..."

Also, in the absence of another human, they use LLMs to provide initial critique about what they have produced (e.g. in writing form). They use LLMs in the same way jazz musicians riff off one another.

What's the core trick?

At its core, the main "trick" to using an LLM autodidactically is to avoid delegating critical thinking to the LLM and instead applying the full force of one's agency. We need to leverage the galaxy brain of knowledge from its training set (and, where applicable, internet search capabilities) and apply individual effort by critically thinking through LLM outputs. Essentially, every skill we were taught to hone in literature class in high school, debate club in junior college, science philosophy class in undergrad, and scientific journal clubs during graduate training!

AI has brought the philosophical points of human agency into sharp relief. Like any tool, LLMs can be used to increase your agency or diminish it. It's a double-edged sword. Use it for the former!

The invisible polish of automatic model routing

2025-05-25T00:00:00Z

I've been using Cursor's latest updates, and while the surface-level improvements are nice—better edge rounding, refined colors, thoughtful layering—there's one change that's got me genuinely excited: automatic model routing.

No more model picker. No more stopping mid-thought to decide between OpenAI's models, Claude, or whatever other model might be appropriate for my current task. Cursor just figures it out and routes my request to the right model automatically.

Why model pickers are UI bugs

I remember reading somewhere (probably on Twitter, let's be honest) that model pickers are fundamentally a UI bug. The argument was simple: users shouldn't need to understand the technical differences between models to get their work done. They should just describe what they want, and the system should handle the rest.

At the time, I nodded along but didn't fully appreciate how right this was until I experienced Cursor's implementation. Before this change, I was making micro-decisions about model selection multiple times per day. Should I use the faster model for this simple refactoring? Do I need the more capable model for this complex architectural question? Each decision was small, maybe taking 2-3 seconds, but they added up.

The cognitive tax of micro-decisions

These tiny decisions represent what I think of as cognitive tax: small mental overhead that accumulates throughout the day. Each model selection forced a brief context switch: I had to step out of my coding flow, evaluate the complexity of my request, weigh speed versus capability, and make a choice.

The individual cost was negligible. The cumulative cost was not. By the end of the day, I'd made dozens of these micro-decisions, each one pulling a small amount of mental energy away from the actual problem I was trying to solve.

Cursor's automatic routing eliminates this entirely. I describe what I want, hit enter, and trust that the right model will handle it. The decision-making burden shifts from me to the system, where it belongs.

Parallels to Apple's design

This reminds me of something I read about Apple's design philosophy during the Jony Ive era. The idea more than making things look beautiful, it was about removing friction at every possible level, even in places users might not consciously notice.

Think about the original iPhone's lack of a keyboard. Everyone said it was crazy, that people needed physical keys. But Apple understood that the mental model of "keyboard for typing" was actually limiting. By removing the physical keyboard, they freed up space for context-sensitive interfaces that could adapt to what you were actually trying to do.

Cursor's automatic model routing feels like the same kind of thinking. Instead of optimizing the model picker interface, they eliminated the need for it entirely. The best interface is often no interface at all.

The broader principle

What makes this interesting isn't just that it saves me a few seconds per day. It's that it represents a shift in how we think about AI tool design. Instead of exposing the complexity of the underlying system to users, we can build intelligence into the routing layer itself.

This has implications beyond just model selection. How many other micro-decisions are we forcing users to make that could be automated away? How many interface elements exist because we haven't figured out how to make them unnecessary?

I suspect we'll see more of this pattern as AI tools mature. The first generation of AI interfaces were necessarily explicit: users needed to understand models, parameters, and context windows because the tools couldn't make those decisions reliably. But as the underlying systems get smarter, the interfaces can get simpler.

The invisible improvements

The best improvements are often the ones you don't notice consciously but feel in your workflow. Cursor's automatic model routing is exactly this kind of enhancement. I don't think about it while I'm coding, but I feel its absence when I use other tools that still require manual model selection.

This is the kind of polish that compounds. Each eliminated micro-decision, each removed point of friction, each automated choice creates space for deeper focus on the work that actually matters. It's not revolutionary on its own, but it's part of building tools that feel like extensions of thought rather than obstacles to it.

The question for other AI tool builders is: what other invisible friction exists in your interfaces? What decisions are you forcing users to make that your system could handle automatically? The model picker was just the beginning.

Supercharge your coding agents with VSCode workspaces

2025-05-24T00:00:00Z

I was building out my LLM tutorial repository for SciPy 2025 and found myself constantly switching between windows—improving the LlamaBot library in one window, then flipping to my tutorial repo in another to update examples that used the new features. Every time I added a new method or changed an API in LlamaBot, I had to remember to update the corresponding tutorial examples. The constant context switching was slowing me down and making it easy to miss places where the tutorial needed updates.

Then I discovered something that changed how I code across repos: Workspaces! They aren't just convenient for organizing multiple repositories, they're also game-changers for coding agents in Cursor.

When you add multiple repositories to the same workspace, your coding agent magically gains context across all your repos simultaneously. No more window switching, no more explaining relationships between codebases. Instead, your coding assistants can access code in multiple repositories at once.

Here's how to set this up and why it matters.

Setting up your first multi-repo workspace

Step 1: Create the workspace

Open a blank Cursor/VSCode window and immediately save it as a workspace file (File → Save Workspace As). I recommend saving it outside any repository; I keep mine as a sibling directory to my repos, like ~/github/llm-scipy-tutorial.code-workspace.

Then, go to File → Add Folder to Workspace to add your first repository (my main tutorial project ~/github/building-with-llms-made-simple), and repeat for your second repo (the companion library I was improving, ~/github/llamabot). You'll see both folders appear in the Explorer sidebar.

Step 2: Watch the magic happen

Here's where it gets interesting. Fire up Cursor's AI or GitHub Copilot and give it a specific prompt that references files across both repos. Try something like:

"Look at @llamabot/llamabot/bot/simplebot.py and edit @building-with-llms-made-simple/notebooks/03_advanced_bot.py to update the StructuredBot example for document summarization."

(If you're using VSCode instead of Cursor, just replace the @ symbols with #.)

Your agent can now see both codebases simultaneously. It understands how your LlamaBot library works and can create coherent examples in your tutorial repo, suggesting coordinated changes across repos while maintaining consistency between your library code and tutorial examples.

Step 3: Reopening your workspace

Next time you open Cursor or VSCode, you'll see your workspace listed on the welcome screen under "Recent". Click it to instantly load all your repositories with the same folder structure and settings.

Quick tips that make this even better

Keep workspaces outside repositories: I always save workspace files as siblings to my repo directories, never inside them. This prevents workspace files from accidentally getting committed and keeps things clean when you're working across multiple projects.

~/github/
├── llamabot/
├── building-with-llms-made-simple/
└── llm-scipy-tutorial.code-workspace

Quick tip on scale: You can add as many repositories as you need. I've had workspaces with multiple model experiment repos, shared data utilities, and production pipelines, and Cursor's agent could reference files across all of them. When you use @workspace in Cursor, it considers every file in every repository. Fair warning though—I recently worked across 5 repos at work and my head was spinning even with LLM help. Sometimes less is more.

Be prescriptive with file references: Prompting across repos works best when you can pinpoint exactly which file to reference or edit. In Cursor, use @<file> syntax, while in VSCode it's #<file>. This helps the agent focus on the specific files you care about rather than wandering through your entire workspace.

Use this pattern strategically: This approach shines when your current project depends on functionality that was developed beforehand in other repositories. Think data science projects that depend on internal tools built by other teams, or tutorial repositories that need to stay consistent with the underlying library they're demonstrating. When you have models or analyses that depend on utilities, libraries, or frameworks developed separately, workspaces let your coding agent understand both the dependency and the dependent code simultaneously. For single-repo exploratory work, stick to regular folders.

That's it. Next time you're coordinating changes across multiple data science repositories, set up a workspace and let your coding agent see the full picture.

Eric Ma's Blog

Benchmarking LLMs with Marimo Pair

Skill environment check

Data analysis task

Benchmarking

How the UpSet plots turned out

Other observations

Recommendations

Discussion

Reflections

Calibration Is Synchronizing Feedback Loops With Neural Throughput

The Accelerating Landscape of Possibility

Why AI Feels Different Now

Here's Why It Feels Different

Our Brain Is the Bottleneck Now

Two Calibration Strategies

Calibration Is Not Optimization

What Calibration Actually Looks Like

Calibration: Something You Do Daily, Not Something You Learn Once

What This Looks Like in Practice

Undoing AI vibe-coded slop with AI

The Initial State

The First Wave

The Pivotal Moment

Backend Pluginification (Late January 2026)

The Testing Safety Net

The Numbers

The Bigger Lesson

What This Means for the Future

Closing Thoughts

Creative mentorship strategies for career growth in challenging times

The problem with lean times

You already have something to offer

Five strategies that have worked for me

Coach others one-on-one

Present at internal guilds and "birds of a feather" events

Organize communities of practice

Host informal coffee hours

Host or support external meetups

Advice for managers

The core principles

Closing air gaps

Why air gaps matter

How to find air gaps

Air gaps in the wild

File schlepping in the lab

GitHub activity tracking

Autonomous laboratories

The blockers: imagination and skill

Imagination

Skill

Programmatic access

Closing air gaps with agents

The principle

Agent skills are also human skills

A concrete example

A second example, cutting deeper

What this means

The takeaway

My weekend experiment making PyMC installable in a WASM environment

The core challenge: getting PyTensor to build for WebAssembly

The code changes: what I modified in PyTensor

Change 1: making Numba optional on WebAssembly

Change 2: adding Pixi development environment configuration

Change 3: documenting the WASM build process

What I actually PR'd to PyTensor

Unfortunately (for now), NUTS is gone

What does work?

A weekend well spent

Mastering Personal Knowledge Management with Obsidian and AI

The plain text decision

The core system

Ingesting information

Managing and maintaining

Producing and sharing

The role of agent skills

What's still friction

Getting started

Skills you can use today

The bigger picture

When to update `AGENTS.md` vs create a skill

`AGENTS.md` as repository memory

A starter prompt for generating `AGENTS.md`