<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://michaelpaonam.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://michaelpaonam.com/" rel="alternate" type="text/html" /><updated>2026-05-08T15:04:39+00:00</updated><id>https://michaelpaonam.com/feed.xml</id><title type="html">Michael Paonam — Software Engineer</title><subtitle>Senior Software Engineer writing about backend architecture, debugging production systems, and building reliable software.</subtitle><entry><title type="html">ETHGlobal Open Agents: What the Winners Did Differently</title><link href="https://michaelpaonam.com/web3/2026/05/03/open-agents-hackathon-post-mortem.html" rel="alternate" type="text/html" title="ETHGlobal Open Agents: What the Winners Did Differently" /><published>2026-05-03T00:00:00+00:00</published><updated>2026-05-03T00:00:00+00:00</updated><id>https://michaelpaonam.com/web3/2026/05/03/open-agents-hackathon-post-mortem</id><content type="html" xml:base="https://michaelpaonam.com/web3/2026/05/03/open-agents-hackathon-post-mortem.html"><![CDATA[<p>I spent 14 days building ARYA — a multi-agent AI swarm for DeFi yield farming — for the ETHGlobal Open Agents hackathon. It didn’t place. Here’s what I built, what the finalists built, and where I think the gap was.</p>

<div class="iframe-container">
<iframe src="https://www.youtube.com/embed/9g2LWZVdooE?si=RHB4PdAAZUCgaSM7" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>
</div>

<h2 id="what-i-built">What I built</h2>

<p>ARYA is four specialized AI agents (Scout, Risk, Executor, Orchestrator) that collaboratively discover yield farming opportunities, evaluate risk, and execute swaps — with a human-in-the-loop approval gate enforced by smart contracts. The user approves or rejects every strategy before funds move.</p>

<p><img src="/assets/images/architecture.svg" alt="ARYA architecture" /></p>

<p>The stack: LangGraph.js for agent orchestration, five Solidity contracts on 0G Chain (ERC-7857 identity, strategy vault, reputation tracking, ERC-4337 smart accounts with session keys), a Next.js frontend, Uniswap Trading API for swaps, KeeperHub for automated monitoring, and 0G Storage for agent memory.</p>

<p>206 commits over 14 days, solo. Contracts deployed, frontend live on Vercel, all three sponsor integrations wired up. The system worked end-to-end: agents propose a strategy, user reviews it in the dashboard, approves on-chain, executor builds the swap. <a href="https://arya-testnet.com/">arya-testnet.com</a></p>

<h2 id="what-the-finalists-built">What the finalists built</h2>

<p>Seven projects made the finalist round. Three stood out as instructive comparisons to what I built:</p>

<p><strong>Clan World: Ælder Whispers</strong> — Four AI agents compete as rival clan elders in a fully onchain strategy game. They negotiate, betray each other, and consolidate knowledge into ERC-7857 iNFTs that transfer between owners. The key mechanic: a memory-wipe every 10 ticks forces agents to consciously decide what to save. This makes agent skill measurable — you can watch one agent outperform another over time.</p>

<p><strong>Slopstock</strong> — A stock exchange for AI agents. Agents are minted as iNFTs with sealed weights in TEEs, ownership is fractionalized into shares, and inference revenue distributes to shareholders. Three live productive agents running right now: a Solidity auditor, a meme-token rugpull detector, and a price oracle. 83 passing tests across 8 contracts. The TEE-sealed weights mean ownership transfer is cryptographically real, not just a database update.</p>

<p><strong>LPlens</strong> — Diagnoses underperforming Uniswap V3 LP positions using closed-form impermanent loss math, classifies pool behavior, simulates 1,000 historical swaps against V4 hooks, and proposes one-click migrations with Permit2. Every numeric output has an honesty label (VERIFIED/COMPUTED/ESTIMATED) — a trust mechanism I haven’t seen before in DeFi tooling.</p>

<p>The other four finalists (Mnemosyne, Aegis402, DAIO, Common OS) followed similar patterns — each had a single novel mechanic explored deeply rather than a broad integration of many components.</p>

<h2 id="where-the-gap-was">Where the gap was</h2>

<h3 id="1-i-built-infrastructure-they-built-experiences">1. I built infrastructure. They built experiences.</h3>

<p>ARYA is plumbing. It’s a pipeline that moves data between agents and presents recommendations in a dashboard. The user experience is: look at numbers, click approve. Functional, but not compelling in a 3-minute demo.</p>

<p>Clan World has AI agents publicly trash-talking each other and forming alliances. Slopstock lets you buy shares in an AI agent and watch your dividends roll in. These create moments that stick in a judge’s memory. My demo was “look at this well-architected system working correctly.” Their demos were “watch this happen.”</p>

<h3 id="2-novel-mechanics-vs-sensible-architecture">2. Novel mechanics vs. sensible architecture</h3>

<p>I spent significant time on architecture: clean agent separation, proper graph-based orchestration, session keys with spend limits, reputation tracking. Good engineering, but none of it is new.</p>

<p>The winners introduced mechanics that don’t exist elsewhere. Clan World’s memory-wipe-and-save cycle. Slopstock’s fractional ownership of TEE-sealed model weights. LPlens’s honesty labels that explicitly distinguish verified on-chain data from estimated projections. Each finalist had at least one thing where you’d say “I’ve never seen that before.”</p>

<p>ARYA’s novelty was supposed to be the multi-agent collaboration with human-in-the-loop approval. But that’s a well-known pattern in AI safety research — it’s responsible engineering, not a new idea.</p>

<h3 id="3-depth-over-breadth">3. Depth over breadth</h3>

<p>ARYA integrates three sponsors across five contracts, four agents, a full frontend, and multiple external APIs. It’s wide. But each integration is shallow — the Uniswap integration calls the Trading API for quotes, the KeeperHub integration sets up basic monitoring, the 0G integration stores state.</p>

<p>LPlens does one thing — analyze LP positions — but goes absurdly deep: closed-form IL math, pool behavior classification, 1,000-swap Monte Carlo simulations, V4 hook compatibility analysis. Slopstock has 8 contracts with 83 passing tests and three agents that produce value today. Depth reads as conviction. Breadth reads as a checklist.</p>

<h2 id="what-id-change">What I’d change</h2>

<p><strong>Pick one agent, make it exceptional.</strong> Instead of four agents doing the full pipeline, I’d build one agent that does something no human can do efficiently — like LPlens analyzing every LP position in a user’s wallet and explaining exactly why each one is bleeding money.</p>

<p><strong>Demo-driven development.</strong> Start with the 3-minute demo script on day 1. Every feature exists only if it creates a visible moment in that demo. I built bottom-up (contracts → agents → frontend) instead of top-down (demo → what do I need to make this moment work). The bottom-up approach produces sound architecture. The top-down approach produces a compelling story.</p>

<p><strong>Ship a live agent.</strong> Slopstock’s agents are running and producing value right now. My agents run in response to user requests. There’s a difference between “this agent can do things” and “this agent is doing things right now, watch it.” The latter is dramatically more impressive when you have three minutes to convince a judge.</p>

<p>The project works. The architecture is sound. But hackathons reward memorable demos and novel ideas over clean engineering — and that’s the right call for a 2-week sprint where the goal is to show what’s possible.</p>]]></content><author><name></name></author><category term="web3" /><summary type="html"><![CDATA[I spent 14 days building ARYA — a multi-agent AI swarm for DeFi yield farming — for the ETHGlobal Open Agents hackathon. It didn’t place. Here’s what I built, what the finalists built, and where I think the gap was.]]></summary></entry><entry><title type="html">G-Hook: Gamedevjs 2026 Post-Mortem</title><link href="https://michaelpaonam.com/gamedev/2026/04/26/g-hook-gamedevjs-2026-post-mortem.html" rel="alternate" type="text/html" title="G-Hook: Gamedevjs 2026 Post-Mortem" /><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><id>https://michaelpaonam.com/gamedev/2026/04/26/g-hook-gamedevjs-2026-post-mortem</id><content type="html" xml:base="https://michaelpaonam.com/gamedev/2026/04/26/g-hook-gamedevjs-2026-post-mortem.html"><![CDATA[<p>I built a 2D top-down grappling hook time-trial game in 10 days for the Gamedevjs 2026 game jam. The engine was Defold (one of the jam sponsors), targeting HTML5. The core mechanic was inspired by Fanny from Mobile Legends — fire cables at anchor points, get pulled toward them, chain hooks for speed, race through checkpoints.</p>

<p>This is what went right, what went wrong, and what I’d do differently.</p>

<div class="iframe-container iframe-container--game">
<iframe frameborder="0" src="https://itch.io/embed-upload/17312646?color=333333" allowfullscreen="" width="800" height="550"><a href="https://0xpaona.itch.io/g-hook">Play G-hook on itch.io</a></iframe>
</div>
<div class="mobile-fallback">
<p>This game requires a keyboard and mouse to play.</p>
<a href="https://0xpaona.itch.io/g-hook">Play G-hook on itch.io (desktop)</a>
</div>

<h2 id="timeline">Timeline</h2>

<p>The jam ran mid-to-late April. I started April 17 with a blank Defold project and submitted April 26. Rough breakdown:</p>

<ul>
  <li><strong>Day 1 (Apr 17):</strong> Project scaffolding, first working prototype of the hook mechanic, basic collision setup.</li>
  <li><strong>Days 2–3 (Apr 18–19):</strong> Cable auto-release, mouse aiming, initial level geometry via Wavedash (Defold’s tilemap tool).</li>
  <li><strong>Days 4–6 (Apr 22–23):</strong> Level loader with collection proxies, two playable levels, sprite assets, camera panning.</li>
  <li><strong>Days 7–8 (Apr 24–25):</strong> Level select screen, splash screen, third level shell, spike hazards.</li>
  <li><strong>Days 9–10 (Apr 25–26):</strong> Camera clamping, backgrounds, crate walls, exit gate, speed gates, final polish, “thank you” screen.</li>
</ul>

<h2 id="the-core-mechanic-pull-dont-swing">The core mechanic: pull, don’t swing</h2>

<p>My first attempt used a pendulum-swing approach — raycast to an anchor, create a rope constraint, let the player orbit around it. In zero-gravity top-down, this felt lifeless. There’s no “down” to swing from, so the rope just made the player orbit aimlessly.</p>

<p>I scrapped it and went with a pull-toward system. When you fire a cable, the player is pulled directly toward the anchor point. Multiple cables (up to 3) pull toward the bisector of the last two. The cable auto-releases when you arrive close enough. This created the momentum-based gameplay I wanted — fire at a distant anchor, get pulled at speed, release at the right moment to coast, fire the next cable to redirect.</p>

<p>The implementation is pure Lua math on every frame — no physics joints:</p>

<div class="language-lua highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">local</span> <span class="n">PULL_FORCE</span> <span class="o">=</span> <span class="mi">500</span>
<span class="kd">local</span> <span class="n">DAMPING_HOOKED</span> <span class="o">=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">02</span>  <span class="c1">-- near-zero friction while pulled</span>

<span class="c1">-- Each frame: compute pull direction from active cables</span>
<span class="c1">-- Apply force toward bisector of last two anchor positions</span>
<span class="c1">-- Auto-release when within AUTO_RELEASE_DIST (120px)</span>
</code></pre></div></div>

<p>This gave me full control over the feel without fighting Box2D’s joint solver in a zero-gravity environment.</p>

<h2 id="what-went-right">What went right</h2>

<p><strong>Choosing Defold.</strong> The engine’s collection proxy system made level loading trivial — each level is a self-contained <code class="language-plaintext highlighter-rouge">.collection</code> file, loaded/unloaded by a central loader script. The HTML5 build pipeline (via <code class="language-plaintext highlighter-rouge">bob.jar</code>) just works. Zero-gravity was a single line in <code class="language-plaintext highlighter-rouge">game.project</code>:</p>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">[physics]</span>
<span class="py">gravity_y</span> <span class="p">=</span> <span class="s">0.0</span>
<span class="py">gravity_x</span> <span class="p">=</span> <span class="s">0.0</span>
</code></pre></div></div>

<p><strong>Speed gates as a late mechanic.</strong> On day 9, I added the speed aura — a particle effect that activates above 700 px/s — and made some checkpoints require that speed to pass. This added skill expression to what was otherwise just “hook and go.” Players have to chain hooks correctly to build enough speed.</p>

<p><strong>Keeping scope tiny.</strong> Three short levels. No enemies. No powerups. One mechanic, explored through level design.</p>

<h2 id="what-went-wrong">What went wrong</h2>

<p><strong>Art.</strong> I’m not an artist. The final game uses colored squares for most geometry and a spider sprite for the player. The cable rendering is literally Defold’s <code class="language-plaintext highlighter-rouge">draw_line</code> debug primitive — single-pixel white lines. It works mechanically but looks like a prototype.</p>

<p><strong>Sound.</strong> I added a background music track on day 8 and never got to sound effects. Hook fire, cable snap, speed aura activation, checkpoint hit — all silent. This hurt game feel significantly.</p>

<p><strong>Level design iteration time.</strong> Defold’s tile editor is functional but not fast for iterating on level layouts. I’d place anchors, build the game, test the route, realize the spacing was wrong, and repeat. A hot-reload workflow or an in-editor play mode would have saved time.</p>

<p><strong>Late camera work.</strong> I didn’t add camera clamping until day 9. For the first week, the camera would happily follow the player off into void space. This wasn’t a huge problem during development but made the game feel unpolished whenever you overshot a section.</p>

<h2 id="technical-decisions-id-revisit">Technical decisions I’d revisit</h2>

<p><strong>Single-file player script.</strong> All player logic — movement, cable management, chain tracking, speed aura, checkpoint handling, restart — lives in one ~250-line script. For a jam this was fine; for anything longer-lived I’d split cable management into its own module.</p>

<p><strong>Manual rope math instead of physics joints.</strong> The right call for this game, but it means the cable has no elasticity, no visual sag, no physical interaction with obstacles. A hybrid approach — manual pull force but a visual catenary curve — would look better without changing the feel.</p>

<p><strong>No automated testing.</strong> Defold doesn’t have a built-in test runner for game scripts. Every change meant building and manually playing through. For a 10-day jam this is acceptable, but I caught at least two regressions late (camera breaking after restart, checkpoint ordering off by one).</p>

<h2 id="tools-and-workflow">Tools and workflow</h2>

<ul>
  <li><strong>Defold Editor</strong> for scene composition, tilemap painting, and builds.</li>
  <li><strong>Git + GitHub</strong> with feature branches for major additions (controls rework, level one, assets).</li>
  <li><strong><code class="language-plaintext highlighter-rouge">build.sh</code></strong> wrapping <code class="language-plaintext highlighter-rouge">bob.jar</code> for HTML5 bundling — avoids the Defold Editor’s bundle GUI.</li>
  <li><strong>Python HTTP server</strong> for local testing (WASM requires HTTP, not <code class="language-plaintext highlighter-rouge">file://</code>).</li>
</ul>

<h2 id="numbers">Numbers</h2>

<ul>
  <li>10 days, ~40 commits across 5 PRs</li>
  <li>3 playable levels</li>
  <li>~800 lines of Lua (player, camera, level, checkpoint, HUD, loader, cable, utilities)</li>
  <li>1 background music track, 0 sound effects</li>
  <li>Final HTML5 bundle: ~3MB</li>
</ul>

<h2 id="what-id-do-next">What I’d do next</h2>

<p>If I picked this back up: proper cable visuals (sprite-based with catenary math), sound effects for every action, 2–3 more levels that explore the multi-cable mechanic more deeply, and a global leaderboard via a simple REST API. The speed gate mechanic has room to grow — gates that require specific chain counts, or anchors that only activate at certain speeds.</p>

<p>The game works. It feels good to swing through a level at full speed, chaining hooks and barely clearing a speed gate. For 10 days and a first game jam, I’ll take it.</p>]]></content><author><name></name></author><category term="gamedev" /><summary type="html"><![CDATA[I built a 2D top-down grappling hook time-trial game in 10 days for the Gamedevjs 2026 game jam. The engine was Defold (one of the jam sponsors), targeting HTML5. The core mechanic was inspired by Fanny from Mobile Legends — fire cables at anchor points, get pulled toward them, chain hooks for speed, race through checkpoints.]]></summary></entry><entry><title type="html">Building a Production RAG API with Vector Search</title><link href="https://michaelpaonam.com/ai/2026/04/14/building-a-production-rag-api-with-vector-search.html" rel="alternate" type="text/html" title="Building a Production RAG API with Vector Search" /><published>2026-04-14T00:00:00+00:00</published><updated>2026-04-14T00:00:00+00:00</updated><id>https://michaelpaonam.com/ai/2026/04/14/building-a-production-rag-api-with-vector-search</id><content type="html" xml:base="https://michaelpaonam.com/ai/2026/04/14/building-a-production-rag-api-with-vector-search.html"><![CDATA[<p>Most RAG tutorials stop at “retrieve documents, pass to LLM, return answer.” That’s about 20% of what a production deployment actually requires. The remaining 80% is concurrency control, adaptive retrieval, structured citations, and graceful degradation when your upstream services throttle you.</p>

<p><img src="/assets/images/rag_system_design.svg" alt="RAG project - System Design" /></p>

<p>I built this for an internal documentation assistant — engineers asking questions about proprietary system docs that couldn’t be indexed by public LLMs. The corpus was ~2,000 pages of technical documentation across PDFs, markdown, and HTML. Traffic was modest (a few hundred queries per day) but bursty — entire teams would hit it during incidents, which is exactly when you can’t afford it to fall over.</p>

<p>The stack: FastAPI, a columnar database with vector search capabilities, an embedding model behind an API gateway, and an LLM for generation. Everything async, everything behind rate limits.</p>

<h2 id="adaptive-retrieval-depth">Adaptive retrieval depth</h2>

<p>Not every query needs the same number of documents. “What port does the connector use?” needs 3 chunks. “Explain all the differences between mode A and mode B” might need 10.</p>

<p>I started with a fixed k=5 and quickly saw two failure modes: simple lookups returned irrelevant padding documents that confused the LLM, and complex questions missed critical context because 5 chunks weren’t enough.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">determine_k</span><span class="p">(</span><span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="n">length</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">length</span> <span class="o">&lt;</span> <span class="mi">30</span><span class="p">:</span>
        <span class="n">k</span> <span class="o">=</span> <span class="mi">3</span>
    <span class="k">elif</span> <span class="n">length</span> <span class="o">&lt;=</span> <span class="mi">100</span><span class="p">:</span>
        <span class="n">k</span> <span class="o">=</span> <span class="mi">5</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">k</span> <span class="o">=</span> <span class="mi">7</span>
    <span class="k">if</span> <span class="n">COMPLEXITY_PATTERN</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="n">query</span><span class="p">):</span>
        <span class="n">k</span> <span class="o">+=</span> <span class="mi">2</span>
    <span class="k">return</span> <span class="nb">max</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="nb">min</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
</code></pre></div></div>

<p>The complexity pattern matches terms like “compare”, “list all”, “comprehensive”, “every”. It’s a crude heuristic — query length is a weak proxy for complexity. But it eliminated the worst cases of over-retrieval and under-retrieval without adding an LLM call to classify the query (which would double latency for every request).</p>

<h2 id="multi-query-retrieval-with-early-exit">Multi-query retrieval with early exit</h2>

<p>Single-query retrieval has a blind spot: if the user’s phrasing doesn’t match how the source documents express the concept, you miss relevant chunks. Multi-query generates rephrased variants and merges results.</p>

<p>The problem: multi-query is expensive. An extra LLM call for rephrasing plus N additional vector searches. For most queries the original phrasing works fine — you don’t want to pay that cost unconditionally.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">original_results</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">_search_async</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>

<span class="k">if</span> <span class="n">high_confidence</span><span class="p">(</span><span class="n">original_results</span><span class="p">,</span> <span class="n">threshold</span><span class="o">=</span><span class="mf">0.7</span><span class="p">):</span>
    <span class="n">rewrite_task</span><span class="p">.</span><span class="n">cancel</span><span class="p">()</span>
    <span class="n">docs</span> <span class="o">=</span> <span class="p">[</span><span class="n">doc</span> <span class="k">for</span> <span class="n">doc</span><span class="p">,</span> <span class="n">score</span> <span class="ow">in</span> <span class="n">original_results</span> <span class="k">if</span> <span class="n">score</span> <span class="o">&gt;=</span> <span class="n">min_similarity</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">variants</span> <span class="o">=</span> <span class="k">await</span> <span class="n">rewrite_task</span>
    <span class="n">docs</span> <span class="o">=</span> <span class="n">merge_results</span><span class="p">(</span><span class="n">result_sets</span><span class="p">,</span> <span class="n">min_similarity</span><span class="p">)</span>
</code></pre></div></div>

<p>The query rewrite and original retrieval run concurrently from the start. If the original results score above threshold on at least 3 documents, we cancel the rewrite and skip additional searches. In practice, ~70% of requests take the fast path. The remaining 30% — usually jargon-heavy or vaguely phrased queries — benefit measurably from the rephrased variants.</p>

<p>The early-exit pattern means multi-query adds zero latency to the majority path while still catching the long tail of poorly-phrased queries.</p>

<h2 id="concurrency-control">Concurrency control</h2>

<p>The vector database and the LLM gateway both had hard connection limits. During an incident, 15 engineers would simultaneously ask questions about the same system, and without backpressure the service would exhaust connection pools and cascade into 429s from the LLM gateway.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="bp">self</span><span class="p">.</span><span class="n">_hana_semaphore</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">Semaphore</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span>   <span class="c1"># max concurrent DB queries
</span><span class="bp">self</span><span class="p">.</span><span class="n">_aicore_semaphore</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">Semaphore</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>  <span class="c1"># max concurrent LLM calls
</span></code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">_search_async</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">k</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
    <span class="k">async</span> <span class="k">with</span> <span class="bp">self</span><span class="p">.</span><span class="n">_hana_semaphore</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">_aicore_semaphore</span><span class="p">:</span>
        <span class="k">return</span> <span class="k">await</span> <span class="n">with_retry</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_search</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>
</code></pre></div></div>

<p>Deliberately conservative. Under load, requests queue at the semaphore rather than hammering upstream. The alternative — letting all requests through and handling 429s reactively — creates worse tail latency because retries compound with each other.</p>

<p>The retry layer underneath handles rate limits with exponential backoff, respecting <code class="language-plaintext highlighter-rouge">Retry-After</code> headers:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">with_retry</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="n">max_retries</span><span class="o">=</span><span class="mi">3</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">attempt</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_retries</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="k">return</span> <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">to_thread</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">)</span>
        <span class="k">except</span> <span class="n">RateLimitError</span> <span class="k">as</span> <span class="n">exc</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">attempt</span> <span class="o">==</span> <span class="n">max_retries</span><span class="p">:</span>
                <span class="k">raise</span>
            <span class="n">retry_after</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span>
                <span class="n">exc</span><span class="p">.</span><span class="n">response</span><span class="p">.</span><span class="n">headers</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"Retry-After"</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
            <span class="p">)</span> <span class="k">if</span> <span class="n">exc</span><span class="p">.</span><span class="n">response</span> <span class="k">else</span> <span class="mi">0</span>
            <span class="n">delay</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">retry_after</span><span class="p">,</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="n">attempt</span><span class="p">))</span>
            <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="n">delay</span><span class="p">)</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">asyncio.to_thread</code> is important — the SDK clients are synchronous, so blocking calls go to the thread pool to keep the event loop responsive for other requests.</p>

<h2 id="structured-citations">Structured citations</h2>

<p>A RAG answer without citations is just a hallucination with extra steps. Early user feedback was clear: “I don’t trust this answer unless I can verify it.” So the system injects source headers that the LLM can reference:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">format_docs</span><span class="p">(</span><span class="n">docs</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">Document</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="n">parts</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">docs</span><span class="p">:</span>
        <span class="n">meta</span> <span class="o">=</span> <span class="n">doc</span><span class="p">.</span><span class="n">metadata</span>
        <span class="n">name</span> <span class="o">=</span> <span class="n">meta</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"document_name"</span><span class="p">,</span> <span class="s">"Unknown"</span><span class="p">)</span>
        <span class="n">page</span> <span class="o">=</span> <span class="n">meta</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"page"</span><span class="p">)</span>
        <span class="n">chapter</span> <span class="o">=</span> <span class="n">meta</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"chapter"</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">page</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
            <span class="n">ref</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"Chapter: </span><span class="si">{</span><span class="n">chapter</span><span class="si">}</span><span class="s">, Page </span><span class="si">{</span><span class="n">page</span><span class="si">}</span><span class="s">"</span> <span class="k">if</span> <span class="n">chapter</span> <span class="k">else</span> <span class="sa">f</span><span class="s">"Page </span><span class="si">{</span><span class="n">page</span><span class="si">}</span><span class="s">"</span>
            <span class="n">parts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"[Source: </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s">, </span><span class="si">{</span><span class="n">ref</span><span class="si">}</span><span class="s">]</span><span class="se">\n</span><span class="si">{</span><span class="n">doc</span><span class="p">.</span><span class="n">page_content</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">parts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">doc</span><span class="p">.</span><span class="n">page_content</span><span class="p">)</span>
    <span class="k">return</span> <span class="s">"</span><span class="se">\n\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">parts</span><span class="p">)</span>
</code></pre></div></div>

<p>Documents without pagination metadata (parsed markdown files) are included without a source header. The prompt explicitly instructs the LLM not to cite them — this prevents fabricated page numbers. The trade-off: some answers lack citations even when the information is correct. We accepted this over the alternative of the LLM inventing “Page 47” for a markdown file.</p>

<h2 id="streaming-with-error-boundaries">Streaming with error boundaries</h2>

<p>The response streams token-by-token. This creates an error handling problem: once you’ve started streaming, you can’t return a JSON error response. The HTTP status is already 200.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">retrieve_stream</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">AsyncGenerator</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="bp">None</span><span class="p">]:</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">docs</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">_retrieve</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
    <span class="k">except</span> <span class="n">DatabaseError</span><span class="p">:</span>
        <span class="k">yield</span> <span class="s">"Sorry, an error occurred while searching."</span>
        <span class="k">return</span>

    <span class="k">try</span><span class="p">:</span>
        <span class="k">async</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">document_chain</span><span class="p">.</span><span class="n">astream</span><span class="p">({...}):</span>
            <span class="k">yield</span> <span class="n">chunk</span>
    <span class="k">except</span> <span class="nb">Exception</span><span class="p">:</span>
        <span class="k">yield</span> <span class="s">"</span><span class="se">\n\n</span><span class="s">Sorry, an error occurred while generating the answer."</span>
</code></pre></div></div>

<p>Retrieval errors fail fast — if the database is unreachable, the user gets an immediate message rather than waiting for a timeout. Generation errors degrade gracefully: partial answers are already on the wire, so we append an error notice. Users preferred seeing a partial answer with an error notice over getting nothing.</p>

<h2 id="timeouts-as-graceful-degradation">Timeouts as graceful degradation</h2>

<p>Each sub-operation has an independent timeout tuned to its importance:</p>

<ul>
  <li><strong>Query rewrite:</strong> 500ms. If the LLM takes longer to rephrase, proceed with the original phrasing. The rewrite is an optimization, not a requirement.</li>
  <li><strong>Variant retrieval:</strong> 1 second per variant. If one rephrased query’s retrieval is slow, merge results from whichever variants completed.</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">_search_async_with_timeout</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">k</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
    <span class="k">return</span> <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">wait_for</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_search_async</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">k</span><span class="p">),</span> <span class="n">timeout</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">retrieval_timeout</span>
    <span class="p">)</span>
</code></pre></div></div>

<p>Under normal load, everything completes within budget. Under burst load (incident-driven traffic), the enhancements gracefully shed while the core path — original query retrieval + generation — always completes. Users get a slightly less optimized answer rather than a timeout error.</p>

<h2 id="what-i-learned">What I learned</h2>

<p>The production gap in RAG isn’t retrieval quality — it’s operational resilience. The entire service is under 300 lines. Most of that is plumbing: semaphores, retries, timeouts, error boundaries. The actual RAG logic is maybe 40 lines.</p>

<p>Things I’d do differently next time: add a response cache keyed on query embedding similarity (many incident-driven questions are near-duplicates), instrument retrieval quality metrics (track how often multi-query actually changes the result set), and add a fallback path that returns raw document snippets without LLM generation when the gateway is fully saturated. An imperfect answer fast beats a perfect answer never.</p>]]></content><author><name></name></author><category term="ai" /><summary type="html"><![CDATA[Most RAG tutorials stop at “retrieve documents, pass to LLM, return answer.” That’s about 20% of what a production deployment actually requires. The remaining 80% is concurrency control, adaptive retrieval, structured citations, and graceful degradation when your upstream services throttle you.]]></summary></entry><entry><title type="html">Distributed Tracing Across Services</title><link href="https://michaelpaonam.com/observability/2024/08/04/distributed-tracing-across-services.html" rel="alternate" type="text/html" title="Distributed Tracing Across Services" /><published>2024-08-04T00:00:00+00:00</published><updated>2024-08-04T00:00:00+00:00</updated><id>https://michaelpaonam.com/observability/2024/08/04/distributed-tracing-across-services</id><content type="html" xml:base="https://michaelpaonam.com/observability/2024/08/04/distributed-tracing-across-services.html"><![CDATA[<p>We had six services talking over HTTP and Kafka, and during incidents the hardest question was never “what broke?” — it was “what touched this request?” Logs existed per service. Metrics existed per host. Nothing tied them together. An incident that should have taken five minutes to resolve would burn an hour of manual log correlation.</p>

<p>This post covers how we wired up distributed tracing: correlation IDs first, then OpenTelemetry, then structured logging that actually made the whole thing queryable.</p>

<h2 id="the-problem">The problem</h2>

<p><img src="/assets/images/alert.jpg" alt="image from undraw" /></p>

<p>A user places an order. The request hits an API gateway, fans out to an inventory service, a payment service, a notification service, and eventually writes to a ledger via Kafka. Each service logs independently. When checkout latency spikes to 12 seconds, you’re searching five different log streams by timestamp, hoping the clocks are synced and the log formats are consistent enough to piece together what happened.</p>

<p>We tried this for three months. The worst incident took 45 minutes to resolve — not because the fix was hard, but because finding the slow service required manually jumping between Kibana indices and guessing at timing overlaps.</p>

<h2 id="correlation-ids-the-quick-win">Correlation IDs: the quick win</h2>

<p>Before going full OpenTelemetry, we started with the simplest possible thing: a UUID generated at the edge, propagated everywhere.</p>

<p><strong>At the gateway (Spring Boot filter):</strong></p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Component</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">CorrelationIdFilter</span> <span class="kd">extends</span> <span class="nc">OncePerRequestFilter</span> <span class="o">{</span>

    <span class="nd">@Override</span>
    <span class="kd">protected</span> <span class="kt">void</span> <span class="nf">doFilterInternal</span><span class="o">(</span><span class="nc">HttpServletRequest</span> <span class="n">request</span><span class="o">,</span>
                                    <span class="nc">HttpServletResponse</span> <span class="n">response</span><span class="o">,</span>
                                    <span class="nc">FilterChain</span> <span class="n">chain</span><span class="o">)</span> <span class="kd">throws</span> <span class="nc">ServletException</span><span class="o">,</span> <span class="nc">IOException</span> <span class="o">{</span>
        <span class="nc">String</span> <span class="n">correlationId</span> <span class="o">=</span> <span class="n">request</span><span class="o">.</span><span class="na">getHeader</span><span class="o">(</span><span class="s">"X-Correlation-ID"</span><span class="o">);</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">correlationId</span> <span class="o">==</span> <span class="kc">null</span><span class="o">)</span> <span class="o">{</span>
            <span class="n">correlationId</span> <span class="o">=</span> <span class="no">UUID</span><span class="o">.</span><span class="na">randomUUID</span><span class="o">().</span><span class="na">toString</span><span class="o">();</span>
        <span class="o">}</span>
        <span class="no">MDC</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="s">"correlationId"</span><span class="o">,</span> <span class="n">correlationId</span><span class="o">);</span>
        <span class="n">response</span><span class="o">.</span><span class="na">setHeader</span><span class="o">(</span><span class="s">"X-Correlation-ID"</span><span class="o">,</span> <span class="n">correlationId</span><span class="o">);</span>
        <span class="k">try</span> <span class="o">{</span>
            <span class="n">chain</span><span class="o">.</span><span class="na">doFilter</span><span class="o">(</span><span class="n">request</span><span class="o">,</span> <span class="n">response</span><span class="o">);</span>
        <span class="o">}</span> <span class="k">finally</span> <span class="o">{</span>
            <span class="no">MDC</span><span class="o">.</span><span class="na">remove</span><span class="o">(</span><span class="s">"correlationId"</span><span class="o">);</span>
        <span class="o">}</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p><strong>Propagation rules we settled on:</strong></p>
<ul>
  <li>HTTP calls: pass as <code class="language-plaintext highlighter-rouge">X-Correlation-ID</code> header (added to our shared RestTemplate config)</li>
  <li>Kafka messages: set in record headers, not the payload (avoids schema changes)</li>
  <li>Async workers: extract from message metadata before processing, put in MDC</li>
</ul>

<p>This alone cut our mean-time-to-identify from ~30 minutes to ~8 minutes. One search by correlation ID across all indices would surface every log line for a transaction. The limitation: no timing information, no parent-child relationships, no visualization.</p>

<h2 id="opentelemetry-when-correlation-ids-arent-enough">OpenTelemetry: when correlation IDs aren’t enough</h2>

<p>We evaluated three options: Zipkin (lighter, less active development), Jaeger (mature, but self-hosted complexity), and OpenTelemetry with a managed backend. We went with OTel exporting to Grafana Tempo — mainly because we already had Grafana for metrics and didn’t want another UI.</p>

<p><strong>Basic setup (Spring Boot with OTel SDK):</strong></p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Configuration</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">TracingConfig</span> <span class="o">{</span>

    <span class="nd">@Bean</span>
    <span class="kd">public</span> <span class="nc">OpenTelemetry</span> <span class="nf">openTelemetry</span><span class="o">()</span> <span class="o">{</span>
        <span class="nc">SdkTracerProvider</span> <span class="n">tracerProvider</span> <span class="o">=</span> <span class="nc">SdkTracerProvider</span><span class="o">.</span><span class="na">builder</span><span class="o">()</span>
            <span class="o">.</span><span class="na">addSpanProcessor</span><span class="o">(</span><span class="nc">BatchSpanProcessor</span><span class="o">.</span><span class="na">builder</span><span class="o">(</span>
                <span class="nc">OtlpGrpcSpanExporter</span><span class="o">.</span><span class="na">builder</span><span class="o">().</span><span class="na">build</span><span class="o">()</span>
            <span class="o">).</span><span class="na">build</span><span class="o">())</span>
            <span class="o">.</span><span class="na">build</span><span class="o">();</span>

        <span class="k">return</span> <span class="nc">OpenTelemetrySdk</span><span class="o">.</span><span class="na">builder</span><span class="o">()</span>
            <span class="o">.</span><span class="na">setTracerProvider</span><span class="o">(</span><span class="n">tracerProvider</span><span class="o">)</span>
            <span class="o">.</span><span class="na">setPropagators</span><span class="o">(</span><span class="nc">ContextPropagators</span><span class="o">.</span><span class="na">create</span><span class="o">(</span>
                <span class="nc">W3CTraceContextPropagator</span><span class="o">.</span><span class="na">getInstance</span><span class="o">()</span>
            <span class="o">))</span>
            <span class="o">.</span><span class="na">build</span><span class="o">();</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p><strong>Instrumenting a service call:</strong></p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Service</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">PaymentService</span> <span class="o">{</span>

    <span class="kd">private</span> <span class="kd">final</span> <span class="nc">Tracer</span> <span class="n">tracer</span><span class="o">;</span>

    <span class="kd">public</span> <span class="nf">PaymentService</span><span class="o">(</span><span class="nc">OpenTelemetry</span> <span class="n">openTelemetry</span><span class="o">)</span> <span class="o">{</span>
        <span class="k">this</span><span class="o">.</span><span class="na">tracer</span> <span class="o">=</span> <span class="n">openTelemetry</span><span class="o">.</span><span class="na">getTracer</span><span class="o">(</span><span class="s">"payment-service"</span><span class="o">);</span>
    <span class="o">}</span>

    <span class="kd">public</span> <span class="nc">PaymentResult</span> <span class="nf">processPayment</span><span class="o">(</span><span class="nc">String</span> <span class="n">orderId</span><span class="o">,</span> <span class="nc">String</span> <span class="n">method</span><span class="o">,</span> <span class="kt">long</span> <span class="n">amount</span><span class="o">)</span> <span class="o">{</span>
        <span class="nc">Span</span> <span class="n">span</span> <span class="o">=</span> <span class="n">tracer</span><span class="o">.</span><span class="na">spanBuilder</span><span class="o">(</span><span class="s">"process_payment"</span><span class="o">).</span><span class="na">startSpan</span><span class="o">();</span>
        <span class="k">try</span> <span class="o">(</span><span class="nc">Scope</span> <span class="n">scope</span> <span class="o">=</span> <span class="n">span</span><span class="o">.</span><span class="na">makeCurrent</span><span class="o">())</span> <span class="o">{</span>
            <span class="n">span</span><span class="o">.</span><span class="na">setAttribute</span><span class="o">(</span><span class="s">"order.id"</span><span class="o">,</span> <span class="n">orderId</span><span class="o">);</span>
            <span class="n">span</span><span class="o">.</span><span class="na">setAttribute</span><span class="o">(</span><span class="s">"payment.method"</span><span class="o">,</span> <span class="n">method</span><span class="o">);</span>
            <span class="nc">PaymentResult</span> <span class="n">result</span> <span class="o">=</span> <span class="n">paymentClient</span><span class="o">.</span><span class="na">charge</span><span class="o">(</span><span class="n">orderId</span><span class="o">,</span> <span class="n">amount</span><span class="o">);</span>
            <span class="n">span</span><span class="o">.</span><span class="na">setAttribute</span><span class="o">(</span><span class="s">"payment.status"</span><span class="o">,</span> <span class="n">result</span><span class="o">.</span><span class="na">getStatus</span><span class="o">());</span>
            <span class="k">return</span> <span class="n">result</span><span class="o">;</span>
        <span class="o">}</span> <span class="k">finally</span> <span class="o">{</span>
            <span class="n">span</span><span class="o">.</span><span class="na">end</span><span class="o">();</span>
        <span class="o">}</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<p><strong>Context propagation</strong> is handled by W3C Trace Context headers (<code class="language-plaintext highlighter-rouge">traceparent</code>, <code class="language-plaintext highlighter-rouge">tracestate</code>). The OTel Java agent auto-instruments most HTTP clients and Kafka producers/consumers, so we only wrote manual spans for business-critical paths where we wanted custom attributes.</p>

<p>The trade-off we hit: the OTel Java agent adds ~50ms to startup and a small per-request overhead (~2ms in our measurements). For our latency budget this was fine. We considered the manual SDK-only approach (no agent) but decided the auto-instrumentation coverage was worth the overhead.</p>

<p><strong>Where we drew the line:</strong> instrument at service boundaries only — incoming request, outgoing HTTP call, message publish, message consume. We explicitly decided not to instrument internal method calls. The signal-to-noise ratio drops fast when you trace everything.</p>

<h2 id="structured-logging">Structured logging</h2>

<p>We had structured logging before tracing, but it wasn’t connected. The missing piece was injecting trace context into every log line automatically.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"timestamp"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2024-07-15T02:13:47Z"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"level"</span><span class="p">:</span><span class="w"> </span><span class="s2">"info"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"service"</span><span class="p">:</span><span class="w"> </span><span class="s2">"payment-service"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"trace_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"4bf92f3577b34da6a3ce929d0e0e4736"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"span_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"00f067aa0ba902b7"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"correlation_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ord-18473-a8f2"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"message"</span><span class="p">:</span><span class="w"> </span><span class="s2">"payment processed"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"order_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"18473"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"amount_cents"</span><span class="p">:</span><span class="w"> </span><span class="mi">4500</span><span class="p">,</span><span class="w">
  </span><span class="nl">"duration_ms"</span><span class="p">:</span><span class="w"> </span><span class="mi">230</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>We kept both the correlation ID (business identifier — the order ID) and the trace ID (infrastructure identifier). This lets you query from either direction: “show me everything for order 18473” or “show me everything in this trace.” Different people ask different questions — support engineers think in orders, on-call engineers think in traces.</p>

<p><strong>Implementation (Logback with MDC):</strong></p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">&lt;!-- logback-spring.xml --&gt;</span>
<span class="nt">&lt;appender</span> <span class="na">name=</span><span class="s">"JSON"</span> <span class="na">class=</span><span class="s">"ch.qos.logback.core.ConsoleAppender"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;encoder</span> <span class="na">class=</span><span class="s">"net.logstash.logback.encoder.LogstashEncoder"</span><span class="nt">&gt;</span>
        <span class="nt">&lt;includeMdcKeyName&gt;</span>correlationId<span class="nt">&lt;/includeMdcKeyName&gt;</span>
        <span class="nt">&lt;includeMdcKeyName&gt;</span>traceId<span class="nt">&lt;/includeMdcKeyName&gt;</span>
        <span class="nt">&lt;includeMdcKeyName&gt;</span>spanId<span class="nt">&lt;/includeMdcKeyName&gt;</span>
    <span class="nt">&lt;/encoder&gt;</span>
<span class="nt">&lt;/appender&gt;</span>
</code></pre></div></div>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Component</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">TraceContextLogger</span> <span class="kd">implements</span> <span class="nc">HandlerInterceptor</span> <span class="o">{</span>

    <span class="nd">@Override</span>
    <span class="kd">public</span> <span class="kt">boolean</span> <span class="nf">preHandle</span><span class="o">(</span><span class="nc">HttpServletRequest</span> <span class="n">request</span><span class="o">,</span>
                             <span class="nc">HttpServletResponse</span> <span class="n">response</span><span class="o">,</span>
                             <span class="nc">Object</span> <span class="n">handler</span><span class="o">)</span> <span class="o">{</span>
        <span class="nc">Span</span> <span class="n">span</span> <span class="o">=</span> <span class="nc">Span</span><span class="o">.</span><span class="na">current</span><span class="o">();</span>
        <span class="nc">SpanContext</span> <span class="n">ctx</span> <span class="o">=</span> <span class="n">span</span><span class="o">.</span><span class="na">getSpanContext</span><span class="o">();</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">ctx</span><span class="o">.</span><span class="na">isValid</span><span class="o">())</span> <span class="o">{</span>
            <span class="no">MDC</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="s">"traceId"</span><span class="o">,</span> <span class="n">ctx</span><span class="o">.</span><span class="na">getTraceId</span><span class="o">());</span>
            <span class="no">MDC</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="s">"spanId"</span><span class="o">,</span> <span class="n">ctx</span><span class="o">.</span><span class="na">getSpanId</span><span class="o">());</span>
        <span class="o">}</span>
        <span class="k">return</span> <span class="kc">true</span><span class="o">;</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<h2 id="what-actually-changed">What actually changed</h2>

<p>The first incident after full rollout: alert fires for checkout latency at 2 AM. The alert itself now includes a trace ID (we configured AlertManager to attach it from the exemplar). Open Tempo, paste the trace ID, see the full waterfall. A downstream ledger service is taking 8 seconds on a database query — connection pool exhausted because a batch job was running during peak hours. Four minutes from alert to root cause.</p>

<p>Before tracing, that same incident pattern took 30-45 minutes. The fix was always simple once you found the slow service. The cost was in the finding.</p>

<p>The less obvious win: support tickets. “What happened to order #18473?” used to mean 20 minutes of log archaeology. Now it’s one search. We built a small internal tool that takes an order ID, finds the correlation ID, pulls the trace, and renders the timeline. Support engineers use it directly without paging on-call.</p>

<p><strong>What I’d skip if doing it again:</strong> we spent two weeks building custom dashboards for trace metrics (p99 per service, error rates by span). Grafana Tempo’s built-in service graph and RED metrics dashboard gave us 90% of that for free. Should have started there.</p>

<p><strong>What I wouldn’t skip:</strong> keeping the correlation ID separate from the trace ID. Some teams just use the trace ID for everything. But traces are infrastructure-scoped — they break across async boundaries, batch jobs, retry queues. The business correlation ID survives all of that because it’s just a string you propagate manually. When the Kafka consumer picks up a failed message three hours later, the trace is new but the correlation ID is the same.</p>]]></content><author><name></name></author><category term="observability" /><summary type="html"><![CDATA[We had six services talking over HTTP and Kafka, and during incidents the hardest question was never “what broke?” — it was “what touched this request?” Logs existed per service. Metrics existed per host. Nothing tied them together. An incident that should have taken five minutes to resolve would burn an hour of manual log correlation.]]></summary></entry></feed>