Marginalia — An Autonomous Research Investigator

Inspiration

I'm an MRI researcher at UC Santa Cruz. Every new project — characterizing a reconstruction method, scoping a grant, replying to a reviewer — starts with the same 2–4 hour grind:

Search arXiv
Skim 20 abstracts
Open 5 papers
Mentally track who claims what
Notice that paper A says X works at 8× acceleration while paper B says above 4× introduces artifacts
Try to figure out why they disagree

That's bookkeeping at the literature scale. It should be automatable. The available tools fail at it three ways:

Chatbots → no provenance, hallucinated citations
Summarization pipelines → never compare claims across papers
Cloud RAG → sends paper text and patient context to third parties

Marginalia is the version that works for a researcher: runs where the data lives (local DGX Spark), builds a persistent structured knowledge graph across investigations, and actively looks for contradictions between papers — because contradictions are where the open research questions live.

🧠 What I Learned

Autonomous ≠ scripted. "Agentic AI" vs "pipeline" comes down to whether the model is making real decisions. With a SQLite knowledge graph as durable state and a ReAct loop reading from it, the agent's branching path on any question is genuinely data-driven.
Nemotron model selection matters. Using two distinct Nemotron variants in role-specific job— Nemotron-3-Nano-30B-A3Bfor structured JSON extraction andNemotron-3-Super-120B-A12B` for synthesis — was much more effective than running everything through the biggest model.
Persistence is the real moat. A chat assistant forgets. A knowledge graph compounds. The agent's first action on any new question is query_memory.
OpenClaw's skill architecture is genuinely good. SKILL.md + Python scripts invoked via exec is a pleasingly minimal contract.

🏗️ How I Built It

Three layers on a single DGX Spark, all local:

### 1. Orchestration layer

OpenClaw gateway running an isolated marginalia agent on Qwen 3.6 35B (32K context, tools-capable, reasoning). The agent's AGENTS.md carries four operating directives that make autonomy visible in the trace:

| D1 | Narrate every decision in 1–3 sentences between tool calls |

| D2 | Always call query_memory first on any new question |

| D3 | Run one contrarian arXiv search before synthesizing |

| D4 | Weight reasoning by confidence (cite >0.85 explicitly, acknowledge 0.4–0.85, ignore <0.4) |

2. Skill layer (9 skills via OpenClaw `exec`)

| marginalia-start-investigation | Open a new investigation row |

| marginalia-query-memory | Recall stored claims joined with paper info |

| marginalia-arxiv-search | Rate-limited arXiv search (1 req / 3 s), Atom XML parse |

| marginalia-fetch-abstract | DB-first abstract lookup with arXiv fallback |

| marginalia-extract-claims | Calls Nemotron 33B → JSON claim extraction |

| marginalia-add-claim | Insert a derived claim |

| marginalia-check-contradiction | Calls Nemotron 33B → contradiction judgment |

| marginalia-synthesize | Calls Nemotron 33B / Super → final markdown report |

| marginalia-submit-final-report | Persist + mark complete |

| marginalia-verify-citation | NemoClaw policy-demo skill (blocks non-arXiv egress) |

3. Persistent memory layer — SQLite knowledge graph

Schema invariant: every claim in any synthesis is anchored to a real arxiv_id in papers. The synthesis is told never invent citations, and that constraint is enforceable at the SQL level via foreign keys.

Security layer (NemoClaw bonus track)

An OpenShell PolicyFile (policies/marginalia-sandbox.yaml) following the NemoClaw reference blueprint schema:

Network egress allow-list: export.arxiv.org + local inference gateway only
Default deny for everything else
marginalia-verify-citation demonstrates the block: pointed at example.com, it returns status: blocked with an audit log line

Implementation stats

~570 KB of code
stdlib Python + requests + sqlite3 for skills, markdown for agent files, YAML for the policy
No web frameworks, no build steps, no cloud dependencies
105 unit tests pass in ~2 seconds

Time complexity

A typical investigation:

$$t_\text{total} \approx \underbrace{t_\text{orch}}{\sim 10\text{s/turn}} \cdot N\text{turns} + \underbrace{t_\text{extract}}{\sim 40\text{s}} \cdot N\text{papers} + \underbrace{t_\text{synth}}_{\sim 30\text{s}}$$

Fresh question, 3 papers: ~3 minutes
Question with rich prior memory: ~30 seconds (because query_memory short-circuits most of the work)

⚔️ Challenges

The orchestrator drifts when given freedom

Even with explicit scope caps in AGENTS.md ("at most 3 extractions"), the Qwen orchestrator would happily extract from 6+ papers, blowing the demo budget. I had to layer multiple defenses: explicit tool prohibitions (do not use web_search), a tighter user prompt at demo time, and ultimately accepting that a "fast" live demo is fundamentally about leveraging the persistent-memory cache rather than running a full extraction loop in front of judges.

Local model orchestration is memory-thrashy

The Spark has 128 GB unified memory. Ollama has to evict and reload, and a Super cold-load is 60–90 seconds. Mitigation:

Pre-warm the two loop models with keep_alive: 30m
Swap synthesis to Nemotron 33B for the live demo (Super reserved for "deep mode" offline runs)

OpenClaw's session model has surprising sharp edges

Sessions accumulate context across turns; a session that's seen 8 long extractions hits the 32 K context limit and the next message fails silently with Context overflow. Recovery: fresh session per investigation. Bench harnesses pass --session-id to force fresh sessions, and AGENTS.md explicitly tells the agent it operates one-investigation-per-session.

Demo reliability vs. demo impressiveness

A full live ReAct loop with 4-paper extraction + Super synthesis would be the most impressive thing to show, and also the thing most likely to crash. The honest answer for demo: lead with the artifacts the agent has already produced (saved synthesis with real citations, contradictions, open questions) and use the live loop only to demonstrate the persistent-memory beat. That converted "agent might stall" risk into "agent runs short because memory hits."

Test-driven discipline kept me sane

Every layer shipped with unit tests + microbenchmarks in the same task. When extract_claims returned malformed JSON during integration testing, I had already written tests around the JSON-parse retry logic in subagents.call_nemotron — so I knew the helper itself was fine and the issue was prompt-side. 105 tests across 7 modules saved me hours of debugging at 3am.

🎯 The Selling Point

Marginalia isn't another summarization tool — it's the version of literature review that:

Finds contradictions between papers, not just summaries → where the open research questions live
Remembers across investigations → second question is faster because the first is cached
Runs entirely locally → for research environments where data must stay on the machine

▎ Imagine a research assistant who never forgets what you've read together, who actively looks for fights ▎ between papers, and who never sends your data anywhere.

Built With

apis
claude
cloud-services
databases
frameworks
nemoclaw
platforms

Updates

Edison Kuo started this project — May 16, 2026 09:11 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.