Marginalia — An Autonomous Research Investigator
Inspiration
I'm an MRI researcher at UC Santa Cruz. Every new project — characterizing a reconstruction method, scoping a grant, replying to a reviewer — starts with the same 2–4 hour grind:
- Search arXiv
- Skim 20 abstracts
- Open 5 papers
- Mentally track who claims what
- Notice that paper A says X works at 8× acceleration while paper B says above 4× introduces artifacts
- Try to figure out why they disagree
That's bookkeeping at the literature scale. It should be automatable. The available tools fail at it three ways:
- Chatbots → no provenance, hallucinated citations
- Summarization pipelines → never compare claims across papers
- Cloud RAG → sends paper text and patient context to third parties
Marginalia is the version that works for a researcher: runs where the data lives (local DGX Spark), builds a persistent structured knowledge graph across investigations, and actively looks for contradictions between papers — because contradictions are where the open research questions live.
🧠 What I Learned
- Autonomous ≠ scripted. "Agentic AI" vs "pipeline" comes down to whether the model is making real decisions. With a SQLite knowledge graph as durable state and a ReAct loop reading from it, the agent's branching path on any question is genuinely data-driven.
- Nemotron model selection matters. Using two distinct Nemotron variants in role-specific job— Nemotron-3-Nano-30B-A3B
for structured JSON extraction andNemotron-3-Super-120B-A12B` for synthesis — was much more effective than running everything through the biggest model. - Persistence is the real moat. A chat assistant forgets. A knowledge graph compounds. The agent's first action on any new question is
query_memory. - OpenClaw's skill architecture is genuinely good.
SKILL.md+ Python scripts invoked viaexecis a pleasingly minimal contract.
🏗️ How I Built It
Three layers on a single DGX Spark, all local:
### 1. Orchestration layer
OpenClaw gateway running an isolated marginalia agent on Qwen 3.6 35B (32K context, tools-capable, reasoning). The agent's AGENTS.md carries four operating directives that make autonomy visible in the trace:
| D1 | Narrate every decision in 1–3 sentences between tool calls |
| D2 | Always call query_memory first on any new question |
| D3 | Run one contrarian arXiv search before synthesizing |
| D4 | Weight reasoning by confidence (cite >0.85 explicitly, acknowledge 0.4–0.85, ignore <0.4) |
2. Skill layer (9 skills via OpenClaw exec)
| marginalia-start-investigation | Open a new investigation row |
| marginalia-query-memory | Recall stored claims joined with paper info |
| marginalia-arxiv-search | Rate-limited arXiv search (1 req / 3 s), Atom XML parse |
| marginalia-fetch-abstract | DB-first abstract lookup with arXiv fallback |
| marginalia-extract-claims | Calls Nemotron 33B → JSON claim extraction |
| marginalia-add-claim | Insert a derived claim |
| marginalia-check-contradiction | Calls Nemotron 33B → contradiction judgment |
| marginalia-synthesize | Calls Nemotron 33B / Super → final markdown report |
| marginalia-submit-final-report | Persist + mark complete |
| marginalia-verify-citation | NemoClaw policy-demo skill (blocks non-arXiv egress) |
3. Persistent memory layer — SQLite knowledge graph
Schema invariant: every claim in any synthesis is anchored to a real arxiv_id in papers. The synthesis is told never invent citations, and that constraint is enforceable at the SQL level via foreign keys.
- Security layer (NemoClaw bonus track)
An OpenShell PolicyFile (policies/marginalia-sandbox.yaml) following the NemoClaw reference blueprint schema:
- Network egress allow-list: export.arxiv.org + local inference gateway only
- Default deny for everything else
- marginalia-verify-citation demonstrates the block: pointed at example.com, it returns status: blocked with an audit log line
Implementation stats
- ~570 KB of code
- stdlib Python + requests + sqlite3 for skills, markdown for agent files, YAML for the policy
- No web frameworks, no build steps, no cloud dependencies
- 105 unit tests pass in ~2 seconds
Time complexity
A typical investigation:
$$t_\text{total} \approx \underbrace{t_\text{orch}}{\sim 10\text{s/turn}} \cdot N\text{turns} + \underbrace{t_\text{extract}}{\sim 40\text{s}} \cdot N\text{papers} + \underbrace{t_\text{synth}}_{\sim 30\text{s}}$$
- Fresh question, 3 papers: ~3 minutes
- Question with rich prior memory: ~30 seconds (because query_memory short-circuits most of the work)
⚔️ Challenges
- The orchestrator drifts when given freedom
Even with explicit scope caps in AGENTS.md ("at most 3 extractions"), the Qwen orchestrator would happily extract from 6+ papers, blowing the demo budget. I had to layer multiple defenses: explicit tool prohibitions (do not use web_search), a tighter user prompt at demo time, and ultimately accepting that a "fast" live demo is fundamentally about leveraging the persistent-memory cache rather than running a full extraction loop in front of judges.
- Local model orchestration is memory-thrashy
The Spark has 128 GB unified memory. Ollama has to evict and reload, and a Super cold-load is 60–90 seconds. Mitigation:
- Pre-warm the two loop models with keep_alive: 30m
- Swap synthesis to Nemotron 33B for the live demo (Super reserved for "deep mode" offline runs)
- OpenClaw's session model has surprising sharp edges
Sessions accumulate context across turns; a session that's seen 8 long extractions hits the 32 K context limit and the next message fails silently with Context overflow. Recovery: fresh session per investigation. Bench harnesses pass --session-id to force fresh sessions, and AGENTS.md explicitly tells the agent it operates one-investigation-per-session.
- Demo reliability vs. demo impressiveness
A full live ReAct loop with 4-paper extraction + Super synthesis would be the most impressive thing to show, and also the thing most likely to crash. The honest answer for demo: lead with the artifacts the agent has already produced (saved synthesis with real citations, contradictions, open questions) and use the live loop only to demonstrate the persistent-memory beat. That converted "agent might stall" risk into "agent runs short because memory hits."
- Test-driven discipline kept me sane
Every layer shipped with unit tests + microbenchmarks in the same task. When extract_claims returned malformed JSON during integration testing, I had already written tests around the JSON-parse retry logic in subagents.call_nemotron — so I knew the helper itself was fine and the issue was prompt-side. 105 tests across 7 modules saved me hours of debugging at 3am.
🎯 The Selling Point
Marginalia isn't another summarization tool — it's the version of literature review that:
- Finds contradictions between papers, not just summaries → where the open research questions live
- Remembers across investigations → second question is faster because the first is cached
- Runs entirely locally → for research environments where data must stay on the machine
▎ Imagine a research assistant who never forgets what you've read together, who actively looks for fights ▎ between papers, and who never sends your data anywhere.
Built With
- apis
- claude
- cloud-services
- databases
- frameworks
- nemoclaw
- platforms
Log in or sign up for Devpost to join the conversation.