Skip to content

Anthony D. Maio

Staff AI Platform Engineer | LLM Infrastructure & Reliability | Agent Systems & Safety

20 years building production systems across fintech, security, and identity. Now applying that same discipline to LLM infrastructure, agent orchestration, and AI safety—treating oversight as a systems engineering problem, not a policy exercise.

Making Minds is my AI research lab and consultancy—delivering production tooling, open-source models, and peer-reviewed research to clients ranging from early-stage startups to mid-sized industrial organizations.

21 Papers
18+ Models
11 Open Tools
1 Book

Agentic AI architectures · Multi-agent coordination · AI coherence & memory · Epistemic stress detection · AI introspection · Mechanistic interpretability

Flagship

Production tools and models—shipped, installable, used.

Substack

Long-form AI analysis, technical walkthroughs, and The Checkpoint newsletter

Deep dives on AI safety, agentic architectures, and the systems that power production AI. From live-blogging an OpenAI competition to dissecting Palantir's military AI.

Weekly newsletter Technical deep-dives Industry analysis

mnemos

Biomimetic memory for coding agents

Five neuroscience-inspired memory modules — surprisal gating, mutable RAG, affective routing, sleep consolidation, spreading activation — as composable building blocks for LLM agents.

pip install mnemos-memory[mcp]
MCP-native 5 bio modules pip install

Cartograph

Map the repo before you burn context

CLI-first repo analysis. Rank files, trace dependency hubs, pull task-scoped context, and hand structured artifacts to Claude Code, OpenClaw, or any agent.

npm install -g @anthony-maio/cartograph
CLI first 2 skills Claude + OpenClaw

Slipstream

60–85% token reduction for multi-agent coordination

A semantic quantization protocol that compresses inter-agent communication while preserving meaning. Includes trained LoRA adapters, PyPI package, and Ollama model.

pip install slipcore
60-85% reduction pip + Ollama LoRA adapters

Eve-2

272M-parameter Mixture-of-Experts, trained from scratch

Base model pretrained on ~10.5B tokens (FineWeb-edu) using PyTorch DDP, plus instruction-tuned and task-specialist derivatives optimized for CPU/edge inference.

272M params MoE architecture CPU/edge ready

Eve-3

SABER — Slip-Anchors, Experience Streams, and Re-entry

Next-generation cognitive architecture building on Eve-2's MoE foundation. SABER adds persistent slip-anchors for error correction, experience streams for continual learning, and re-entry loops for self-monitoring.

SABER architecture Continual learning Self-monitoring

CoDA-GQA-L

9.5x KV cache compression with 2 custom Triton kernels

Bounded-memory differential attention compresses the KV cache from O(n) to a fixed 218 KB per layer. Retains 100% needle-in-haystack retrieval at 16K tokens on Mistral-7B.

218 KB/layer 100% retrieval Triton kernels

Synthesis

Federated skill ecosystem for safe AI self-extension

A capability marketplace where agents discover, compose, and publish skills through TDD gates and graduated trust. Composition-over-creation keeps self-extension safe and auditable.

Federated TDD-gated Graduated trust

JSON Tokenizer

Structure-aware tokenization — stop wasting tokens on JSON grammar

Assigns dedicated single tokens to JSON grammar elements and learns compact key vocabularies, achieving 5-15% token savings with a vocabulary ~90x smaller than cl100k_base.

5-15% savings ~90x smaller vocab Structure-aware

Parameter Golf

Matched SOTA in OpenAI's Model Craft Challenge

Trained the best 16MB language model in 10 minutes on 8xH100s. Reached 1.1234 bpb using a model council of 5 frontier LLMs, custom Triton kernels, and FlashAttention-3 Hopper builds.

1.1234 bpb 16MB / 10min Model council

Procrustes Bridge

Do LLMs share the same internal geometry?

Learns orthogonal rotations between LLM hidden-state spaces via SVD-based Procrustes alignment. Tests whether one model's internal state can decode tokens through another model's output head.

Llama ↔ Mistral SVD alignment 3 injection strategies

Research

Papers organized by theme.

Scalable AI Oversight

How do we verify AI outputs when the verifier is weaker than the system it checks?

Multi-Agent Coordination

Efficient, safe communication protocols for agent swarms.

Cognitive Architectures

Building minds that persist, learn, stay coherent, and extend their own capabilities safely.

AI Safety & Alignment

Understanding failure modes — sycophancy, hallucination, and the gap between behavioral and mechanistic safety.

Book

Applied AI for Industry

AI deployment guide for industries that build, move, and power the world—where reliability, safety, and ROI are non-negotiable.

Writing

Long-form analysis, technical walkthroughs, and opinion across Substack, Medium, and Hugging Face.

Ten Agents Destroyed Production and Everyone is Strangely OK With It

When the agentic promise meets operational reality

OpenAI's Parameter Golf Day 7: Sub-1.0

Breaking the 1.0 bpb barrier in the 16MB language model competition

Parameter Golf Day 6: The Pod Lottery

When GPU infrastructure becomes the bottleneck

Parameter Golf Day 5: 157 Kilobytes

Every byte matters when your model cap is 16MB

Sixty Thousand Kernels

Building FlashAttention-3 from source on RunPod

Live Blogging the OpenAI Parameter Golf Challenge

Real-time dispatches from a 16MB language model competition

The 80/20 Lie: Why 80% of Agentic AI Work Isn't AI

The infrastructure reality behind agentic systems

Boots on the Ground AI: Eve 3

The SABER architecture — Slip-Anchors, Experience Streams, and Re-entry

Getting Started with NemoClaw on Windows (WSL2)

A practical guide to NVIDIA's sandboxed AI coding agent

Your Model Doesn't Need to Re-Read the Document

Your Model Doesn't Need to Re-Read the Document

Introducing Stateful Neural Databases

The Recursive Developer

The Recursive Developer

How agentic masters justify $2,000+/mo in coding assistants

Your Agent Has Amnesia

Your Agent Has Amnesia

Announcing mnemos: biomimetic memory for agents

How to Actually Code With Agents

How to Actually Code With Agents

The velocity trap and the practices that survive it

A $1.5M Company Just Did What Used to Require the CIA

A $1.5M Company Just Did What Used to Require the CIA

AI closes the gap between observation and finished intelligence

Agentic Development Workflows

Agentic Development Workflows

What is happening in enterprise prod right now

Inside Maven, Palantir's Military Brain Built on Claude

Inside Maven, Palantir's Military Brain Built on Claude

How an AI safety company's tech ended up selecting bombing targets

Structure-Aware Tokenization for JSON

Structure-Aware Tokenization for JSON

Stop wasting tokens on predictable JSON structure

Read the Contract, Not the Press Release

Read the Contract, Not the Press Release

What OpenAI's Pentagon deal actually says

From Theoretical Exploit to Counterterrorism Tool

From Theoretical Exploit to Counterterrorism Tool

When research meets real-world impact

CoDA-GQA-L: 9.5x KV Cache Compression

CoDA-GQA-L: 9.5x KV Cache Compression

Technical deep-dive on bounded-memory differential attention

From 'We Need AI' to 'We Ship AI'

From 'We Need AI' to 'We Ship AI'

Bridging the gap from ambition to deployment

Medium → all posts

Hugging Face → profile

The Checkpoint Newsletter

Weekly roundup of developments that matter if you build, deploy, or think critically about AI systems.

  • March 21, 2026 — OpenAI acquires Promptfoo, Knuth uses Claude, Mistral Small 4
  • March 20, 2026 — Mistral Small 4, GPT-5.4 Mini/Nano, and the week in releases
  • March 5, 2026 — Data and compute are the new currencies of power
Subscribe →

Glossary

Safety & Oversight

HDCS
— Heterogeneous Divergence-Convergence Swarm. Ensemble of diverse AI models that cross-check each other.
CMED
— Cross-Model Epistemic Divergence. Test suite for revealing AI verification blind spots.
EAP
— Evolutionary Adversarial Pipeline. Automated red-teaming that evolves prompts to find safety blind spots.
LotL
— Living-off-the-Land. Repurposing legitimate tools for unintended goals.

Architectures

MRA
— Manifold Resonance Architecture. Detects epistemic stress before generating answers.
CPR
— Collaborative Partner Reasoning. Separates exploratory reasoning from final answers.
C2
— Continuity Core. Layered memory giving stateless AI persistent context.
UCR
— Universal Concept Reference. Compact semantic anchors for 82% fewer tokens.
SABER
— Slip-Anchors, Experience Streams, and Re-entry. Cognitive architecture with learnable error-correction codebooks, per-token state flow, and resonant FFN layers.
CoDA
— Constrained Orthogonal Differential Attention. Sharpens attention by subtracting a gated inhibitory stream via learnable rotation.