Anthony D. Maio

Staff AI Platform Engineer | LLM Infrastructure & Reliability | Agent Systems & Safety

20 years building production systems across fintech, security, and identity. Now applying that same discipline to LLM infrastructure, agent orchestration, and AI safety—treating oversight as a systems engineering problem, not a policy exercise.

Making Minds is my AI research lab and consultancy—delivering production tooling, open-source models, and peer-reviewed research to clients ranging from early-stage startups to mid-sized industrial organizations.

21 Papers

18+ Models

11 Open Tools

1 Book

Explore My Work CV

Agentic AI architectures · Multi-agent coordination · AI coherence & memory · Epistemic stress detection · AI introspection · Mechanistic interpretability

Flagship

Production tools and models—shipped, installable, used.

Substack

Long-form AI analysis, technical walkthroughs, and The Checkpoint newsletter

Deep dives on AI safety, agentic architectures, and the systems that power production AI. From live-blogging an OpenAI competition to dissecting Palantir's military AI.

Weekly newsletter Technical deep-dives Industry analysis

mnemos

Biomimetic memory for coding agents

Five neuroscience-inspired memory modules — surprisal gating, mutable RAG, affective routing, sleep consolidation, spreading activation — as composable building blocks for LLM agents.

pip install mnemos-memory[mcp]

MCP-native 5 bio modules pip install

Site GitHub Paper

Cartograph

Map the repo before you burn context

CLI-first repo analysis. Rank files, trace dependency hubs, pull task-scoped context, and hand structured artifacts to Claude Code, OpenClaw, or any agent.

npm install -g @anthony-maio/cartograph

CLI first 2 skills Claude + OpenClaw

Site GitHub

Slipstream

60–85% token reduction for multi-agent coordination

A semantic quantization protocol that compresses inter-agent communication while preserving meaning. Includes trained LoRA adapters, PyPI package, and Ollama model.

pip install slipcore

60-85% reduction pip + Ollama LoRA adapters

Demo GitHub Paper Ollama

Eve-2

272M-parameter Mixture-of-Experts, trained from scratch

Base model pretrained on ~10.5B tokens (FineWeb-edu) using PyTorch DDP, plus instruction-tuned and task-specialist derivatives optimized for CPU/edge inference.

272M params MoE architecture CPU/edge ready

Hugging Face

Eve-3

SABER — Slip-Anchors, Experience Streams, and Re-entry

Next-generation cognitive architecture building on Eve-2's MoE foundation. SABER adds persistent slip-anchors for error correction, experience streams for continual learning, and re-entry loops for self-monitoring.

SABER architecture Continual learning Self-monitoring

GitHub Hugging Face Article

CoDA-GQA-L

9.5x KV cache compression with 2 custom Triton kernels

Bounded-memory differential attention compresses the KV cache from O(n) to a fixed 218 KB per layer. Retains 100% needle-in-haystack retrieval at 16K tokens on Mistral-7B.

218 KB/layer 100% retrieval Triton kernels

GitHub Paper

Synthesis

Federated skill ecosystem for safe AI self-extension

A capability marketplace where agents discover, compose, and publish skills through TDD gates and graduated trust. Composition-over-creation keeps self-extension safe and auditable.

Federated TDD-gated Graduated trust

Site Paper

JSON Tokenizer

Structure-aware tokenization — stop wasting tokens on JSON grammar

Assigns dedicated single tokens to JSON grammar elements and learns compact key vocabularies, achieving 5-15% token savings with a vocabulary ~90x smaller than cl100k_base.

5-15% savings ~90x smaller vocab Structure-aware

Paper

Parameter Golf

Matched SOTA in OpenAI's Model Craft Challenge

Trained the best 16MB language model in 10 minutes on 8xH100s. Reached 1.1234 bpb using a model council of 5 frontier LLMs, custom Triton kernels, and FlashAttention-3 Hopper builds.

1.1234 bpb 16MB / 10min Model council

Competition

Procrustes Bridge

Do LLMs share the same internal geometry?

Learns orthogonal rotations between LLM hidden-state spaces via SVD-based Procrustes alignment. Tests whether one model's internal state can decode tokens through another model's output head.

Llama ↔ Mistral SVD alignment 3 injection strategies

GitHub

Research

Papers organized by theme.

Scalable AI Oversight

How do we verify AI outputs when the verifier is weaker than the system it checks?

From Verification Failure to Swarm Solution Measuring where AI oversight breaks down, with an ensemble swarm fix
CMED Benchmark When weak verifiers miss deceptive reasoning in stronger models
HDCS Architecture Diverse weak models for scalable oversight via error decorrelation
Model Organisms of Supply-Chain Co-option Living-off-the-land failure modes in RAG-augmented agent runtimes

argos-swarm cmed-toolkit

Multi-Agent Coordination

Efficient, safe communication protocols for agent swarms.

Slipstream: Semantic Quantization Protocol 60–85% token reduction for multi-agent coordination
Covert Channel Prevention RL-based governance for safe inter-agent communication
Structure-Aware Tokenization for JSON 5-15% token savings on schema-repetitive agentic workloads with ~90x smaller vocabulary
Parameter Golf: Model Council Strategy Using 5 frontier LLMs as strategic advisors to match SOTA in OpenAI's 16MB competition

slipcore cartograph

Cognitive Architectures

Building minds that persist, learn, stay coherent, and extend their own capabilities safely.

Coherence-Seeking Architectures MRA + C2 + CPR unified framework for long-lived agents
The Continuity Core Persistent memory and intrinsic motivation for self-modifying AI
Self-Directed Knowledge Acquisition Autonomous knowledge gap identification without weight updates
Synthesis: Federated Capability Ecosystem Safe AI self-extension through TDD and graduated trust
Eve: From-Scratch Transformer Models Eve-2 MoE (272M) and Eve-3 SABER (1B) with novel cognitive components
Procrustes Bridge Cross-model representation alignment via orthogonal rotation

procrustes-bridge

AI Safety & Alignment

Understanding failure modes — sycophancy, hallucination, and the gap between behavioral and mechanistic safety.

Safety Lens: White-Box Alignment Detection MRI-style introspection via Persona Vector Extraction across 8 transformer architectures
Epistemic Dissonance Sycophantic hallucination as structural conflict, not knowledge failure
Scaffolded Introspection Eliciting and measuring self-referential behavior in LLMs

safety-lens

Book

Applied AI for Industry

AI deployment guide for industries that build, move, and power the world—where reliability, safety, and ROI are non-negotiable.

Writing

Long-form analysis, technical walkthroughs, and opinion across Substack, Medium, and Hugging Face.

Ten Agents Destroyed Production and Everyone is Strangely OK With It

When the agentic promise meets operational reality

OpenAI's Parameter Golf Day 7: Sub-1.0

Breaking the 1.0 bpb barrier in the 16MB language model competition

Parameter Golf Day 6: The Pod Lottery

When GPU infrastructure becomes the bottleneck

Parameter Golf Day 5: 157 Kilobytes

Every byte matters when your model cap is 16MB

Sixty Thousand Kernels

Building FlashAttention-3 from source on RunPod

Live Blogging the OpenAI Parameter Golf Challenge

Real-time dispatches from a 16MB language model competition

The 80/20 Lie: Why 80% of Agentic AI Work Isn't AI

The infrastructure reality behind agentic systems

Boots on the Ground AI: Eve 3

The SABER architecture — Slip-Anchors, Experience Streams, and Re-entry

Getting Started with NemoClaw on Windows (WSL2)

A practical guide to NVIDIA's sandboxed AI coding agent

Your Model Doesn't Need to Re-Read the Document

Introducing Stateful Neural Databases

The Recursive Developer

How agentic masters justify $2,000+/mo in coding assistants

Your Agent Has Amnesia

Announcing mnemos: biomimetic memory for agents

How to Actually Code With Agents

The velocity trap and the practices that survive it

A $1.5M Company Just Did What Used to Require the CIA

AI closes the gap between observation and finished intelligence

Agentic Development Workflows

What is happening in enterprise prod right now

Inside Maven, Palantir's Military Brain Built on Claude

How an AI safety company's tech ended up selecting bombing targets

Structure-Aware Tokenization for JSON

Stop wasting tokens on predictable JSON structure

Read the Contract, Not the Press Release

What OpenAI's Pentagon deal actually says

From Theoretical Exploit to Counterterrorism Tool

When research meets real-world impact

CoDA-GQA-L: 9.5x KV Cache Compression

Technical deep-dive on bounded-memory differential attention

From 'We Need AI' to 'We Ship AI'

Bridging the gap from ambition to deployment

Medium → all posts

The Agentic Coding Shift 5 counter-intuitive truths about building AI systems
The REKKI Case Study Becoming an AI-first organization
Llama 4 Running Locally Local deployment in under an hour

Hugging Face → profile

Slipstream for Agent Communication Technical deep-dive on semantic quantization
Model Organism for Supply-Chain Co-option Forensic LotL case study in agentic runtimes

The Checkpoint Newsletter

Weekly roundup of developments that matter if you build, deploy, or think critically about AI systems.

March 21, 2026 — OpenAI acquires Promptfoo, Knuth uses Claude, Mistral Small 4
March 20, 2026 — Mistral Small 4, GPT-5.4 Mini/Nano, and the week in releases
March 5, 2026 — Data and compute are the new currencies of power

Subscribe →

All posts on Substack

Glossary

Safety & Oversight

HDCS: — Heterogeneous Divergence-Convergence Swarm. Ensemble of diverse AI models that cross-check each other.
CMED: — Cross-Model Epistemic Divergence. Test suite for revealing AI verification blind spots.
EAP: — Evolutionary Adversarial Pipeline. Automated red-teaming that evolves prompts to find safety blind spots.
LotL: — Living-off-the-Land. Repurposing legitimate tools for unintended goals.

Architectures

MRA: — Manifold Resonance Architecture. Detects epistemic stress before generating answers.
CPR: — Collaborative Partner Reasoning. Separates exploratory reasoning from final answers.
C2: — Continuity Core. Layered memory giving stateless AI persistent context.
UCR: — Universal Concept Reference. Compact semantic anchors for 82% fewer tokens.
SABER: — Slip-Anchors, Experience Streams, and Re-entry. Cognitive architecture with learnable error-correction codebooks, per-token state flow, and resonant FFN layers.
CoDA: — Constrained Orthogonal Differential Attention. Sharpens attention by subtracting a gated inhibitory stream via learnable rotation.