Retrieval Divergence
Coming soon
Does a behavioral specification change what information an AI retrieves, not just how it responds? We are running a study across multiple memory systems (Mem0, Letta, Supermemory) and Base Layer to measure retrieval divergence on the same fact store. Results forthcoming.
Known Failure Modes
Where the system breaks
Every system has failure modes. We document ours publicly so you know what to expect, how we caught each problem, and what we did about it. Hiding failures doesn’t make them go away. Showing them builds the trust that lets you use the system seriously.
8 documented failure modes. Topic skew (fixed via 73-word guard), sycophancy amplification (verified via stacking study, mitigated architecturally), thin data overconfidence (partially fixed, density matters), cognitive anchoring (fixed via blind authoring), pronoun effects (open research question), extraction positional bias (fixed via chunking), ceremonial pipeline steps (10 of 14 cut after ablation), and provenance gap (fixed by re-adding Embed step).
GPT Memory Stacking
Coming soon
Does a Base Layer identity model improve AI interaction quality when stacked on top of platform memory (ChatGPT’s built-in memory)? We logged 100 responses across 5 conditions — GPT memory only, full model, granular files, fresh context, and no memory — to measure the interaction quality difference.
100 responses, 5 conditions, scoring in progress. Results will include per-condition quality scores and analysis of whether platform memory complements or conflicts with identity model injection.
Behavioral Grammar
46 predicates, formally specified
Before predicates, 57% of extracted facts started with “The user is...”, generic LLM artifacts that inflated scores and wasted tokens. The fix: a constrained vocabulary of 46 verbs that force the extraction model to classify every fact into a structured triple: {subject, predicate, object}.
This is not a knowledge graph. It is a behavioral grammar, a finite set of verbs that can describe how any human thinks, acts, values, fears, builds, and relates. The vocabulary is organized into five categories: cognitive (believes, values, fears), behavioral (practices, avoids, builds), relational (collaborates, mentors, trusts), contextual (works_at, lives_in), and experiential (experienced, struggled_with).
46 predicates. Epistemic precision over convenience: “attended” is not “graduated_from,” “wants_to” is not “aspires_to.” Behavioral predicates (values, avoids, fears) are the most predictive for identity compression. Biographical predicates (works_at, lives_in) provide context but rarely discriminate.
Coding Agent Study
An honest null result
The drift study showed that structured descriptions change how an AI approaches problems. The natural next question: does that actually make it better at solving them?
We tested this on real software engineering tasks. We took 30 hard bug reports from a well-known benchmark (SWE-Bench) and gave an AI coding agent different kinds of help: design principles from the project, generic encouragement, principles from a completely unrelated project, or no help at all. Then we measured: did it fix the bug?
The AI with no extra help performed best. The bare baseline solved 37% of problems. Every condition where we injected design knowledge performed worse, including our best treatment at 30%.
Relevant and irrelevant knowledge produced identical results. Django principles on Django bugs, and machine learning principles on Django bugs, both scored 30%. The AI was equally unaffected by relevant and irrelevant information. This is the cleanest finding. It rules out “our principles were just poorly written.”
The AI didn’t actually use the information. When we reviewed the AI’s step-by-step reasoning, it acknowledged the design principles in its first message and then completely ignored them. On some problems, it wasted time writing summaries about how its approach aligned with the principles instead of actually fixing the bug.
Understanding how someone thinks helps AI work with people. It doesn’t help AI fix code. Those are different problems, and this study proved it. Base Layer is built for humans, not coding agents.
Behavioral Drift
Does format matter?
V4 BriefV4 was used in this study. V5 is the current version.When you teach an AI something new about a person, does it update the right behavior, or does it change everything randomly? Imagine telling your assistant “this person once got burned by over-engineering a project.” Ideally, that should change how the AI approaches software architecture decisions, but it shouldn’t change how it helps with debugging or security reviews.
We tested this across four different AI models, from free local models to expensive frontier APIs. We described the same person three different ways:
Flat preferences
“Likes simple code, prefers TypeScript, wants tests.” How most AI memory systems work today.
Structured reasoning
“Avoids premature abstraction because they’ve seen it fail. Requires three concrete cases before extracting a pattern.”
Narrative prose
A flowing description of the person’s approach. Same information, written as paragraphs.
The structured format won decisively. When the AI was given structured reasoning about why someone thinks a certain way, new information was routed to the correct behavior. An architecture lesson changed architecture behavior specifically, not debugging, not security, not everything at once.
Flat preferences produced random change. A list of likes and dislikes gave the AI no way to figure out which behavior a new piece of information should update. The change was scattered across every dimension equally, or missed the target entirely.
A free 7B model with the right format outperformed a frontier API model with the wrong format. The way you describe someone to an AI matters more than which AI model you use. This was the most surprising finding.
An AI that understands why you avoid over-engineering routes new lessons to the right place. An AI that just knows you “prefer simple code” can’t. How you describe someone to AI determines whether the AI can actually learn from new information about them.
Brief Optimization
31 versions tested
V5 BriefV5 is the current brief format — citation-stripped, cleaner prose.The final step of the pipeline takes everything we know about someone and writes a summary that other AI systems will read. How you write that summary, the instructions you give the AI that writes it, dramatically affects how useful the result is. We tested 31 different versions across 7 rounds to find what works.
The best summary is shorter, marks where it’s uncertain, and tells the AI when NOT to apply a pattern. Less confident, more useful.
Pipeline Simplification
Less is more
V4 BriefV4 was used in this study. V5 is the current version.We originally built a 14-step pipeline to turn conversations into an identity summary. Before shipping, we asked: which of these steps actually matter? We tested every single one by removing it and measuring what happened to quality.
10 of the 14 steps were unnecessary. Scoring, classification, contradiction detection, adversarial review. They all sounded rigorous. None of them improved the final output. Removing them actually made it better.
The 3-layer structure is essential. We split identity into three layers: what you reason from (your foundations), how you behave (your patterns), and testable predictions (things we can verify). Combining all three into one pass scored lower.
Raw facts without synthesis don’t work. Just dumping extracted facts into the AI without organizing them scored worst. The synthesis step, where facts become structured patterns, is where real compression happens.
Simpler is better. The ablation reduced 14 steps to 4 — quality went up. We later added a 5th step (Embed) for traceability, not quality. Most of the complexity we built was ceremony, not substance.
Compression & Format
How much data is enough?
V4 BriefV4 was used in this study. V5 is the current version.How much of someone’s conversation history does the system actually need? And does it matter whether the output is written as prose, bullet points, or a structured guide? We tested both questions.
The pipeline’s value is in compression, not accumulation. The best summary is short, behavioral (not biographical), and structured rather than narrative.
Prediction Benchmark
Can it predict real people?
V4 BriefV4 was used in this study. V5 is the current version.Can the system actually predict how a real person would respond to questions? We used a dataset of 100 real people, each with detailed descriptions of who they are. We compressed each description into a short summary and tested whether an AI could use that summary to predict the person’s actual survey responses.
The result
Our compressed summary (18x shorter) matched or outperformed giving the AI the entire description. On one model, the compressed version actually predicted better than the full dump, statistically significant at p=0.008.
Why compression works
A 130,000-character description contains a lot of noise: irrelevant details, repetition, tangents. Compressing it to 7,000 characters forces the system to keep only what actually predicts behavior. Less noise, more signal.
A compressed summary predicts real human responses better than giving the AI everything. Compression doesn’t lose signal. It concentrates it.
Quality Measurement
Testing our own work
V4 BriefV4 was used in this study. V5 is the current version.How do you measure whether a behavioral summary is actually good? We built five tests and ran them on a summary of Benjamin Franklin (extracted from his autobiography). Two passed, two failed, one couldn’t be measured. The failures taught us as much as the passes.
What passed
Claim traceability (99.98%): Nearly every claim in the summary traces back to something Franklin actually said or did. The system doesn’t make things up.
Signal retention: After heavy compression, the summary actually captures more of what matters than the raw source. Forcing brevity makes the system prioritize better.
What failed (and why that’s informative)
Adversarial resistance: The more accurately the summary captures someone, the easier it is to exploit their real contradictions. Accuracy and security are in tension. This is a real tradeoff, not a fixable bug.
Cross-model consistency: Different AI models interpret the same summary differently. Portability across models needs work.
Faithful summaries expose real tensions in someone’s worldview, and that makes them both more useful and more vulnerable. You can’t have perfect accuracy and perfect security. We think accuracy is the right trade.
Traceability
Can we prove it?
V4 BriefV4 was used in this study. V5 is the current version.Using an AI to judge another AI’s output is circular. You’re trusting the same kind of system you’re trying to evaluate. We built an evaluation framework where every result can be checked by a human, costs nothing to run, and produces the same answer every time. No AI judges.
If a human can’t check the claim, it’s not evidence. Every metric in this framework is verifiable without running an AI model.
Output Format
How should a brief look?
V4 BriefV4 was used in this study. V5 is the current version.The last step of the pipeline writes the final summary that other AI systems will read. We tested six different formats, from flowing prose to structured guides to dense shorthand, to find what makes a summary most useful.
Same information, restructured, is dramatically more useful. The winning format doesn’t just describe patterns. It tells the AI when not to apply them and how to resolve contradictions.
Design Decisions
80 decisions, all public
Every architectural choice is documented with reasoning, alternatives considered, and status. 78 decisions across 97 sessions. Most projects publish their code. We also publish why the code looks the way it does: every wrong turn, every superseded idea, every decision that survived. The full log is in the repository.
Architecture
Quality & Privacy
Evaluation Philosophy
What Didn't Work
Nothing is hidden. The prompts are in the code. The reasoning is in the log. We publish what didn’t work alongside what did.