← base-layer.ai

Research

We test everything we build and publish everything we find, including when things don’t work. Here’s what we learned across 97 development sessions and eight studies.

Download Summary
01

Less information works better

You might think giving an AI more context about someone would help. It doesn't. Throwing away 80% of what we extract doesn't hurt quality, and often improves it. The system's job is to find the signal, not include everything.

02

What you avoid reveals more than what you believe

Your struggles and avoidance patterns are more predictive of your behavior than your stated opinions or biography. The friction points in how you think define you more than your resume does.

03

How you describe someone to AI matters more than which AI you use

The same information, formatted differently, produces dramatically different results. A structured summary outperforms flowing prose by 24%, at one-third the length. A free local model with the right format outperforms a frontier API model with the wrong format.

04

Most of our pipeline was unnecessary

We started with 14 processing steps. We tested each one. Ten were pure ceremony. The ablation reduced it to 4 steps (March 2026). We later added a 5th step — Embed — after discovering traceability requires vector embeddings. The embed step doesn't improve quality, but it makes the identity model auditable.

05

A compressed summary predicts better than the full history

We tested whether a short summary (~2,500 words) could predict someone's survey responses. It matched or beat giving the AI their entire 130,000-character conversation history. Compression concentrates signal.

06

This doesn't help AI write better code

We tested whether injecting design knowledge into a coding agent makes it solve bugs better. It doesn't. Understanding how someone thinks helps AI interact with people, not with codebases. We publish our null results too.

Retrieval Divergence

Coming soon

Coming Soon

Does a behavioral specification change what information an AI retrieves, not just how it responds? We are running a study across multiple memory systems (Mem0, Letta, Supermemory) and Base Layer to measure retrieval divergence on the same fact store. Results forthcoming.

Known Failure Modes

Where the system breaks

April 3, 2026Download

Every system has failure modes. We document ours publicly so you know what to expect, how we caught each problem, and what we did about it. Hiding failures doesn’t make them go away. Showing them builds the trust that lets you use the system seriously.

8 documented failure modes. Topic skew (fixed via 73-word guard), sycophancy amplification (verified via stacking study, mitigated architecturally), thin data overconfidence (partially fixed, density matters), cognitive anchoring (fixed via blind authoring), pronoun effects (open research question), extraction positional bias (fixed via chunking), ceremonial pipeline steps (10 of 14 cut after ablation), and provenance gap (fixed by re-adding Embed step).

GPT Memory Stacking

Coming soon

Coming Soon

Does a Base Layer identity model improve AI interaction quality when stacked on top of platform memory (ChatGPT’s built-in memory)? We logged 100 responses across 5 conditions — GPT memory only, full model, granular files, fresh context, and no memory — to measure the interaction quality difference.

100 responses, 5 conditions, scoring in progress. Results will include per-condition quality scores and analysis of whether platform memory complements or conflicts with identity model injection.

Authoring Prompt Ablation

73 words changed everything

March 27, 2026Download

Identity models were skewing toward dominant topics in the source data. A subject who wrote extensively about prediction markets had their entire model framed around prediction markets — even though their actual identity is about probabilistic reasoning and institutional skepticism. The authoring prompts (~1,000 words each) had no guard against topic-specific positions being elevated to identity axioms.

We ran 4 rounds of testing across 10 prompt conditions on two subjects with known skew problems. A 73-word instruction eliminated topic skew entirely. 78% of the original prompt was ceremonial.

73 words changed everything. “How someone reasons IS identity. What they reason ABOUT is not.” This single guard reduced topic mentions from 9 to 0, cut prompt size by 78%, and produced tighter, more universal identity models.

Behavioral Grammar

46 predicates, formally specified

March 18, 2026Download

Before predicates, 57% of extracted facts started with “The user is...”, generic LLM artifacts that inflated scores and wasted tokens. The fix: a constrained vocabulary of 46 verbs that force the extraction model to classify every fact into a structured triple: {subject, predicate, object}.

This is not a knowledge graph. It is a behavioral grammar, a finite set of verbs that can describe how any human thinks, acts, values, fears, builds, and relates. The vocabulary is organized into five categories: cognitive (believes, values, fears), behavioral (practices, avoids, builds), relational (collaborates, mentors, trusts), contextual (works_at, lives_in), and experiential (experienced, struggled_with).

46 predicates. Epistemic precision over convenience: “attended” is not “graduated_from,” “wants_to” is not “aspires_to.” Behavioral predicates (values, avoids, fears) are the most predictive for identity compression. Biographical predicates (works_at, lives_in) provide context but rarely discriminate.

Coding Agent Study

An honest null result

March 13, 2026Download

The drift study showed that structured descriptions change how an AI approaches problems. The natural next question: does that actually make it better at solving them?

We tested this on real software engineering tasks. We took 30 hard bug reports from a well-known benchmark (SWE-Bench) and gave an AI coding agent different kinds of help: design principles from the project, generic encouragement, principles from a completely unrelated project, or no help at all. Then we measured: did it fix the bug?

The AI with no extra help performed best. The bare baseline solved 37% of problems. Every condition where we injected design knowledge performed worse, including our best treatment at 30%.

Relevant and irrelevant knowledge produced identical results. Django principles on Django bugs, and machine learning principles on Django bugs, both scored 30%. The AI was equally unaffected by relevant and irrelevant information. This is the cleanest finding. It rules out “our principles were just poorly written.”

The AI didn’t actually use the information. When we reviewed the AI’s step-by-step reasoning, it acknowledged the design principles in its first message and then completely ignored them. On some problems, it wasted time writing summaries about how its approach aligned with the principles instead of actually fixing the bug.

7
conditions tested
30
real bug reports
$524
total cost
0%
improvement

Understanding how someone thinks helps AI work with people. It doesn’t help AI fix code. Those are different problems, and this study proved it. Base Layer is built for humans, not coding agents.

Behavioral Drift

Does format matter?

V4 BriefV4 was used in this study. V5 is the current version.
March 12, 2026Download

When you teach an AI something new about a person, does it update the right behavior, or does it change everything randomly? Imagine telling your assistant “this person once got burned by over-engineering a project.” Ideally, that should change how the AI approaches software architecture decisions, but it shouldn’t change how it helps with debugging or security reviews.

We tested this across four different AI models, from free local models to expensive frontier APIs. We described the same person three different ways:

Flat preferences

“Likes simple code, prefers TypeScript, wants tests.” How most AI memory systems work today.

Structured reasoning

“Avoids premature abstraction because they’ve seen it fail. Requires three concrete cases before extracting a pattern.”

Narrative prose

A flowing description of the person’s approach. Same information, written as paragraphs.

The structured format won decisively. When the AI was given structured reasoning about why someone thinks a certain way, new information was routed to the correct behavior. An architecture lesson changed architecture behavior specifically, not debugging, not security, not everything at once.

Flat preferences produced random change. A list of likes and dislikes gave the AI no way to figure out which behavior a new piece of information should update. The change was scattered across every dimension equally, or missed the target entirely.

A free 7B model with the right format outperformed a frontier API model with the wrong format. The way you describe someone to an AI matters more than which AI model you use. This was the most surprising finding.

4
models tested
3
description formats
$0.30
total API cost
7B > 70B
with right format

An AI that understands why you avoid over-engineering routes new lessons to the right place. An AI that just knows you “prefer simple code” can’t. How you describe someone to AI determines whether the AI can actually learn from new information about them.

Brief Optimization

31 versions tested

V5 BriefV5 is the current brief format — citation-stripped, cleaner prose.
March 11, 2026Download

The final step of the pipeline takes everything we know about someone and writes a summary that other AI systems will read. How you write that summary, the instructions you give the AI that writes it, dramatically affects how useful the result is. We tested 31 different versions across 7 rounds to find what works.

31
versions tested
56%
shorter than V4
7
rounds of testing

The best summary is shorter, marks where it’s uncertain, and tells the AI when NOT to apply a pattern. Less confident, more useful.

Pipeline Simplification

Less is more

V4 BriefV4 was used in this study. V5 is the current version.
March 8, 2026Download

We originally built a 14-step pipeline to turn conversations into an identity summary. Before shipping, we asked: which of these steps actually matter? We tested every single one by removing it and measuring what happened to quality.

10 of the 14 steps were unnecessary. Scoring, classification, contradiction detection, adversarial review. They all sounded rigorous. None of them improved the final output. Removing them actually made it better.

The 3-layer structure is essential. We split identity into three layers: what you reason from (your foundations), how you behave (your patterns), and testable predictions (things we can verify). Combining all three into one pass scored lower.

Raw facts without synthesis don’t work. Just dumping extracted facts into the AI without organizing them scored worst. The synthesis step, where facts become structured patterns, is where real compression happens.

14→5
steps simplified
87
simplified score
83
original score
~$16
total test cost

Simpler is better. The ablation reduced 14 steps to 4 — quality went up. We later added a 5th step (Embed) for traceability, not quality. Most of the complexity we built was ceremony, not substance.

Compression & Format

How much data is enough?

V4 BriefV4 was used in this study. V5 is the current version.
March 8, 2026Download

How much of someone’s conversation history does the system actually need? And does it matter whether the output is written as prose, bullet points, or a structured guide? We tested both questions.

20%
of facts needed
+24%
structure vs prose
1-2.5K
optimal characters

The pipeline’s value is in compression, not accumulation. The best summary is short, behavioral (not biographical), and structured rather than narrative.

Prediction Benchmark

Can it predict real people?

V4 BriefV4 was used in this study. V5 is the current version.
March 7, 2026Download

Can the system actually predict how a real person would respond to questions? We used a dataset of 100 real people, each with detailed descriptions of who they are. We compressed each description into a short summary and tested whether an AI could use that summary to predict the person’s actual survey responses.

The result

Our compressed summary (18x shorter) matched or outperformed giving the AI the entire description. On one model, the compressed version actually predicted better than the full dump, statistically significant at p=0.008.

Why compression works

A 130,000-character description contains a lot of noise: irrelevant details, repetition, tangents. Compressing it to 7,000 characters forces the system to keep only what actually predicts behavior. Less noise, more signal.

100
real people
18:1
compression
71.8%
prediction accuracy
p=0.008
statistically significant

A compressed summary predicts real human responses better than giving the AI everything. Compression doesn’t lose signal. It concentrates it.

Quality Measurement

Testing our own work

V4 BriefV4 was used in this study. V5 is the current version.
March 7, 2026Download

How do you measure whether a behavioral summary is actually good? We built five tests and ran them on a summary of Benjamin Franklin (extracted from his autobiography). Two passed, two failed, one couldn’t be measured. The failures taught us as much as the passes.

What passed

Claim traceability (99.98%): Nearly every claim in the summary traces back to something Franklin actually said or did. The system doesn’t make things up.

Signal retention: After heavy compression, the summary actually captures more of what matters than the raw source. Forcing brevity makes the system prioritize better.

What failed (and why that’s informative)

Adversarial resistance: The more accurately the summary captures someone, the easier it is to exploit their real contradictions. Accuracy and security are in tension. This is a real tradeoff, not a fixable bug.

Cross-model consistency: Different AI models interpret the same summary differently. Portability across models needs work.

Faithful summaries expose real tensions in someone’s worldview, and that makes them both more useful and more vulnerable. You can’t have perfect accuracy and perfect security. We think accuracy is the right trade.

Traceability

Can we prove it?

V4 BriefV4 was used in this study. V5 is the current version.
March 7, 2026Download

Using an AI to judge another AI’s output is circular. You’re trusting the same kind of system you’re trying to evaluate. We built an evaluation framework where every result can be checked by a human, costs nothing to run, and produces the same answer every time. No AI judges.

$0
evaluation cost
4
evaluation layers
2
layers implemented
8/10
prompts improved

If a human can’t check the claim, it’s not evidence. Every metric in this framework is verifiable without running an AI model.

Output Format

How should a brief look?

V4 BriefV4 was used in this study. V5 is the current version.
March 3, 2026Download

The last step of the pipeline writes the final summary that other AI systems will read. We tested six different formats, from flowing prose to structured guides to dense shorthand, to find what makes a summary most useful.

6
formats tested
+24%
structure vs prose
V4
production version

Same information, restructured, is dramatically more useful. The winning format doesn’t just describe patterns. It tells the AI when not to apply them and how to resolve contradictions.

Design Decisions

80 decisions, all public

Ongoing

Every architectural choice is documented with reasoning, alternatives considered, and status. 78 decisions across 97 sessions. Most projects publish their code. We also publish why the code looks the way it does: every wrong turn, every superseded idea, every decision that survived. The full log is in the repository.

78
decisions logged
97
sessions
47
fact types
414
tests passing

Architecture

Quality & Privacy

Evaluation Philosophy

What Didn't Work

Nothing is hidden. The prompts are in the code. The reasoning is in the log. We publish what didn’t work alongside what did.