The Third Memory Problem

Apr 16, 2026

On March 30, Anthropic shipped a packaging error with version 2.1.88 of Claude Code and accidentally published 512,000 lines of TypeScript. The code was mirrored within hours. The industry conclusion arrived fast: the real engineering is in the harness. Large language models are processors. The moat is the operating system you build around them.

This conclusion is correct. It’s also incomplete.

The leaked code is genuinely sophisticated. A background daemon called KAIROS — Dream Mode — wakes after 24 hours of inactivity, reviews memory files, prunes contradictions, consolidates learnings, and rewrites the index small enough to load cleanly into the next session. Tool lists are sent to the API in alphabetical order, which stabilizes the KV cache and lets subsequent calls skip the compute-heavy prefill phase entirely.

The memory problem is being treated as one problem. There are three. Most practitioners — including most engineers — are conflating them.

The retrieval problem is between-session forgetting. This is what the Amnesia Tax names: the hidden cost paid every time you re-explain yourself to a system that forgot everything from yesterday. Nine hundred seventy-seven GitHub repositories are solving this. Vector databases, semantic search indexes, episodic memory stores. The filing system problem — the work happened, you need to find it later.

The execution problem is mid-session degradation. Context windows grow. Attention computation scales quadratically. Large contexts become slow, expensive, and eventually incoherent. Claude Code’s harness addresses this directly: the self-healing loops, the context compaction, the KAIROS overnight consolidation. The OS problem. Complex, production-scale, genuinely hard engineering.

The reasoning problem is different in kind, not degree. It’s not about recovering what happened or preventing context collapse. It’s about encoding what the operator has learned — which calls to stop trusting, which patterns to resist, which instincts survived enough failures to be reliable. This is what Compiled Thinking produces: the operator’s accumulated judgment written in a form the model can load at session start and apply throughout.

No general-purpose repository solves this. KAIROS doesn’t either.

Here’s what that looks like in practice.

I was drafting TIE essays with full workspace context loaded — retrieval working, execution working, voice constraints in place. The drafts were coherent, structured correctly, and scored well against standard quality criteria.

They kept failing my evaluation.

The specific failure: the model was producing arguments — logically sound, well-reasoned — that didn’t trace to anything I’d actually built. The essays were credible enough to pass a surface read but couldn’t survive the question: which build produced this finding? The failure wasn’t obvious. The essays read as authoritative — specific claims, confident register, TIE voice intact. Without an explicit evaluation gate, I would have published at least two of them. The failure persisted across six drafts over three sessions before I traced it to a missing standard rather than a model limitation.

The retrieval layer couldn’t fix this. The execution layer couldn’t fix this. The system was already operating at the ceiling of what those layers produce. The gap wasn’t capability — it was the absence of an evaluation criterion.

My evaluation standard — claim must trace to an artifact, not to an argument — didn’t exist anywhere in the system. I had to encode it explicitly: “No finding without an experiment. No concept without evidence.”

Once written, the model applied it. Before that, even with perfect context and clean execution, it optimized for essay quality rather than research integrity. The standard was in my head. It had to be extracted.

The Honest Part

KAIROS can synthesize what happened. It prunes contradictions and consolidates learnings from memory files — real capability, and the subagent prompt Anthropic wrote for it is precise: *”You are performing a dream, a reflective pass over your memory files. Synthesize what you have learned recently into durable, well-organized memories so that future sessions can orient quickly.”*

The question is: contradictions according to what standard? Learnings evaluated against what criteria?

The answer is: the model’s. Which means KAIROS can improve at executing the loop — managing context, compressing efficiently, flagging inconsistencies. It cannot get better at deciding whether the output was any good, because good in most knowledge domains is a judgment call that depends on the operator’s accumulated experience, not on the content of the memory files.

This is what the Reflection Problem describes. Automated reflectors don’t degrade because their architecture is wrong. They degrade in ambiguous domains because the feedback signals they need to calibrate improvement are exactly what automation can’t generate. If the evaluation standard lives in the practitioner’s head and nowhere else, no synthesis process can sharpen it.

KAIROS is excellent at what automation can do: synthesis, compression, contradiction-pruning where criteria are clear. The reasoning layer requires what automation structurally cannot do: a human deciding what the criteria are in the first place.

That said — Compiled Thinking persists judgment, it doesn’t validate it. Encode a bad standard and the system becomes reliably wrong rather than randomly wrong. Internal consistency is not correctness.

The practitioners who understand this distinction will build differently.

The reasoning problem requires ongoing operator investment. It doesn’t get solved. It gets maintained.

This means the constraint file discipline isn’t a workaround for what models can’t yet do. It’s the layer the model structurally cannot replace, because it encodes evaluative judgment — which preferences survived contact with real work, which decisions were relitigated once and shouldn’t be again, which patterns only became visible after the fourth failure.

The leaked codebase is 512,000 lines of TypeScript. The reasoning layer is three markdown files and the discipline to update them.

Both are real engineering. One requires a team at Anthropic. The other requires a practitioner who knows what they’ve learned and is willing to write it down.

The engineers built the OS. The file holds last month’s judgment.

Robert Ford builds products, writes stories and essays, and publishes The Intelligence Engine — a Substack about building AI practices that compound. His other writing lives at Brittle Views.

The Intelligence Engine

Discussion about this post

Ready for more?