My AI Memory System Retrieved the Right Sessions. It Wasn’t Enough.
The system could find prior context. That did not mean I would reach for it.
A terminal hung mid-operation. No error, no output — the process stopped and didn’t recover. When I restarted, the workspace files were intact. Three hours of diagnostic reasoning existed only in the transcript. I found the relevant exchange by memory: opened the file, scrolled until I located it. Recovered.
The recovery depended on luck. I happened to remember which session to check. Most people in this situation lose the work. I decided the underlying problem was structural: there’s no way to query a transcript. You can open it. You can scroll. You can’t ask “what did I decide about the authentication layer six weeks ago” and get a ranked answer. The knowledge is there. The retrieval isn’t.
The first repair was retrieval. I implemented MemPalace — an open-source semantic search layer that mines conversation transcripts into a vector database and retrieves on meaning, not keywords. What made it useful wasn’t the deployment. It was a configuration decision the defaults get wrong.
The first failure
MemPalace ships with ChromaDB’s default embedding model: `all-MiniLM-L6-v2`. I used it. Mined 500+ sessions and ran the first searches.
Query: Supabase schema decisions.
Before: a migration log; a dependency update thread; a debugging session where Supabase was the environment, not the subject. The session where the schema was actually designed — 40 minutes of architecture work — didn’t appear in the top results.
The words matched. The substance didn’t surface.
The default is a sentence similarity model. A migration log mentions Supabase clearly in every sentence. An architecture session mentions it once, then spends 40 minutes deciding what it should do. The default scores the former higher.
Long-context retrieval models are trained to answer a different question: is this passage *about* the concept, or does it merely reference it? That distinction is exactly what retrieval over transcripts needs.
`nomic-embed-text` is that class of model. The specific model matters less than the class — sentence similarity vs. long-context retrieval. The difference isn’t size. It’s what it was trained to find.
I replaced the embedding model and rebuilt the index.
The system resisted
Two files needed patching: `palace.py` (which builds the vector collection) and `searcher.py` (which embeds queries at search time). I patched `palace.py`, wiped the collection, and started re-mining.
Before the mine completed, a repair process ran — re-importing a partial collection from an earlier state. The repair didn’t know the configuration had changed. It reset the embedding function to the default. The collection now held a mix: some chunks embedded at 768 dimensions, the rest at 384.
The first search after the rebuild failed. Dimension mismatch: 384 vs. 768.
The error looked like an incomplete patch. The cause was different: a repair process that reverted to a state it considered safe. Safe state is not the same as correct state.
I patched both files explicitly, wiped and rebuilt from scratch. After: the architecture session — 40 minutes of schema design — ranked first. The session where the schema was defined, not the sessions where it was mentioned.
This was not an evaluation framework — it was a known-answer probe. Good enough to expose the default failure. Not enough to certify retrieval quality.
The second problem
The retrieval worked. Three weeks later, I noticed I wasn’t using it.
Not because it had failed. Because using it required: opening Terminal, navigating to the build directory, activating a virtual environment, running `mempalace search “query”`, reading results in monochrome output, and — if something looked relevant — manually finding and opening the source file to read it in full.
A shell alias would have reduced the first two steps. A fuzzy-search wrapper might have made the CLI tolerable. But the failure wasn’t just command entry — it was result handling: scanning, comparing, opening the source session, returning to the work with enough surrounding context to trust what I’d found. The browser UI was not for search. It was for inspection.
The issue was not the CLI. Retrieval happens at a fragile moment: when you suspect prior context exists but don’t yet know whether finding it will repay the interruption. At that moment, every extra step argues for staying cold. You take the shortcut — start the session cold, rely on workspace files, accept partial context.
The second build
The second repair was not better retrieval. It was reducing the distance between needing memory and reaching it.
I built a Flask server wrapping the CLI and a browser-based UI: a search field, result cards with workspace tags and relevance scores, a slide-in panel that pulls the complete session when you want to read it in full.
Building the full-session panel turned up a structural problem underneath the interface one.
ChromaDB’s internal schema is undocumented. Pulling complete session content — not just the matched chunk, but the whole source file — required querying the SQLite backing store directly. The metadata key holding the source filename isn’t `source`. It’s `source_file`. Document text isn’t stored in the metadata table. It lives in `embedding_fulltext_search_content`, column `c0`, where the row ID maps to the embedding ID.
None of that is in any documentation. Finding it required building a debug endpoint to dump the actual table structure and inspect sample rows — building the inspector before building the feature.
The same pattern had appeared earlier. The collection could search until mixed embedding dimensions exposed hidden configuration drift. The CLI could retrieve chunks until full-session inspection exposed private storage assumptions. The public interface proved that retrieval worked. It did not expose what retrieval depended on.
The ingest step — re-mining sessions into the index — is now a button. It streams the mining process live in a terminal panel. The lag between session and index was always manageable. Now it’s visible.
The honest constraints
**No temporal weighting.** A session from eight months ago retrieves at the same weight as one from last week. For a practice that evolves, older sessions may surface positions you’ve since revised. You’re the tiebreaker.
**Conflicting decisions retrieve at parity.** If you changed your mind between sessions, both versions surface with equal confidence. The system has no awareness of which decision superseded the other.
**No evaluation framework — and no signal when it fails.** There’s no ground truth for retrieval quality. The system can return plausible but incorrect sessions with no indication it’s wrong. You can run this for months without knowing whether retrieval is working or producing confident noise.
**The repair fragility is a standing risk.** Any process that rebuilds the collection — migration, emergency restore, partial re-mine — can reset the embedding function to the default. Both files need updating atomically. If that documentation doesn’t travel with the collection, the failure recurs.
**The interface increases confidence without increasing correctness.** Result cards, relevance scores, and full-session panels make retrieval feel more authoritative. They don’t prove the retrieved session is the right one. The UI makes weak retrieval harder to detect.
**The full-session panel depends on private storage assumptions.** Search can keep working while session expansion breaks silently. The panel relies on ChromaDB internals discovered empirically — not a supported contract. If the storage schema changes, the panel fails even if search doesn’t.
What this is actually about
The mistake was thinking usable memory ended at retrieval. I had solved access. I had improved search. I had not made the system reachable at the moment prior context was needed.
My first retrieval build stopped one layer too early. The index was current. The results were good. The system still failed at the point of use because the interface couldn’t meet the cognitive moment when the question arose.
Defaults set the first ceiling. Friction sets the second. If either is wrong, memory remains a project you built, not a practice you use.
Case Study Insight: A retrieval system that works correctly and goes unused has the same operational value as one that doesn’t work. The model determines what can be found. The interface determines whether memory enters the work.
Robert Ford builds products, writes stories and essays, and publishes The Intelligence Engine — a Substack about building AI practices that compound. His other writing lives at Brittle Views.


