Your Conversation History Is a Knowledge Base. You Just Can’t Search It.
The problem isn’t that AI doesn’t remember. It’s that you can’t retrieve what it helped you build.
Every session leaves a record. Decisions get logged. Architecture gets documented. But the actual reasoning — where the problem was diagnosed, where the constraint was established, where two approaches were weighed against each other — that lives in the transcript. And transcripts can’t be queried.
You can open them. You can scroll. What you can’t do is ask “what did I decide about the authentication layer six weeks ago” and get a ranked answer. The knowledge is there. The retrieval isn’t.
A hung session made this concrete. The terminal stopped mid-operation — no error, no output. When I restarted, the workspace files were intact. Three hours of diagnostic reasoning existed only in the transcript. I found the relevant exchange by memory, opened the file, read through until I located it. Recovered. But the recovery took longer than it should have, and it only worked because I remembered which session to look in.
Most people hit this and lose the work. I decided the problem was structural.
The fix is a retrieval layer over conversation history. I built one — implemented here with MemPalace, an open-source semantic search layer that mines transcripts into a vector database and retrieves on meaning, not keywords. Query it and it returns ranked passages from past sessions with source metadata.
What made it useful wasn’t the deployment. It was a configuration decision the defaults get wrong.
The first failure
MemPalace ships with ChromaDB’s default embedding model: `all-MiniLM-L6-v2`. I used it. Mined 500+ sessions and ran the first searches.
Query: Supabase schema decisions on one of my projects.
Before: a migration log; a dependency update thread; a debugging session where Supabase was the environment, not the subject. The session where the schema was actually designed — 40 minutes of architecture work — didn’t appear in the top results.
The words matched. The substance didn’t surface.
The model ranks surface similarity. These transcripts don’t surface the decision — they bury it. A migration log mentions Supabase clearly in every sentence. An architecture session mentions it once, then spends 40 minutes deciding what it should do. The default model scores the former higher.
Long-context models are trained to answer a different question: is this passage *about* the concept, or just mentioning it? That distinction is exactly what the retrieval needed.
`nomic-embed-text` is that class of model. The specific model matters less than the class — sentence similarity vs long-context retrieval. The difference isn’t size — it’s what it was trained to retrieve.
I replaced the embedding model and rebuilt the index.
The system resisted
Two files needed patching: `palace.py` (which builds the vector collection) and `searcher.py` (which embeds queries at search time). I patched `palace.py`, wiped the collection, and started re-mining.
Before the mine completed, a repair process ran — re-importing a partial collection from an earlier state. The repair reset the embedding function to the default. The collection now held a mix: some chunks embedded at 768 dimensions, the rest at 384.
The first search after the rebuild failed. Dimension mismatch: 384 vs 768.
The error looked like an incomplete patch — query embedded by the old model, collection built by the new one, ChromaDB refusing to compare them. But the cause was different: a repair process that didn’t know what the configuration should be. It reverted to a state it considered safe.
Systems revert to defaults unless configuration is enforced. Safe state is not the same as correct state.
I patched both files explicitly, wiped the collection again, re-mined from scratch. The second fix held.
After: same query, same transcripts. The architecture session — the one with 40 minutes of schema design — ranked first. The same query that had returned migration logs now returned the session where the schema was defined. The difference between mention and decision.
Wiring it in
The `/recall` skill makes this operational inside a work session. Call it with a query before starting work — it runs `mempalace search`, returns a pre-brief block of relevance-ranked passages with source metadata and session timestamps, and surfaces them in the conversation before the workspace files load.
The integration with `/open` is natural: recall runs first, then status files. The pre-brief assembles from two sources — the markdown files the workspace maintains, and the conversation history the workspace generated. These are different records of the same work. Both matter.
The Honest Part
The palace is a snapshot. The corpus reflects the last time you ran `mempalace mine`. Recent sessions are dark until the next mine. A nightly task or a hook on `/close` keeps the lag short — this is manageable.
What isn’t manageable without deliberate design:
**No evaluation framework — and no signal when it fails.** There’s no ground truth for retrieval quality. The system can return plausible but incorrect sessions with no indication it’s wrong. You won’t know from the output whether you’re reading the session where a decision was made or a session where the same topic appeared in passing. You can’t measure precision or recall without building the evaluation harness yourself. This means you can run the system for months without knowing whether the retrieval is working or producing confident noise.
**Conflicting decisions retrieve at parity.** If you changed your mind between sessions, MemPalace returns both versions with equal confidence. The system has no awareness of which decision superseded the other. You’re the tiebreaker.
**No temporal weighting.** A session from eight months ago retrieves at the same weight as one from last week. For a practice that evolves, that’s a category problem the retrieval layer doesn’t solve.
**The repair fragility doesn’t go away.** Any process that rebuilds or repairs the collection — import, migration, emergency restore — is an opportunity to reset the embedding function to the default. The fix requires both files updated atomically, documented explicitly. If the documentation doesn’t travel with the collection, the failure recurs.
What this is actually about
The standard advice when building retrieval systems is to treat the embedding model as a commodity. Use the default. The model isn’t the product.
That’s wrong when your input distribution doesn’t match what the default was trained on. A sentence similarity model on long-form conversation transcripts is a category mismatch — technically functional, practically weak. The system ran for weeks before the mismatch was diagnosed, because weak retrieval doesn’t announce itself as a configuration error. It returns the wrong things with apparent confidence.
A natural alternative: fix the logging instead. Better structured summaries, more granular decision capture, outcome logs. Structured logging captures what was decided. It doesn’t capture the reasoning that produced the decision — the alternatives weighed, the constraints surfaced, the diagnostic path taken. Retrieval recovers that context. Logging records the conclusion.
The context window isn’t the limit. Retrieval is. And retrieval quality is bounded by how well your embedding model matches your data distribution.
In retrieval systems built on long-form content, the embedding model sets the ceiling.
Case Study Insight: You already have access to everything that was said. The question is whether you can retrieve what was decided. That distinction — between access and retrieval — is where the embedding model either earns its keep or fails quietly.
Robert Ford builds products, writes stories and essays, and publishes The Intelligence Engine — a Substack about building AI practices that compound. His other writing lives at Brittle Views.

