The Reflection Problem

An academic paper proved that evolving context beats static prompts. It also revealed where automation stops and practice begins.

Apr 02, 2026

A recent paper formalizes something I’ve been doing by hand for months.

“Agentic Context Engineering,” accepted at ICLR 2026, argues that instead of compressing what an AI knows into terse instructions, you should let the context grow. A Generator executes tasks, a Reflector extracts lessons, a Curator integrates them into structured context. Under test conditions, this structure matched GPT-4.1’s production agent with a fraction of the compute.

The paper names two problems that practitioners already know.

The first is brevity bias. Prompt optimization converges toward shorter, more generic instructions. Each revision strips domain-specific detail until the failures cluster in edge cases — the exact cases that needed the specific knowledge the optimization compressed away.

In my own system, I’ve watched this happen in reverse. Constraint files that started as three-line reminders grew to 163 lines across five projects. Each line earned its place by catching a specific failure. The academic version of brevity bias is what happens when you go the other direction — optimizing for conciseness until the constraints disappear and the failures return.

The second is context collapse. When an LLM rewrites its own accumulated context — summarizing what it’s learned into a fresh document — the summary degrades with each iteration. At step 60 of their experiment, the context held 18,000 tokens and performed well. At step 61, it collapsed to 122 tokens and performed worse than having no context at all.

The system forgot. Not gradually — catastrophically.

ACE solves both problems with architecture. Incremental delta updates instead of monolithic rewrites. A dedicated Reflector separated from the Generator. The context grows without collapsing.

The structure matches what I’ve been doing manually: append-only decision logs, constraint files that grow but never get fully rewritten, status files that track what changed rather than what the system thinks I should know. The paper demonstrates the same structure under controlled conditions.

ACE works brilliantly in clean-feedback environments. Agent tasks where code executes or throws an error. Financial analysis where the answer is right or wrong. The Reflector knows whether the Generator succeeded because there’s an objective signal.

The paper acknowledges what happens without clean feedback. When ground-truth labels are absent — when there’s no execution trace, no right answer to compare against — both ACE and its competitors degrade. The context gets polluted by lessons extracted from ambiguous results. The Reflector can’t distinguish good work from bad, so it encodes both as strategies.

This is where it starts to break.

The Reflection Problem: systems can accumulate context, but in ambiguous domains they can’t reliably decide what’s worth keeping.

The domains I work in — essay quality, strategic positioning, voice consistency, whether a constraint file has earned its place — don’t produce execution traces. The “feedback” is whether the constraint caught the right thing, whether the essay landed with practitioners, whether engagement produced reciprocity. These signals are real but ambiguous, delayed, and often invisible in the metrics.

In ACE’s architecture, the Reflector would encode my Friday afternoon publish slot as a viable strategy because the essay went live without errors. My system reads the signal differently — the 24-hour snapshot showed 8 views and flat traffic against five prior publish cycles, with concurrent absence of the thread engagement that correlates with subscriber growth. A weak signal at best, but one that only makes sense in the context of the five cycles before it.

No automated Reflector I’ve seen makes that call reliably. Not because the capability is impossible, but because the evaluation requires judgment that only accumulates through practice.

In practice, the split shows up immediately.

Automated context engineering — ACE’s mode — runs a clean feedback loop: try something, measure the result, extract the lesson, update the playbook. This scales. The paper proves it works.

Practiced context engineering runs the feedback loop through a human who holds the evaluator role — not because automation is impossible, but because the evaluation itself is the expertise. Knowing which constraint earned its place, which essay landed, which engagement signal matters — this is the practice. The system doesn’t produce the judgment. The judgment produces the system.

It splits into two modes the paper can’t test directly. My constraint files work on the third project because I built two projects without them first — I learned where the joints were by building integrated and feeling where things broke. Automate the Reflector before the practitioner has that intuition, and the context grows in the wrong direction.

The Honest Part

I’m making a convenient argument.

The paper proves that automated context engineering works — measurably, reproducibly, at scale. My system is one person, nine subscribers, and a methodology I can’t yet separate from my own expertise. Claiming that practiced reflection is architecturally necessary could be motivated reasoning dressed up as architectural insight. Maybe what I call “judgment” is just the part I haven’t figured out how to automate yet.

I don’t know where the boundary is. I know that ACE’s Reflector degrades without clean feedback. I know that my practiced reflection produces better context in ambiguous domains — or at least I believe it does, based on signals that an ML researcher would rightly call anecdotal.

The gap might close. Models might get better at evaluating their own work in judgment-dependent domains. Some of what I’m calling “practiced reflection” could probably be automated today — publish-slot analysis, engagement-pattern correlation, constraint-file usage tracking. I haven’t tried.

I also can’t always tell when my judgment is wrong. The mechanism I have is crude: when a constraint sits untouched for months, or when I route around the same rule three projects in a row, that’s the signal that the line was drawn in the wrong place. I’ve removed constraints this way. But there’s no execution trace that says “this judgment call was bad.” The feedback is slow, indirect, and easy to miss. An automated system with clean signals would catch its mistakes faster than I catch mine.

What I can’t automate yet is the decision about what matters. Which constraint earned its place. Which engagement signal is noise. Which lesson from the last project applies to the next one and which was specific to a context that won’t repeat.

That judgment is the practice. The system is the artifact the practice produces.

ACE proved that evolving context beats static prompts. The next question is whether the evolution itself can be fully automated.

I don’t think it can. But the history of these systems is a history of things that looked like judgment until they didn’t.

Robert Ford builds products, writes stories and essays, and publishes The Intelligence Engine — a Substack about building AI practices that compound. His other writing lives at Brittle Views.

The Intelligence Engine

Discussion about this post

Ready for more?