Two AIs Rewrote Our Investor Deck — Here’s the Pattern That Took It From 3 to 9

The builder and the evaluator should never be the same model.

Mar 31, 2026

My co-founder sent me a pitch deck. Twelve slides for an angel raise. Consumer subscription startup — real product, real users, warm brand.

The deck had right instincts in the wrong execution. Pricing was wrong — a number we’d already changed internally, still showing the old one. Revenue claims were unvalidated. The financial model didn’t reconcile: subscriber count times annual revenue didn’t equal the total on the slide. No traction slide. No ask slide. Several typos. A well-meaning deck that would lose the room in the first five minutes.

The question wasn’t how to fix the deck. It was how to systematically harden a high-stakes deliverable — investor-facing material where every claim gets tested against reality — without spending a week on revision cycles.

The Friction

The standard workflow for reviewing a co-founder’s work looks like this: read it, mark it up, send notes, wait for the revision, review the revision, send more notes. Each round takes a day. Politeness inflates the feedback. Disagreements over word choices stall progress on structural problems. After three rounds you still aren’t confident it’s ready, because neither of you is an investor.

I could have had Claude — the model I use for building — rewrite the deck from scratch. And I did, for the first pass. Claude produced a 14-slide revision that fixed the structural problems: correct pricing, validated claims only, bottom-up market sizing, a traction slide, an ask slide. It was a significant improvement.

But then I faced a problem that most AI workflows ignore: how do you evaluate the thing you just built?

If Claude rewrites the deck and Claude reviews the rewrite, you get confirmation bias with a confidence score. The model that chose those words will find reasons those words are good. The model that structured those slides will argue the structure is sound. It’s not lying. It’s doing what language models do — maintaining coherence with their own output.

The reviewer and the builder shouldn’t be the same model. I needed an adversary.

The Build

I built a five-round loop I’m calling Adversarial Hardening. Two models in deliberate opposition, with a structured protocol between them.

Claude builds a versioned artifact — deck v1, v2, v3 — with full context: company facts, confirmed pricing, internal policies, known issues with prior versions. I paste that artifact into ChatGPT with a contextual evaluation prompt. Not “review this deck.” A structured scoring rubric: specific dimensions, prior-version comparison, explicit instructions to be adversarial. ChatGPT stress-tests and scores it — dimension by dimension, line by line, with numerical ratings. I bring the feedback back to Claude for targeted revision. Not “make it better.” Specific fixes against specific scores. Repeat until convergence.

The critical piece isn’t the models. It’s the prompt.

Round 1 was a single-document evaluation. I gave ChatGPT the original deck and my written feedback, and told it: “Evaluate both — don’t assume either one is right. Challenge the deck and challenge my recommendations.” The original scored 3 out of 10. Claude’s first rewrite scored 8.

Round 2 shifted to a three-version comparison. “Here are versions A, B, and C. Score each on these seven dimensions. Identify the top three priority fixes.” This round caught something I’d missed across two full reads of my own rewrite: the market-sizing slide still used top-down TAM numbers — $300 billion productivity market, one billion AI users — that looked impressive and proved nothing. ChatGPT flagged the slide as “decorative math” and demanded a bottom-up funnel with capture mechanics. It also caught claims language still too assertive for a pre-revenue company — “will achieve” became “designed to achieve” — and flagged the missing ask terms.

Rounds 3 and 4 were iterative convergence. Scores climbed from 8 to 8.5 to 9 to 9.4. The moves got smaller with each pass. Softening a single verb. Trimming a vision slide from five bullet points to three. Adding churn assumptions to the financial model so the numbers could be independently verified.

One reversal I resisted: ChatGPT flagged the financial projections as still too aggressive — even after I’d already scaled them down from my co-founder’s original numbers. I’d anchored on the revised figures as “conservative enough.” The adversary disagreed. It pointed out that the Year 1 subscriber count implied 1,200 new sign-ups per month against 5-7% churn, and demanded I either show the acquisition math or label the assumptions as modeled rather than projected. I didn’t want to weaken the slide further. I did it anyway. That single change — from “projected” to “modeled, not yet observed” — was the difference between a financial slide that invites scrutiny and one that survives it.

ChatGPT also pushed to lower the subscription price — arguing it would improve conversion. The logic was clean and wrong for this system. Pricing wasn’t just conversion; it was positioning. We held the higher price and reserved the lower one for controlled entry conditions — not the default.

The loop stopped when two consecutive rounds produced no new material objections — only cosmetic suggestions the adversary itself scored below threshold.

Round 5 expanded the scope. Instead of evaluating the deck alone, I gave ChatGPT a four-document package: the deck, an investor Q&A prep document, a verbal delivery script, and an internal note to my co-founder explaining the changes. “Evaluate this as a complete fundraising package — not just ‘is the deck good’ but ‘is this team ready to walk into a room and raise money?’” The package scored 9.4.

Four design decisions made the prompt effective rather than generic:

I always included company context — confirmed facts, internal policies, known disagreements between the founders — so the evaluator had the same information an honest advisor would have. I always compared against prior versions, not just absolute quality, so regressions would get caught. I always demanded numerical scores, because numbers force specificity where adjectives allow drift. And I never asked “is this good?” I asked “score these seven dimensions and identify the three highest-priority fixes.”

The seven-dimension scoring rubric never changed across five rounds. Everything else did. The rubric was the stabilizing constraint — the fixed frame that made each round’s feedback comparable to the last, and made convergence measurable rather than felt.

The Insight

Adversarial Hardening is a workflow primitive in this system — not a technique I applied once, but a structure that made every subsequent round produce better output than the last.

The models didn’t drive the result. The separation did. When one model generates and refines its own work, you get coherent mediocrity — everything fits together, nothing gets pressure-tested, and the output is exactly as good as the model’s blind spots allow.

The separation only worked because the prompt forced scoring, comparison, and prioritization. A prompt that includes the specific artifact, prior versions, the author’s stated constraints, a structured rubric, and explicit adversarial framing produces feedback specific enough to act on.

3 to 8 was structural. 8 to 9.4 was precision. Each round was diminishing returns on quality but increasing returns on confidence. By round 5, a hostile evaluator with structured criteria and full context couldn’t find material issues. That’s a different kind of “done” than “I think this looks good.”

The counterfactual is specific. Without the adversarial loop, I would have shipped Claude’s round-1 rewrite — the 8/10 version. It was dramatically better than the original. The claims were cleaner. The structure was sound. And it still had unvalidated language, missing ask terms, and a financial model that couldn’t survive investor scrutiny. The 8/10 deck gets a polite meeting. The 9.4/10 deck gets a second one.

Adversarial Hardening is a session pattern with specific requirements — the builder never evaluates its own work, the evaluator gets full context and structured criteria, and the loop runs until the evaluator runs out of material objections.

The Honest Part

This worked for a pitch deck — a document with clear success criteria, a well-understood audience, and objective dimensions to score against. Whether it generalizes to artifacts with fuzzier quality criteria is an open question.

The scoring rubric made the feedback actionable. But the rubric itself was something I designed — choosing the seven dimensions, weighting them, deciding what constitutes a “material objection.” If the rubric is wrong, the loop converges on the wrong target. Adversarial Hardening hardens against the criteria you give it. It doesn’t tell you whether those criteria are the right ones.

The 3-to-9.4 arc also compressed a specific kind of work: taking existing knowledge and structuring it for a specific audience. The company facts existed. The strategy existed. The product existed. What didn’t exist was a tight presentation of those things. This loop compressed refinement. It didn’t generate new knowledge. Whether the same pattern works for building something genuinely new — where the evaluator can’t check claims against known facts because the facts don’t exist yet — is untested.

And the adversary wasn’t always right. ChatGPT pushed back on the “AI-as-condiment” positioning — arguing that angel investors in 2026 want to see “AI” front and center, not buried. That was generic investor-deck advice, not ours. Our positioning constraint existed for specific reasons, and the evaluator didn’t have the context to know why. I discarded the critique. Several others got filtered the same way — feedback that reflected best practices for a general pitch deck rather than the specific constraints we’d already decided on.

The human in the loop did real work. I wasn’t just copying and pasting between two models. I was reading ChatGPT’s feedback, deciding which critiques were valid, filtering out the generic ones, and translating the valid ones into revision instructions for Claude. The operator’s judgment is the quality function between the two models. If you remove that — if you automate the loop and let the models negotiate directly — you might get convergence, but you lose the judgment about which convergence matters.

What This Is Actually About

Prior case studies in this series deposited specific artifacts: a constraints template, a decision log pattern, a multi-tool orchestration protocol. This case study adds one more: the Adversarial Hardening prompt — a reusable evaluation structure where a contextual rubric, version comparison, and adversarial framing produce feedback that actually moves a score.

In this run, AI wasn’t used to produce the deck. It was used to pressure-test it. That’s a different use case than most practitioners have built workflows for — and it’s the one that moved the score.

Systems that can’t tolerate error separate creation from approval. The engineer who writes the code doesn’t approve the pull request. The architect who designs the structure doesn’t certify the load calculations. Adversarial Hardening applies the same principle to AI workflows — and most AI workflows don’t have it.

The prompt is the artifact that made the loop transferable. The seven-dimension rubric, the version-comparison requirement, the “top three priority fixes” constraint on output — those transfer to any high-stakes deliverable. Strategy documents. Product specs. Legal agreements. Course modules. Anything where “I think this is good” isn’t a sufficient quality standard.

The deck went from 3 to 9.4. Not because AI is smart. Because agreement was structurally disallowed — and quality followed.

Case Study Insight: The highest-leverage AI pattern isn’t generation — it’s structured adversarial evaluation. When the builder and the critic are architecturally separated, quality converges faster than any single-model workflow allows.

Robert Ford builds products, writes stories and essays, and publishes The Intelligence Engine — a Substack about building AI practices that compound. His other writing lives at Brittle Views.

The Intelligence Engine

Discussion about this post

Ready for more?