How We Stopped Guessing: Hypothesis-Driven Development for AI Tools

How We Stopped Guessing: Hypothesis-Driven Development for AI Tools
Photo by Rodolfo Clix on Pexels

We shipped a feature we were proud of. It was elegant, well-reasoned, and obviously helpful: before reading any code file, the AI should preview its structure first. See the functions, classes, and types at a glance. Then decide what to read.

We called it mandatory structure preview. It made perfect sense. Every senior developer does this instinctively—you don’t read a 500-line file top to bottom when you only need one function.

Then we measured it. Across 200 benchmark runs, mandatory structure preview made tasks 4x slower than reading files directly.

Our best idea was our worst feature.

The Vibes Problem

AI developer tools have a measurement problem. Most improvements are evaluated by feel:

  • “It seems faster”
  • “The responses feel more accurate”
  • “I think it’s using fewer tokens”

This isn’t laziness. Measuring AI tool effectiveness is genuinely hard. Tasks vary wildly in complexity. The same task can take 15 seconds or 90 seconds depending on which path the model explores. Session-to-session variance makes any single comparison meaningless.

So teams ship features that make intuitive sense, write blog posts about theoretical benefits, and never discover that their “optimization” is actually a regression.

We were that team. Then we built a system to stop guessing.

The Approach: Every Change Is a Hypothesis

We adopted a simple rule: no prompt change ships without benchmark evidence.

Every modification to how Gabb’s MCP tools guide the AI—the skill instructions, the tool descriptions, the usage guidance—follows the same process:

  1. State the hypothesis. What specific change will improve performance, and why?
  2. Define the expected outcome. Which metric should move, and in which direction?
  3. Pick a target task. A SWE-bench task we expect to improve.
  4. Pick a control task. A task we expect to be unaffected.
  5. Run 20 times. Each condition, same task, fresh checkout.
  6. Analyze. Statistical significance, not gut feeling.

If the target improves and the control holds, we widen to 40 tasks. If the wider set holds, we merge. If anything regresses, we investigate.

This sounds heavy. In practice, it takes about an hour to get results for a single hypothesis—most of that is automated benchmark runtime. The analysis takes minutes. The discipline saves weeks of wrong-direction work.

Five Hypotheses That Shaped Gabb

Here’s what this process actually looks like, through five real experiments.

Hypothesis 1: Mandatory Structure Preview

The idea: Force the AI to call gabb_structure before every file read.

Why it seemed right: Previewing file structure should help the AI read only the relevant sections, saving tokens and time.

What happened: On January 2nd, we ran 200 benchmarks (20 tasks × 10 runs). The data showed something unexpected:

Navigation PatternAvg TimeFrequency
Direct read (no preview)14.8s44% of runs
Structure then read59.1s22% of runs

Structure-then-read was 4x slower than just reading the file. The preview call added network latency and decision overhead without actually reducing how much the AI read. For files under 200 lines—which is most files—the preview was pure waste.

The fix: Make structure preview conditional. Skip it for small files, known locations, and post-symbol-search reads. Use it only for large unknown files where targeted reading genuinely helps.

Result: The conditional version improved our overall speedup from 32.5% to 34.4%.

Lesson: Intuition is not evidence. What makes sense for humans (preview before reading) doesn’t necessarily make sense for an AI operating with different cost structures.

Hypothesis 2: Expose Symbol Search as a Tool

The idea: Give the AI a dedicated gabb_symbol tool for workspace-wide semantic symbol search.

Why it seemed right: Instead of grepping for function names (text search), let the AI search by symbol name (semantic search). “Find the function called update_proxy_model_permissions” should return one result, not 23 fuzzy matches.

What happened: This was our biggest win. On 15 symbol-heavy tasks:

Control:  Grep "find_model" → 47 matches → Read 5 files → 55.3s
Gabb:     gabb_symbol "find_model" → 1 exact match → Read 1 file → 21.1s

Average speedup on symbol-heavy tasks: 43.2%. One task—finding a specific migration handler in Django—achieved a 73.9% speedup.

Lesson: The biggest gains come from eliminating entire categories of work, not from making existing work slightly faster.

Hypothesis 3: Anti-Patterns for Simple Tasks

The idea: Add explicit guidance telling the AI not to explore when the task is simple.

Why it seemed right: We noticed the AI sometimes ran 10+ search calls for tasks like fixing a typo. It was exploring because the instructions encouraged exploration, regardless of whether the task warranted it.

What happened: Before the anti-pattern guidance:

Task: Fix typo → 10 Bash calls, 8 Grep calls, 9 Read calls → 81.4s

After:

Task: Fix typo → Read known file → 22.8s

Success rate improved from 94% to 95%. The AI became better at matching effort to task complexity.

Lesson: Guidance about when not to use a tool is as important as guidance about when to use it.

Hypothesis 4: String-Search Exemption

The idea: Skip gabb_structure when the AI is searching for string literals (error messages, log text, comments).

Why it seemed right: Semantic indexing is great for finding symbols—functions, classes, types. But if you’re looking for the string “Connection refused” in an error handler, you need grep, not symbol search.

What happened: 23% time improvement, 100% success rate maintained. The AI correctly used grep for text searches and Gabb for symbol searches.

Lesson: Know the boundaries of your tool’s value. Not everything is a nail.

Hypothesis 5: Parallel Exploration (Disproven)

The idea: Let the AI explore code structure in parallel with starting implementation, using subagent tasks.

Why it seemed right: Why wait for exploration to finish before writing code? Start implementing based on what you know, and incorporate additional context as it arrives.

What happened: It regressed. The anti-pattern guidance from Hypothesis 3 inadvertently discouraged the task/subagent system entirely, preventing legitimate parallel work. We reverted, then fixed the anti-pattern guidance to be more precise.

Lesson: Disproven hypotheses reveal interaction effects you can’t predict by reasoning alone. This failure taught us that guidance for one feature can silently break another. We would never have caught this without measurement.

Why Disproven Hypotheses Matter

It’s tempting to only share successes. But in our system, disproven hypotheses are first-class results:

  • They prevent revisiting dead ends
  • They document interaction effects between features
  • They build institutional knowledge about what doesn’t work and why

We track disproven hypotheses with the same rigor as proven ones. Each gets a GitHub issue, a PR with implementation, benchmark results, and a clear conclusion. The next engineer considering a similar change can see exactly why it was tried and why it failed.

In three months, we’ve proven 6 hypotheses, disproven 2, and reverted 1. That’s a 67% hit rate—meaning a third of our “good ideas” turned out to be wrong. Without measurement, all 9 would have shipped.

The Statistical Foundation

Good data requires good methodology. Every benchmark run follows the same protocol:

Sample size: 20 runs per condition per task. This gives us enough statistical power to detect meaningful effects while keeping runtime practical.

Significance testing: Welch’s t-test for time comparisons, chi-square for success rate differences. We require p < 0.05 to call a result significant.

Effect size: We report Cohen’s d alongside p-values. A statistically significant result that only saves 0.5 seconds isn’t worth shipping. Our primary benchmark shows d = 0.38 (small-to-medium effect), meaning the improvement is consistent enough to matter across diverse tasks.

Isolation: Every run starts from a fresh git checkout. No cached results, no cross-contamination between runs. Each run is independent.

Progressive widening: We don’t run 400 benchmarks for every hypothesis. Start with 1 task (20 runs). If it looks promising, widen to the control task. If the control holds, widen to the full 40-task suite. This keeps iteration fast while ensuring rigor for changes that reach production.

What This Looks Like in Practice

Our benchmark progression tells the story of the methodology in action:

DateScopeSpeedupWhat Changed
Jan 11 task, 50 runs50%Initial measurement (single task, unreliable)
Jan 220 tasks, 200 runs3%Discovered structure preview regression
Jan 510 tasks, 100 runs23%Fixed conditional guidance
Jan 620 tasks, 200 runs14%Identified over-exploration regressions
Jan 1140 tasks, 400 runs32.5%Full suite, statistically significant (p < 0.001)
Jan 2620 tasks, 200 runs34.4%Conditional structure validated

The first measurement (50% speedup on one task) was wildly optimistic. The second measurement (3% on twenty tasks) revealed that our feature was actually hurting most tasks. Only by measuring broadly and iterating did we arrive at a genuine, reproducible 32-34% improvement.

If we’d stopped at the first measurement, we’d have shipped a product that made most tasks slower while believing we’d achieved a 50% speedup. See our benchmark results for the current numbers.

The Meta-Lesson

Building tools for AI is different from building tools for humans. With human-facing tools, you can observe behavior, collect feedback, iterate on UX. With AI-facing tools, the “user” is a language model whose behavior changes based on subtle wording differences in prompts.

Consider: changing “you MUST call gabb_structure before reading any file” to “consider calling gabb_structure for large, unfamiliar files” produced a measurable performance difference. The tool itself didn’t change. The code didn’t change. A few words in a prompt changed how an AI navigated code.

This makes intuition unreliable. A change that seems obviously beneficial can regress performance. A change that seems too subtle to matter can produce the biggest improvement. The only reliable signal is measurement.

Applying This to Your Work

You don’t need SWE-bench to adopt this approach. The core principles apply to any AI tool development:

Measure before and after. Pick 5-10 representative tasks. Run each 3-5 times. Compare medians. It’s not rigorous enough for a paper, but it’s rigorous enough to catch regressions.

State your hypothesis before testing. This prevents post-hoc rationalization. “We changed X and something got better” is weaker than “We predicted X would improve Y because of Z, and it did.”

Track failures. Keep a log of changes that didn’t work. Future-you will thank past-you for not wasting time re-exploring dead ends.

Test at realistic scale. One cherry-picked example proves nothing. Five diverse examples prove something. Forty diverse examples prove quite a lot.

Be suspicious of big numbers. Our initial 50% speedup was real—for one task. The true improvement across diverse tasks was 32%. Still excellent, but honest.

What’s Next

We’re continuing to iterate. Current hypotheses under investigation include:

  • Whether call graph information (callers/callees) improves multi-file bug fixes
  • Whether proactive context injection (showing related files before the AI asks) reduces search time
  • Whether different guidance strategies work better for different task complexities

Each will go through the same process: hypothesis, prediction, measurement, conclusion.

The days of shipping AI tool changes on vibes are over—at least for us. The methodology adds a small amount of overhead and eliminates a large amount of wasted effort. It’s the most productive process change we’ve made.

Your AI tools are making promises. Are you measuring whether they keep them?


Gabb uses hypothesis-driven development to validate every change to its AI integration. The full methodology, benchmark suite, and historical results are available in our GitHub repository. Our hypothesis tracking uses GitHub Issues—search for the hypothesis label to see every experiment we’ve run.