33% Faster Task completion time
42% Fewer Tool calls
88% Improved Tasks faster with Gabb

Executive Summary

32.5% Time improvement p < 0.001
95% Success rate +1% over control
87.5% Tasks improved 35/40 tasks faster
$0 Cost increase Identical API costs

Key Findings

  • Gabb excels on moderate-to-complex tasks (20-100s baseline) with 40%+ average speedup
  • Median improvement (46.5%) exceeds mean (32.5%), indicating consistent impact on typical tasks
  • Minor regressions on simple tasks (<20s) where control was already optimal
  • Token increase of 9.4% is cost-neutral due to efficient prompt caching

Primary Metrics

40 SWE-bench lite tasks, 10 runs per task, 800 total runs

Metric Control With Gabb Difference Significance
Success Rate 94.0% 95.0% +1.0% -
Wall Time 45.4s ± 43.3s 30.6s ± 34.9s -32.5% ***
Total Tokens 79,313 ± 116,402 86,748 ± 79,347 +9.4% ns
Cost (USD) $0.040 ± $0.060 $0.040 ± $0.040 0% -

Statistical Analysis

Wall Time

Absolute improvement 14.8s
t-statistic 5.31
95% Confidence Interval [9.2s, 20.3s]
Cohen's d 0.38 (small-medium)
p-value < 0.001

Time Distribution

Minimum 10.0s → 9.8s (-2%)
Q1 (25th) 15.2s → 13.8s (-10%)
Median 32.6s → 17.4s (-47%)
Q3 (75th) 61.1s → 25.9s (-58%)
Maximum 300s → 300s (0%)

Time Distribution Shift

Gabb shifts 78% of runs into the <30s bucket (vs 46% for control)

Time Bucket Control With Gabb Shift
<15s 93 (23.2%) 147 (36.8%) +13.6%
15-30s 92 (23.0%) 165 (41.2%) +18.2%
30-60s 111 (27.8%) 35 (8.8%) -19.0%
60-120s 86 (21.5%) 39 (9.8%) -11.7%
>120s 18 (4.5%) 14 (3.5%) -1.0%

Tool Usage Patterns

How Gabb changes the way Claude navigates code

Tool Control (avg) With Gabb (avg) Reduction
Read 5.1 3.2 -37.1%
Grep 3.9 2.4 -38.3%
Bash 3.9 1.4 -65.2%
Glob 1.1 0.5 -58.3%
Task (subagent) 0.6 0.2 -67.3%
gabb_symbol - 0.5 new
gabb_structure - 0.3 new

Behavioral Pattern Analysis

Control Patterns

grep_then_read 74.8% of runs, 55.3s avg
direct_read 19.5% of runs, 15.5s avg
search_only 5.8% of runs, 19.1s avg

75% of runs stuck in slow grep→read cycle

Gabb Patterns

direct_read 44.0% of runs, 25.3s avg
symbol_search 39.2% of runs, 21.1s avg
structure_then_read 12.5% of runs, 85.4s avg
solved_from_prompt 4.2% of runs, 13.4s avg

83% using fast direct or symbol navigation

Task Classification Analysis

By Baseline Complexity

Gabb provides most value on moderate-to-complex tasks (20-100s baseline)

Complexity Tasks Avg Speedup Notes
Simple (<20s) 13 1.9% Near break-even
Moderate (20-50s) 13 41.7% Sweet spot
Complex (50-100s) 12 39.9% Strong improvement
Very Complex (>100s) 2 18.6% Moderate improvement

By Gabb Usage Pattern

Pattern Tasks Avg Speedup Description
Symbol-heavy 15 43.2% Used gabb_symbol effectively
Structure-heavy 5 12.6% Relied on gabb_structure
Minimal gabb 20 18.6% Little/no gabb tool usage

Improvement Distribution

Category Tasks Percentage
Major improvement (≥50%) 12 30%
Moderate improvement (20-50%) 8 20%
Minor improvement (0-20%) 15 37.5%
Regression (<0%) 5 12.5%

Deep Dive: Results by Task

Top Improvements

Task Control Gabb Speedup Pattern
astropy__astropy-7746 76.9s 18.6s +75.8% Minimal gabb
django__django-11283 67.4s 17.6s +73.9% Symbol-heavy
django__django-12184 81.4s 22.8s +72.0% Minimal gabb
django__django-12284 66.8s 19.8s +70.3% Symbol-heavy
django__django-11999 61.0s 19.0s +68.8% Symbol-heavy
django__django-11422 37.9s 15.1s +60.1% Symbol-heavy
django__django-11905 33.2s 13.5s +59.3% Symbol-heavy
django__django-12286 35.6s 14.6s +59.1% Minimal gabb
django__django-12308 54.1s 23.2s +57.1% Symbol-heavy
astropy__astropy-14995 41.8s 18.6s +55.6% Symbol-heavy

Case Study: astropy-7746 (75.8% Speedup)

Control (76.9s avg)

  • Bash: 10.7 calls
  • Grep: 8.9 calls
  • Read: 9.6 calls
  • Glob: 1.9 calls
  • Task: 1.0 calls

Gabb (18.6s avg)

  • Read: 2.1 calls only

Control required extensive exploratory searching (8.9 Grep, 10.7 Bash commands) while Gabb solved it with just 2.1 Read calls on average. Semantic navigation eliminated wasteful exploration.

Regression Analysis

All 5 regression tasks were simple (<20s baseline) where control was already optimal

Task Control Gabb Regression Cause
django__django-12453 11.8s 14.4s -21.9% Unnecessary Grep added
django__django-11583 14.3s 17.3s -21.0% Extra exploration overhead
django__django-12589 11.7s 12.9s -11.0% Trivial single-file task
astropy__astropy-6938 12.2s 12.5s -2.5% Minimal overhead
django__django-11742 70.5s 71.8s -1.9% Structure-heavy approach

Methodology

SWE-bench Tasks

40 tasks derived from real GitHub issues in popular open-source projects (Django, Astropy).

A/B Testing

Each task runs under two conditions: standard Claude Code (control) and Claude Code with Gabb enabled.

Statistical Rigor

400 runs per condition with Welch's t-test. Results significant at p < 0.001 with 95% confidence intervals.

Isolated Execution

Each task runs in a fresh git checkout. Gabb daemon indexes the workspace before task execution.

Benchmark Details

Type Suite (40 SWE-bench lite tasks)
Runs per task 10
Total runs 800 (400 per condition)
Date January 11, 2026
Commit 554bfda

Run Your Own Benchmark

cd benchmark/claude-code

# Run a single SWE-bench task
python run.py --swe-bench django__django-11179 --runs 5

# Run the full benchmark suite
python run.py --swe-bench-suite --limit 40 --runs 10

# Analyze results
python analyze.py --latest --markdown

Full benchmark code available in the gabb-cli repository .