33% Faster Task completion time

42% Fewer Tool calls

88% Improved Tasks faster with Gabb®

32.5% Time improvement p < 0.001

95% Success rate +1% over control

87.5% Tasks improved 35/40 tasks faster

$0 Cost increase Identical API costs

Key Findings

Gabb excels on moderate-to-complex tasks (20-100s baseline) with 40%+ average speedup
Median improvement (46.5%) exceeds mean (32.5%), indicating consistent impact on typical tasks
Minor regressions on simple tasks (<20s) where control was already optimal
Token increase of 9.4% is cost-neutral due to efficient prompt caching

40 SWE-bench lite tasks, 10 runs per task, 800 total runs

Metric	Control	With Gabb	Difference	Significance
Success Rate	94.0%	95.0%	+1.0%	-
Wall Time	45.4s ± 43.3s	30.6s ± 34.9s	-32.5%	***
Total Tokens	79,313 ± 116,402	86,748 ± 79,347	+9.4%	ns
Cost (USD)	$0.040 ± $0.060	$0.040 ± $0.040	0%	-

Statistical Analysis

Wall Time

Absolute improvement 14.8s

t-statistic 5.31

95% Confidence Interval [9.2s, 20.3s]

Cohen's d 0.38 (small-medium)

p-value < 0.001

Time Distribution

Minimum 10.0s → 9.8s (-2%)

Q1 (25th) 15.2s → 13.8s (-10%)

Median 32.6s → 17.4s (-47%)

Q3 (75th) 61.1s → 25.9s (-58%)

Maximum 300s → 300s (0%)

Time Distribution Shift

Gabb shifts 78% of runs into the <30s bucket (vs 46% for control)

Time Bucket	Control	With Gabb	Shift
<15s	93 (23.2%)	147 (36.8%)	+13.6%
15-30s	92 (23.0%)	165 (41.2%)	+18.2%
30-60s	111 (27.8%)	35 (8.8%)	-19.0%
60-120s	86 (21.5%)	39 (9.8%)	-11.7%
>120s	18 (4.5%)	14 (3.5%)	-1.0%

How Gabb changes the way Claude navigates code

Tool	Control (avg)	With Gabb (avg)	Reduction
Read	5.1	3.2	-37.1%
Grep	3.9	2.4	-38.3%
Bash	3.9	1.4	-65.2%
Glob	1.1	0.5	-58.3%
Task (subagent)	0.6	0.2	-67.3%
gabb_symbol	-	0.5	new
gabb_structure	-	0.3	new

Behavioral Pattern Analysis

Control Patterns

grep_then_read 74.8% of runs, 55.3s avg

direct_read 19.5% of runs, 15.5s avg

search_only 5.8% of runs, 19.1s avg

75% of runs stuck in slow grep→read cycle

Gabb Patterns

direct_read 44.0% of runs, 25.3s avg

symbol_search 39.2% of runs, 21.1s avg

structure_then_read 12.5% of runs, 85.4s avg

solved_from_prompt 4.2% of runs, 13.4s avg

83% using fast direct or symbol navigation

By Baseline Complexity

Gabb provides most value on moderate-to-complex tasks (20-100s baseline)

Complexity	Tasks	Avg Speedup	Notes
Simple (<20s)	13	1.9%	Near break-even
Moderate (20-50s)	13	41.7%	Sweet spot
Complex (50-100s)	12	39.9%	Strong improvement
Very Complex (>100s)	2	18.6%	Moderate improvement

By Gabb Usage Pattern

Pattern	Tasks	Avg Speedup	Description
Symbol-heavy	15	43.2%	Used gabb_symbol effectively
Structure-heavy	5	12.6%	Relied on gabb_structure
Minimal gabb	20	18.6%	Little/no gabb tool usage

Improvement Distribution

Category	Tasks	Percentage
Major improvement (≥50%)	12	30%
Moderate improvement (20-50%)	8	20%
Minor improvement (0-20%)	15	37.5%
Regression (<0%)	5	12.5%

Top Improvements

Task	Control	Gabb	Speedup	Pattern
astropy__astropy-7746	76.9s	18.6s	+75.8%	Minimal gabb
django__django-11283	67.4s	17.6s	+73.9%	Symbol-heavy
django__django-12184	81.4s	22.8s	+72.0%	Minimal gabb
django__django-12284	66.8s	19.8s	+70.3%	Symbol-heavy
django__django-11999	61.0s	19.0s	+68.8%	Symbol-heavy
django__django-11422	37.9s	15.1s	+60.1%	Symbol-heavy
django__django-11905	33.2s	13.5s	+59.3%	Symbol-heavy
django__django-12286	35.6s	14.6s	+59.1%	Minimal gabb
django__django-12308	54.1s	23.2s	+57.1%	Symbol-heavy
astropy__astropy-14995	41.8s	18.6s	+55.6%	Symbol-heavy

Case Study: astropy-7746 (75.8% Speedup)

Control (76.9s avg)

Bash: 10.7 calls
Grep: 8.9 calls
Read: 9.6 calls
Glob: 1.9 calls
Task: 1.0 calls

Gabb (18.6s avg)

Read: 2.1 calls only

Control required extensive exploratory searching (8.9 Grep, 10.7 Bash commands) while Gabb solved it with just 2.1 Read calls on average. Semantic navigation eliminated wasteful exploration.

Regression Analysis

All 5 regression tasks were simple (<20s baseline) where control was already optimal

Task	Control	Gabb	Regression	Cause
django__django-12453	11.8s	14.4s	-21.9%	Unnecessary Grep added
django__django-11583	14.3s	17.3s	-21.0%	Extra exploration overhead
django__django-12589	11.7s	12.9s	-11.0%	Trivial single-file task
astropy__astropy-6938	12.2s	12.5s	-2.5%	Minimal overhead
django__django-11742	70.5s	71.8s	-1.9%	Structure-heavy approach

SWE-bench Tasks

40 tasks derived from real GitHub issues in popular open-source projects (Django, Astropy).

A/B Testing

Each task runs under two conditions: standard Claude Code (control) and Claude Code with Gabb enabled.

Statistical Rigor

400 runs per condition with Welch's t-test. Results significant at p < 0.001 with 95% confidence intervals.

Isolated Execution

Each task runs in a fresh git checkout. Gabb daemon indexes the workspace before task execution.

Benchmark Details

Type	Suite (40 SWE-bench lite tasks)
Runs per task	10
Total runs	800 (400 per condition)
Date	January 11, 2026
Commit	554bfda

cd benchmark/claude-code

# Run a single SWE-bench task
python run.py --swe-bench django__django-11179 --runs 5

# Run the full benchmark suite
python run.py --swe-bench-suite --limit 40 --runs 10

# Analyze results
python analyze.py --latest --markdown

Full benchmark code available in the gabb-cli repository .

Benchmark Results

Executive Summary