Benchmark Results
Measuring the impact of semantic code indexing on AI-assisted navigation
Executive Summary
Key Findings
- Gabb excels on moderate-to-complex tasks (20-100s baseline) with 40%+ average speedup
- Median improvement (46.5%) exceeds mean (32.5%), indicating consistent impact on typical tasks
- Minor regressions on simple tasks (<20s) where control was already optimal
- Token increase of 9.4% is cost-neutral due to efficient prompt caching
Primary Metrics
40 SWE-bench lite tasks, 10 runs per task, 800 total runs
| Metric | Control | With Gabb | Difference | Significance |
|---|---|---|---|---|
| Success Rate | 94.0% | 95.0% | +1.0% | - |
| Wall Time | 45.4s ± 43.3s | 30.6s ± 34.9s | -32.5% | *** |
| Total Tokens | 79,313 ± 116,402 | 86,748 ± 79,347 | +9.4% | ns |
| Cost (USD) | $0.040 ± $0.060 | $0.040 ± $0.040 | 0% | - |
Statistical Analysis
Wall Time
Time Distribution
Time Distribution Shift
Gabb shifts 78% of runs into the <30s bucket (vs 46% for control)
| Time Bucket | Control | With Gabb | Shift |
|---|---|---|---|
| <15s | 93 (23.2%) | 147 (36.8%) | +13.6% |
| 15-30s | 92 (23.0%) | 165 (41.2%) | +18.2% |
| 30-60s | 111 (27.8%) | 35 (8.8%) | -19.0% |
| 60-120s | 86 (21.5%) | 39 (9.8%) | -11.7% |
| >120s | 18 (4.5%) | 14 (3.5%) | -1.0% |
Tool Usage Patterns
How Gabb changes the way Claude navigates code
| Tool | Control (avg) | With Gabb (avg) | Reduction |
|---|---|---|---|
| Read | 5.1 | 3.2 | -37.1% |
| Grep | 3.9 | 2.4 | -38.3% |
| Bash | 3.9 | 1.4 | -65.2% |
| Glob | 1.1 | 0.5 | -58.3% |
| Task (subagent) | 0.6 | 0.2 | -67.3% |
| gabb_symbol | - | 0.5 | new |
| gabb_structure | - | 0.3 | new |
Behavioral Pattern Analysis
Control Patterns
75% of runs stuck in slow grep→read cycle
Gabb Patterns
83% using fast direct or symbol navigation
Task Classification Analysis
By Baseline Complexity
Gabb provides most value on moderate-to-complex tasks (20-100s baseline)
| Complexity | Tasks | Avg Speedup | Notes |
|---|---|---|---|
| Simple (<20s) | 13 | 1.9% | Near break-even |
| Moderate (20-50s) | 13 | 41.7% | Sweet spot |
| Complex (50-100s) | 12 | 39.9% | Strong improvement |
| Very Complex (>100s) | 2 | 18.6% | Moderate improvement |
By Gabb Usage Pattern
| Pattern | Tasks | Avg Speedup | Description |
|---|---|---|---|
| Symbol-heavy | 15 | 43.2% | Used gabb_symbol effectively |
| Structure-heavy | 5 | 12.6% | Relied on gabb_structure |
| Minimal gabb | 20 | 18.6% | Little/no gabb tool usage |
Improvement Distribution
| Category | Tasks | Percentage |
|---|---|---|
| Major improvement (≥50%) | 12 | 30% |
| Moderate improvement (20-50%) | 8 | 20% |
| Minor improvement (0-20%) | 15 | 37.5% |
| Regression (<0%) | 5 | 12.5% |
Deep Dive: Results by Task
Top Improvements
| Task | Control | Gabb | Speedup | Pattern |
|---|---|---|---|---|
| astropy__astropy-7746 | 76.9s | 18.6s | +75.8% | Minimal gabb |
| django__django-11283 | 67.4s | 17.6s | +73.9% | Symbol-heavy |
| django__django-12184 | 81.4s | 22.8s | +72.0% | Minimal gabb |
| django__django-12284 | 66.8s | 19.8s | +70.3% | Symbol-heavy |
| django__django-11999 | 61.0s | 19.0s | +68.8% | Symbol-heavy |
| django__django-11422 | 37.9s | 15.1s | +60.1% | Symbol-heavy |
| django__django-11905 | 33.2s | 13.5s | +59.3% | Symbol-heavy |
| django__django-12286 | 35.6s | 14.6s | +59.1% | Minimal gabb |
| django__django-12308 | 54.1s | 23.2s | +57.1% | Symbol-heavy |
| astropy__astropy-14995 | 41.8s | 18.6s | +55.6% | Symbol-heavy |
Case Study: astropy-7746 (75.8% Speedup)
Control (76.9s avg)
- Bash: 10.7 calls
- Grep: 8.9 calls
- Read: 9.6 calls
- Glob: 1.9 calls
- Task: 1.0 calls
Gabb (18.6s avg)
- Read: 2.1 calls only
Control required extensive exploratory searching (8.9 Grep, 10.7 Bash commands) while Gabb solved it with just 2.1 Read calls on average. Semantic navigation eliminated wasteful exploration.
Regression Analysis
All 5 regression tasks were simple (<20s baseline) where control was already optimal
| Task | Control | Gabb | Regression | Cause |
|---|---|---|---|---|
| django__django-12453 | 11.8s | 14.4s | -21.9% | Unnecessary Grep added |
| django__django-11583 | 14.3s | 17.3s | -21.0% | Extra exploration overhead |
| django__django-12589 | 11.7s | 12.9s | -11.0% | Trivial single-file task |
| astropy__astropy-6938 | 12.2s | 12.5s | -2.5% | Minimal overhead |
| django__django-11742 | 70.5s | 71.8s | -1.9% | Structure-heavy approach |
Methodology
SWE-bench Tasks
40 tasks derived from real GitHub issues in popular open-source projects (Django, Astropy).
A/B Testing
Each task runs under two conditions: standard Claude Code (control) and Claude Code with Gabb enabled.
Statistical Rigor
400 runs per condition with Welch's t-test. Results significant at p < 0.001 with 95% confidence intervals.
Isolated Execution
Each task runs in a fresh git checkout. Gabb daemon indexes the workspace before task execution.
Benchmark Details
| Type | Suite (40 SWE-bench lite tasks) |
| Runs per task | 10 |
| Total runs | 800 (400 per condition) |
| Date | January 11, 2026 |
| Commit | 554bfda |
Run Your Own Benchmark
cd benchmark/claude-code
# Run a single SWE-bench task
python run.py --swe-bench django__django-11179 --runs 5
# Run the full benchmark suite
python run.py --swe-bench-suite --limit 40 --runs 10
# Analyze results
python analyze.py --latest --markdown Full benchmark code available in the gabb-cli repository .