From 45 Seconds to 30: How Semantic Indexing Reduces AI Task Time

By Gabb Team January 30, 2025

benchmarksperformancesemantic-indexingdata-analysis

Claims about AI productivity tools are easy to make and hard to verify. When we built Gabb, we wanted to know—not believe, but know—whether semantic indexing actually improved AI assistant performance.

So we ran 400 benchmarks on real coding tasks and measured everything. This post presents what we found.

The Experiment

We used tasks from SWE-bench, the industry-standard benchmark for AI coding assistants. SWE-bench tasks are real GitHub issues from production open-source projects—not synthetic exercises, but actual bugs that real developers filed and fixed.

Our test suite included 40 diverse tasks from Django and Astropy, spanning bug fixes, feature implementations, and refactoring work. Each task ran 10 times in both conditions (with and without Gabb), giving us 400 total runs per condition to ensure statistical reliability.

Setup:

Control condition: Standard Claude Code with built-in tools (Grep, Read, Glob, Bash)
Gabb condition: Claude Code with Gabb’s semantic indexing enabled
Isolation: Each task ran in a fresh git checkout to prevent cross-contamination
Measurement: Tool calls, completion time, token usage, and success rate were logged for every run

The Results

Metric	Without Gabb	With Gabb	Improvement
Task Completion Time	45.4s	30.6s	32% faster
Tool Calls per Task	14.6	8.1	45% fewer
Success Rate	94%	95%	+1%
Cost per Task	$0.040	$0.040	Identical

The time reduction is statistically significant (p < 0.001) with a 95% confidence interval of 9–20 seconds saved per task. Put differently: we can say with high confidence that Gabb saves real time on real tasks, not just on lucky runs.

Where the Time Goes

The 32% improvement doesn’t come from any single optimization. It emerges from a fundamental shift in how the AI navigates code.

Tool Usage Breakdown

Tool	Without Gabb	With Gabb	Reduction
Read	5.1 calls	3.2 calls	37%
Grep	3.9 calls	2.4 calls	38%
Bash	3.9 calls	1.4 calls	64%
Glob	1.1 calls	0.5 calls	55%

The Gabb condition adds two new tools (gabb_symbol and gabb_structure), which together averaged 0.8 calls per task. These replace 6+ calls from the traditional search-read-search cycle.

Pattern Shift

The most revealing metric is how the AI approaches navigation:

Without Gabb:

74.8% of runs used a “grep then read” pattern
Multiple searches to narrow down file candidates
Frequent backtracking when wrong files were read

With Gabb:

44.0% of runs used “direct read” (already knew the file)
39.2% used “symbol search” (direct lookup by name)
Minimal backtracking

The AI stops exploring and starts navigating.

A Concrete Example

Here’s how the same task plays out differently:

Task: Find and fix a permission bug in Django’s migration system

Without Gabb:

1. Grep for "permission" → 847 matches
2. Grep for "migration permission" → 23 matches
3. Read django/contrib/auth/management/__init__.py → partial match
4. Grep for "create_permissions" → 12 matches
5. Read django/contrib/auth/migrations/0001_initial.py → wrong file
6. Read django/contrib/auth/migrations/0011_update_proxy_permissions.py → found it

Time: ~52 seconds, 6 tool calls

With Gabb:

1. gabb_symbol("update_proxy_model_permissions") → django/contrib/auth/migrations/0011_update_proxy_permissions.py:5
2. Read django/contrib/auth/migrations/0011_update_proxy_permissions.py → confirm and fix

Time: ~28 seconds, 2 tool calls

Same result. Half the time.

Why Fewer Tool Calls Matter

Each tool call carries hidden costs:

Network latency: Round-trip time to the API adds up
Context consumption: Tool results consume tokens from the context window
Decision overhead: The AI must process results and decide next steps
Failure potential: Each call is a chance to go down the wrong path

A 45% reduction in tool calls isn’t just faster—it’s more reliable. Fewer opportunities for the AI to get confused or sidetracked.

The Cost Question

You might expect that adding a new tool would increase token usage and therefore cost. The data tells a more nuanced story:

Metric	Without Gabb	With Gabb
Total Tokens	79,313	86,748
Cost per Task	$0.040	$0.040

Token usage increased by about 9% (primarily from the skill instructions in the system prompt). But the cost stayed identical. Why?

Prompt caching. Modern AI APIs cache repeated content, and Gabb’s instructions are the same across all tasks. The cache hit rate fully offsets the additional tokens.

The 32% time savings comes at zero additional API cost.

Statistical Confidence

Good data requires good methodology. Here’s how we ensured our results are trustworthy:

Sample size: 400 runs per condition gives 99% power to detect effects of our observed size.

Significance testing: Welch’s t-test confirms p < 0.001 for both time and tool call reductions. This means there’s less than a 0.1% chance these results are random noise.

Effect size: Cohen’s d of 0.38–0.41 indicates a small-to-medium practical effect. The improvement is consistent enough to matter across diverse tasks, not just outliers.

Variance handling: Standard deviations are high (tasks vary widely in complexity), but the mean differences remain significant even accounting for this variance.

What This Means for You

For Individual Developers

32% faster means you get answers while your question is still fresh. Instead of context-switching while the AI searches, you stay in flow. Over a day of heavy AI usage, that’s hours reclaimed.

For Teams

The tool call reduction has second-order benefits:

Lower rate-limit pressure on shared API keys
Less noisy conversation logs
More predictable response times

For Large Codebases

Our benchmark included repositories with hundreds of thousands of lines of code. The time savings scale—larger codebases don’t mean proportionally longer searches when the index already knows where everything is.

The Limitations

We believe in honest benchmarking. Here’s what this data doesn’t prove:

Language coverage: Our benchmark focused on Python codebases. Results may vary for TypeScript, Rust, or mixed-language projects, though we expect similar patterns.

Task types: SWE-bench tasks are bug fixes. Feature development, refactoring, and code review workflows weren’t directly measured.

Model dependency: These results are with Claude Sonnet. Other models may show different improvement ratios.

Setup cost: First-time indexing takes time (typically 30–60 seconds for a medium codebase). The benchmarks measure task time after indexing is complete.

Try It Yourself

The benchmark suite is open source. You can reproduce these results or run against your own codebase:

# Install Gabb
brew install gabb-software/tap/gabb

# Set up for Claude Code
gabb setup
claude mcp add gabb -- gabb mcp-server

# Run the benchmark (optional - for verification)
cd benchmark/claude-code
python run.py --swe-bench django__django-11179 --runs 5

Or just use Gabb in your daily workflow and notice the difference.

Conclusion

We set out to answer a simple question: does semantic indexing actually make AI coding assistants faster?

The answer, across 400 measured tasks, is yes. 32% faster, 45% fewer tool calls, same accuracy, same cost.

These aren’t the dramatic “10x” claims that pervade the AI tooling space. They’re real, measured, reproducible improvements. The kind that add up over weeks and months of daily use.

Your AI assistant is already smart. Gabb just helps it stop getting lost.

The full benchmark methodology and raw data are available in our GitHub repository. Questions about the methodology? Open an issue—we’re happy to discuss the details.