From 45 Seconds to 30: How Semantic Indexing Reduces AI Task Time
Claims about AI productivity tools are easy to make and hard to verify. When we built Gabb, we wanted to know—not believe, but know—whether semantic indexing actually improved AI assistant performance.
So we ran 400 benchmarks on real coding tasks and measured everything. This post presents what we found.
The Experiment
We used tasks from SWE-bench, the industry-standard benchmark for AI coding assistants. SWE-bench tasks are real GitHub issues from production open-source projects—not synthetic exercises, but actual bugs that real developers filed and fixed.
Our test suite included 40 diverse tasks from Django and Astropy, spanning bug fixes, feature implementations, and refactoring work. Each task ran 10 times in both conditions (with and without Gabb), giving us 400 total runs per condition to ensure statistical reliability.
Setup:
- Control condition: Standard Claude Code with built-in tools (Grep, Read, Glob, Bash)
- Gabb condition: Claude Code with Gabb’s semantic indexing enabled
- Isolation: Each task ran in a fresh git checkout to prevent cross-contamination
- Measurement: Tool calls, completion time, token usage, and success rate were logged for every run
The Results
| Metric | Without Gabb | With Gabb | Improvement |
|---|---|---|---|
| Task Completion Time | 45.4s | 30.6s | 32% faster |
| Tool Calls per Task | 14.6 | 8.1 | 45% fewer |
| Success Rate | 94% | 95% | +1% |
| Cost per Task | $0.040 | $0.040 | Identical |
The time reduction is statistically significant (p < 0.001) with a 95% confidence interval of 9–20 seconds saved per task. Put differently: we can say with high confidence that Gabb saves real time on real tasks, not just on lucky runs.
Where the Time Goes
The 32% improvement doesn’t come from any single optimization. It emerges from a fundamental shift in how the AI navigates code.
Tool Usage Breakdown
| Tool | Without Gabb | With Gabb | Reduction |
|---|---|---|---|
| Read | 5.1 calls | 3.2 calls | 37% |
| Grep | 3.9 calls | 2.4 calls | 38% |
| Bash | 3.9 calls | 1.4 calls | 64% |
| Glob | 1.1 calls | 0.5 calls | 55% |
The Gabb condition adds two new tools (gabb_symbol and gabb_structure), which together averaged 0.8 calls per task. These replace 6+ calls from the traditional search-read-search cycle.
Pattern Shift
The most revealing metric is how the AI approaches navigation:
Without Gabb:
- 74.8% of runs used a “grep then read” pattern
- Multiple searches to narrow down file candidates
- Frequent backtracking when wrong files were read
With Gabb:
- 44.0% of runs used “direct read” (already knew the file)
- 39.2% used “symbol search” (direct lookup by name)
- Minimal backtracking
The AI stops exploring and starts navigating.
A Concrete Example
Here’s how the same task plays out differently:
Task: Find and fix a permission bug in Django’s migration system
Without Gabb:
1. Grep for "permission" → 847 matches
2. Grep for "migration permission" → 23 matches
3. Read django/contrib/auth/management/__init__.py → partial match
4. Grep for "create_permissions" → 12 matches
5. Read django/contrib/auth/migrations/0001_initial.py → wrong file
6. Read django/contrib/auth/migrations/0011_update_proxy_permissions.py → found it
Time: ~52 seconds, 6 tool calls
With Gabb:
1. gabb_symbol("update_proxy_model_permissions") → django/contrib/auth/migrations/0011_update_proxy_permissions.py:5
2. Read django/contrib/auth/migrations/0011_update_proxy_permissions.py → confirm and fix
Time: ~28 seconds, 2 tool calls
Same result. Half the time.
Why Fewer Tool Calls Matter
Each tool call carries hidden costs:
- Network latency: Round-trip time to the API adds up
- Context consumption: Tool results consume tokens from the context window
- Decision overhead: The AI must process results and decide next steps
- Failure potential: Each call is a chance to go down the wrong path
A 45% reduction in tool calls isn’t just faster—it’s more reliable. Fewer opportunities for the AI to get confused or sidetracked.
The Cost Question
You might expect that adding a new tool would increase token usage and therefore cost. The data tells a more nuanced story:
| Metric | Without Gabb | With Gabb |
|---|---|---|
| Total Tokens | 79,313 | 86,748 |
| Cost per Task | $0.040 | $0.040 |
Token usage increased by about 9% (primarily from the skill instructions in the system prompt). But the cost stayed identical. Why?
Prompt caching. Modern AI APIs cache repeated content, and Gabb’s instructions are the same across all tasks. The cache hit rate fully offsets the additional tokens.
The 32% time savings comes at zero additional API cost.
Statistical Confidence
Good data requires good methodology. Here’s how we ensured our results are trustworthy:
Sample size: 400 runs per condition gives 99% power to detect effects of our observed size.
Significance testing: Welch’s t-test confirms p < 0.001 for both time and tool call reductions. This means there’s less than a 0.1% chance these results are random noise.
Effect size: Cohen’s d of 0.38–0.41 indicates a small-to-medium practical effect. The improvement is consistent enough to matter across diverse tasks, not just outliers.
Variance handling: Standard deviations are high (tasks vary widely in complexity), but the mean differences remain significant even accounting for this variance.
What This Means for You
For Individual Developers
32% faster means you get answers while your question is still fresh. Instead of context-switching while the AI searches, you stay in flow. Over a day of heavy AI usage, that’s hours reclaimed.
For Teams
The tool call reduction has second-order benefits:
- Lower rate-limit pressure on shared API keys
- Less noisy conversation logs
- More predictable response times
For Large Codebases
Our benchmark included repositories with hundreds of thousands of lines of code. The time savings scale—larger codebases don’t mean proportionally longer searches when the index already knows where everything is.
The Limitations
We believe in honest benchmarking. Here’s what this data doesn’t prove:
Language coverage: Our benchmark focused on Python codebases. Results may vary for TypeScript, Rust, or mixed-language projects, though we expect similar patterns.
Task types: SWE-bench tasks are bug fixes. Feature development, refactoring, and code review workflows weren’t directly measured.
Model dependency: These results are with Claude Sonnet. Other models may show different improvement ratios.
Setup cost: First-time indexing takes time (typically 30–60 seconds for a medium codebase). The benchmarks measure task time after indexing is complete.
Try It Yourself
The benchmark suite is open source. You can reproduce these results or run against your own codebase:
# Install Gabb
brew install gabb-software/tap/gabb
# Set up for Claude Code
gabb setup
claude mcp add gabb -- gabb mcp-server
# Run the benchmark (optional - for verification)
cd benchmark/claude-code
python run.py --swe-bench django__django-11179 --runs 5
Or just use Gabb in your daily workflow and notice the difference.
Conclusion
We set out to answer a simple question: does semantic indexing actually make AI coding assistants faster?
The answer, across 400 measured tasks, is yes. 32% faster, 45% fewer tool calls, same accuracy, same cost.
These aren’t the dramatic “10x” claims that pervade the AI tooling space. They’re real, measured, reproducible improvements. The kind that add up over weeks and months of daily use.
Your AI assistant is already smart. Gabb just helps it stop getting lost.
The full benchmark methodology and raw data are available in our GitHub repository. Questions about the methodology? Open an issue—we’re happy to discuss the details.