How I improved car manual search accuracy from 8% to 64% through systematic experimentation, hypothesis testing, and data-driven optimization
Car owners struggle to find information in dense 600+ page manuals. Traditional keyword search (Ctrl+F) fails when users don't know exact terminology or when relevant information is scattered across multiple sections.
Example: Searching "towing capacity" might miss relevant information in sections titled "Trailer Towing," "Maximum Load," or "Vehicle Specifications."
"Can a hybrid retrieval system combining semantic understanding and keyword matching outperform traditional search methods for technical document Q&A?"
Each experiment tested a specific hypothesis about information retrieval performance
Hypothesis: Traditional BM25 keyword search provides baseline performance
Result: 8% accuracy β Confirmed semantic understanding is needed
Hypothesis: Embeddings-based semantic search > keyword matching
Result: 26% accuracy (+225% improvement) β Hypothesis supported
Hypothesis: Combining semantic + keyword + reranking > either alone
Result: 62% accuracy (+138% from Hβ) β Strong support
Hypothesis: Query expansion + page-aware voting further improves accuracy
Result: 64% accuracy (+3% from Hβ) β Marginal gains confirmed
Hypothesis: Multi-stage retrieval & extreme tuning can push beyond 64%
Result: 52-64% accuracy (regression!) β Hypothesis rejected
β οΈ Key Learning: Over-optimization can hurt performance
BM25 keyword search (Β±2 pages)
4/50 correct answers
Ultimate RAG (Β±2 pages)
32/50 correct answers
Absolute improvement
Cohen's h = 1.26 (large)
Pinpoint page accuracy
16/50 exact matches
Tolerance | Baseline | Ultimate RAG | Improvement |
---|---|---|---|
Exact Match | 4% (2/50) | 32% (16/50) | +700% |
Β±1 page | 6% (3/50) | 50% (25/50) | +733% |
Β±2 pages | 8% (4/50) | 64% (32/50) | +700% |
Β±5 pages | 12% (6/50) | 70% (35/50) | +483% |
The model shows consistent improvement across all tolerance levels, with the largest effect sizes at tighter tolerances (Β±1 page: +733%). This indicates the system doesn't just find "nearby" answers but actually identifies the most relevant page with high precision.
Systematic progression from 8% to 64% accuracy
Visual story of optimization process
Performance across different precision levels
Which optimizations made the difference
PDF β Chunks (3000 chars, 30% overlap) β Embeddings (all-mpnet-base-v2) β Vector Store (ChromaDB) β Query β Query Expansion (3 variations) β Hybrid Search (70% semantic + 30% BM25) β Top-K Retrieval (60 chunks) β Cross-Encoder Reranking (MiniLM-L-6) β Top-12 Chunks β Page-Aware Voting (3.0^rank) β Final Answer
Keyword search using BM25 only. Established the challenge.
Added sentence-transformers embeddings. 3.25x improvement!
Combined semantic + BM25 with cross-encoder reranking.
Query expansion + page-aware voting. Peak performance!
6 attempts failed to improve. Learned when to stop!
Understanding failure modes reveals system limitations and future directions
Requires synthesizing information from 2+ sections
Multiple valid interpretations of the question
Answer in tables/diagrams, not extractable text
Rare topics with insufficient context
61% of errors (11/18) stem from reasoning gaps, not retrieval failures. The system successfully finds relevant chunks but lacks the capability to synthesize information across contexts or resolve ambiguity. This suggests the next performance breakthrough requires reasoning capabilities (e.g., chain-of-thought, self-consistency) rather than improved retrieval algorithms.
Impact of removing each component (in percentage points)
After achieving 64% accuracy, six additional optimization attempts yielded zero improvementβsome even decreased performance to 52%. This plateau indicated I had reached the fundamental limits of retrieval-based approaches. Takeaway: Monitor validation curves carefully; continued optimization beyond plateaus often leads to overfitting.
The winning solution combined just 4 techniques. More complex multi-stage pipelines with 8+ components performed worse and were harder to debug. Takeaway: Start with simple baselines, add complexity incrementally, and require each addition to prove its value on held-out data.
Error analysis revealed 61% of failures required reasoning, not better retrieval. Quantitative metrics (64% accuracy) told me performance was good; qualitative analysis told me why and where to improve next. Takeaway: Always complement quantitative metrics with deep-dive error analysis.
Interested in discussing this project or exploring opportunities?
Built with β€οΈ by Agape Miteu | October 2025