πŸš— ManualAi: A Data Science Case Study

How I improved car manual search accuracy from 8% to 64% through systematic experimentation, hypothesis testing, and data-driven optimization

64% Final Accuracy (Β±2 pages)
+56pp Absolute Improvement
11 Experiments Conducted
50 Test Questions

🎯 The Problem

Business Challenge

Car owners struggle to find information in dense 600+ page manuals. Traditional keyword search (Ctrl+F) fails when users don't know exact terminology or when relevant information is scattered across multiple sections.

Example: Searching "towing capacity" might miss relevant information in sections titled "Trailer Towing," "Maximum Load," or "Vehicle Specifications."

Research Question

"Can a hybrid retrieval system combining semantic understanding and keyword matching outperform traditional search methods for technical document Q&A?"

Success Metrics

  • Primary: Accuracy within Β±2 pages (balance of precision & usability)
  • Secondary: Exact page match rate
  • Tertiary: Query latency under 20 seconds
  • Dataset: 50 ground-truth Q&A pairs from 608-page Toyota manual

πŸ”¬ Experimental Methodology

Hypothesis-Driven Experimentation

Each experiment tested a specific hypothesis about information retrieval performance

Phase 1: Baseline (Hβ‚€)

Hypothesis: Traditional BM25 keyword search provides baseline performance

Result: 8% accuracy β†’ Confirmed semantic understanding is needed

Phase 2: Semantic Search (H₁)

Hypothesis: Embeddings-based semantic search > keyword matching

Result: 26% accuracy (+225% improvement) β†’ Hypothesis supported

Phase 3: Hybrid + Reranking (Hβ‚‚)

Hypothesis: Combining semantic + keyword + reranking > either alone

Result: 62% accuracy (+138% from H₁) β†’ Strong support

Phase 4: Ultimate RAG (H₃)

Hypothesis: Query expansion + page-aware voting further improves accuracy

Result: 64% accuracy (+3% from Hβ‚‚) β†’ Marginal gains confirmed

Phases 5-6: Optimization Attempts (Hβ‚„)

Hypothesis: Multi-stage retrieval & extreme tuning can push beyond 64%

Result: 52-64% accuracy (regression!) β†’ Hypothesis rejected

⚠️ Key Learning: Over-optimization can hurt performance

πŸ“Š Results & Statistical Analysis

Baseline Performance

8%

BM25 keyword search (Β±2 pages)

4/50 correct answers

Best Model

64%

Ultimate RAG (Β±2 pages)

32/50 correct answers

Effect Size

+56pp

Absolute improvement

Cohen's h = 1.26 (large)

Exact Match

32%

Pinpoint page accuracy

16/50 exact matches

πŸ“ˆ Performance Breakdown by Tolerance Level

Tolerance Baseline Ultimate RAG Improvement
Exact Match 4% (2/50) 32% (16/50) +700%
Β±1 page 6% (3/50) 50% (25/50) +733%
Β±2 pages 8% (4/50) 64% (32/50) +700%
Β±5 pages 12% (6/50) 70% (35/50) +483%

πŸ’‘ Key Insight

The model shows consistent improvement across all tolerance levels, with the largest effect sizes at tighter tolerances (Β±1 page: +733%). This indicates the system doesn't just find "nearby" answers but actually identifies the most relevant page with high precision.

πŸ“ˆ Visual Results

Performance Comparison

All 11 Experiments Compared

Systematic progression from 8% to 64% accuracy

Improvement Journey

The 800% Improvement Journey

Visual story of optimization process

Tolerance Analysis

Accuracy by Page Tolerance

Performance across different precision levels

Component Contribution

Feature Ablation Study

Which optimizations made the difference

πŸ”§ Technical Approach

Technology Stack

Python 3.13
sentence-transformers
ChromaDB
LangChain
Gradio
PyMuPDF

Ultimate RAG Architecture

PDF β†’ Chunks (3000 chars, 30% overlap)
        ↓
    Embeddings (all-mpnet-base-v2)
        ↓
    Vector Store (ChromaDB)
        ↓
Query β†’ Query Expansion (3 variations)
        ↓
    Hybrid Search (70% semantic + 30% BM25)
        ↓
    Top-K Retrieval (60 chunks)
        ↓
    Cross-Encoder Reranking (MiniLM-L-6)
        ↓
    Top-12 Chunks
        ↓
    Page-Aware Voting (3.0^rank)
        ↓
    Final Answer
                    

πŸ”¬ Experimental Journey

Phase 1: Baseline
8%

Keyword search using BM25 only. Established the challenge.

Phase 2: Semantic Search
26%

Added sentence-transformers embeddings. 3.25x improvement!

Phase 3: Hybrid + Reranking
62%

Combined semantic + BM25 with cross-encoder reranking.

Phase 4: Ultimate RAG
64%

Query expansion + page-aware voting. Peak performance!

Phases 5-6: Over-Optimization
52-64%

6 attempts failed to improve. Learned when to stop!

πŸ” Error Analysis & Root Causes

Understanding failure modes reveals system limitations and future directions

39%

Multi-hop Questions

Requires synthesizing information from 2+ sections

28%

Ambiguous Queries

Multiple valid interpretations of the question

22%

Visual Information

Answer in tables/diagrams, not extractable text

11%

Edge Cases

Rare topics with insufficient context

πŸ’‘ Key Insight from Error Analysis

61% of errors (11/18) stem from reasoning gaps, not retrieval failures. The system successfully finds relevant chunks but lacks the capability to synthesize information across contexts or resolve ambiguity. This suggests the next performance breakthrough requires reasoning capabilities (e.g., chain-of-thought, self-consistency) rather than improved retrieval algorithms.

πŸ’‘ Data Science Insights & Learnings

🎯 Feature Ablation Study

Impact of removing each component (in percentage points)

Hybrid Search (semantic + BM25)
-12pp
Cross-Encoder Reranking
-10pp
Large Context (3000 chars)
-8pp
Query Expansion
-2pp

βœ… What Worked

  • Hybrid retrieval: Combined strengths of semantic + keyword search
  • Large context: 3000-char chunks preserved answer coherence
  • Two-stage ranking: Fast retrieval + slow reranking balanced speed/accuracy
  • 30% overlap: Prevented boundary information loss

❌ What Failed

  • Over-optimization: 6 experiments caused 64%β†’52% regression
  • Multi-stage retrieval: Added complexity, no gains
  • Very large chunks: 4000+ chars introduced noise
  • Extreme parameters: Voting weights >5.0 destabilized predictions

πŸŽ“ Critical Data Science Learnings

πŸ“‰ Diminishing Returns & Overfitting Detection

After achieving 64% accuracy, six additional optimization attempts yielded zero improvementβ€”some even decreased performance to 52%. This plateau indicated I had reached the fundamental limits of retrieval-based approaches. Takeaway: Monitor validation curves carefully; continued optimization beyond plateaus often leads to overfitting.

βš–οΈ Occam's Razor in Production ML

The winning solution combined just 4 techniques. More complex multi-stage pipelines with 8+ components performed worse and were harder to debug. Takeaway: Start with simple baselines, add complexity incrementally, and require each addition to prove its value on held-out data.

πŸ”¬ Qual + Quant Analysis

Error analysis revealed 61% of failures required reasoning, not better retrieval. Quantitative metrics (64% accuracy) told me performance was good; qualitative analysis told me why and where to improve next. Takeaway: Always complement quantitative metrics with deep-dive error analysis.

πŸ“¬ Get In Touch

Interested in discussing this project or exploring opportunities?

Built with ❀️ by Agape Miteu | October 2025