ManualAi: A Data Science Case Study

🎯 The Problem

Business Challenge

Car owners struggle to find information in dense 600+ page manuals. Traditional keyword search (Ctrl+F) fails when users don't know exact terminology or when relevant information is scattered across multiple sections.

Example: Searching "towing capacity" might miss relevant information in sections titled "Trailer Towing," "Maximum Load," or "Vehicle Specifications."

Research Question

"Can a hybrid retrieval system combining semantic understanding and keyword matching outperform traditional search methods for technical document Q&A?"

Success Metrics

Primary: Accuracy within ±2 pages (balance of precision & usability)
Secondary: Exact page match rate
Tertiary: Query latency under 20 seconds
Dataset: 50 ground-truth Q&A pairs from 608-page Toyota manual

🔬 Experimental Methodology

Hypothesis-Driven Experimentation

Each experiment tested a specific hypothesis about information retrieval performance

Phase 1: Baseline (H₀)

Hypothesis: Traditional BM25 keyword search provides baseline performance

Result: 8% accuracy → Confirmed semantic understanding is needed

Phase 2: Semantic Search (H₁)

Hypothesis: Embeddings-based semantic search > keyword matching

Result: 26% accuracy (+225% improvement) → Hypothesis supported

Phase 3: Hybrid + Reranking (H₂)

Hypothesis: Combining semantic + keyword + reranking > either alone

Result: 62% accuracy (+138% from H₁) → Strong support

Phase 4: Ultimate RAG (H₃)

Hypothesis: Query expansion + page-aware voting further improves accuracy

Result: 64% accuracy (+3% from H₂) → Marginal gains confirmed

Phases 5-6: Optimization Attempts (H₄)

Hypothesis: Multi-stage retrieval & extreme tuning can push beyond 64%

Result: 52-64% accuracy (regression!) → Hypothesis rejected

⚠️ Key Learning: Over-optimization can hurt performance

📊 Results & Statistical Analysis

Baseline Performance

BM25 keyword search (±2 pages)

4/50 correct answers

Best Model

64%

Ultimate RAG (±2 pages)

32/50 correct answers

Effect Size

+56pp

Absolute improvement

Cohen's h = 1.26 (large)

Exact Match

32%

Pinpoint page accuracy

16/50 exact matches

📈 Performance Breakdown by Tolerance Level

Tolerance	Baseline	Ultimate RAG	Improvement
Exact Match	4% (2/50)	32% (16/50)	+700%
±1 page	6% (3/50)	50% (25/50)	+733%
±2 pages	8% (4/50)	64% (32/50)	+700%
±5 pages	12% (6/50)	70% (35/50)	+483%

💡 Key Insight

The model shows consistent improvement across all tolerance levels, with the largest effect sizes at tighter tolerances (±1 page: +733%). This indicates the system doesn't just find "nearby" answers but actually identifies the most relevant page with high precision.

📈 Visual Results

All 11 Experiments Compared

Systematic progression from 8% to 64% accuracy

The 800% Improvement Journey

Visual story of optimization process

Accuracy by Page Tolerance

Performance across different precision levels

Feature Ablation Study

Which optimizations made the difference

🔧 Technical Approach

Technology Stack

Python 3.13

sentence-transformers

ChromaDB

LangChain

Gradio

PyMuPDF

Ultimate RAG Architecture

PDF → Chunks (3000 chars, 30% overlap)
        ↓
    Embeddings (all-mpnet-base-v2)
        ↓
    Vector Store (ChromaDB)
        ↓
Query → Query Expansion (3 variations)
        ↓
    Hybrid Search (70% semantic + 30% BM25)
        ↓
    Top-K Retrieval (60 chunks)
        ↓
    Cross-Encoder Reranking (MiniLM-L-6)
        ↓
    Top-12 Chunks
        ↓
    Page-Aware Voting (3.0^rank)
        ↓
    Final Answer

🔬 Experimental Journey

Phase 1: Baseline

Keyword search using BM25 only. Established the challenge.

Phase 2: Semantic Search

26%

Added sentence-transformers embeddings. 3.25x improvement!

Phase 3: Hybrid + Reranking

62%

Combined semantic + BM25 with cross-encoder reranking.

Phase 4: Ultimate RAG

64%

Query expansion + page-aware voting. Peak performance!

Phases 5-6: Over-Optimization

52-64%

6 attempts failed to improve. Learned when to stop!

🔍 Error Analysis & Root Causes

Understanding failure modes reveals system limitations and future directions

39%

Multi-hop Questions

Requires synthesizing information from 2+ sections

28%

Ambiguous Queries

Multiple valid interpretations of the question

22%

Visual Information

Answer in tables/diagrams, not extractable text

11%

Edge Cases

Rare topics with insufficient context

💡 Key Insight from Error Analysis

61% of errors (11/18) stem from reasoning gaps, not retrieval failures. The system successfully finds relevant chunks but lacks the capability to synthesize information across contexts or resolve ambiguity. This suggests the next performance breakthrough requires reasoning capabilities (e.g., chain-of-thought, self-consistency) rather than improved retrieval algorithms.

💡 Data Science Insights & Learnings

🎯 Feature Ablation Study

Impact of removing each component (in percentage points)

Hybrid Search (semantic + BM25)

-12pp

Cross-Encoder Reranking

-10pp

Large Context (3000 chars)

-8pp

Query Expansion

-2pp

✅ What Worked

Hybrid retrieval: Combined strengths of semantic + keyword search
Large context: 3000-char chunks preserved answer coherence
Two-stage ranking: Fast retrieval + slow reranking balanced speed/accuracy
30% overlap: Prevented boundary information loss

❌ What Failed

Over-optimization: 6 experiments caused 64%→52% regression
Multi-stage retrieval: Added complexity, no gains
Very large chunks: 4000+ chars introduced noise
Extreme parameters: Voting weights >5.0 destabilized predictions

🎓 Critical Data Science Learnings

📉 Diminishing Returns & Overfitting Detection

After achieving 64% accuracy, six additional optimization attempts yielded zero improvement—some even decreased performance to 52%. This plateau indicated I had reached the fundamental limits of retrieval-based approaches. Takeaway: Monitor validation curves carefully; continued optimization beyond plateaus often leads to overfitting.

⚖️ Occam's Razor in Production ML

The winning solution combined just 4 techniques. More complex multi-stage pipelines with 8+ components performed worse and were harder to debug. Takeaway: Start with simple baselines, add complexity incrementally, and require each addition to prove its value on held-out data.

🔬 Qual + Quant Analysis

Error analysis revealed 61% of failures required reasoning, not better retrieval. Quantitative metrics (64% accuracy) told me performance was good; qualitative analysis told me why and where to improve next. Takeaway: Always complement quantitative metrics with deep-dive error analysis.

🚗 ManualAi: A Data Science Case Study