RAG systems are easy to build and hard to trust. You can assemble a retriever, a vector store, and a chat model in an afternoon — but without a principled way to measure retrieval quality, faithfulness, and answer relevancy, every configuration change is a guess. This project is the measurement layer: a fully local RAG pipeline with a rigorous evaluation suite built on DeepEval, designed around iterative A/B comparison against a fixed benchmark.
The system runs entirely on local hardware (RTX 3090) via LM Studio, with GPT-4o-mini as the judge model for reliable metric scoring. It targets a realistic mixed corpus — markdown documentation, shell scripts, and config files from a personal codebase — and tracks pipeline performance across four independent metrics as retrieval and generation strategies are swapped in and out.
System Architecture
The pipeline has two layers: a RAG pipeline that ingests documents, chunks them, stores embeddings in ChromaDB via a VectorStore abstraction, and generates answers at query time — and an evaluation layer that scores those answers across four DeepEval metrics.
flowchart LR D[Documents] --> C[Chunker] --> VS Q[Query] --> R[Retriever] R -->|embed + search| VS[(ChromaDB)] VS -->|ranked chunks| R R --> G[Generator] --> A[Answer] A --> F[Faithfulness] A --> AR[Answer Relevancy] R --> CP[Ctx Precision] R --> CR[Ctx Recall] classDef stable fill:#14532d,stroke:#4ade80,color:#bbf7d0 classDef localVar fill:#78350f,stroke:#fbbf24,color:#fef3c7 classDef judgeVar fill:#7f1d1d,stroke:#f87171,color:#fee2e2 class R,VS stable class G,A localVar class F,AR,CP,CR judgeVar
Every eval run writes a timestamped artifact folder (artifacts/eval_runs/<UTC>/) containing a structured report.json and a human-readable report.md with per-sample scores, metric reasoning, and pass/fail summaries. This makes regressions visible and improvements measurable — each pipeline variant leaves a permanent, comparable record.
The Dataset: Synthetic Q&A over a Real Codebase
The evaluation corpus is the documentation and configuration files for a personal Linux streaming setup (StreamDeck2) — markdown guides, install scripts, systemd unit templates, config file templates. The kind of mixed technical content that RAG systems frequently encounter in practice.
The 25-question synthetic dataset was generated to cover specific, verifiable facts from this corpus: exact file paths, command-line flags, environment variable names, systemd directives. These are questions that have unambiguous right answers, which makes metric calibration meaningful.
The questions span five categories:
- Installation (
install, 10 questions) — exact behaviour ofinstall.sh: default user values, config file paths and modes, environment variables that control the install, AUR packages required - Systemd (
systemd, 4 questions) — service unit directives:User=,Requires=,After=,ExecStart=commands - Networking (
networking, 2 questions) — firewall port requirements, service ports, pairing protocols - MoonDeck / inter-process (
moondeck, 6 questions) — IPC primitives, singleton mechanics, Qt shared memory, env var contracts between processes - Xorg / display (
xorg, 3 questions) — Xorg flags, Modeline config, input isolation mechanics
Difficulty is weighted toward medium questions that require pulling specific details from a single section, with a few that require correlating facts across two or more document chunks.
How the Evaluation Works
Each sample in the benchmark has a query, an expected answer, and no pre-specified context — the retriever must find the relevant chunks on its own. The pipeline generates an answer, then a judge model scores it across four independent metrics.
| Metric | What it measures | Threshold |
|---|---|---|
| Faithfulness | Are all claims in the answer supported by the retrieved chunks? Catches hallucination — the model adding facts beyond what was retrieved. | ≥ 0.70 |
| Answer Relevancy | Does the answer actually address the question asked? Catches tangential or evasive responses. | ≥ 0.70 |
| Contextual Precision | Are the most relevant chunks ranked highest in the retrieval results? A low score means relevant chunks exist but are buried under noise. | ≥ 0.60 |
| Contextual Recall | Do the retrieved chunks cover all the facts needed to answer the question? A low score means the retriever missed something important. | ≥ 0.60 |
These metrics are implemented via DeepEval (Confident AI). The conceptual definitions of Faithfulness and Answer Relevancy originate in the RAGAS paper — Es et al., 2023, published at EACL 2024 as a system demonstration — which introduced LLM-based, reference-free scoring for RAG pipelines. Contextual Precision and Contextual Recall were later additions to the RAGAS library, post-publication, and are adopted by DeepEval under the same conceptual definitions.
A sample passes only if all four metrics clear their thresholds. This means a perfectly faithful, well-written answer still fails if the retriever ranked its supporting chunks poorly — which keeps pressure on the whole pipeline, not just the generator.
The four metrics deliberately target different failure modes. Faithfulness and Answer Relevancy are generation-side: they catch a model that hallucinates or wanders off topic. Contextual Precision and Contextual Recall are retrieval-side: they catch a retriever that misses relevant chunks or buries them in noise. Separating these allows a pipeline change — say, switching from dense-only retrieval to hybrid BM25 + dense — to be attributed precisely to either retrieval or generation quality shifts.
A Sample in Detail
syn010 is representative of what the system does well — a debugging question about a non-obvious root cause:
Query
Why does MoonDeckStream fail with ‘Another instance already running!’ after a stream ends, and what is the root cause involving SIGTERM and quick_exit()?
Generated Answer (truncated)
The issue occurs because when Sunshine sends SIGTERM to MoonDeckStream, the signal handler calls std::quick_exit(128 + signum). Unlike a normal exit, quick_exit() does not run C++ destructors — so the QSharedMemory and QSystemSemaphore objects used as the singleton lock are never cleaned up. On Linux, System V IPC segments persist until explicitly released, so on the next launch the singleton check finds the orphaned segment and refuses to start.
Faithfulness
The score is 0.92 because the actual output incorrectly attributes management of QSharedMemory and QSystemSemaphore to the destructor of SingleInstanceGuard, while it is actually the responsibility of MoonDeckStream to ensure only one instance runs.
Answer Relevancy
The score is 1.00 because the response directly addresses the question about MoonDeckStream’s failure and the involvement of SIGTERM and quick_exit() without any irrelevant statements.
Contextual Precision
The score is 0.89 because the relevant nodes are well-ranked, with the first two providing direct answers. However, an irrelevant node in third position (a timeline without SIGTERM context) slightly lowers the score.
Contextual Recall
The score is 1.00 because every aspect of the expected output is thoroughly supported by the retrieval context — SIGTERM behaviour, quick_exit() semantics, and System V IPC persistence are all present.
The 0.917 faithfulness deduction is worth noting: the judge caught that the answer attributed cleanup responsibility to a destructor that doesn’t exist in the described architecture. A human reviewer reading quickly might have missed it. This is exactly the kind of subtle factual error — correct root cause, wrong attribution — that the evaluation is designed to surface.
RAG Improvements: First A/B Comparison
With the evaluation infrastructure in place, three changes to the pipeline were applied simultaneously and compared against the baseline (chunk size 512, no deduplication, soft grounding prompt):
Chunk size 512 → 1024. The initial chunk_size=512 produced ~100-token chunks — too small for multi-fact queries. Many documentation sections span 500–2,000 characters; a 512-char limit forces mid-sentence splits that produce incoherent fragments. Bumping to 1024 chars with 128-char overlap keeps more sections whole and doubles information density per retrieval slot without increasing top_k.
Query-time deduplication. One sample (syn004) consumed 4 of 5 retrieval slots with near-identical systemctl status blobs from different log files. The fix: over-fetch 2 × top_k from the vector store, apply text-hash deduplication in retriever.py, and return up to top_k unique chunks. This frees slots that would otherwise be wasted on redundant content.
System prompt grounding. Six faithfulness failures shared a pattern: the model supplemented retrieved context with pre-training knowledge. The prompt changed from “always ground your answers” (soft guidance) to “Answer using ONLY the provided context. Do not add information from your own knowledge” (explicit prohibition). This improved Answer Relevancy (+0.063 mean) but introduced one regression — a borderline case tipped toward over-cautious abstention.
| Pipeline | Chunk Size | Dedup | Pass Rate | Faithfulness | Answer Relevancy | Ctx Precision | Ctx Recall |
|---|---|---|---|---|---|---|---|
| Baseline | 512 / overlap 64 | — | 40% | 0.809 | 0.819 | 0.781 | 0.931 |
| + chunking + dedup + prompt | 1024 / overlap 128 | text-hash | 44% | 0.778 | 0.882 | 0.738 | 0.879 |
Pass rate improved by one sample (net +4 pp). The dominant factor was larger chunks: five samples where sections had been split mid-sentence now retrieved coherent context and passed. Four regressions partially offset these gains — three generation-side (over-cautious abstention, hallucinated role inference, marginal faithfulness miss) and one retrieval-side (embedding space reshaped by larger chunks, causing a niche query to lose ranking). The retrieval metric means dipped slightly across the board, consistent with larger chunks producing broader embeddings that compete differently in vector similarity.
Remaining failure modes after this change: hallucination is now the dominant pattern (7 of 14 failures), followed by retrieval ranking (relevant chunk present but buried at rank 4–5) and retrieval miss (facts not in the top-5 at all).
Measurement Reliability: Variance and Aggregation
Comparability between pipeline variants doesn’t come from making individual runs deterministic — it comes from running multiple times and aggregating. This stack has irreducible variance at two layers, and trying to engineer it away with seeds and API pinning has real maintenance cost and incomplete payoff. The right response is to measure the noise floor empirically and use median-over-N runs as the comparison primitive.
Where the variance lives. Retrieval is stable: Chroma with cosine distance and a fixed embedding model produces 25/25 identical contexts across back-to-back runs against an unchanged index. Variance enters at the two LLM layers:
Local generation (LM Studio / llama.cpp) — GPU floating-point non-associativity and batch scheduling mean llama.cpp does not produce bitwise-identical output even at temperature: 0. Dropping from temperature: 0.1 to 0.0 improved answer stability from 16% to 60% identical across runs — a meaningful improvement, but not determinism.
GPT-4o-mini judge — even when the generated answer is identical across two runs, judge scores vary. In back-to-back temperature: 0 runs, 11 of 15 same-answer samples received different scores, and 4 of those flipped pass/fail status. This is the dominant remaining noise source. Per-metric jitter:
| Metric | Stdev (same-answer pairs) | Pass/fail flips | Notes |
|---|---|---|---|
| Faithfulness | 0.106 | 4/25 | Multi-step claim extraction amplifies per-call variance |
| Contextual Recall | 0.164 | — | LLM reasoning against gold is subjective |
| Answer Relevancy | 0.051 | — | Most stable generation-side metric |
| Contextual Precision | 0.047 | — | Most stable retrieval-side metric |
Observed pass-rate bands. Six consecutive runs across two temperature settings establish the empirical noise floor:
| Condition | Pass rate range | Band |
|---|---|---|
temperature: 0.1 (4 runs) | 44–52% | 8 pp |
temperature: 0.0 (2 runs) | 48–60% | 12 pp |
A single run can land anywhere in a ~12 pp window. A pass-rate shift of 4 pp between two pipeline variants is within the noise; a consistent shift of 8+ pp across multiple runs is signal.
How to compare runs. The repo includes scripts/summarize_eval_runs.py, which reads meta.json from any set of run directories and prints min / max / median pass rate and metric means per dataset. The workflow for a meaningful pipeline comparison:
- Run the eval 3× for each pipeline variant
- Pass the run directories to
summarize_eval_runs.py - Compare medians — a change clears the noise floor if it exceeds ~5 pp on pass rate or ~0.05 on any metric mean
The A/B results elsewhere on this page are single-run comparisons. They show directional trends; treat them as hypotheses to confirm with multi-run aggregation.
Next: Retrieval Strategy A/B Tests
The infrastructure supports arbitrary pipeline swaps — any change to chunking, retrieval, or generation can be measured against the same 25-sample benchmark and compared by metric. The next comparison candidates, ordered by expected impact on the current failure modes:
Cross-encoder reranking. Three samples fail because the relevant chunk is retrieved but ranked 4th or 5th. A lightweight cross-encoder (e.g. cross-encoder/ms-marco-MiniLM-L-6-v2) re-scores the top-k candidates after vector retrieval, trading latency for precision. This directly targets the ranking failures without changing what gets retrieved.
Hybrid search (dense + BM25). Four samples fail because the relevant chunk is never retrieved at all — the dense vector search misses it. BM25 sparse retrieval is strong on exact term matches (command names, env var strings, file paths) — exactly the kind of facts that fail in this corpus. A hybrid index combining Reciprocal Rank Fusion of dense and sparse scores should recover these misses.
HyDE (Hypothetical Document Embeddings). Instead of embedding the raw query, HyDE prompts the LLM to generate a hypothetical answer document first, then embeds that. The hypothesis lives in the same semantic space as the corpus chunks, which can improve recall for questions phrased differently from the source text.
AST-aware code chunking. The current RecursiveChunker splits code at class/def/\n\n boundaries with a character limit. Tree-sitter parsing (via chonkie’s CodeChunker) splits at actual AST nodes — functions are never cut mid-body, and scope context is preserved. This is primarily a chunking quality improvement, but it should help the install-script and systemd categories where function-level context matters.
Each of these represents a single, independently measurable pipeline change — the benchmark stays fixed, and the four-metric scorecard separates retrieval effects from generation effects.