← Back to projects

Eval Suite for Local, Codebase RAG

DeepEval-powered evaluation infrastructure for measuring and comparing RAG pipeline variants — faithfulness, retrieval precision, and answer quality across iterative configuration changes.

Personal Project • March 2026

Python RAG LLMs DeepEval LM Studio Vector Search ChromaDB

RAG systems are easy to build and hard to trust. You can assemble a retriever, a vector store, and a chat model in an afternoon — but without a principled way to measure retrieval quality, faithfulness, and answer relevancy, every configuration change is a guess. This project is the measurement layer: a fully local RAG pipeline with a rigorous evaluation suite built on DeepEval, designed around iterative A/B comparison against a fixed benchmark.

The system runs entirely on local hardware (RTX 3090) via LM Studio, with GPT-4o-mini as the judge model for reliable metric scoring. It targets a realistic mixed corpus — markdown documentation, shell scripts, and config files from a personal codebase — and tracks pipeline performance across four independent metrics as retrieval and generation strategies are swapped in and out.

System Architecture

The pipeline has two layers: a RAG pipeline that ingests documents, chunks them, stores embeddings in ChromaDB via a VectorStore abstraction, and generates answers at query time — and an evaluation layer that scores those answers across four DeepEval metrics.

flowchart LR
  D[Documents] --> C[Chunker] --> VS
  Q[Query] --> R[Retriever]
  R -->|embed + search| VS[(ChromaDB)]
  VS -->|ranked chunks| R
  R --> G[Generator] --> A[Answer]
  A --> F[Faithfulness]
  A --> AR[Answer Relevancy]
  R --> CP[Ctx Precision]
  R --> CR[Ctx Recall]

  classDef stable fill:#14532d,stroke:#4ade80,color:#bbf7d0
  classDef localVar fill:#78350f,stroke:#fbbf24,color:#fef3c7
  classDef judgeVar fill:#7f1d1d,stroke:#f87171,color:#fee2e2

  class R,VS stable
  class G,A localVar
  class F,AR,CP,CR judgeVar
Retrieval — deterministic (25/25 identical contexts)Generation — local LLM variance (llama.cpp)Evaluation — judge scoring variance (GPT-4o-mini)

Every eval run writes a timestamped artifact folder (artifacts/eval_runs/<UTC>/) containing a structured report.json and a human-readable report.md with per-sample scores, metric reasoning, and pass/fail summaries. This makes regressions visible and improvements measurable — each pipeline variant leaves a permanent, comparable record.

The Dataset: Synthetic Q&A over a Real Codebase

The evaluation corpus is the documentation and configuration files for a personal Linux streaming setup (StreamDeck2) — markdown guides, install scripts, systemd unit templates, config file templates. The kind of mixed technical content that RAG systems frequently encounter in practice.

The 25-question synthetic dataset was generated to cover specific, verifiable facts from this corpus: exact file paths, command-line flags, environment variable names, systemd directives. These are questions that have unambiguous right answers, which makes metric calibration meaningful.

The questions span five categories:

  • Installation (install, 10 questions) — exact behaviour of install.sh: default user values, config file paths and modes, environment variables that control the install, AUR packages required
  • Systemd (systemd, 4 questions) — service unit directives: User=, Requires=, After=, ExecStart= commands
  • Networking (networking, 2 questions) — firewall port requirements, service ports, pairing protocols
  • MoonDeck / inter-process (moondeck, 6 questions) — IPC primitives, singleton mechanics, Qt shared memory, env var contracts between processes
  • Xorg / display (xorg, 3 questions) — Xorg flags, Modeline config, input isolation mechanics

Difficulty is weighted toward medium questions that require pulling specific details from a single section, with a few that require correlating facts across two or more document chunks.

How the Evaluation Works

Each sample in the benchmark has a query, an expected answer, and no pre-specified context — the retriever must find the relevant chunks on its own. The pipeline generates an answer, then a judge model scores it across four independent metrics.

MetricWhat it measuresThreshold
FaithfulnessAre all claims in the answer supported by the retrieved chunks? Catches hallucination — the model adding facts beyond what was retrieved.≥ 0.70
Answer RelevancyDoes the answer actually address the question asked? Catches tangential or evasive responses.≥ 0.70
Contextual PrecisionAre the most relevant chunks ranked highest in the retrieval results? A low score means relevant chunks exist but are buried under noise.≥ 0.60
Contextual RecallDo the retrieved chunks cover all the facts needed to answer the question? A low score means the retriever missed something important.≥ 0.60

These metrics are implemented via DeepEval (Confident AI). The conceptual definitions of Faithfulness and Answer Relevancy originate in the RAGAS paper — Es et al., 2023, published at EACL 2024 as a system demonstration — which introduced LLM-based, reference-free scoring for RAG pipelines. Contextual Precision and Contextual Recall were later additions to the RAGAS library, post-publication, and are adopted by DeepEval under the same conceptual definitions.

A sample passes only if all four metrics clear their thresholds. This means a perfectly faithful, well-written answer still fails if the retriever ranked its supporting chunks poorly — which keeps pressure on the whole pipeline, not just the generator.

The four metrics deliberately target different failure modes. Faithfulness and Answer Relevancy are generation-side: they catch a model that hallucinates or wanders off topic. Contextual Precision and Contextual Recall are retrieval-side: they catch a retriever that misses relevant chunks or buries them in noise. Separating these allows a pipeline change — say, switching from dense-only retrieval to hybrid BM25 + dense — to be attributed precisely to either retrieval or generation quality shifts.

A Sample in Detail

syn010 is representative of what the system does well — a debugging question about a non-obvious root cause:

Query

Why does MoonDeckStream fail with ‘Another instance already running!’ after a stream ends, and what is the root cause involving SIGTERM and quick_exit()?

Generated Answer (truncated)

The issue occurs because when Sunshine sends SIGTERM to MoonDeckStream, the signal handler calls std::quick_exit(128 + signum). Unlike a normal exit, quick_exit() does not run C++ destructors — so the QSharedMemory and QSystemSemaphore objects used as the singleton lock are never cleaned up. On Linux, System V IPC segments persist until explicitly released, so on the next launch the singleton check finds the orphaned segment and refuses to start.

✓ 0.917

Faithfulness

The score is 0.92 because the actual output incorrectly attributes management of QSharedMemory and QSystemSemaphore to the destructor of SingleInstanceGuard, while it is actually the responsibility of MoonDeckStream to ensure only one instance runs.

✓ 1.000

Answer Relevancy

The score is 1.00 because the response directly addresses the question about MoonDeckStream’s failure and the involvement of SIGTERM and quick_exit() without any irrelevant statements.

✓ 0.887

Contextual Precision

The score is 0.89 because the relevant nodes are well-ranked, with the first two providing direct answers. However, an irrelevant node in third position (a timeline without SIGTERM context) slightly lowers the score.

✓ 1.000

Contextual Recall

The score is 1.00 because every aspect of the expected output is thoroughly supported by the retrieval context — SIGTERM behaviour, quick_exit() semantics, and System V IPC persistence are all present.

The 0.917 faithfulness deduction is worth noting: the judge caught that the answer attributed cleanup responsibility to a destructor that doesn’t exist in the described architecture. A human reviewer reading quickly might have missed it. This is exactly the kind of subtle factual error — correct root cause, wrong attribution — that the evaluation is designed to surface.

RAG Improvements: First A/B Comparison

With the evaluation infrastructure in place, three changes to the pipeline were applied simultaneously and compared against the baseline (chunk size 512, no deduplication, soft grounding prompt):

Chunk size 512 → 1024. The initial chunk_size=512 produced ~100-token chunks — too small for multi-fact queries. Many documentation sections span 500–2,000 characters; a 512-char limit forces mid-sentence splits that produce incoherent fragments. Bumping to 1024 chars with 128-char overlap keeps more sections whole and doubles information density per retrieval slot without increasing top_k.

Query-time deduplication. One sample (syn004) consumed 4 of 5 retrieval slots with near-identical systemctl status blobs from different log files. The fix: over-fetch 2 × top_k from the vector store, apply text-hash deduplication in retriever.py, and return up to top_k unique chunks. This frees slots that would otherwise be wasted on redundant content.

System prompt grounding. Six faithfulness failures shared a pattern: the model supplemented retrieved context with pre-training knowledge. The prompt changed from “always ground your answers” (soft guidance) to “Answer using ONLY the provided context. Do not add information from your own knowledge” (explicit prohibition). This improved Answer Relevancy (+0.063 mean) but introduced one regression — a borderline case tipped toward over-cautious abstention.

PipelineChunk SizeDedupPass RateFaithfulnessAnswer RelevancyCtx PrecisionCtx Recall
Baseline512 / overlap 6440%0.8090.8190.7810.931
+ chunking + dedup + prompt1024 / overlap 128text-hash44%0.7780.8820.7380.879

Pass rate improved by one sample (net +4 pp). The dominant factor was larger chunks: five samples where sections had been split mid-sentence now retrieved coherent context and passed. Four regressions partially offset these gains — three generation-side (over-cautious abstention, hallucinated role inference, marginal faithfulness miss) and one retrieval-side (embedding space reshaped by larger chunks, causing a niche query to lose ranking). The retrieval metric means dipped slightly across the board, consistent with larger chunks producing broader embeddings that compete differently in vector similarity.

Remaining failure modes after this change: hallucination is now the dominant pattern (7 of 14 failures), followed by retrieval ranking (relevant chunk present but buried at rank 4–5) and retrieval miss (facts not in the top-5 at all).

Measurement Reliability: Variance and Aggregation

Comparability between pipeline variants doesn’t come from making individual runs deterministic — it comes from running multiple times and aggregating. This stack has irreducible variance at two layers, and trying to engineer it away with seeds and API pinning has real maintenance cost and incomplete payoff. The right response is to measure the noise floor empirically and use median-over-N runs as the comparison primitive.

Where the variance lives. Retrieval is stable: Chroma with cosine distance and a fixed embedding model produces 25/25 identical contexts across back-to-back runs against an unchanged index. Variance enters at the two LLM layers:

Local generation (LM Studio / llama.cpp) — GPU floating-point non-associativity and batch scheduling mean llama.cpp does not produce bitwise-identical output even at temperature: 0. Dropping from temperature: 0.1 to 0.0 improved answer stability from 16% to 60% identical across runs — a meaningful improvement, but not determinism.

GPT-4o-mini judge — even when the generated answer is identical across two runs, judge scores vary. In back-to-back temperature: 0 runs, 11 of 15 same-answer samples received different scores, and 4 of those flipped pass/fail status. This is the dominant remaining noise source. Per-metric jitter:

MetricStdev (same-answer pairs)Pass/fail flipsNotes
Faithfulness0.1064/25Multi-step claim extraction amplifies per-call variance
Contextual Recall0.164LLM reasoning against gold is subjective
Answer Relevancy0.051Most stable generation-side metric
Contextual Precision0.047Most stable retrieval-side metric

Observed pass-rate bands. Six consecutive runs across two temperature settings establish the empirical noise floor:

ConditionPass rate rangeBand
temperature: 0.1 (4 runs)44–52%8 pp
temperature: 0.0 (2 runs)48–60%12 pp

A single run can land anywhere in a ~12 pp window. A pass-rate shift of 4 pp between two pipeline variants is within the noise; a consistent shift of 8+ pp across multiple runs is signal.

How to compare runs. The repo includes scripts/summarize_eval_runs.py, which reads meta.json from any set of run directories and prints min / max / median pass rate and metric means per dataset. The workflow for a meaningful pipeline comparison:

  1. Run the eval 3× for each pipeline variant
  2. Pass the run directories to summarize_eval_runs.py
  3. Compare medians — a change clears the noise floor if it exceeds ~5 pp on pass rate or ~0.05 on any metric mean

The A/B results elsewhere on this page are single-run comparisons. They show directional trends; treat them as hypotheses to confirm with multi-run aggregation.

Next: Retrieval Strategy A/B Tests

The infrastructure supports arbitrary pipeline swaps — any change to chunking, retrieval, or generation can be measured against the same 25-sample benchmark and compared by metric. The next comparison candidates, ordered by expected impact on the current failure modes:

Cross-encoder reranking. Three samples fail because the relevant chunk is retrieved but ranked 4th or 5th. A lightweight cross-encoder (e.g. cross-encoder/ms-marco-MiniLM-L-6-v2) re-scores the top-k candidates after vector retrieval, trading latency for precision. This directly targets the ranking failures without changing what gets retrieved.

Hybrid search (dense + BM25). Four samples fail because the relevant chunk is never retrieved at all — the dense vector search misses it. BM25 sparse retrieval is strong on exact term matches (command names, env var strings, file paths) — exactly the kind of facts that fail in this corpus. A hybrid index combining Reciprocal Rank Fusion of dense and sparse scores should recover these misses.

HyDE (Hypothetical Document Embeddings). Instead of embedding the raw query, HyDE prompts the LLM to generate a hypothetical answer document first, then embeds that. The hypothesis lives in the same semantic space as the corpus chunks, which can improve recall for questions phrased differently from the source text.

AST-aware code chunking. The current RecursiveChunker splits code at class/def/\n\n boundaries with a character limit. Tree-sitter parsing (via chonkie’s CodeChunker) splits at actual AST nodes — functions are never cut mid-body, and scope context is preserved. This is primarily a chunking quality improvement, but it should help the install-script and systemd categories where function-level context matters.

Each of these represents a single, independently measurable pipeline change — the benchmark stays fixed, and the four-metric scorecard separates retrieval effects from generation effects.