Skip to content

Evaluation Suite

The evaluation framework is the backbone of RAGhelm. It provides:

  • Golden dataset of 100 curated Q&A pairs across 5 open-license RPG systems
  • Retrieval metrics: Recall@k, MRR, NDCG@k
  • Generation scoring: LLM-as-judge faithfulness, relevance, completeness
  • Regression testing: Automatic detection of performance degradation
  • Latency benchmarking: P50/P95/P99 metrics for retrieval and generation

Architecture

# The eval runner orchestrates the full pipeline
from raghelm.eval.runner import EvalRunner

runner = EvalRunner("raghelm/eval/golden_dataset.json")
results = runner.run_suite(suite="full")

Key Components

Component Purpose
golden_dataset.py Load and validate evaluation datasets
runner.py Orchestrate eval runs
metrics.py Compute Recall@k, MRR, NDCG@k
scorer.py LLM-as-judge generation scoring
regression.py Detect regression against baseline
benchmark.py Latency benchmarking

Golden Dataset Sources

All 100 examples are sourced from open-license RPG systems:

Source License Examples
Cairn RPG CC-BY-SA 4.0 20
SCP Foundation CC-BY-SA 3.0 20
Fate Core CC-BY 3.0 21
Dungeon World CC-BY 3.0 20
D&D 5.1 SRD CC-BY 4.0 20