Evaluation Suite¶

The evaluation framework is the backbone of RAGhelm. It provides:

Golden dataset of 100 curated Q&A pairs across 5 open-license RPG systems
Retrieval metrics: Recall@k, MRR, NDCG@k
Generation scoring: LLM-as-judge faithfulness, relevance, completeness
Regression testing: Automatic detection of performance degradation
Latency benchmarking: P50/P95/P99 metrics for retrieval and generation

Architecture¶

# The eval runner orchestrates the full pipeline
from raghelm.eval.runner import EvalRunner

runner = EvalRunner("raghelm/eval/golden_dataset.json")
results = runner.run_suite(suite="full")

Key Components¶

Component	Purpose
`golden_dataset.py`	Load and validate evaluation datasets
`runner.py`	Orchestrate eval runs
`metrics.py`	Compute Recall@k, MRR, NDCG@k
`scorer.py`	LLM-as-judge generation scoring
`regression.py`	Detect regression against baseline
`benchmark.py`	Latency benchmarking

Golden Dataset Sources¶

All 100 examples are sourced from open-license RPG systems:

Source	License	Examples
Cairn RPG	CC-BY-SA 4.0	20
SCP Foundation	CC-BY-SA 3.0	20
Fate Core	CC-BY 3.0	21
Dungeon World	CC-BY 3.0	20
D&D 5.1 SRD	CC-BY 4.0	20