Evaluation Suite¶
The evaluation framework is the backbone of RAGhelm. It provides:
- Golden dataset of 100 curated Q&A pairs across 5 open-license RPG systems
- Retrieval metrics: Recall@k, MRR, NDCG@k
- Generation scoring: LLM-as-judge faithfulness, relevance, completeness
- Regression testing: Automatic detection of performance degradation
- Latency benchmarking: P50/P95/P99 metrics for retrieval and generation
Architecture¶
# The eval runner orchestrates the full pipeline
from raghelm.eval.runner import EvalRunner
runner = EvalRunner("raghelm/eval/golden_dataset.json")
results = runner.run_suite(suite="full")
Key Components¶
| Component | Purpose |
|---|---|
golden_dataset.py |
Load and validate evaluation datasets |
runner.py |
Orchestrate eval runs |
metrics.py |
Compute Recall@k, MRR, NDCG@k |
scorer.py |
LLM-as-judge generation scoring |
regression.py |
Detect regression against baseline |
benchmark.py |
Latency benchmarking |
Golden Dataset Sources¶
All 100 examples are sourced from open-license RPG systems:
| Source | License | Examples |
|---|---|---|
| Cairn RPG | CC-BY-SA 4.0 | 20 |
| SCP Foundation | CC-BY-SA 3.0 | 20 |
| Fate Core | CC-BY 3.0 | 21 |
| Dungeon World | CC-BY 3.0 | 20 |
| D&D 5.1 SRD | CC-BY 4.0 | 20 |