Changelog¶
v0.1.0 (2026-06-13)¶
- Initial evaluation suite release
- 100-item golden eval dataset from 5 open-license RPG sources
- Retrieval metrics: Recall@k, MRR, NDCG@k
- LLM-as-judge generation scoring
- Regression testing with baseline comparison
- Latency benchmarking harness
- Branded SVG badge generation
- CLI via
python -m raghelm eval - CI integration with GitHub Actions