Skip to content

Changelog

v0.1.0 (2026-06-13)

  • Initial evaluation suite release
  • 100-item golden eval dataset from 5 open-license RPG sources
  • Retrieval metrics: Recall@k, MRR, NDCG@k
  • LLM-as-judge generation scoring
  • Regression testing with baseline comparison
  • Latency benchmarking harness
  • Branded SVG badge generation
  • CLI via python -m raghelm eval
  • CI integration with GitHub Actions