Running Evaluations¶
CLI¶
# Quick run (10 random examples)
uv run python -m raghelm eval --suite quick
# Full run (all 100 examples)
uv run python -m raghelm eval --suite full
Programmatic API¶
from raghelm.eval.runner import EvalRunner
runner = EvalRunner("raghelm/eval/golden_dataset.json")
results = runner.run_suite(suite="full")
print(f"Recall@5: {results['metrics']['recall@5']}")
Output¶
Results are saved to data/eval_results/eval_YYYYMMDD_HHMMSS.json.
Regression Testing¶
On first run, a baseline is automatically saved to data/baseline.json. Subsequent runs compare against this baseline and fail if any metric drops significantly.