Skip to content

Running Evaluations

CLI

# Quick run (10 random examples)
uv run python -m raghelm eval --suite quick

# Full run (all 100 examples)
uv run python -m raghelm eval --suite full

Programmatic API

from raghelm.eval.runner import EvalRunner

runner = EvalRunner("raghelm/eval/golden_dataset.json")
results = runner.run_suite(suite="full")
print(f"Recall@5: {results['metrics']['recall@5']}")

Output

Results are saved to data/eval_results/eval_YYYYMMDD_HHMMSS.json.

Regression Testing

On first run, a baseline is automatically saved to data/baseline.json. Subsequent runs compare against this baseline and fail if any metric drops significantly.