Running Evaluations¶

CLI¶

# Local quick run
uv run python -m raghelm eval --suite quick --mode local --judge-mode offline_heuristic

# Local full run
uv run python -m raghelm eval --suite full --mode local --judge-mode offline_heuristic

# Production run: requires Pinecone and LLM provider credentials
uv run python -m raghelm eval --suite full --mode production --judge-mode production

# Production smoke test: verify required production configuration without a full run
uv run python -m raghelm eval --mode production --judge-mode production --smoke

Programmatic API¶

from raghelm.eval.runner import EvalRunner

runner = EvalRunner("raghelm/eval/golden_dataset.json")
results = runner.run_suite(suite="full")
print(f"Recall@5: {results['metrics']['recall@5']}")

Output¶

Eval results are saved to data/eval_results/eval_YYYYMMDD_HHMMSS.json.

Production release readiness should be represented by a ReadinessScorecard linked to a RAGRunManifest. Raw eval output is evidence input, not the final release decision surface.

Release-readiness bundle¶

Use release-check to turn a frozen eval summary into deterministic product artifacts:

uv run python -m raghelm release-check \
  --eval-summary benchmarks/fixtures/release_readiness/improved_candidate_eval_summary.json \
  --release-candidate fixture-improved \
  --output-dir data/release_check/fixture-improved

The command writes:

ragrun-manifest.json — a RAGRunManifest with source, target, metric, privacy, and artifact hash evidence.
readiness-scorecard.json — a ReadinessScorecard with the gate decision and checks.

Gate statuses are deterministic: ship, needs_review, or block. Local/demo artifacts are explicitly labeled as local evidence and can at most reach review for production readiness. Passing --production-claim fails closed unless the manifest mode is production.

The first benchmark fixtures live in benchmarks/fixtures/release_readiness/: the bad candidate blocks, while the improved local candidate reaches needs_review because local evidence cannot support production claims unattended.

Regression testing¶

On first run, a baseline is saved to data/baseline.json. Subsequent runs compare against that baseline and fail if a metric drops beyond the configured tolerance.

Runner options¶

Option	Values	Purpose
`--suite`	`quick`, `full`	Choose 10 random examples or the full dataset.
`--mode`	`local`, `production`	Choose local deterministic/dev behavior or real Pinecone/LLM behavior.
`--judge-mode`	`offline_heuristic`, `production`	Choose local scoring or production LLM-as-judge scoring.
`--smoke`	boolean flag	Validate production eval configuration without a full run.

Check uv run python -m raghelm eval --help for the latest available options.

Evidence rules¶

Local evals must be labeled local/demo and must not back production claims.
Production evals must fail closed when required credentials/config are missing.
Public claims require production scorecard and manifest evidence per ADR-010.
Private data must not be copied into public proof artifacts by default per ADR-003.