Eval Runner API¶

EvalRunner orchestrates local and production eval runs. Eval output is evidence input for release control; scorecards and manifests are the release decision artifacts.

`EvalRunner`¶

from raghelm.eval.runner import EvalRunner

runner = EvalRunner("raghelm/eval/golden_dataset.json")
results = runner.run_suite(suite="quick")

Constructor¶

EvalRunner(dataset_path: str)

Loads and validates the golden dataset. Invalid datasets fail before scoring.

`run_suite`¶

run_suite(suite: str = "full") -> dict

suite="full": run all examples.
suite="quick": run a 10-example subset.

Returns metrics, generation scores, and per-example details. Results are saved to data/eval_results/eval_YYYYMMDD_HHMMSS.json and compared against data/baseline.json when present.

Release-control usage¶

Use runner output as source evidence for ReadinessScorecards and RAGRunManifests. Do not publish local runner output as production quality proof.

Eval Runner API¶

EvalRunner¶

Constructor¶

run_suite¶

Release-control usage¶

`EvalRunner`¶

`run_suite`¶