Golden Dataset¶
The golden dataset is the ground truth used to evaluate RAG retrieval and generation quality.
Structure¶
Each example follows this schema:
{
"id": "cairn_001",
"question": "What happens when a character reaches 0 STR in Cairn?",
"expected_answer": "When a character reaches 0 STR, they are DEAD.",
"source_docs": ["cairn_srd.md"],
"relevant_chunks": ["When a character reaches 0 STR, they are DEAD."],
"difficulty": 1,
"category": "factual_lookup"
}
Fields¶
| Field | Type | Description |
|---|---|---|
id |
string | Unique identifier (source_NNN) |
question |
string | The question to answer |
expected_answer |
string | Ground-truth answer |
source_docs |
list[string] | Source document filenames |
relevant_chunks |
list[string] | Verbatim chunks containing the answer |
difficulty |
int | 1 = easy, 2 = medium, 3 = hard |
category |
string | Query type (see below) |
Query Categories¶
| Category | Description | Count |
|---|---|---|
factual_lookup |
Single-fact retrieval from one document | 32 |
comparison |
Compare two or more rules/items | 13 |
synthesis |
Combine multiple rules to derive an answer | 14 |
temporal |
Sequence of events or consequences over time | 21 |
contradictory |
Resolve apparent rule conflicts | 20 |
Difficulty Distribution¶
| Level | Count | Description |
|---|---|---|
| 1 (Easy) | 25 | Direct lookup, single chunk |
| 2 (Medium) | 49 | Cross-reference 2-3 chunks |
| 3 (Hard) | 26 | Multi-hop reasoning across documents |
Sources¶
All 100 examples use open-license RPG content:
| Source | License | Count |
|---|---|---|
| Cairn RPG SRD | CC-BY-SA 4.0 | 20 |
| SCP Foundation Wiki | CC-BY-SA 3.0 | 20 |
| Fate Core SRD | CC-BY 3.0 | 19 |
| Fate Accelerated SRD | CC-BY 3.0 | 1 |
| Fate System Toolkit | CC-BY 3.0 | 1 |
| Dungeon World SRD | CC-BY 3.0 | 20 |
| D&D 5.1 SRD | CC-BY 4.0 | 20 |
Validation¶
The dataset is validated on every eval run using raghelm.eval.golden_dataset.validate_dataset():
from raghelm.eval.golden_dataset import load_golden_dataset, validate_dataset
dataset = load_golden_dataset("raghelm/eval/golden_dataset.json")
issues = validate_dataset(dataset)
if issues:
for issue in issues:
print(f"ERROR: {issue}")
sys.exit(1)
Validation checks: - No duplicate IDs - All required fields present (id, question, expected_answer, source_docs, relevant_chunks) - Difficulty is 1, 2, or 3 - Category is one of the valid types
Creating Custom Datasets¶
from raghelm.eval.golden_dataset import GoldenDataset, GoldenExample, save_golden_dataset
dataset = GoldenDataset(
version="1.0.0",
description="My custom eval dataset",
examples=[
GoldenExample(
id="my_001",
question="What is the answer?",
expected_answer="42",
source_docs=["my_doc.txt"],
relevant_chunks=["The answer is 42."],
difficulty=1,
category="factual_lookup"
)
]
)
save_golden_dataset(dataset, "my_dataset.json")