Skip to content

Golden Dataset

The golden dataset is the ground truth used to evaluate RAG retrieval and generation quality.

Structure

Each example follows this schema:

{
  "id": "cairn_001",
  "question": "What happens when a character reaches 0 STR in Cairn?",
  "expected_answer": "When a character reaches 0 STR, they are DEAD.",
  "source_docs": ["cairn_srd.md"],
  "relevant_chunks": ["When a character reaches 0 STR, they are DEAD."],
  "difficulty": 1,
  "category": "factual_lookup"
}

Fields

Field Type Description
id string Unique identifier (source_NNN)
question string The question to answer
expected_answer string Ground-truth answer
source_docs list[string] Source document filenames
relevant_chunks list[string] Verbatim chunks containing the answer
difficulty int 1 = easy, 2 = medium, 3 = hard
category string Query type (see below)

Query Categories

Category Description Count
factual_lookup Single-fact retrieval from one document 32
comparison Compare two or more rules/items 13
synthesis Combine multiple rules to derive an answer 14
temporal Sequence of events or consequences over time 21
contradictory Resolve apparent rule conflicts 20

Difficulty Distribution

Level Count Description
1 (Easy) 25 Direct lookup, single chunk
2 (Medium) 49 Cross-reference 2-3 chunks
3 (Hard) 26 Multi-hop reasoning across documents

Sources

All 100 examples use open-license RPG content:

Source License Count
Cairn RPG SRD CC-BY-SA 4.0 20
SCP Foundation Wiki CC-BY-SA 3.0 20
Fate Core SRD CC-BY 3.0 19
Fate Accelerated SRD CC-BY 3.0 1
Fate System Toolkit CC-BY 3.0 1
Dungeon World SRD CC-BY 3.0 20
D&D 5.1 SRD CC-BY 4.0 20

Validation

The dataset is validated on every eval run using raghelm.eval.golden_dataset.validate_dataset():

from raghelm.eval.golden_dataset import load_golden_dataset, validate_dataset

dataset = load_golden_dataset("raghelm/eval/golden_dataset.json")
issues = validate_dataset(dataset)
if issues:
    for issue in issues:
        print(f"ERROR: {issue}")
    sys.exit(1)

Validation checks: - No duplicate IDs - All required fields present (id, question, expected_answer, source_docs, relevant_chunks) - Difficulty is 1, 2, or 3 - Category is one of the valid types

Creating Custom Datasets

from raghelm.eval.golden_dataset import GoldenDataset, GoldenExample, save_golden_dataset

dataset = GoldenDataset(
    version="1.0.0",
    description="My custom eval dataset",
    examples=[
        GoldenExample(
            id="my_001",
            question="What is the answer?",
            expected_answer="42",
            source_docs=["my_doc.txt"],
            relevant_chunks=["The answer is 42."],
            difficulty=1,
            category="factual_lookup"
        )
    ]
)

save_golden_dataset(dataset, "my_dataset.json")