Skip to content

RAGhelm

Production-grade multi-namespace RAG platform with agentic routing, automated evaluation, and real-time observability.

Built on Pinecone. Ships in weeks, not months.

Python Ruff TypeScript Fastify

Eval Quality Gates

Recall@5 MRR NDCG@5 Generation

What is RAGhelm?

A comprehensive platform that combines:

  • Multi-namespace RAG with agentic routing across document collections
  • Golden evaluation datasets sourced from open-license RPG systems (Cairn, SCP Foundation, Fate Core, Dungeon World, D&D 5.1 SRD)
  • Automated evaluation pipelines measuring Recall@k, MRR, NDCG@k, and generation quality
  • Regression testing to catch retrieval degradation before it reaches production
  • Real-time observability with Prometheus metrics and custom branded badges

Quick Start

# Install dependencies
uv sync

# Run the evaluation suite (10 random examples)
uv run python -m raghelm eval --suite quick

# Run the full evaluation suite (all 100 examples)
uv run python -m raghelm eval --suite full

# Generate branded badges from latest eval results
uv run python scripts/generate_badges.py

Architecture

raghelm/
  raghelm/
    eval/              # Evaluation framework
      golden_dataset   # Dataset loader + validation
      runner           # Eval suite orchestrator
      metrics          # Recall@k, MRR, NDCG@k
      scorer           # LLM-as-judge scoring
      regression       # Regression detection
      benchmark        # Latency benchmarking
    agent/             # Agentic router
    retrieval/         # Multi-namespace retrieval
    generation/        # LLM generation pipeline
    ingestion/         # Document ingestion
  tests/              # Test suite
  data/
    eval_results/      # Evaluation run outputs
    eval_cache/        # Cached LLM scores
    baseline.json      # Regression baseline