RAGhelm¶
Production-grade multi-namespace RAG platform with agentic routing, automated evaluation, and real-time observability.
Built on Pinecone. Ships in weeks, not months.
Eval Quality Gates¶
What is RAGhelm?¶
A comprehensive platform that combines:
- Multi-namespace RAG with agentic routing across document collections
- Golden evaluation datasets sourced from open-license RPG systems (Cairn, SCP Foundation, Fate Core, Dungeon World, D&D 5.1 SRD)
- Automated evaluation pipelines measuring Recall@k, MRR, NDCG@k, and generation quality
- Regression testing to catch retrieval degradation before it reaches production
- Real-time observability with Prometheus metrics and custom branded badges
Quick Start¶
# Install dependencies
uv sync
# Run the evaluation suite (10 random examples)
uv run python -m raghelm eval --suite quick
# Run the full evaluation suite (all 100 examples)
uv run python -m raghelm eval --suite full
# Generate branded badges from latest eval results
uv run python scripts/generate_badges.py
Architecture¶
raghelm/
raghelm/
eval/ # Evaluation framework
golden_dataset # Dataset loader + validation
runner # Eval suite orchestrator
metrics # Recall@k, MRR, NDCG@k
scorer # LLM-as-judge scoring
regression # Regression detection
benchmark # Latency benchmarking
agent/ # Agentic router
retrieval/ # Multi-namespace retrieval
generation/ # LLM generation pipeline
ingestion/ # Document ingestion
tests/ # Test suite
data/
eval_results/ # Evaluation run outputs
eval_cache/ # Cached LLM scores
baseline.json # Regression baseline