RAG System Evaluation: Hallucination & Reliability
Objective
Evaluate the reliability of a Retrieval-Augmented Generation (RAG) system by testing consistency, faithfulness, and hallucination risk.
Key Findings
- Ungrounded prompts led to hallucinated (unsupported) information
- Strict grounding eliminated hallucinations and improved consistency
- Model behavior changed significantly based on prompt strength
- Evaluation metrics can produce false positives when context is missing
Conclusion
Prompt grounding is critical for reliable LLM behavior. Without it, models can generate confident but unsupported answers, and evaluation metrics may fail to detect these issues.
