RAG System Evaluation: Hallucination & Reliability

Objective

Evaluate the reliability of a Retrieval-Augmented Generation (RAG) system by testing consistency, faithfulness, and hallucination risk.

Key Findings

  • Ungrounded prompts led to hallucinated (unsupported) information
  • Strict grounding eliminated hallucinations and improved consistency
  • Model behavior changed significantly based on prompt strength
  • Evaluation metrics can produce false positives when context is missing

Conclusion

Prompt grounding is critical for reliable LLM behavior. Without it, models can generate confident but unsupported answers, and evaluation metrics may fail to detect these issues.