| Query handling | Query rewrite success, intent classification accuracy | Ensures retrieval starts from the right intent | Correct docs exist but are never retrieved |
| Retrieval | Recall@k, MRR, nDCG, context overlap | Measures whether useful evidence is fetched | Hallucination risk rises despite strong model |
| Generation | Groundedness, citation precision, answer completeness | Validates answer quality against retrieved evidence | Fluent but unsupported responses |
| Operations | P95 latency, token cost per answer, timeout rate | Protects UX and margins | Quality unstable under load or budget pressure |