Per docs/design/search-quality-eval.md §1.5 (the two-criteria framing):
- Criterion 1 classes C-H (acronym, CamelCase fragment, deprecation-aware, cross-source canonical, prose, symbol-attribute). Each needs its own corpus and metric.
- Criterion 2 (anti-hallucination): does an LLM agent given cupertino's top-K results actually produce correct Swift? This is the actual success measure; this baseline is at best a precondition. The Phase 1.7 agent-eval (design §14.4, not yet written) is where Criterion 2 gets measured.
A MRR-0.9467 baseline is necessary but not sufficient for high-quality agent grounding. An agent can still hallucinate even when the right doc is at rank 1.