Measures:
- Strict programmatic-ground-truth match rate on top-3 (26.7%)
- That the ranker often surfaces page tangentially related to the question (visible in the miss listing)
- The methodology limit for class G specifically
Does not measure:
- Whether the surfaced pages, taken together, would be useful for an LLM agent constructing a Swift code answer
- Whether human-judged relevance differs materially from the regex
- Whether re-running with broadened regex would substantially change the result (worth doing as a follow-up if the test is rerun later)