Back to dashboard
`docs/design/search-quality-eval.md` §14.2 Phase 1.5 (prose / conceptual), query class G from §1.4

Search-quality baseline: prose / conceptual (Phase 1.5, v1.2.0 candidate)

This audit tests multi-word natural-language queries — the kind a developer or AI agent would actually issue in a coding session, like "how to make a type Sendable" or "actor reentrancy semantics" — that have no single canonical URI. The right answer is a small SET of relevant documents spread across apple-docs and swift-evolution.

Measured 2026-05-20·Weak

Headline result
4 / 15 (26.7%)

Read in detail

Each card opens its own page. The headline and charts above are all you need at a glance; the cards are for the why and how.

Aggregate

This is the second-worst class baseline after acronym (4/22, 18%). For prose, this number represents an upper-bound on the methodology problem and a lower-bound on the ranker problem; the truth is somewhere in between.

Read details →

What the Regex Says

The harness defined per-query "valid" URIs as a tight enumeration: for "actor reentrancy semantics", valid = apple-docs://swift/actor OR swift-evolution://SE-0306 OR swift-evolution://SE-0327.

Read details →

Reading the Misses Honestly

Several of the "misses" are arguably correct results that the regex rejected:

Read details →

What This Audit Measures Vs What It Doesn't

Read details →

What This Baseline Says About the Ranker

Cupertino's BM25F + RRF configuration is tuned for canonical-lookup and symbol-identifier queries (classes A, D), where it excels (92% P@1, 100% rank-1 fragment recall).

Read details →

Possible Future Directions (out of Scope for This Audit)

Following the feedback_code_changes_as_ideas_for_future rule:

Read details →

Implications for Criterion 2 (anti-hallucination)

An AI agent issuing a prose query like "how to make a type Sendable" gets appentity and systemcoordinator at top-3, NOT the Sendable protocol page.

Read details →

Method Recap

15 prose queries, each with a regex matching a per-query enumerated valid-URI set. For each: run cupertino search "<query>" --limit 10, find first-relevant rank, compute P@3, P@5, any-match-in-top-3 (binary).

Read details →

Combined Phase 1 Baseline Coverage on V1.2.0

Six of eight Phase 1.x classes from §1.4 now have documented baselines. One remains: H (symbol-attribute). Plus Phase 1.7 (anti-hallucination agent-end-to-end).

Read details →

Sources cited in this measurement

Every metric and method this audit relies on, with a link to the foundational source. Auto-collected from the audit text.

P@k (Precision at k)

Manning, Raghavan, Schütze (2008) IIR §8.4

Open citation

Reciprocal Rank Fusion (k=60)

Cormack, Clarke, Büttcher (2009), SIGIR

Open citation

Mean Reciprocal Rank

Voorhees (1999), TREC-8 QA Report

Open citation