`docs/design/search-quality-eval.md` §14.2 Phase 1.5 (prose / conceptual), query class G from §1.4

Search-quality baseline: prose / conceptual (Phase 1.5, v1.2.0 candidate)

This audit tests multi-word natural-language queries — the kind a developer or AI agent would actually issue in a coding session, like "how to make a type Sendable" or "actor reentrancy semantics" — that have no single canonical URI. The right answer is a small SET of relevant documents spread across apple-docs and swift-evolution.

Measured 2026-05-20·Weak

Headline result

4 / 15 (26.7%)

Method & sourceP@k (Precision at k)Manning, Raghavan, Schütze (2008) IIR §8.4

Read in detail

Each card opens its own page. The headline and charts above are all you need at a glance; the cards are for the why and how.

Aggregate

This is the second-worst class baseline after acronym (4/22, 18%). For prose, this number represents an upper-bound on the methodology problem and a lower-bound on the ranker problem; the truth is somewhere in between.

Read details →

What the Regex Says

The harness defined per-query "valid" URIs as a tight enumeration: for "actor reentrancy semantics", valid = apple-docs://swift/actor OR swift-evolution://SE-0306 OR swift-evolution://SE-0327.

Read details →

Reading the Misses Honestly

Several of the "misses" are arguably correct results that the regex rejected:

Read details →

What This Audit Measures Vs What It Doesn't

Read details →

What This Baseline Says About the Ranker

Cupertino's BM25F + RRF configuration is tuned for canonical-lookup and symbol-identifier queries (classes A, D), where it excels (92% P@1, 100% rank-1 fragment recall).

Read details →

Possible Future Directions (out of Scope for This Audit)

Following the feedback_code_changes_as_ideas_for_future rule:

Read details →

Implications for Criterion 2 (anti-hallucination)

An AI agent issuing a prose query like "how to make a type Sendable" gets appentity and systemcoordinator at top-3, NOT the Sendable protocol page.

Read details →

Method Recap

15 prose queries, each with a regex matching a per-query enumerated valid-URI set. For each: run cupertino search "<query>" --limit 10, find first-relevant rank, compute P@3, P@5, any-match-in-top-3 (binary).

Read details →

Combined Phase 1 Baseline Coverage on V1.2.0

Six of eight Phase 1.x classes from §1.4 now have documented baselines. One remains: H (symbol-attribute). Plus Phase 1.7 (anti-hallucination agent-end-to-end).

Read details →

Sources cited in this measurement

Every metric and method this audit relies on, with a link to the foundational source. Auto-collected from the audit text.

Search-quality baseline: prose / conceptual (Phase 1.5, v1.2.0 candidate)

Read in detail

Aggregate

What the Regex Says

Reading the Misses Honestly

What This Audit Measures Vs What It Doesn't

What This Baseline Says About the Ranker

Possible Future Directions (out of Scope for This Audit)

Implications for Criterion 2 (anti-hallucination)

Method Recap

Combined Phase 1 Baseline Coverage on V1.2.0

Sources cited in this measurement

P@k (Precision at k)

Reciprocal Rank Fusion (k=60)

Mean Reciprocal Rank