`docs/design/search-quality-eval.md` Phase 1 (single-system mode, no Phase 2 human judging)

Search-quality baseline: v1.2.0 candidate `search.db`

This audit records the v1.2.0 candidate database's standing on Criterion 1 (good search) restricted to query classes A (canonical lookup) and B (framework-root) per the design's §1.4 taxonomy. It is an absolute baseline; future ranking changes are measured against this single-system snapshot using the same harness in paired mode. The classes C-H from the taxonomy are out of scope per design §3 (NG6).

Measured 2026-05-20·Strong

Headline result

46 / 50

Method & sourceP@k (Precision at k)Manning, Raghavan, Schütze (2008) IIR §8.4

Read in detail

Each card opens its own page. The headline and charts above are all you need at a glance; the cards are for the why and how.

Aggregate

P@5 looks low next to MRR. Reason: the 50 queries each have exactly one canonical right answer in this design, so P@5 has a ceiling of 0.2 per query if at most one match is in the top 5.

Read details →

Sub-perfect Cases (4 of 50)

The four queries that did not yield a top-1 match. Each is informative.

Read details →

Method Recap

50 canonical-lookup queries each paired with a right-answer URI regex. For each query, cupertino search "<query>" --limit 10 was invoked via the develop-tip binary with cupertino.config.json set to baseDirectory: ~/.cupe…

Read details →

What This Baseline Does Not Measure

Per docs/design/search-quality-eval.md §1.5 (the two-criteria framing):

Read details →

How to Use This Baseline

When evaluating a future ranking change (BM25F weight tweak, new tokenizer, schema change), re-run the same 50-query corpus on both the unchanged binary/DB and the changed binary/DB, use the paired-comparison mode (/tmp/…

Read details →

Sources cited in this measurement

Every metric and method this audit relies on, with a link to the foundational source. Auto-collected from the audit text.

P@k (Precision at k)

Manning, Raghavan, Schütze (2008) IIR §8.4

Open citation

Mean Reciprocal Rank

Voorhees (1999), TREC-8 QA Report

Open citation

NDCG

Järvelin & Kekäläinen (2002)

Open citation

Reciprocal Rank Fusion (k=60)

Cormack, Clarke, Büttcher (2009), SIGIR

Open citation

Wilcoxon signed-rank test

Wilcoxon (1945), Biometrics Bulletin

Open citation