Sources · cupertino search-quality dashboards

Where every metric on the dashboards comes from

Every number on the cupertino search-quality dashboards is a measurement using methods from peer-reviewed information-retrieval research and standard statistics. Nothing is invented for this project. This page lists, for each metric or method we use, the foundational source.

Compiled 2026-05-20·All sources are publicly accessible, most are free to read

The textbook this whole thing builds on

Manning, Raghavan, Schütze (2008). Introduction to Information Retrieval.

Cambridge University Press. Stanford NLP. Free full text online.

The reference for everything we do on the search-quality side. Chapter 8 (Evaluation in information retrieval) defines all the metrics on the dashboards. The cupertino universal IR-eval rule and the cupertino-specific design both cite this book extensively.

Read free online

Metrics we compute on the dashboards

MRR

Mean Reciprocal Rank

The position of the first right answer, averaged across queries.

For each query, take 1 / (rank of the first relevant document). Average across all queries. If the right doc is at position 1 for every query, MRR = 1.0. If always at position 2, MRR = 0.5. Used when each query has one canonical right answer — cupertino's canonical-lookup case.

Source: Voorhees, E.M. (1999). The TREC-8 Question Answering Track Report. NIST Special Publication.

Read the paper (PDF)

P@k

Precision at k

Of the top k results, what fraction is relevant.

If the top-5 returns 3 relevant docs, P@5 = 0.6. P@1 = whether the very first result is relevant. We use P@1, P@5, P@10 on the dashboards.

Source: Manning, Raghavan, Schütze (2008), §8.4 "Evaluation of ranked retrieval results."

Read the chapter

NDCG@k

Normalized Discounted Cumulative Gain

Like P@k but penalises relevant results that appear lower down.

Each relevant doc contributes a smaller "gain" the further down it appears, divided by IDCG (the gain of a perfect ranking). Used when ranking position matters and relevance can be graded. Standard in modern machine-learning ranking work.

Source: Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4).

DOI: 10.1145/582415.582418

MAP

Mean Average Precision

Precision averaged across positions, then averaged across queries.

For each query, compute precision at every position where a relevant document appears, then average. Then average across queries. Known for "especially good discrimination and stability" (Manning IIR §8.4). The default headline metric at TREC.

Source: Manning, Raghavan, Schütze (2008), §8.4.

Read the chapter

R-Precision

P@k where k equals the number of known relevant docs.

Adjusts for queries that have widely-different numbers of relevant documents. Cleanest when the relevant-set size is known up front; used for prose / conceptual queries where multiple documents jointly answer the question.

Source: Manning, Raghavan, Schütze (2008), §8.4.

Read the chapter

Statistical tests on the dashboards

Paired test

Wilcoxon signed-rank test

The right significance test for rank metrics.

When comparing two search systems on the same queries, the per-query MRR differences aren't normally distributed (rank metrics are bounded and discrete). The Wilcoxon signed-rank test is the standard non-parametric paired test for this case. Used in any future cupertino comparison of two builds.

Source: Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6).

JSTOR: 3001968

Paired binary test

McNemar's test

For paired binary outcomes — does X help or not?

When the per-task outcome is binary (compiled / didn't compile, agent's code was correct / wasn't), the right paired test is McNemar's. Used in the Phase 1.7 anti-hallucination eval design to compare LLM-with-cupertino vs LLM-alone.

Source: McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2): 153-157.

DOI: 10.1007/BF02295996

Sign test

Binomial / sign test

For "did X win or lose" on each query.

Simplest paired non-parametric test: count wins vs losses on a pre-defined event, compare against 50/50 chance via the binomial distribution. Used in the deprecation-aware audit ("Swift form wins over Obj-C form 30 / 30 times, p = 0.0078").

Source: Conover, W.J. (1999). Practical Nonparametric Statistics, 3rd ed., Wiley. The sign test goes back further (Arbuthnot 1710), but Conover is the modern canonical reference.

Background

Agreement

Cohen's kappa (κ)

How much two human judges agree, beyond chance.

When two people independently judge the same documents as relevant or not, you need to know whether their agreement is real or accidental. κ corrects for chance agreement. The Phase 2 TREC-grade human-pooling step would measure κ on a double-judged subset; cupertino targets κ ≥ 0.8 (Manning IIR §8.5).

Source: Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1).

DOI: 10.1177/001316446002000104

Methodology paradigms

Paradigm

Cranfield paradigm

The basic recipe for evaluating any IR system.

A test collection has three components: a document corpus, a set of queries, and relevance judgments for each (query, document) pair. Then run both systems on every query and compute metrics. Every dashboard on this page is a Cranfield-style evaluation.

Source: Cleverdon, C. (1967). The Cranfield tests on index language devices. Aslib Proceedings, 19(6).

DOI: 10.1108/eb050097

Paradigm

TREC pooling

How to build relevance judgments without judging every document.

Take the top-K from every system being evaluated, take the union (the "pool"), and have humans judge only that subset. Used to make human-judging cost-bounded. The Phase 2 cupertino plan follows this. NIST's TREC has operated on pooled judgments since 1992.

Source: Sparck Jones, K., & van Rijsbergen, C.J. (1975). Report on the need for and provision of an ideal information retrieval test collection. British Library R&D Report 5266. Operationalised by TREC since 1992.

TREC overview

Ranking algorithms inside cupertino

Ranking

BM25

The classical "best match" relevance score.

For each query term, weight by term-frequency-in-document and inverse-document-frequency, with a saturating function so very-frequent-in-doc terms don't dominate. The default ranking primitive in SQLite's FTS5 extension that cupertino uses.

Source: Robertson, S.E., & Walker, S. (1994). Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. Proceedings of SIGIR 1994.

ACM DL

Ranking

BM25F (field-weighted BM25)

BM25 with per-column weights.

Each indexed column gets its own weight. Cupertino's main FTS5 table uses a 9-weight vector: title=10, symbols=5, summary=3, framework=2, symbol_components=1.5, others=1. The cupertino-specific tuning lives in Packages/Sources/Search/Search.Index.Search.swift.

Source: Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. Proceedings of CIKM 2004.

DOI: 10.1145/1031171.1031181

Fusion

Reciprocal Rank Fusion

Combining rankings from multiple sources without per-source score normalisation.

For each document, sum weight(source) / (k + rank_in_source(doc)) across all sources that returned it. Robust to incompatible per-source score scales. Cupertino uses RRF with k=60 (the paper's default) to fuse apple-docs, Swift Evolution, HIG, Apple Archive, Swift Book, Swift.org, and packages.

Source: Cormack, G.V., Clarke, C.L.A., & Büttcher, S. (2009). Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. Proceedings of SIGIR 2009.

DOI: 10.1145/1571941.1572114

Substrate cupertino sits on

Engine

SQLite FTS5

The full-text-search extension bundled with SQLite.

Cupertino uses off-the-shelf FTS5 — no custom build, no third-party loadable extensions. The 9-weight BM25F vector is configured at query time; the tokenizer is FTS5's bundled porter unicode61. See the database architecture doc for the full schema.

Source: SQLite documentation, FTS5: Full Text Search.

sqlite.org/fts5.html

Tokenizer

Porter stemmer

English-language word stemming.

Reduces inflected forms to a common stem (running, ran, runs → run). Cupertino enables Porter stemming on the prose-bearing FTS5 tables and disables it on identifier-bearing tables (you don't want URLSession stemmed).

Source: Porter, M.F. (1980). An algorithm for suffix stripping. Program, 14(3).

Reference implementation

Apple toolchain

swift-syntax

Apple's Swift parser library, used by cupertino to extract symbols from code snippets.

Cupertino's AST indexer (issue #81) uses swift-syntax to walk Swift code embedded in Apple's documentation, extract every declaration's name, kind, signature, attributes, and generic constraints, then store them in a dedicated doc_symbols table and feed identifier columns into the FTS5 ranking.

github.com/swiftlang/swift-syntax

Apple toolchain

swift symbolgraph-extract

Apple's authoritative SDK symbol-graph emitter.

Runs on the active Xcode toolchain to produce a per-module JSON describing every public API surface in the SDK. Cupertino's sibling repo cupertino-symbolgraphs runs it for every Apple framework and uses the output to populate the authoritative generic-constraints table for cupertino's iteration-3 enrichment pass (#759).

Swift toolchain

The benchmark we deliberately don't replicate

SWE-bench

A general-purpose benchmark for LLM-driven code generation.

Phase 1.7's design borrows the "compile + functional-correctness" scoring approach from SWE-bench but does not attempt to replicate the full benchmark. Cupertino's task corpus is Apple-platform-Swift-specific and ~30 tasks; SWE-bench targets thousands of Python repository tasks. The right answer when we need broader coverage is to adopt SWE-bench wholesale, not extend our harness toward it.

Source: Jimenez, C., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024.

arXiv:2310.06770