Reciprocal Rank Fusion (k=60)
Cormack, Clarke, Büttcher (2009), SIGIR
Open citationCupertino is a local-first indexing and serving system for Apple developer documentation. It crawls the full Apple docs JSON API (~412,000 pages, ~420 frameworks), imports the result into a SQLite FTS5 database with a multi-pass BM25F ranker, and serves it to AI agents over the Model Context Protocol from a single hand-rolled Swift binary. Search latency is sub-100ms p99; the entire system runs offline after a one-time bundle download. The bundle is rebuilt and distributed via GitHub Releases (~685 MB compressed). Crawl-to-bundle is a ~14-day pipeline that runs out-of-band; the user-facing install path skips the crawl entirely.
Apple publishes developer documentation as a JavaScript-rendered single-page application backed by an undocumented JSON API. AI coding agents (Claude, Copilot, Cursor, etc.) need accurate, current Apple API references to avoid generating code that calls nonexistent symbols, uses deprecated APIs, or violates platform availability constraints. The naive approaches all fail:
The corpus has favorable properties for a local index:
URLSession, @Observable, dataTaskPublisher). FTS5 + BM25F handles this at sub-millisecond latency without an embeddings pipeline.The Model Context Protocol (Anthropic, 2024) standardizes how AI agents discover and invoke tools and read external resources. MCP-compatible agents (Claude Desktop, Claude Code, Cursor, VS Code Copilot, Zed, Windsurf, OpenAI Codex, GitHub Copilot for Xcode) can consume Cupertino without per-agent integration work. Targeting MCP avoids building N adapter binaries.
framework_aliases and document count in docs_metadata.Task finds Swift's Task struct above the Mach kernel's task_* C functions.kind, conforms_to, @MainActor, platform availability (min_ios, min_macos, etc.), and other DocC-extracted fields without parsing JSON at query time.These are explicit non-goals so future contributors don't waste cycles re-proposing them:
FASTLANE_SESSION and xcodes-style SRP authentication were evaluated and rejected; no consumer in the design needs them.cupertino setup to get a clean bundle. Migration code paths are a known source of subtle bugs in long-lived databases; we chose to skip them entirely and rebuild from source.| ID | Requirement | Verified by |
|---|---|---|
| F1 | Full-text search across all indexed sources | cupertino search, MCP search tool |
| F2 | Read a full document by apple-docs:// URI | cupertino read, MCP read_document |
| F3 | List indexed frameworks with document counts | cupertino list-frameworks, MCP list_frameworks |
| F4 | Filter search results by framework, platform version, source | CLI flags and MCP tool parameters |
| F5 | Symbol-shape queries (find all actors conforming to Sendable) | MCP search_symbols, search_conformances |
| F6 | Property-wrapper and concurrency pattern lookup | MCP search_property_wrappers, search_concurrency |
| F7 | Sample code read at file granularity | MCP read_sample, read_sample_file |
| F8 | Crawl resume from arbitrary kill point | Crawler session checkpoint, keyed by start URL |
| F9 | Content-hash-based incremental re-crawl | SHA-256 per JSON file; skip on hash match |
| F10 | Audit log of all rejected and deduplicated rows | Search.JSONLImportLogSink writes JSONL |
| ID | Requirement | Target | Current state |
|---|---|---|---|
| N1 | Search latency p99 | < 100ms | Holds on v1.1.0 bundle (~285k docs); tracked in integration test |
| N2 | Read latency p99 | < 50ms | SQLite single-row lookup on indexed PK |
| N3 | Memory at serve time | < 500MB resident | SQLite mmap; process is the binary + open file handles |
| N4 | Storage footprint | < 5 GB bundle on disk | search.db 2.4 GB + packages.db 990 MB + samples.db 185 MB = ~3.6 GB (v1.1.0) |
| N5 | Setup time | < 60s | bundle download from GitHub Releases, ~685 MB compressed |
| N6 | Scale headroom | 10x current corpus | analyzed in §10 |
| N7 | Correctness | zero data loss on power loss; zero tier-C collisions on healthy run | SQLite ACID + door classification (docs/PRINCIPLES.md §2-3) |
| N8 | Reproducibility | same crawl → same bundle (modulo Apple changes) | content hashing + deterministic URI canonicalization |
| N9 | Portability | single Homebrew formula, no runtime deps | binary is self-contained except for SQLite (system) |
| N10 | Build hygiene | zero producer→producer imports; CI-enforced | scripts/check-package-purity.sh + scripts/check-target-foundation-only.sh |
Four sequential stages. Each is independently runnable; the user-facing cupertino save --docs chains them inline today, but they are designed to decouple cleanly (epic #769):
┌────────┐ ┌──────────────┐ ┌────────┐ ┌────────┐
│ Crawl │ → │ Import/Index │ → │ Enrich │ → │ Serve │
└────────┘ └──────────────┘ └────────┘ └────────┘
~14d ~12h ~5 min runtime
| Stage | Binary | Input | Output | Idempotent |
|---|---|---|---|---|
| Crawl | cupertino fetch --type docs | Apple JSON API over the network | ~/.cupertino/docs/**/*.json | Yes (content hash skip) |
| Import / Index | cupertino save --docs | JSON files on disk | search.db | Yes (INSERT OR REPLACE) |
| Enrich | cupertino-postprocessor (after #769) | search.db | enrichment columns in same DB | Yes (upsert + version column) |
| Serve | cupertino serve | search.db | MCP responses over stdio | Read-only |
The decoupling lets us re-run enrichment without a re-crawl, ship pre-built bundles via GitHub Releases (skipping crawl + index for end users), and run the crawl on dedicated hardware.
┌──────────────────────────────────────────┐
│ Apple developer docs │
│ developer.apple.com/tutorials/data/... │
└─────────────────┬────────────────────────┘
│ WKWebView, 0.05s delay, BFS
▼
┌──────────────────────────────────────────┐
│ ~/.cupertino/docs/<framework>/*.json │
│ git-tracked corpus repo (cupertino-docs) │
└─────────────────┬────────────────────────┘
│ cupertino save --docs
▼
┌──────────────────────────────────────────┐
│ search.db (SQLite FTS5 + metadata + AST)│
└─────────────────┬────────────────────────┘
│ enrichment passes
▼
┌──────────────────────────────────────────┐
│ search.db (synonyms, constraints, │
│ hierarchy, recovery applied) │
└─────────────────┬────────────────────────┘
│ cupertino serve / search
┌───────────┴───────────────┐
▼ ▼
┌───────────────────┐ ┌──────────────────────┐
│ MCP (stdio JSON- │ │ CLI (search, read, │
│ RPC) for agents │ │ list, doctor) │
└───────────────────┘ └──────────────────────┘
Cupertino is a single Swift binary that runs in one of two modes, detected at startup:
| stdin shape | argv | Mode |
|---|---|---|
| pipe | none or serve | MCP server (long-lived, reads JSON-RPC, replies on stdout) |
| TTY | none or serve | error: serve from terminal is almost always wrong |
| any | subcommand (search, read, fetch, save, doctor, setup, cleanup, list-frameworks, ...) | CLI mode, executes subcommand, exits |
The single-binary topology was chosen over a separate cupertino-mcp to simplify deployment: one Homebrew formula, one PATH entry, one update path. Per-mode dispatch is a 5-line check at startup.
Apple's developer documentation site is an SPA. The page at https://developer.apple.com/documentation/swiftui/view fetches its content from a parallel JSON endpoint:
https://developer.apple.com/tutorials/data/documentation/swiftui/view.json
The JSON response is structured. Key fields:
| Field | Type | Used for |
|---|---|---|
metadata.title | string | display name, FTS5 title column |
metadata.role | string | maps to kind column (func, struct, ...) |
metadata.externalID | string | stable symbol identifier (e.g., c:@S@exclave_textlayout_info_v1) |
metadata.modules[].name | string | framework slug |
metadata.platforms[] | array | drives min_ios/min_macos/etc columns |
abstract | array | short summary, FTS5 summary column |
primaryContentSections[] | array | declarations, discussion, code examples |
references | dict | linked symbol metadata |
diffAvailability | dict | version diff vs previous SDKs |
The shape is not publicly documented. Cupertino reverse-engineers it. Schema drift is a known risk (§13).
For ~99% of pages, a direct HTTP GET against the JSON endpoint works. For the remaining ~1%, Apple's CDN returns one of:
"title": null with otherwise valid JSONWKWebView executes the page's JavaScript, follows redirects with browser semantics, and gives us the fully-resolved JSON. The cost is macOS-only crawling. The benefit is complete coverage.
Crawler runs WKWebView with @MainActor isolation, navigates to the page URL, waits for the JS to settle, extracts the embedded JSON via a small JS bridge, and writes it to disk. Each navigation uses a fresh WKWebView instance to avoid state leakage; instances are pooled via a small actor.
Starting from https://developer.apple.com/documentation/technologies, BFS over references[] links, scoped to documentation paths. Frameworks are also seeded from technologies.json (Apple's framework directory) to catch the ~200 frameworks not reachable from the homepage at depth ≤ 5.
Configurable knobs:
| Flag | Default | Purpose |
|---|---|---|
--max-pages | unbounded for --type docs --type all | safety bound for testing |
--max-depth | 21 | empirical; depth at which Apple's deeper symbol pages live |
--request-delay | 0.05s | conservative rate limit; nothing publicly documented but 50ms is well below Apple's threshold |
--start-url | docs home | for targeted re-crawls |
--allowed-prefixes | none | filter BFS to a path prefix |
Every fetched URL is canonicalized to an apple-docs:// URI before storage. The canonicalization is lossless and reversible:
building-a-pass and building_a_pass; in progress #285).apple-docs://<framework>/<path-tail-joined-by-slashes>.The URI is reversible to the original URL by string substitution. No hashing. This guarantees:
1 - exp(-N²/2·2⁶⁴) collision probability; at 40M docs the expected collisions are non-zero. Lossless URIs make collisions impossible by construction.Each fetched JSON is hashed with SHA-256 and compared against the previously stored file at the canonicalized URI. If the hashes match, the new fetch is discarded (the disk file is already current). This makes re-crawls incremental: a re-fetch of an unchanged page is one HTTP request + one SHA computation, no disk write.
The hash uses the raw JSON bytes after canonicalizing JSON key order. JSON serialization can vary across crawls even when content hasn't changed; key-order canonicalization removes that source of false positives.
The BFS queue and visited-URI set are periodically serialized to disk under a session key derived from the start URL. A killed crawler that resumes with the same start URL picks up where it left off. Sessions from a different start URL do not collide.
Every JSON file crosses a door before any DB write. The door has three responsibilities, executed in order:
URLUtilities.appleDocsURI(from:) returns the canonical URI or nil. A nil return rejects the file (it doesn't represent a doc page; e.g., a marketing page that strayed into the BFS).
Search.StrategyHelpers.titleLooksLikePlaceholderError rejects files whose extracted title matches one of:
| Pattern | Example | Why |
|---|---|---|
| empty | "" | server-side rendering returned nothing |
Apple Developer Documentation | bare shell, never rendered | React fallback when JS failed |
error (case-insensitive, when URL leaf is also error) | WebKitJS FileReader.error, IDBRequest.error | Apple's renderer collapsed the property name into the title field |
Rejections are written to the JSONL import log with reason placeholderTitle. The associated symbol can be recovered downstream by the recovery enrichment pass (§8.4) using the URL leaf as a synthetic title.
The door maintains a per-run seen map keyed by URI. For any URI that's already been accepted in this run, classify the encounter:
| Tier | Match criteria | Action | Log |
|---|---|---|---|
| A | same URI + same contentHash | silent skip (byte-identical) | none |
| B | same URI + same canonical title + different contentHash | richest variant wins; loser logged | ⏭️ |
| C | same URI + different canonical title | first arrival stays; collision surfaced | 🚨 |
Tier-A is byte-equality. Tier-B is the case of "same logical Apple page, slightly drifted JSON between crawls" (a 2-byte trailing-slash difference, a rendering nondeterminism). Tier-C is the case where URI canonicalization conflated two genuinely different pages; this is a correctness bug that must be fixed in canonicalization, not papered over in the index.
Tier-C non-zero at end-of-run causes cupertino save to exit non-zero with a "work-not-done" banner. The full contract is in docs/PRINCIPLES.md §2-3.
When a Tier-B match fires, the door picks the "richest" variant deterministically:
{abstract, declaration, sections, codeExamples, rawMarkdown}.This guarantees Tier-B drift produces a monotonic improvement in the indexed row: a populated abstract never loses to an empty one, a full declaration never loses to an empty one. The corpus on disk preserves both variants for offline audit.
Six primary structures in search.db. The schema is in Search.Index.Schema.swift; columns called out here are the load-bearing ones for the design.
docs_ftsVirtual FTS5 table, porter + unicode61 tokenizer. Column order matches the BM25 weight vector:
CREATE VIRTUAL TABLE docs_fts USING fts5(
uri, -- weight 1.0
source, -- weight 1.0
framework, -- weight 2.0
language, -- weight 1.0
title, -- weight 10.0
content, -- weight 1.0
summary, -- weight 3.0
symbols, -- weight 5.0 (AST-extracted Swift symbol names)
symbol_components, -- weight 1.5 (CamelCase splits: LazyVGrid → Lazy / VGrid / Grid)
tokenize='porter unicode61'
);
Weight rationale:
title dominates at 10× because Apple titles are concise, symbol-named, and the highest-signal field for "I'm looking for X" queries.symbols at 5× ensures AST-extracted symbol names beat random prose mentions. Without this, Task matches every page that uses the English word "task".summary at 3× boosts the curated 1-sentence abstract over the discussion body.framework at 2× breaks ties between swiftui/view and webkit/view when the query is View SwiftUI.symbol_components at 1.5× lets vgrid match LazyVGrid without requiring users to know the exact CamelCase form.docs_metadataOne row per document. Non-FTS columns for filtering and JSON retrieval:
CREATE TABLE docs_metadata (
uri TEXT PRIMARY KEY,
source TEXT, framework TEXT, language TEXT,
kind TEXT, -- func | class | struct | enum | actor | protocol | ...
symbols TEXT, -- denormalized from doc_symbols for fast column read
file_path TEXT, content_hash TEXT, last_crawled INTEGER, word_count INTEGER,
source_type TEXT, package_id INTEGER, json_data TEXT,
min_ios TEXT, min_macos TEXT, min_tvos TEXT, min_watchos TEXT, min_visionos TEXT,
availability_source TEXT, -- api | parsed | inherited | derived
implementation_swift_version TEXT, -- for swift-evolution rows: toolchain version
FOREIGN KEY (package_id) REFERENCES packages(id)
);
json_data carries the full raw JSON, so read_document returns the original Apple payload without re-parsing the disk file. Trade-off: doubles the DB size. We accept it for read latency (§N2).
Indexes on source, framework, language, kind, min_ios, min_macos, min_tvos, min_watchos, min_visionos, implementation_swift_version keep attribute-filter queries < 10ms even at full corpus size.
docs_structuredDocC-extracted declaration fields, one row per doc:
CREATE TABLE docs_structured (
uri TEXT PRIMARY KEY,
url TEXT, title TEXT, kind TEXT,
abstract TEXT, declaration TEXT, overview TEXT,
module TEXT, platforms TEXT,
conforms_to TEXT, inherited_by TEXT, conforming_types TEXT,
attributes TEXT, -- @MainActor, @Sendable, @available comma-separated
FOREIGN KEY (uri) REFERENCES docs_metadata(uri) ON DELETE CASCADE
);
This table is the surface for attribute queries: "find all protocols that inherit from Equatable" hits inherited_by directly, no FTS5 needed.
doc_symbolsAST-extracted symbols, one row per declared symbol per doc:
CREATE TABLE doc_symbols (
id INTEGER PRIMARY KEY,
doc_uri TEXT, name TEXT, kind TEXT,
line INTEGER, column INTEGER, signature TEXT,
is_async INTEGER, is_throws INTEGER, is_public INTEGER, is_static INTEGER,
attributes TEXT, conformances TEXT,
generic_params TEXT, generic_constraints TEXT,
FOREIGN KEY (doc_uri) REFERENCES docs_metadata(uri) ON DELETE CASCADE
);
generic_constraints is populated by the enrichment passes (§8). A separate FTS5 table doc_symbols_fts indexes name, signature, attributes, conformances for the semantic-search tools.
inheritanceEdge table for class inheritance, populated from DocC's relationshipsSections.inheritsFrom and inheritedBy arrays:
CREATE TABLE inheritance (
parent_uri TEXT, child_uri TEXT,
PRIMARY KEY (parent_uri, child_uri)
);
A dedicated table (vs a JSON column on docs_metadata) because NSObject and UIView have thousands of descendants; a JSON-blob column would be unscannable and bloated.
framework_aliasesMaps framework identifier, import name, display name, and search synonyms:
CREATE TABLE framework_aliases (
identifier TEXT PRIMARY KEY, -- "corebluetooth"
import_name TEXT, -- "CoreBluetooth"
display_name TEXT, -- "Core Bluetooth"
synonyms TEXT -- "bluetooth"
);
Populated during indexing; the synonym column is updated by the synonyms enrichment pass (§8.1).
After each doc is inserted, ASTIndexer.Extractor (SwiftSyntax) runs on two paths:
Declaration path: if the page has a declaration.code field (Apple's structured Swift declaration), extract symbol names, signatures, attributes, and import statements. Writes to doc_symbols.
Code example path (extractCodeExampleSymbols): for all pages with Swift code blocks in discussion sections, extract symbol names from usage snippets. This catches symbols that are used in examples but not formally declared by the page's own type. The extracted names are added to the page's symbols column for BM25 boost.
Coverage on the v1.2.0 corpus: ~165,000 of ~285,000 pages have at least one doc_symbols row. The remainder are pages with no Swift code (Objective-C frameworks, conceptual articles, HIG, sample-code metadata).
Every door event (accept, Tier-A, Tier-B, Tier-C, placeholder reject, validity reject) is recorded as a Search.ImportLogEntry and serialized to JSONL via Search.JSONLImportLogSink. Writes are actor-isolated so concurrent strategies (apple-docs, swift-evolution, samples, etc.) interleave cleanly. The log path is surfaced in the cupertino save final report. JSONL was chosen over a DB-internal log table because:
\n flush is durable. A killed indexer leaves a valid prefix.jq, grep, anything line-oriented.Four passes run after the main indexing loop. Today they execute inline inside Search.IndexBuilder.buildIndex(); epic #769 extracts them into a standalone cupertino-postprocessor binary. Full design in docs/design/post-processor.md.
| Pass | Writes to | Reads from | Depends on |
|---|---|---|---|
synonyms | framework_aliases.synonyms | hardcoded list of 22 mappings (corebluetooth → bluetooth, etc.) | nothing |
constraints | doc_symbols.generic_constraints | apple-constraints.json (output of swift symbolgraph-extract) | nothing |
hierarchy | same column | doc_symbols parent/child rows | constraints |
recovery | re-inserts placeholder-title rejects | JSONL import log + URL leaf | nothing |
22 framework aliases get search-time alternate names so bluetooth finds CoreBluetooth, nfc finds CoreNFC, mpsgraph finds MetalPerformanceShadersGraph. The list is hardcoded (it doesn't change frequently and is small). The pass updates the synonyms column on existing framework_aliases rows; rows for frameworks not in the list keep synonyms IS NULL.
Apple ships authoritative generic constraints for stdlib symbols via swift symbolgraph-extract. The cupertino-constraints-gen binary parses that output once at build time and produces apple-constraints.json. The constraints pass loads that JSON and joins it into doc_symbols.generic_constraints for symbols whose pathComponents match an entry.
This makes queries like "all stdlib protocols requiring Self == X" cheap: a WHERE on generic_constraints LIKE '%Self == %' against an indexed column.
Symbol hierarchies in Apple's docs are flat: each page declares its own constraints but doesn't inherit parent constraints. The hierarchy pass walks parent→child edges in doc_symbols and propagates generic_constraints from parent to child. Required for accurate constraint queries on extensions and conformances.
Depends on constraints because it propagates the constraint values that pass writes.
Some Apple JSON pages return title: null or title: "error" at fetch time and get rejected by the placeholder filter (§7.1.2). The recovery pass reads the JSONL import log, finds placeholderTitle rejections that have a recoverable URL leaf (e.g., apple-docs://webkitjs/filereader/error), and re-inserts the row with a synthesized title derived from the URL.
This is design tradeoff: recovered rows have a worse title than they would if Apple's renderer were healthy, but they exist in the index instead of being dropped. The alternative (drop them) loses content the user can clearly see exists on the live site.
Tracked at #777.
The default cupertino search <query> is not a plain FTS5 MATCH. It is a multi-pass pipeline:
input query
│
▼
┌──────────────────────────────────────────┐
│ 1. extractSourcePrefix │ "swift-evolution://" → source filter
│ 2. extractAttributeFilters │ "@MainActor" → SQL WHERE clause
│ 3. sanitizeFTS5Query │ quote terms, split hyphens
└──────────────┬───────────────────────────┘
│
┌───────┴───────────┐
▼ ▼
┌──────────────┐ ┌──────────────────────┐
│ Symbol fast │ │ BM25F │
│ path (URI │ │ bm25(docs_fts, │
│ set from │ │ 1, 1, 2, 1, │
│ doc_symbols_ │ │ 10, 1, 3, 5, 1.5) │
│ fts) │ │ │
└──────┬───────┘ └────────┬─────────────┘
│ │
└────────┬───────────┘
▼
┌──────────────────────────────────────────┐
│ 4. HEURISTIC 1 (exact-title boost, #254) │
│ 50× for clean match, 20× for suffix │
│ Separates Swift's Task from │
│ Mach kernel task_* │
└──────────────┬───────────────────────────┘
▼
┌──────────────────────────────────────────┐
│ 5. HEURISTIC 1.5 (URI simplicity + #256) │
│ shorter URI ranks above longer │
│ + frameworkAuthority tiebreak │
│ (swift, swiftui, foundation ↑; │
│ webkitjs, installer_js ↓) │
└──────────────┬───────────────────────────┘
▼
┌──────────────────────────────────────────┐
│ 6. HEURISTIC 1.6 (kind tiebreak, #610) │
│ canonical type kinds (class, struct, │
│ enum, protocol, actor) win over │
│ property/method/initializer pages │
│ Closes Task / View / Hashable wins │
└──────────────┬───────────────────────────┘
▼
┌──────────────────────────────────────────┐
│ 7. fetchCanonicalTypePages (#254) │
│ Force-include canonical apple-docs │
│ page even if BM25 missed it │
└──────────────┬───────────────────────────┘
▼
┌──────────────────────────────────────────┐
│ 8. RRF fusion (k=60, #192) │
│ Reciprocal Rank Fusion across sources │
│ apple-docs 3.0 · evolution 1.5 · │
│ packages 1.5 │
└──────────────┬───────────────────────────┘
▼
results
BM25 scores are negative in FTS5 convention (lower = better match). Adjusted rank divides by the combined boost multiplier; a 50× exact-title boost moves a candidate above a 10× raw-BM25 winner.
Pure BM25F gets the right answer most of the time and the wrong answer in a small but high-visibility set of cases. The wrong answers cluster around:
Array matches every page that uses the word. BM25 buries the canonical swift/array type page under thousands of someArray prose mentions.URLSession.dataTask is shorter than URLSession and has fewer terms, so BM25 with title weight ranks it above the type page.webkitjs/element matches Element queries that meant XMLDocument.element or Mirror.Element.Each heuristic addresses one of these failure modes empirically. #610 audit on v1.1.0 documents 14 wrong-winner cases; heuristic 1.6 closes 9 of them (Task, View, String, Array, Hashable, Equatable, Codable, Identifiable, Sendable). The remaining 5 (URL, Color, Font, List, Data) are Class B and recover via lossless URI canonicalization (#283), not via ranking.
For a query without an explicit source prefix, the ranker runs per-source (apple-docs, swift-evolution, packages, ...) and fuses results via Reciprocal Rank Fusion (Cormack, Clarke, Büttcher 2009):
score(d) = Σ over sources s weight(s) / (k + rank_in_s(d))
with k = 60 (the standard RRF constant) and source weights tuned empirically: apple-docs gets 3.0, others get 1.5. The 2:1 ratio reflects that apple-docs is the largest and highest-signal source; we want it to dominate unless another source has a stronger match.
One dead source (e.g., samples.db missing because the user didn't run --samples) does not take the whole query down: its contribution is 0 / (60 + rank) = 0.
JSON-RPC 2.0 over stdio. Each message is a single compact line plus a \n delimiter. No embedded newlines (MCP spec requirement). The server reads with line buffering, parses, dispatches, and writes the response on stdout. stderr is reserved for log output.
The MCP tools provided are summarized in docs/ARCHITECTURE.md. Tool implementations live in SearchToolProvider, which depends only on ServicesModels, SearchModels, SampleIndexModels (the protocol seams), not on Search or SampleIndex concrete writers. This keeps the MCP layer testable in isolation.
cupertino read --source <s> and the MCP read_document / read_sample / read_sample_file tools all delegate to Services.ReadService. The service dispatches on source:
| Source | Backing | Returns |
|---|---|---|
apple-docs, hig, apple-archive, swift-org, swift-book, swift-evolution | docs_metadata.json_data | full JSON payload |
samples | Sample.Search.Service | sample project metadata |
packages | Search.PackageQuery.fileContent | package file content from package_files_fts |
No on-disk file reads at serve time. Everything is served from the SQLite databases.
| Quantity | Value |
|---|---|
| Apple docs pages crawled | ~412,000 |
| Apple docs pages indexed (post-dedup) | 285,735 |
| Frameworks | 420 |
Pages with doc_symbols rows | ~165,000 |
search.db size | ~2.4 GB |
packages.db size | ~990 MB |
samples.db size | ~185 MB |
| Sample projects | 619 |
| Indexed Swift sample files | ~18,000 |
| Compressed bundle | ~685 MB |
The principles file commits to designing for 10x current scale (4M docs). Per-component analysis:
SQLite FTS5. Production FTS5 deployments index O(10⁸) documents (Mozilla, Notion). At 4M docs, expected index size ~24 GB (linear with corpus). Query latency for FTS5 is O(log N) on the index; p99 stays well under 100ms at 4M docs.
Door dedup map. Per-run hashmap keyed by URI. At 4M docs:
Acceptable on a build host. If we hit 40x (40M docs), we switch the hashmap to a bloom filter front + SQLite back, costing one extra DB hit per insert in exchange for O(1) memory.
Per-row work at the door: hashmap lookup (O(1)), SHA-256 (~40 μs on M1), one INSERT. No scans. The door is O(N) over the corpus; linear scaling.
AST extraction: SwiftSyntax parse of one declaration is ~5ms. At 4M docs × 50% AST coverage = 2M extractions × 5ms = ~2.8 hours. Acceptable.
Crawl: bottleneck is network + Apple's rate limit. We can't go below ~14 days at current delay; a 10x corpus would be ~140 days. This is the binding constraint at scale. Mitigation: parallel crawl from multiple IPs (operational, not in this design); incremental crawl (only fetch changed pages) is the design answer (G13).
Bundle distribution: GitHub Releases caps artifact size at 2 GB per file. At 10x corpus the compressed bundle is ~7 GB and needs chunking. Mitigation: split bundle by source (search.db.zip, packages.db.zip, samples.db.zip); cupertino setup already supports this.
| Component | Scales to | Limit |
|---|---|---|
| FTS5 query | ~10⁸ docs | tens of GB index |
| Door hashmap | ~10⁷ docs | host RAM |
| WKWebView crawl | ~10⁶ pages | crawl time, not memory |
| Setup download | ~5 GB compressed | GitHub Releases per-file limit |
| os.log volume | ~10⁵ events/sec | macOS log subsystem throttling |
The first limit we hit at 10x is crawl time, not architectural. Incremental crawl (G13) is the long-pole feature.
| Failure mode | Detection | Mitigation |
|---|---|---|
| Crawler killed mid-run | none (silent) | session checkpoint resumes from last serialized BFS state |
| Indexer killed mid-run | partial DB rows | SQLite transactions; partial inserts rolled back on cupertino save --clear |
| Power loss during indexing | DB may have uncommitted | SQLite WAL keeps DB consistent; #236 is open for explicit WAL on local builds |
| Apple API returns malformed JSON | parser throws | per-page error; doesn't kill the crawl; rejected URLs logged |
| Apple changes JSON schema | parser fails on every page | manual: regression caught in next crawl; CI canary not yet implemented (open work) |
| Apple rate-limits | HTTP 429 | retry with backoff; --request-delay knob exposes the rate |
| Disk full | SQLite write fails | crawl/save exits non-zero with the SQLite error |
| Corpus repo corrupted (bad JSON) | door rejects per-page | logged to JSONL; manual recovery |
| Schema mismatch (old DB, new binary) | open-time version check | binary refuses to open; user runs cupertino setup for fresh bundle |
| Tier-C collision in run | 🚨 log + non-zero exit | save report names both URIs + content paths; user audits |
| MCP transport closed mid-response | error frame | host re-spawns server (host responsibility); --no-reap for Codex-style spawn-per-call |
Concurrent save against same DB | undefined (SQLite write lock) | #253 open: detection and bail with clear error |
feedback_assume_no_local_db.md memory.| Threat | Vector | Mitigation |
|---|---|---|
| Malicious documentation injecting prompts via MCP response | Apple's docs themselves | trust boundary: we trust apple.com content; same trust model as any agent that consults docs |
| Sandbox escape via WKWebView | crawler runtime | WKWebView is JS-isolated by macOS; we don't execute fetched JS, we only read its data layer |
| SQLite injection via search query | user input | Search.Index.QueryParsing sanitizes FTS5 special characters; parameterized SQL throughout |
| MCP server reads files outside corpus | path traversal | DB-backed reads only; read_document requires an apple-docs:// URI, not a file path |
| Local DB tampered with | filesystem | DB is signed-and-notarized via Homebrew distribution; user-built DBs are user-trusted |
None. The MCP server is stateless across sessions. Queries are not logged, not persisted, not transmitted. The os.log output stays on the local machine.
Telemetry, analytics, crash reporting, usage metrics: all explicitly not implemented. Adding any of these is a design change that requires an opt-in flag and a separate privacy review.
The serve binary makes zero outbound network calls. cupertino setup does (downloads the bundle from GitHub Releases over HTTPS); cupertino fetch does (crawls Apple); cupertino serve reads search.db and never opens a socket. This is enforceable by the MCP layer and Search packages not importing URLSession or Network.
os.log with subsystem com.cupertino and categories crawler, mcp, search, cli, transport, pdf, evolution, samples. Categories let consumers filter:
log show --predicate 'subsystem == "com.cupertino" AND category == "search"' --last 1h
The Logging concrete writer is composition-root-only: producer packages depend on LoggingModels (a protocol seam) and receive a Recording instance via init injection. This keeps producers testable without an os.log backend.
Long-running phases emit progress to stderr/stdout:
Saved lines (configurable verbosity)Progress: X/Y (indexed, skipped) lines (#588)⛔ for placeholder, ⏭️ for Tier-B, 🚨 for Tier-CEnd of cupertino save prints a structured summary:
✅ Indexed: N documents
⛔ Skipped (placeholder title): X
⏭️ Skipped (Tier-B dedup): Y
🚨 Collisions (Tier-C): Z ← non-zero is a build failure
Audit log: /path/to/import-log.jsonl
cupertino doctor is a read-only inspection over the local DB and filesystem state. It verifies:
Used as a smoke test after setup or save.
| Tier | Tool | Scope | Count |
|---|---|---|---|
| Unit | Swift Testing (@Test, @Suite) | one type, one method, mocked deps via withDependencies | majority |
| Integration | Swift Testing tagged .integration | real SQLite, real WKWebView, real Apple docs (network) | smaller set |
| End-to-end | external (manual) | cupertino save on real corpus + golden-query regression | manual, pre-release |
Current count: ~330 test functions across 207 files, expanding to ~2,300+ runtime cases via @Test(arguments:) parameterization.
Every PR runs:
xcrun swift build (must compile)xcrun swift test (must pass)swiftformat --lint (zero diffs)swiftlint (zero violations)scripts/check-package-purity.sh (no producer→producer imports)scripts/check-target-foundation-only.sh (strict producer allow-list)scripts/check-docs-commands-drift.sh (CLI surface matches docs/commands/)scripts/check-issue-body-staleness.sh (issue refs in PR body are valid)docs/audits/stage-d-regression-locks-2026-05-17.md documents the manual regression set (specific queries, specific expected top results) that must pass before any v1.x release. These are the queries that motivated heuristics 1, 1.5, and 1.6.
| Channel | Path | Audience |
|---|---|---|
| Homebrew | mihaelamj/tap/cupertino | macOS users |
| Direct binary | GitHub Releases attached binaries | scripted installs, CI |
| Source | git clone + make build | contributors |
| Pre-built bundle | GitHub Releases cupertino-databases-vX.Y.Z.zip | every install path (downloaded by cupertino setup) |
Two version numbers are tracked separately:
Shared.Constants.App.version): the released cupertino CLI version.Shared.Constants.App.databaseVersion): the schema/content version of search.db.They are decoupled. A binary version bump that doesn't touch the schema does not bump databaseVersion. A schema-bumping change bumps both. cupertino setup downloads the bundle whose name matches the binary's databaseVersion.
Backward compatibility:
search.db whose schema version is newer than the binary supports.search.db whose schema version is older than the minimum supported. User runs cupertino setup for a fresh bundle.There is no in-place migration. The rebuild-instead policy is documented in docs/PRINCIPLES.md and in the assume-no-local-DB memory entry.
The full pipeline runs out-of-band on the maintainer's hardware:
cupertino fetch --type all (~14 days wall clock).cupertino-docs git repo (one commit per crawl session).cupertino save on the corpus (~12 hours).cupertino setup to pull the new bundle.The crawl + save cost is paid once by the maintainer and amortized across every user. End users never run the full pipeline.
Considered: pure URLSession + JSON decode against tutorials/data/....
Rejected: ~1% of pages return title: null or a 3,297-byte React shell to non-browser fetches. Examples include some kernel pages (exclave_textlayout_info_v1) and certain WebKitJS property pages. WKWebView's JS execution is the only reliable way to get full content.
Cost paid: macOS-only crawling.
Considered: embed every page via a sentence-transformer model; serve cosine similarity.
Rejected as primary: Apple-docs queries are symbol-named ~80% of the time. URLSession.dataTaskPublisher is a token-match query, not a semantic query. Adding an embedding lookup to every query adds 50-200ms of inference latency and a model dependency. The 20% of queries that benefit from semantic matching (conceptual queries: "how do I make a network request") are a real win but a complement, not a replacement.
Status: planned as Phase 2.5 (#183) as a parallel index, not a replacement.
Considered: apple-docs://<8-byte-SHA-of-URL> instead of lossless path encoding.
Rejected: non-zero collision floor at any corpus size. At 4M docs (10x target), the expected number of 8-byte SHA collisions is 4e6² / 2 / 2⁶⁴ ≈ 4e-7 (one in 2.5 million corpora). Sounds small, but it's nonzero; a single collision corrupts the affected pages silently. The principles file commits to a zero-collision-floor design (docs/PRINCIPLES.md §1). Lossless URIs cost ~30% more bytes per row; we accept that cost.
cupertino-mcp binaryConsidered: ship the MCP server as a separate binary from the CLI.
Rejected: doubles deployment surface (two binaries, two formulas, two PATH entries, two update paths). Mode detection at startup (pipe-vs-TTY on stdin) is 5 lines of code. The binary contains both at ~4.3 MB total; the duplication is in compiled code, not in deployment friction.
Considered: one large SPM package containing all source.
Rejected: import boundaries cannot be enforced at compile time. The risk we've avoided by ExtremePackaging is Search accidentally importing Crawler because someone added a helper. With one library, every type is reachable from every other type. With 40 packages and a CI-enforced import contract, the build fails when the boundary is crossed. The cost is Package.swift boilerplate (large but mechanical); the benefit is structural.
Considered: server-backed DB for richer query and concurrent writes.
Rejected: deployment friction kills the "single binary, brew install" goal. SQLite is single-file portable, has FTS5 built in, and has zero operational overhead. The cost is no concurrent writers (one cupertino save at a time); we accept that.
Considered: convert all Apple JSON to Markdown at crawl time, ship Markdown.
Rejected: loses structural metadata (declaration tokens, availability ranges, references). The structured JSON is the source of truth; Markdown is a lossy projection.
Considered: industry-standard search backends.
Rejected: external service, deployment burden, JVM dependency. SQLite FTS5 gives us 95% of Solr's BM25F capabilities at zero operational overhead. The 5% we lose (distributed sharding, more sophisticated analyzers) is not needed at our scale.
Considered: depend on the official @anthropic-ai/mcp-swift-sdk.
Rejected: longer line count, more dependencies, and missing several edge cases that real MCP clients (Claude Desktop, Codex, GitHub Copilot for Xcode) exercise around stdio framing and Transport closed recovery. Cupertino's hand-rolled implementation in the MCP package covers exactly the protocol surface needed (docs/ARCHITECTURE.md §"MCP Server Implementation"). When the official SDK matures and adds the missing edge-case coverage, this is worth re-evaluating.
| ID | Question | Tracking |
|---|---|---|
| Q1 | When does vector search enter the ranker? | Phase 2.5, after v1.0.3 ships |
| Q2 | What's the minimum-viable diagnostic block for MCP responses? | Phase 2.1 design pending |
| Q3 | Apple's JSON schema drift: do we add a CI canary that re-fetches N representative pages weekly? | Open work |
| Q4 | Can we ship a Linux server crawler variant by replacing WKWebView with a headless Chrome wrapper? | Open work, not in scope for v1.x |
| Q5 | Concurrent cupertino save against the same DB: detect and bail | #253 open |
| Q6 | Recovery pass: how aggressive should URL-leaf title derivation be? Just split on _ and titlecase, or LLM-based? | #777, leaning toward mechanical only |
| Q7 | What happens to the 12-hour save time at 10x corpus? | open; AST extraction is the bottleneck |
| ID | Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|---|
| R1 | Apple changes JSON API shape | medium | high (parser breaks) | parser fails fast; manual fix cycle; CI canary (Q3) |
| R2 | Apple rate-limits aggressive crawlers | low | medium (slower crawls) | conservative default delay; respectful UA; back off on 429 |
| R3 | WKWebView macOS-only locks us out of cloud crawl | high (already the case) | medium | Linux crawler variant is open work (Q4) |
| R4 | 12-hour save makes iteration slow | high (current pain) | medium | enrichment separated (#769) so small fixes don't need full re-index |
| R5 | SQLite FTS5 BM25F doesn't scale past ~10⁸ docs | low at current scale | hypothetical | sharding architecture is well-understood; defer until needed |
| R6 | New MCP client doesn't tolerate our transport edge cases | medium | low (per-client fix) | mock agent + tagged integration tests per client family |
These are deliberately out of scope for the current design but worth flagging for sequencing:
why_this_result, bm25_breakdown, heuristics_applied) so agents can reason about ranking decisions instead of treating the response as opaque.cupertino-docs git history to enumerate pages Apple changed since the last commit; re-fetch only those. Cuts daily-update wall time from 14 days to hours.swift symbolgraph-extract output for more frameworks (not just stdlib) to richen the constraint table beyond what Apple's docs JSON exposes.docs/PRINCIPLES.md: six engineering principlesdocs/ARCHITECTURE.md: package layout, file maps, ranker diagramsdocs/package-import-contract.md: per-target allowed/forbidden importsdocs/design/post-processor.md: enrichment-pass pipeline design (epic #769)docs/audits/methodology.md: audit and issue-hygiene policydocs/audits/stage-d-regression-locks-2026-05-17.md: pre-release regression setdocs/portability.md: cross-Mac development setupdocs_fts implements.Search.SmartQuery uses.cupertino-postprocessor.Auto-collected from the metric and method mentions in the text above.
Cormack, Clarke, Büttcher (2009), SIGIR
Open citation