The problem nobody warned me about
Three months into AI-augmented brand work I had a problem nobody warned me about: copy-paste fatigue.
Every new Claude.ai chat started the same way. Paste recent decisions. Paste voice samples. Paste current sprint state. Paste the file I’m working on. Paste the constraints. Send. Wait for the model to acknowledge it has context. Then ask the actual question.
For someone shipping case studies, sprint reports, and decision logs daily - that’s hours per week of copy-paste, and it didn’t scale.
Cloud RAG services existed but missed the point. My brand corpus is private until I publish it. Pricing notes, internal decisions, voice experiments mid-iteration - none of that should leave the laptop.
So I built the thing I needed.
What it is, in one paragraph
sdet-brain is a single persistent index over my Markdown corpus, exposed as MCP tools so every MCP-aware client (Claude Desktop, Claude Code, OpenCode) sees the same context simultaneously. Embeddings are computed locally on Apple Silicon via MLX. The vector store is Qdrant in Docker. The server is FastAPI plus FastMCP 3.0 with stdio, SSE, and streamable HTTP transports. Markdown stays on disk as the single source of truth - Qdrant only holds derivatives.
Architecture
graph TB
subgraph Clients
CD[Claude Desktop<br/>MCP stdio]
CC[Claude Code<br/>MCP HTTP]
OC[OpenCode<br/>MCP HTTP]
WEB[Web client<br/>REST + SSE]
end
subgraph Server
FAPI[FastAPI app]
FMCP[FastMCP 3.0 wrapper]
TOOLS[MCP tools<br/>core + domain]
end
subgraph Pipeline
ING[Ingestion<br/>parser + chunker]
EMB[Embeddings<br/>MLX + Gemini]
WTC[Watchdog<br/>auto-reindex]
end
subgraph Storage
QD[(Qdrant<br/>vectors + payload)]
FS[(Markdown corpus<br/>on disk)]
end
CD --> FMCP
CC --> FMCP
OC --> FMCP
WEB --> FAPI
FAPI --> TOOLS
FMCP --> TOOLS
TOOLS --> ING
TOOLS --> EMB
TOOLS --> QD
FS --> WTC
WTC --> ING
ING --> EMB
EMB --> QD
Four layers, four top-level packages:
- server/ - FastAPI plus FastMCP 3.0. Exposes 11 MCP tools.
- ingestion/ - frontmatter parser, semantic chunker, watchdog.
- embeddings/ - MLX local (primary), Gemini fallback.
- storage/ - Qdrant client with hybrid search (BM25 + dense + RRF fusion).
CLI entrypoints live in cli/. Markdown corpus lives wherever I keep it on disk - Qdrant only stores derivatives.
How a query flows
- Client (say, Claude Code) calls the MCP tool
searchover HTTP. - FastMCP dispatches to the tools layer.
- The embedder (MLX) computes the query vector. Lazy-loaded on first call (a few seconds), sub-100ms after that.
- Qdrant runs hybrid search: BM25 + dense, RRF fusion, optional cross-encoder rerank.
- The tool returns chunks with payload - path, source type, score, snippet.
- The client gets JSON back, the model cites the results in its response.
For richer questions three LLM-backed tools layer on top: query_rewrite (HyDE-style query expansion), multi_query_search (decomposition plus RRF fusion across sub-queries), and summarize_results (cited summary).
Why local-first, not cloud
Three reasons, in priority order.
Privacy by architecture. Brand corpus is private until I publish it. Embeddings and reasoning on a Mac means no inference traffic leaves the laptop. There is no “we promise not to train on it” toggle to trust.
Zero per-query cost. Apple Silicon runs Qwen3-Embedding-0.6B and Qwen3-Next-80B locally. Each query costs the laptop’s electricity, not API tokens. With a hundred-plus queries on a working day this matters more than it sounds.
Latency. Round-tripping to a hosted embedding service adds 100-300ms per query. Local MLX is 20-50ms. For interactive tooling that’s the difference between flow and friction.
Six release tiers in two days
This is where the AI-velocity story comes in. The whole stack - MVP through DX polish - shipped between 30 April and 1 May 2026 across three autonomous Claude Code sessions.
| Tier | Tag | Highlight |
|---|---|---|
| 1 | v0.1.0 | MVP - Qdrant + MLX + 4 core MCP tools + watcher |
| 1.1 | v0.1.1 | Polish - healthcheck, env-driven paths, perf |
| 2 | v0.2.0 | Hybrid search (BM25 + RRF) + cross-encoder rerank + 5 domain tools |
| 3 | v0.3.0 | Local MLX LLM (Qwen3-Next-80B) + /chat + SSE streaming |
| 4 | v0.4.0 | Qwen3-Embedding-8B + tiered LLM router + multi-query agentic retrieval |
| 5 | v0.5.0 | DX - REPL CLI + inline citations + saved templates |
Each tier had its own atomic plan, atomic commits, and re-run quality gates: 213 tests passing, mypy —strict on 70 files, ruff clean, before moving on.
Why source-available, not OSS (yet)
The repo is here for transparency, reference, and learning. You can inspect the architecture, run it locally for your own corpus, learn the patterns. What you can’t do is fork it commercially or wrap it into a hosted product without talking to me first.
A formal OSI license decision (likely AGPL-3.0 or similar) will come with a structured public launch - when documentation, demo dataset, and onboarding tutorial are ready. That’s a separate sprint, planned for a later milestone.
What’s next
A full case study with setup tutorial, demo Markdown corpus, and architecture deep-dive is planned as a “From the field” series episode in the coming weeks.
For now: the source is on GitHub, the architecture is in the README, and the patterns are battle-tested on my own brand work daily.