sdet-brain - persistent RAG over MCP for one human and three Claudes

The problem nobody warned me about

Three months into AI-augmented brand work I had a problem nobody warned me about: copy-paste fatigue.

Every new Claude.ai chat started the same way. Paste recent decisions. Paste voice samples. Paste current sprint state. Paste the file I’m working on. Paste the constraints. Send. Wait for the model to acknowledge it has context. Then ask the actual question.

For someone shipping case studies, sprint reports, and decision logs daily - that’s hours per week of copy-paste, and it didn’t scale.

Cloud RAG services existed but missed the point. My brand corpus is private until I publish it. Pricing notes, internal decisions, voice experiments mid-iteration - none of that should leave the laptop.

So I built the thing I needed.

What it is, in one paragraph

sdet-brain is a single persistent index over my Markdown corpus, exposed as MCP tools so every MCP-aware client (Claude Desktop, Claude Code, OpenCode) sees the same context simultaneously. Embeddings are computed locally on Apple Silicon via MLX. The vector store is Qdrant in Docker. The server is FastAPI plus FastMCP 3.0 with stdio, SSE, and streamable HTTP transports. Markdown stays on disk as the single source of truth - Qdrant only holds derivatives.

Architecture

graph TB
    subgraph Clients
        CD[Claude Desktop<br/>MCP stdio]
        CC[Claude Code<br/>MCP HTTP]
        OC[OpenCode<br/>MCP HTTP]
        WEB[Web client<br/>REST + SSE]
    end

    subgraph Server
        FAPI[FastAPI app]
        FMCP[FastMCP 3.0 wrapper]
        TOOLS[MCP tools<br/>core + domain]
    end

    subgraph Pipeline
        ING[Ingestion<br/>parser + chunker]
        EMB[Embeddings<br/>MLX + Gemini]
        WTC[Watchdog<br/>auto-reindex]
    end

    subgraph Storage
        QD[(Qdrant<br/>vectors + payload)]
        FS[(Markdown corpus<br/>on disk)]
    end

    CD --> FMCP
    CC --> FMCP
    OC --> FMCP
    WEB --> FAPI
    FAPI --> TOOLS
    FMCP --> TOOLS
    TOOLS --> ING
    TOOLS --> EMB
    TOOLS --> QD
    FS --> WTC
    WTC --> ING
    ING --> EMB
    EMB --> QD

Four layers, four top-level packages:

server/ - FastAPI plus FastMCP 3.0. Exposes 11 MCP tools.
ingestion/ - frontmatter parser, semantic chunker, watchdog.
embeddings/ - MLX local (primary), Gemini fallback.
storage/ - Qdrant client with hybrid search (BM25 + dense + RRF fusion).

CLI entrypoints live in cli/. Markdown corpus lives wherever I keep it on disk - Qdrant only stores derivatives.

How a query flows

Client (say, Claude Code) calls the MCP tool search over HTTP.
FastMCP dispatches to the tools layer.
The embedder (MLX) computes the query vector. Lazy-loaded on first call (a few seconds), sub-100ms after that.
Qdrant runs hybrid search: BM25 + dense, RRF fusion, optional cross-encoder rerank.
The tool returns chunks with payload - path, source type, score, snippet.
The client gets JSON back, the model cites the results in its response.

For richer questions three LLM-backed tools layer on top: query_rewrite (HyDE-style query expansion), multi_query_search (decomposition plus RRF fusion across sub-queries), and summarize_results (cited summary).

Why local-first, not cloud

Three reasons, in priority order.

Privacy by architecture. Brand corpus is private until I publish it. Embeddings and reasoning on a Mac means no inference traffic leaves the laptop. There is no “we promise not to train on it” toggle to trust.

Zero per-query cost. Apple Silicon runs Qwen3-Embedding-0.6B and Qwen3-Next-80B locally. Each query costs the laptop’s electricity, not API tokens. With a hundred-plus queries on a working day this matters more than it sounds.

Latency. Round-tripping to a hosted embedding service adds 100-300ms per query. Local MLX is 20-50ms. For interactive tooling that’s the difference between flow and friction.

Six release tiers in two days

This is where the AI-velocity story comes in. The whole stack - MVP through DX polish - shipped between 30 April and 1 May 2026 across three autonomous Claude Code sessions.

Tier	Tag	Highlight
1	v0.1.0	MVP - Qdrant + MLX + 4 core MCP tools + watcher
1.1	v0.1.1	Polish - healthcheck, env-driven paths, perf
2	v0.2.0	Hybrid search (BM25 + RRF) + cross-encoder rerank + 5 domain tools
3	v0.3.0	Local MLX LLM (Qwen3-Next-80B) + `/chat` + SSE streaming
4	v0.4.0	Qwen3-Embedding-8B + tiered LLM router + multi-query agentic retrieval
5	v0.5.0	DX - REPL CLI + inline citations + saved templates

Each tier had its own atomic plan, atomic commits, and re-run quality gates: 213 tests passing, mypy —strict on 70 files, ruff clean, before moving on.

Why source-available, not OSS (yet)

The repo is here for transparency, reference, and learning. You can inspect the architecture, run it locally for your own corpus, learn the patterns. What you can’t do is fork it commercially or wrap it into a hosted product without talking to me first.

A formal OSI license decision (likely AGPL-3.0 or similar) will come with a structured public launch - when documentation, demo dataset, and onboarding tutorial are ready. That’s a separate sprint, planned for a later milestone.

What’s next

A full case study with setup tutorial, demo Markdown corpus, and architecture deep-dive is planned as a “From the field” series episode in the coming weeks.

For now: the source is on GitHub, the architecture is in the README, and the patterns are battle-tested on my own brand work daily.

The problem nobody warned me about

What it is, in one paragraph

Architecture

How a query flows

Why local-first, not cloud

Six release tiers in two days

Why source-available, not OSS (yet)

What’s next

Links

Related

skills-radar - lazy-loading skill discovery for Claude Code

When the Extension Host Refuses to Cooperate - How We Built Claude VSCode Controller for Linux

API Tests Playwright - MAF