In February I wrote that a flaky test is a cancer eating trust in CI/CD. That stability beats coverage. Easy enough to write a manifesto. Harder to back it with code.

So I built two QA tools from scratch, three days apart. They do different jobs. They share one pattern, and the pattern is the actual point: a deterministic core that does the real work, plus an AI layer you can lift off without anything breaking.

Not “AI writes the report.” The report is already complete without it. AI is optional advice on top - and the bytes prove it.

The pattern

Both tools have the same three stages: a deterministic core, an optional AI adapter, a renderer. The core parses, counts, classifies. It runs first, and it runs always. The AI layer runs only if you turn it on, and if it fails, throws, or invents a number, the tool ships the deterministic output unchanged.

One rule sits under both: numbers from code, prose from AI. The model never owns a figure. A code-side gate harvests every legitimate number from the data and rejects any prose that contains one that isn’t there. Ungrounded number, dropped advisory, deterministic report. No exceptions, because it’s a regex and an exit code, not a vibe.

Default is off. The no-AI path isn’t a special case bolted on for safety - it’s the most-exercised path in the suite. In the first tool it’s a real null-object provider that happens to be the default. In the second, the AI is a session that can simply not show up.

Tool one: qa-report-lake

It started from a colleague’s question on LinkedIn: a repo bloating with test-report history, and nothing useful coming out of it.

qa-report-lake takes raw test output - Playwright, JUnit, Allure, or CTRF directly - normalizes it to CTRF, and renders one static HTML report with history, trends, and flaky-vs-new insight. It also does visual regression: a pixel diff with a reg-suit baseline model, the expected/actual/diff triptych, a verdict. Two mechanisms, one pipeline.

The invariant is tested at the byte level. Render the report with AI off: 3319 bytes. With AI on: 3498. The two files differ - that difference is the advisory block, wrapped in markers. Strip the block and what remains is byte-for-byte identical to the no-AI report. A test asserts exactly that. The correctness-bearing output does not move when the AI layer does.

The grounding gate walks every numeric token in the AI prose and matches it against the set of numbers actually present in the data, within a small epsilon - so 90 and 90.0 are the same number, but an invented 91 is not. The anonymizer scrubs client terms before any data reaches a brand repo, with the term map kept out of git. The AI path, when you want it, runs on a local model through Ollama, so nothing leaves the machine.

On real anonymized data it tracked a ten-run pass rate of 79, 81, 31, 77, 45, 56, 79, 79, 87, then 0 percent - the last run a complete outage the report surfaced without a single word of AI. 25 tests, all green. AGPL-3.0.

Tool two: flaky-analytics

Three days later, the same pattern, sharper.

flaky-analytics reads CI run history and answers one question: which tests flake, and what do you do about them. It pulls the last runs from GitHub Actions, normalizes them, and computes flake events - a retry that flips to pass, a single commit that comes back both red and green. It classifies each test as chronic, intermittent, isolated, or always-failing, and writes a quarantine plan with the numeric evidence behind every call.

It’s a Claude Code skill, not a plugin. And here the AI layer is literal: it’s Claude, in the session, reading the analysis and drafting root-cause hypotheses. There’s no model process to run. The skill says it plainly - you are the removable AI layer, the report is already complete without you. The draft goes through a grounding gate; if it cites a number that isn’t in the analysis, it’s rejected, fixed, re-checked, and after two failures the report renders without it.

The invariant here is the sharpest version of the same idea. A test takes the with-AI report, strips the advisory block, and asserts it is byte-equal to the no-AI report. The quarantine plan is byte-identical with AI on or off. The golden fixtures match blessed output to the byte. Same proof as the first tool, one notch tighter. The anonymizer goes a step further too: it scrubs, then re-walks the files and fails the run if a single forbidden term survived.

The whole thing is nine Node scripts, zero dependencies, pure built-ins - it runs in CI with nothing to install. 54 tests, all green. It’s a local, auditable, open-source-shaped answer to paid flake analytics like Datadog Test Optimization or Trunk. AGPL-3.0.

What’s the same, what got sharper

The shared spine: three stages, numbers-from-code, a removable default-off AI, fail-closed everywhere, a code-side grounding gate, anonymize-before-it-lands, CTRF as the common format, the invariant unit-tested with zero network.

flaky-analytics says the lineage out loud in its own source: inherited from qa-report-lake. What changed between them is the edge. The AI moved from an external local model to the session itself. The dependencies went to zero. The scope narrowed from two mechanisms to one. The distribution went from a CLI repo to a skill. And both now prove the invariant byte-for-byte - that part stopped being a difference and became the standard.

Why bother

From June 15, programmatic LLM calls in a pipeline bill at full API rate, separate from interactive use. A deterministic core that counts without tokens stops being a nice-to-have and becomes a line you don’t pay for. The AI goes on top, optionally, where it actually earns its place.

That’s the thesis I keep coming back to: context before LLM. Build the clean, counted context first. Bring the model in second, if at all. When the floor is deterministic and the AI is a layer you can lift off and prove identical, you get the advice without betting the report on it.

Both tools are AGPL. The pattern is the part worth stealing - it transfers to any pipeline drowning in reports, or in flaky tests with no owner.

Repos