60% of Our Tests Had Zero Signal
The Discovery
We build with highly parallel AI agent teams — 7 agents working simultaneously across different parts of the codebase during our initial sprint. 162 commits in 4 days. The velocity was extraordinary. The test suite was green. Coverage metrics looked healthy.
Then we deleted an entire API route handler as an experiment. 94% of the "covering" tests still passed.
That was the moment we stopped trusting our test suite and started auditing it.
A false-confidence test passes regardless of whether the feature it claims to test actually works. It provides the psychological comfort of coverage while providing zero signal about correctness.
The insidious part is that false-confidence tests do not fail. They sit in your suite, green and reassuring, while the code they are supposedly testing drifts, breaks, or gets deleted entirely. You only discover the problem when a production incident occurs in code you believed was tested.
When multiple AI agents write tests in parallel — each with their own context window, each optimising for green CI — the problem compounds. An agent that encounters a 404 from unseeded test data will "fix" the test by accepting 404 as a valid response rather than fixing the seed. Multiply that pattern across 7 parallel agents and 656 test files, and you get a test suite where 60% of tests provide zero signal.
Once we knew what to look for, we found it everywhere.
The FC Taxonomy
Through systematic audit of the agent-generated test suite, we identified 8 distinct false-confidence patterns, plus one conditional variant. Each has a code, a name, and a characteristic signature in test code. The taxonomy exists to enforce good testing practices during AI-assisted development — giving agents (and humans reviewing agent output) a shared vocabulary for test quality.
FC-A: Multi-Status Acceptance
The most common pattern we found. Instead of asserting a specific HTTP status, the test accepts multiple statuses as success:
// FC-A: This test passes whether the endpoint returns 200 OR 404
expect([200, 404]).toContain(res.status)
// FC-A: Same pattern with Array.includes
if ([200, 404].includes(res.status)) {
// assertions here only run on success
}
// FC-A: OR-chain variant
if (res.status === 200 || res.status === 201) {
expect(res.body).toBeDefined()
}Why does this happen? An AI agent writing the test encounters a 404 because the test data is not seeded correctly. Instead of fixing the seed — which requires understanding the broader data model — it adds 404 to the accepted statuses. The test passes. The agent moves on to the next task. The endpoint has never actually been tested.
The fix: One test per expected status. If you expect 200, assert exactly 200.
// Correct: exact assertion
expect(res.status).toBe(200)
expect(res.body.id).toBe(expectedId)FC-B: Shape-Only Assertions
The test checks that the response has a certain structure but never verifies the actual values:
// FC-B: Passes for ANY object response, including error responses
expect(typeof body).toBe("object")
expect(body).toBeDefined()
expect(Array.isArray(body.items)).toBe(true)A response of { error: "Internal Server Error" } satisfies typeof body === "object". An empty array [] satisfies Array.isArray(). These assertions prove nothing about whether the feature works.
// Correct: value assertion
expect(body.mode).toBe("local")
expect(body.items).toHaveLength(3)
expect(body.items[0].name).toBe("Expected Item")FC-C: Mock-Only Assertions
The test mocks a dependency to return a specific value, then asserts that the mock's return value arrived. It is testing the mock framework, not the application:
// FC-C: Mock returns [], test asserts Array.isArray([])
mock.module("../services/user-service", () => ({
getUsers: () => []
}))
const res = await app.request("/api/users")
const body = await res.json()
expect(Array.isArray(body)).toBe(true) // Always true — we just mocked itThe fix: Inject realistic mock data and assert specific values that prove the application logic processed the data correctly.
// Correct: inject real data, assert transformation
mock.module("../services/user-service", () => ({
getUsers: () => [{ id: 1, name: "Alice", role: "admin" }]
}))
const res = await app.request("/api/users")
const body = await res.json()
expect(body[0].name).toBe("Alice")
expect(body[0].role).toBe("admin")FC-D: Route Never Reached
The test makes a request to an endpoint that is not mounted, or hits the wrong port. The framework returns a default response, and the test's assertions are loose enough to pass on it:
// FC-D: Endpoint is /api/v1/users but test hits /api/users
const res = await fetch("http://localhost:3001/api/users")
// Returns 404 from the framework, not from the handler
expect(res.status).toBeLessThan(500) // Passes! 404 < 500FC-D often compounds with FC-E. The route is never reached, and the tautological assertion hides the fact. Two weak patterns combine to create a test that cannot possibly fail.
FC-E: Always-True / Tautological Assertions
Range checks that pass for virtually any response:
// FC-E: Passes for 200, 201, 301, 400, 404, 418, 499...
expect(res.status).toBeLessThan(500)
// FC-E: Passes for any non-negative number
expect(items.length).toBeGreaterThanOrEqual(0)
// FC-E: Tautological — asserts a literal against itself
expect(404).toBe(404)The tautological variant is particularly insidious. It often appears when a developer extracts a value into a variable and then asserts that variable against itself:
const status = res.status
expect(status).toBe(status) // Always true, regardless of what status isFC-F: Silent Skip
The test returns early when the response is not what it expects, silently skipping all assertions:
// FC-F: If the endpoint fails, the test passes with 0 assertions
const res = await fetch("/api/data")
if (!res.ok) return
expect(res.body.items).toHaveLength(5)When the endpoint breaks and returns 500, the test does not fail — it silently exits. The CI stays green.
FC-G: Graceful Degradation
The service returns an empty or default response when its database mock is missing, and the test accepts the empty response as correct:
// FC-G: No mock DB injected. Service catches the error and returns [].
const res = await app.request("/api/products")
expect(res.status).toBe(200)
const body = await res.json()
expect(Array.isArray(body)).toBe(true) // True — it is an empty arrayThe fix: Inject a mock database with realistic data:
const mockDb = makeMockDb({
selectResult: [{ id: 1, name: "Widget", price: 9.99 }]
})
const res = await app.request("/api/products")
expect(res.status).toBe(200)
const body = await res.json()
expect(body[0].name).toBe("Widget")
expect(body[0].price).toBe(9.99)FC-H: Wrong Security Constant
Time-dependent security tests (JWT expiry, rate limiting, token refresh) that use the wrong constant, making them either always-pass or test the wrong boundary:
// FC-H: Test checks token expiry at 15 minutes,
// but the system uses REFRESH_GRACE_S (30s), not TOKEN_TTL_S (900s).
const token = createToken({ expiresIn: "15m" })
// ... wait or mock time ...
expect(isValid(token)).toBe(false)The fix: Map each security test to exactly one constant. Before writing the test, locate all relevant constants (MAX_CLOCK_SKEW_S, REFRESH_GRACE_S, TOKEN_TTL_S, RATE_LIMIT). If two constants could explain the test outcome, the test is modelling the wrong behaviour.
COND: Conditional No-Op
Assertions wrapped in conditions where the else branch has zero assertions:
// COND: If condition is false, test passes with 0 assertions
if (feature.isEnabled) {
expect(res.body.feature).toBeDefined()
}
// No else branch — what if feature is NOT enabled?The Audit Process
Delete an endpoint handler
Run the test suite. Count how many "covering" tests still pass. This reveals the scale of the problem.
Read every passing test that should have failed
Categorise the reason it passed. Why did this test not detect the deletion?
Name the pattern with a code
Build a checklist. Give each pattern a memorable identifier (FC-A, FC-B, etc.).
Scan the entire test suite for that pattern
Count occurrences. Understand the scale across the codebase.
Repeat until no new patterns emerge
The taxonomy stabilised at 8 codes (FC-A through FC-H) plus the COND variant after three full passes.
Fix Direction: Tests Are Specifications
One of the most important principles we established was about fix direction. When a test fails, the default interpretation is that the code is wrong — not the test.
This matters because the most natural response to a failing test is to update the expected value to match what the code returns. That is spec corruption.
| Question | Answer | Action |
|---|---|---|
| Was the test passing before this change? | Yes | Fix the code |
| Does the test represent a stated requirement? | Yes | Fix the code |
| Does the test have an FC pattern? | Yes | Fix the test, document the FC code |
"The code returns X" is never sufficient justification for changing an expected value. Evidence must come from outside the implementation — a product requirement, an API contract, a user story.
The Impact
After the audit and remediation, our test suite looked very different:
- Before: 656 test files, CI green, zero signal on ~60% of tests
- After: Same test count, but every test asserts specific values, every route test verifies the correct endpoint is reached, every mock injects realistic data
The most telling metric was our delete-and-count experiment repeated after remediation. Deleting an endpoint handler now caused 100% of its covering tests to fail. The suite had gone from decorative to diagnostic.
We then built three ESLint rules (no-false-confidence-patterns, no-conditional-status-expect, no-tautological-expect) to prevent FC-A and FC-E patterns from re-entering the codebase. These run as CI gates at error severity on every test file.
ESLint catches FC-A and FC-E before code is committed. For the remaining patterns (FC-B through FC-D, FC-F through FC-H, COND) — which require semantic understanding that AST analysis cannot reliably detect — we built agent scaffolding: structured rules and checklists that AI agents load into their context window during development and testing. The agents reason about each FC pattern as they write tests, catching what static analysis cannot.
Lessons for Other Teams
-
Coverage metrics lie. A file with 100% line coverage can have 0% assertion coverage. The lines execute, but nothing meaningful is verified.
-
Green CI is not a quality signal. It is a necessary condition, not a sufficient one. The quality signal is: "does deleting the feature cause the test to fail?"
-
The most common root cause is test-data seeding. FC-A, FC-C, and FC-G all stem from inadequate test data. When the test does not set up the right preconditions, the code under test takes error or empty paths, and loose assertions accept those paths as success.
-
Tests are specifications, not reflections. If you find yourself changing expected values to match what the code returns, stop. That is the test telling you the code is wrong.
-
Automate what you can, scaffold what you cannot. FC-A and FC-E are detectable via AST analysis — we automated those as ESLint rules. For the rest, we built agent scaffolding: structured testing rules that load into every AI agent's context during development. The agents reason about FC-B through FC-H as they write tests, catching semantic patterns that no linter can detect.
The false-confidence taxonomy is now part of our engineering rules. Every test — whether written by an AI agent or a human — is checked against the full FC-A through FC-H checklist. The ESLint rules catch the statically detectable patterns before commit; the agent scaffolding enforces the rest during development. It takes discipline, but the alternative — a test suite that provides comfort instead of confidence — is worse than having no tests at all.
Related Reading
- Building ESLint Rules to Prevent Tests That Lie shows how we turned the taxonomy from this audit into pre-commit AST-level enforcement.
- Why Agent-Native Teams Need Better Tests, Not More Tests explains why test truthfulness becomes more important as throughput rises.
- Agent-Native Shift-Left CI for High-Velocity Solo Engineering places this testing work inside the wider local quality and CI perimeter.