60% of Our Tests Had Zero Signal

The Discovery

We build with highly parallel AI agent teams: 7 agents working simultaneously across different parts of the codebase during our initial sprint. 162 commits in 4 days. The velocity was extraordinary. The test suite was green. Coverage metrics looked healthy.

Then we deleted an entire API route handler as an experiment. 94% of the "covering" tests still passed.

That was the moment we stopped trusting our test suite and started auditing it.

The Core Problem

A false-confidence test passes regardless of whether the feature it claims to test actually works. It provides the psychological comfort of coverage while providing zero signal about correctness.

The insidious part is that false-confidence tests do not fail. They sit in your suite, green and reassuring, while the code they are supposedly testing drifts, breaks, or gets deleted entirely. You only discover the problem when a production incident occurs in code you believed was tested.

When multiple AI agents write tests in parallel (each with their own context window, each optimising for green CI) the problem compounds. An agent that encounters a 404 from unseeded test data will "fix" the test by accepting 404 as a valid response rather than fixing the seed. Multiply that pattern across 7 parallel agents and 656 test files, and you get a test suite where 60% of tests provide zero signal.

Once we knew what to look for, we found it everywhere.

The FC Taxonomy

Through systematic audit of the agent-generated test suite, we identified 8 distinct false-confidence patterns, plus one conditional variant. Each has a code, a name, and a characteristic signature in test code. The taxonomy exists to enforce good testing practices during AI-assisted development, giving agents (and humans reviewing agent output) a shared vocabulary for test quality.

FC-A: Multi-Status Acceptance

The most common pattern we found. Instead of asserting a specific HTTP status, the test accepts multiple statuses as success:

// FC-A: This test passes whether the endpoint returns 200 OR 404
expect([200, 404]).toContain(res.status)
 
// FC-A: Same pattern with Array.includes
if ([200, 404].includes(res.status)) {
  // assertions here only run on success
}
 
// FC-A: OR-chain variant
if (res.status === 200 || res.status === 201) {
  expect(res.body).toBeDefined()
}

Why does this happen? An AI agent writing the test encounters a 404 because the test data is not seeded correctly. Instead of fixing the seed (which requires understanding the broader data model) it adds 404 to the accepted statuses. The test passes. The agent moves on to the next task. The endpoint has never actually been tested.

The fix: One test per expected status. If you expect 200, assert exactly 200.

// Correct: exact assertion
expect(res.status).toBe(200)
expect(res.body.id).toBe(expectedId)

FC-B: Shape-Only Assertions

The test checks that the response has a certain structure but never verifies the actual values:

// FC-B: Passes for ANY object response, including error responses
expect(typeof body).toBe("object")
expect(body).toBeDefined()
expect(Array.isArray(body.items)).toBe(true)

A response of { error: "Internal Server Error" } satisfies typeof body === "object". An empty array [] satisfies Array.isArray(). These assertions prove nothing about whether the feature works.

// Correct: value assertion
expect(body.mode).toBe("local")
expect(body.items).toHaveLength(3)
expect(body.items[0].name).toBe("Expected Item")

FC-C: Mock-Only Assertions

The test mocks a dependency to return a specific value, then asserts that the mock's return value arrived. It is testing the mock framework, not the application:

// FC-C: Mock returns [], test asserts Array.isArray([])
mock.module("../services/user-service", () => ({
  getUsers: () => []
}))
 
const res = await app.request("/api/users")
const body = await res.json()
expect(Array.isArray(body)).toBe(true) // Always true: we just mocked it

The fix: Inject realistic mock data and assert specific values that prove the application logic processed the data correctly.

// Correct: inject real data, assert transformation
mock.module("../services/user-service", () => ({
  getUsers: () => [{ id: 1, name: "Alice", role: "admin" }]
}))
 
const res = await app.request("/api/users")
const body = await res.json()
expect(body[0].name).toBe("Alice")
expect(body[0].role).toBe("admin")

FC-D: Route Never Reached

The test makes a request to an endpoint that is not mounted, or hits the wrong port. The framework returns a default response, and the test's assertions are loose enough to pass on it:

// FC-D: Endpoint is /api/v1/users but test hits /api/users
const res = await fetch("http://localhost:3001/api/users")
// Returns 404 from the framework, not from the handler
expect(res.status).toBeLessThan(500) // Passes! 404 < 500

Compound Patterns

FC-D often compounds with FC-E. The route is never reached, and the tautological assertion hides the fact. Two weak patterns combine to create a test that cannot possibly fail.

FC-E: Always-True / Tautological Assertions

Range checks that pass for virtually any response:

// FC-E: Passes for 200, 201, 301, 400, 404, 418, 499...
expect(res.status).toBeLessThan(500)
 
// FC-E: Passes for any non-negative number
expect(items.length).toBeGreaterThanOrEqual(0)
 
// FC-E: Tautological. Asserts a literal against itself
expect(404).toBe(404)

The tautological variant is particularly insidious. It often appears when a developer extracts a value into a variable and then asserts that variable against itself:

const status = res.status
expect(status).toBe(status) // Always true, regardless of what status is

FC-F: Silent Skip

The test returns early when the response is not what it expects, silently skipping all assertions:

// FC-F: If the endpoint fails, the test passes with 0 assertions
const res = await fetch("/api/data")
if (!res.ok) return
 
expect(res.body.items).toHaveLength(5)

When the endpoint breaks and returns 500, the test does not fail. It silently exits. The CI stays green.

FC-G: Graceful Degradation

The service returns an empty or default response when its database mock is missing, and the test accepts the empty response as correct:

// FC-G: No mock DB injected. Service catches the error and returns [].
const res = await app.request("/api/products")
expect(res.status).toBe(200)
const body = await res.json()
expect(Array.isArray(body)).toBe(true) // True: it is an empty array

The fix: Inject a mock database with realistic data:

const mockDb = makeMockDb({
  selectResult: [{ id: 1, name: "Widget", price: 9.99 }]
})
 
const res = await app.request("/api/products")
expect(res.status).toBe(200)
const body = await res.json()
expect(body[0].name).toBe("Widget")
expect(body[0].price).toBe(9.99)

FC-H: Wrong Security Constant

Time-dependent security tests (JWT expiry, rate limiting, token refresh) that use the wrong constant, making them either always-pass or test the wrong boundary:

// FC-H: Test checks token expiry at 15 minutes,
// but the system uses REFRESH_GRACE_S (30s), not TOKEN_TTL_S (900s).
const token = createToken({ expiresIn: "15m" })
// ... wait or mock time ...
expect(isValid(token)).toBe(false)

The fix: Map each security test to exactly one constant. Before writing the test, locate all relevant constants (MAX_CLOCK_SKEW_S, REFRESH_GRACE_S, TOKEN_TTL_S, RATE_LIMIT). If two constants could explain the test outcome, the test is modelling the wrong behaviour.

COND: Conditional No-Op

Assertions wrapped in conditions where the else branch has zero assertions:

// COND: If condition is false, test passes with 0 assertions
if (feature.isEnabled) {
  expect(res.body.feature).toBeDefined()
}
// No else branch. What if feature is NOT enabled?

The Audit Process

Delete an endpoint handler

Run the test suite. Count how many "covering" tests still pass. This reveals the scale of the problem.

Read every passing test that should have failed

Categorise the reason it passed. Why did this test not detect the deletion?

Name the pattern with a code

Build a checklist. Give each pattern a memorable identifier (FC-A, FC-B, etc.).

Scan the entire test suite for that pattern

Count occurrences. Understand the scale across the codebase.

Repeat until no new patterns emerge

The taxonomy stabilised at 8 codes (FC-A through FC-H) plus the COND variant after three full passes.

Fix Direction: Tests Are Specifications

One of the most important principles we established was about fix direction. When a test fails, the default interpretation is that the code is wrong, not the test.

This matters because the most natural response to a failing test is to update the expected value to match what the code returns. That is spec corruption.

Question	Answer	Action
Was the test passing before this change?	Yes	Fix the code
Does the test represent a stated requirement?	Yes	Fix the code
Does the test have an FC pattern?	Yes	Fix the test, document the FC code

"The code returns X" is never sufficient justification for changing an expected value. Evidence must come from outside the implementation: a product requirement, an API contract, a user story.

The Impact

After the audit and remediation, our test suite looked very different:

Before: 656 test files, CI green, zero signal on ~60% of tests
After: Same test count, but every test asserts specific values, every route test verifies the correct endpoint is reached, every mock injects realistic data

The most telling metric was our delete-and-count experiment repeated after remediation. Deleting an endpoint handler now caused 100% of its covering tests to fail. The suite had gone from decorative to diagnostic.

We then built three ESLint rules (no-false-confidence-patterns, no-conditional-status-expect, no-tautological-expect) to prevent FC-A and FC-E patterns from re-entering the codebase. These run as CI gates at error severity on every test file.

Automated Prevention at Two Layers

ESLint catches FC-A and FC-E before code is committed. For the remaining patterns (FC-B through FC-D, FC-F through FC-H, COND), which require semantic understanding that AST analysis cannot reliably detect, we built agent scaffolding: structured rules and checklists that AI agents load into their context window during development and testing. The agents reason about each FC pattern as they write tests, catching what static analysis cannot.

Why This Matters More Under High Velocity

The audit above describes what we found. This section explains why the same underlying weakness becomes more dangerous as throughput rises.

In slower development environments, weak tests are an inefficiency. The human integration layer compensates — reviewers remember more context, the surface area of each change wave is smaller, and manual verification has more time to catch what the tests miss.

In agent-native development, that compensation disappears.

When multiple parallel workstreams are running, integration loops are frequent, and a single operator is directing many changes at once, the test suite is no longer one signal among many. It becomes the primary quality signal. And when that signal says "safe to continue" without actually earning the claim, it authorises more change while hiding the fact that confidence has not been built.

That is the specific failure mode. Not that the tests are noisy. That they are too quiet. They pass when they should not, which in a high-output system means more code ships on a foundation that has not been verified.

Speed does not create false-confidence tests. But speed multiplies their cost.

The Agent-Generated Test Problem

There is a structural reason why AI-assisted teams encounter this more often.

Agents are effective at producing tests that look correct: they mirror patterns, write plausible assertions, and satisfy structural expectations. What they optimise for, by default, is green CI — not test truthfulness.

When an agent writing tests encounters a 404 because test data is not seeded correctly, the path of least resistance is to accept 404 as a valid response (FC-A) rather than diagnosing the seed problem. The test goes green. The agent moves on. The endpoint has never actually been tested.

Multiply that across seven parallel agents and 656 test files, and you have a large green suite that provides genuine confidence on perhaps 40% of what it claims.

If your testing patterns are already weak, agentic throughput multiplies the weakness. You accumulate more green checks, more files, more surface area — and not necessarily more truth.

What Stronger Tests Actually Mean

Correcting this is not about adding more tests or using more sophisticated frameworks. It is about what each test proves.

Stronger tests do four specific things:

Assert exact contracts. If an endpoint should return 200 with a particular envelope, assert that exact status and that exact shape with specific values. If a route should reject unauthorised access with 401, assert 401. Permissive assertions let drift hide inside technically-green behaviour.

Exercise real wiring where it matters. Unit tests have their place, but when correctness depends on real middleware order, auth, tenancy, or routing, isolated tests overstate confidence. Critical paths need at least some tests that prove the mounted, bootstrapped application path.

Fail for the reasons you care about. A strong test has a believable failure mode: if the product breaks in the way you care about, the test turns red. A surprising proportion of test code merely proves that the system still produces something, rather than proving it produces the correct thing.

Protect boundaries, not just functions. The most expensive breakages in modern systems happen at boundaries — auth, tenancy, plan gating, event delivery, integration contracts. Those areas deserve tests that prove the boundary itself, not only the local function implementation.

Lessons for Other Teams

Coverage metrics lie. A file with 100% line coverage can have 0% assertion coverage. The lines execute, but nothing meaningful is verified.
Green CI is not a quality signal. It is a necessary condition, not a sufficient one. The quality signal is: "does deleting the feature cause the test to fail?"
The most common root cause is test-data seeding. FC-A, FC-C, and FC-G all stem from inadequate test data. When the test does not set up the right preconditions, the code under test takes error or empty paths, and loose assertions accept those paths as success.
Tests are specifications, not reflections. If you find yourself changing expected values to match what the code returns, stop. That is the test telling you the code is wrong.
Automate what you can, scaffold what you cannot. FC-A and FC-E are detectable via AST analysis; we automated those as ESLint rules. For the rest, we built agent scaffolding: structured testing rules that load into every AI agent's context during development. The agents reason about FC-B through FC-H as they write tests, catching semantic patterns that no linter can detect.

The false-confidence taxonomy is now part of our engineering rules. Every test (whether written by an AI agent or a human) is checked against the full FC-A through FC-H checklist. The ESLint rules catch the statically detectable patterns before commit; the agent scaffolding enforces the rest during development. It takes discipline, but the alternative (a test suite that provides comfort instead of confidence) is worse than having no tests at all.

Building ESLint Rules to Prevent Tests That Lie shows how we turned the taxonomy from this audit into pre-commit AST-level enforcement.
Agent-Native Shift-Left CI for High-Velocity Solo Engineering places this testing work inside the wider local quality and CI perimeter.