AI for Fixing Flaky Tests

AI RNG: Practical Systems That Ship

A flaky test is a tax on trust. It trains the team to ignore failures, rerun pipelines, and accept uncertainty where the whole point of tests was to create certainty. The worst part is the slow drift: one flaky test becomes three, then ten, and soon the suite is no longer a signal you can rely on.

Value WiFi 7 Router
Tri-Band Gaming Router

TP-Link Tri-Band BE11000 Wi-Fi 7 Gaming Router Archer GE650

TP-Link • Archer GE650 • Gaming Router
TP-Link Tri-Band BE11000 Wi-Fi 7 Gaming Router Archer GE650
A nice middle ground for buyers who want WiFi 7 gaming features without flagship pricing

A gaming-router recommendation that fits comparison posts aimed at buyers who want WiFi 7, multi-gig ports, and dedicated gaming features at a lower price than flagship models.

$299.99
Was $329.99
Save 9%
Price checked: 2026-03-23 18:31. Product prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on Amazon at the time of purchase will apply to the purchase of this product.
  • Tri-band BE11000 WiFi 7
  • 320MHz support
  • 2 x 5G plus 3 x 2.5G ports
  • Dedicated gaming tools
  • RGB gaming design
View TP-Link Router on Amazon
Check Amazon for the live price, stock status, and any service or software details tied to the current listing.

Why it stands out

  • More approachable price tier
  • Strong gaming-focused networking pitch
  • Useful comparison option next to premium routers

Things to know

  • Not as extreme as flagship router options
  • Software preferences vary by buyer
See Amazon for current availability
As an Amazon Associate I earn from qualifying purchases.

Flakiness is not mysterious. It is usually nondeterminism you have not controlled, or a contract you asserted too strictly for what the system guarantees. AI can help you diagnose patterns faster, but the core work is still about making the test environment and the test logic deterministic.

The main families of flakiness

Most flaky tests fall into a small set of causes.

SymptomLikely causeTypical fix
Fails around midnight or DSTtime dependencefixed clock, explicit time zones
Passes locally, fails in CIenvironment driftpin versions, normalize config
Fails only under loadrace conditionawait correct signals, remove shared state
Fails when run in a full suitetest pollutionisolate state, clean up resources
Fails with network-like errorsexternal dependencystub services, record/replay, timeouts
Fails with random seedsnondeterministic inputsfix seeds, remove true randomness

This classification is valuable because each family points toward different evidence and different fixes.

Turn flakiness into evidence before touching code

Before you try to fix anything, collect enough data that the fix is not guesswork.

  • How often does it fail in CI over the last week?
  • What is the stable failure signature: timeout, assertion mismatch, unexpected exception?
  • What runs before it when it fails, and what runs before it when it passes?
  • What is different between local and CI runs: CPU, timing, parallelism, environment variables?
  • Does it fail more often when the suite runs in parallel?

AI is useful here because it can cluster failure logs across runs and highlight the variables that correlate with failure. Give it multiple runs and ask it to extract a short list of likely causes, then validate with controlled tests.

A workflow that fixes flakiness without breaking intent

Make the test deterministic first

The first goal is not to make the test pass. It is to make the test behave predictably.

Common stabilizations:

  • Replace real time with a fixed clock.
  • Replace real randomness with a fixed seed.
  • Replace sleeps with awaitable signals and latches.
  • Replace network calls with a stub or in-memory fake.
  • Ensure the test owns its state and cleans up reliably.

A deterministic failing test is easier to fix than a test that fails only once every twenty runs.

Reduce to a minimal reproduction

Treat a flaky test like a production bug.

  • isolate it
  • run it repeatedly
  • shrink its dependencies

If it only fails in the full suite, that often means shared state or global pollution. Your job is to find the coupling and remove it.

Find and remove hidden coupling

Hidden coupling is the most common root cause of suite-only flakiness.

Common culprits:

  • global singletons that retain state across tests
  • environment variables modified without reset
  • shared databases without cleanup or transaction isolation
  • shared ports and background services that collide
  • tests that assume execution order
  • caches that are global instead of per-test

Once you name the coupling, you can remove it or reset it.

Align assertions with the real contract

Some flakiness is not nondeterminism. It is an assertion that was too strict for what the system guarantees.

Examples:

  • asserting exact timing instead of bounded timing
  • asserting ordering when order is intentionally unspecified
  • asserting a full JSON blob when only a subset is contractually stable
  • asserting text formatting that varies by locale or environment

If the contract does not require the strict assertion, relax it to the contract. That is not lowering quality. That is making the test tell the truth.

Stabilization patterns that work repeatedly

If your team fights flakiness often, a small pattern library pays off.

PatternWhat it replacesWhy it helps
Poll with timeoutfixed sleepswaits for reality, not for guess timing
Fake clockwall clockremoves time zones, DST, and scheduling noise
Deterministic IDsrandom UUIDsallows stable assertions and ordering
Hermetic servicesexternal callsremoves network and third-party uncertainty
Per-test isolationshared stateprevents test order and pollution bugs

AI can help you implement these patterns faster by suggesting refactor steps, but the patterns themselves are the real leverage.

Using AI to accelerate diagnosis

AI is most helpful when it is fed real failure data and asked to propose falsifiable experiments.

Useful uses:

  • Summarize differences between passing and failing logs.
  • Suggest likely nondeterminism sources based on stack traces.
  • Propose instrumentation to reveal races, such as logging state transitions.
  • Draft a minimal reproduction harness that runs the test repeatedly with controlled seeds.
  • Recommend where to replace sleeps with explicit synchronization.

Risky use:

  • letting AI “fix” code without a reproduction and without repeated verification.

Preventing flakiness from returning

Fixing flakiness once is good. Preventing it from returning is better.

Track and budget flakiness

Teams tolerate flakiness when it is invisible.

  • Track flaky tests explicitly.
  • Treat new flakiness as a regression that blocks merging.
  • Quarantine only as a short-lived mitigation, not a permanent state.

Keep the suite layered

When everything is end-to-end, the suite inherits all the nondeterminism of the world.

  • unit tests for pure behavior
  • integration tests for specific boundaries
  • end-to-end smoke tests only for critical flows

This layering gives you confidence without turning your suite into a weather report.

Stabilize the environment

CI is a different machine. If your tests assume a personal laptop, they will fail.

  • pin dependency versions
  • normalize time zones and locales
  • isolate resources per test
  • avoid shared global services

A practical flaky-test checklist

  • Do we know the flakiness family?
  • Can we reproduce it by running the test repeatedly?
  • Have we eliminated time, randomness, and sleeps?
  • Is state isolated and cleaned up?
  • Are assertions aligned with contracts rather than implementation details?
  • Did we add a regression guard so the same pattern cannot return?

Flakiness is solvable. It is solved by making uncertainty visible, then removing nondeterminism until the test becomes a reliable witness again.

Keep Exploring AI Systems for Engineering Outcomes

AI Unit Test Generation That Survives Refactors
https://ai-rng.com/ai-unit-test-generation-that-survives-refactors/

Integration Tests with AI: Choosing the Right Boundaries
https://ai-rng.com/integration-tests-with-ai-choosing-the-right-boundaries/

AI Debugging Workflow for Real Bugs
https://ai-rng.com/ai-debugging-workflow-for-real-bugs/

Root Cause Analysis with AI: Evidence, Not Guessing
https://ai-rng.com/root-cause-analysis-with-ai-evidence-not-guessing/

AI Test Data Design: Fixtures That Stay Representative
https://ai-rng.com/ai-test-data-design-fixtures-that-stay-representative/

Books by Drew Higgins