--- name: test-flakiness description: "Detect non-deterministic (flaky) tests by reading CI run logs or test result history. Aggregates pass rates per test, identifies intermittent failures, recommends quarantine or fix, and maintains a flaky test registry. Best run during Polish phase or after multiple CI runs." argument-hint: "[ci-log-path | scan | registry]" user-invocable: true allowed-tools: Read, Glob, Grep, Write, Edit, Bash --- # Test Flakiness Detection A flaky test is one that sometimes passes and sometimes fails without any code change. Flaky tests are worse than no tests in some ways — they train the team to ignore red CI runs, masking genuine failures. This skill identifies them, explains likely causes, and recommends whether to quarantine or fix each one. **Output:** Updated `tests/regression-suite.md` quarantine section + optional `production/qa/flakiness-report-[date].md` **When to run:** - Polish phase (tests have had many runs; statistical signal is reliable) - When developers start dismissing CI failures as "probably flaky" - After `/regression-suite` identifies quarantined tests that need diagnosis --- ## 1. Parse Arguments **Modes:** - `/test-flakiness [ci-log-path]` — analyse a specific CI run log file - `/test-flakiness scan` — scan all available CI logs in `.github/` or standard log output directories - `/test-flakiness registry` — read existing regression-suite.md quarantine section and provide remediation guidance for already-known flaky tests - No argument — auto-detect: run `scan` if CI logs are accessible, else `registry` --- ## 2. Locate CI Log Data ### Option A — GitHub Actions (preferred) Check for test result artifacts: ```bash ls -t .github/ 2>/dev/null ls -t test-results/ 2>/dev/null ``` For Godot projects: GdUnit4 outputs XML results compatible with JUnit format. Check `test-results/` for `.xml` files. For Unity projects: game-ci test runner outputs NUnit XML to `test-results/` by default. For Unreal projects: automation logs go to `Saved/Logs/`. Grep for `Result: Success` and `Result: Fail` patterns. ### Option B — Local log files If a path argument is provided, read that file directly. ### Option C — No log data available If no logs found: > "No CI log data found. To detect flaky tests, this skill needs test result > history from multiple runs. Options: > 1. Run the test suite at least 3 times and collect the output logs > 2. Check CI pipeline output and save a log to `test-results/` > 3. Run `/test-flakiness registry` to review tests already flagged as flaky > in `tests/regression-suite.md`" Stop and ask the user which option to pursue. --- ## 3. Parse Test Results For each CI log or result file found, parse: **JUnit XML format** (GdUnit4 / Unity): - Grep for `25% of runs — quarantine immediately - **Moderate flakiness**: Fails in 5–25% of runs — investigate and fix soon - **Low/suspected flakiness**: Fails in 1–5% of runs — monitor; may be genuinely rare failure For each flaky test, classify the likely cause: ### Cause classification | Cause | Symptoms | Fix direction | |-------|----------|---------------| | **Timing / async** | Fails after awaiting signals or timers; pass rate correlates with system load | Add explicit await/synchronisation; avoid time-based delays | | **Order dependency** | Fails when run after specific other tests; passes in isolation | Add proper setup/teardown; ensure test isolation | | **Random seed** | Fails intermittently with no pattern; involves RNG | Pass explicit seed; don't use `randf()` in tests | | **Resource leak** | Fails more often later in a test run | Fix cleanup in teardown; check orphan nodes (Godot) or object disposal (Unity) | | **External state** | Fails when a file, scene, or global exists from a prior test | Isolate test from file system; use in-memory mocks | | **Floating point** | Fails on comparisons like `== 0.5` | Use epsilon comparison (`is_equal_approx`, `Assert.AreApproximately`) | | **Scene/prefab load race** | Fails when scenes are not yet ready | Await one frame after instantiation; use `await get_tree().process_frame` | Use Grep to check the test file for timing calls, randf, global state access, or equality comparisons on floats to narrow down the cause. --- ## 5. Recommend Action For each flaky test: **Quarantine (High flakiness):** > "Quarantine this test immediately. Disable it in CI by adding > `@pytest.mark.skip` / `[Ignore]` / `GdUnitSkip` annotation. Log it in > `tests/regression-suite.md` quarantine section. The test is now opt-in only. > Fix the root cause before removing quarantine." **Investigate and fix soon (Moderate):** > "This test is intermittently unreliable. Root cause appears to be [cause]. > Suggested fix: [specific fix based on cause classification]. Do not quarantine > yet — fix the test directly." **Monitor (Low/suspected):** > "This test shows suspected flakiness. Collect more run data before > quarantining. Note it as 'suspected' in the regression suite." --- ## 6. Generate Reports ### In-conversation summary ``` ## Flakiness Detection Results **Runs analysed**: [N] **Tests tracked**: [N] ### Flaky Tests Found | Test | System | Fail Rate | Likely Cause | Recommendation | |------|--------|-----------|--------------|----------------| | [test_name] | [system] | [N]% | Timing | Quarantine + fix async | | [test_name] | [system] | [N]% | Float comparison | Fix: use epsilon compare | | [test_name] | [system] | [N]% | Order dependency | Investigate teardown | ### Clean Tests (no flakiness detected) [N] tests ran across [N] runs with consistent results — no flakiness detected. ### Data Limitations [Note if fewer than 5 runs were available — fewer runs = less statistical confidence] ``` --- ## 7. Update Regression Suite + Optional Report File Ask: "May I update the quarantine section of `tests/regression-suite.md` with the flaky tests found?" If yes: use `Edit` to append entries to the Quarantined Tests table. Never remove existing quarantine entries — only add new ones. Ask (separately): "May I write a full flakiness report to `production/qa/flakiness-report-[date].md`?" The full report includes per-test analysis with cause details and engine-specific fix snippets. After writing: - For each quarantined test: "Add the engine-specific skip annotation to disable this test in CI. Re-enable after the root cause is fixed." - For fix-eligible tests: "The fix for [test] is straightforward — change the equality comparison on line [N] to use `is_equal_approx`." - Summary: "Once all quarantine annotations are applied, CI should run green. Schedule fix work for the [N] quarantined tests before the release gate." --- ## Collaborative Protocol - **Never delete test files** — quarantine means annotate + list, not remove - **Statistical confidence matters** — with < 3 runs, flag findings as "suspected" not "confirmed"; ask if more run data is available - **Fix is always the goal** — quarantine is temporary; surface the fix direction even when recommending quarantine - **Ask before writing** — both the regression-suite update and the report file require explicit approval. On write: Verdict: **COMPLETE** — flakiness report written. On decline: Verdict: **BLOCKED** — user declined write. - **Flakiness in CI is a team problem** — surface the list and recommended actions clearly; do not just silently quarantine without the team knowing