--- name: test-evidence-review description: "Quality review of test files and manual evidence documents. Goes beyond existence checks — evaluates assertion coverage, edge case handling, naming conventions, and evidence completeness. Produces ADEQUATE/INCOMPLETE/MISSING verdict per story. Run before QA sign-off or on demand." argument-hint: "[story-path | sprint | system-name]" user-invocable: true allowed-tools: Read, Glob, Grep, Write --- # Test Evidence Review `/smoke-check` verifies that test files **exist** and **pass**. This skill goes further — it reviews the **quality** of those tests and evidence documents. A test file that exists and passes may still leave critical behaviour uncovered. A manual evidence doc that exists may lack the sign-offs required for closure. **Output:** Summary report (in conversation) + optional `production/qa/evidence-review-[date].md` **When to run:** - Before QA hand-off sign-off (`/team-qa` Phase 5) - On any story where test quality is in question - As part of milestone review for Logic and Integration story quality audit --- ## 1. Parse Arguments **Modes:** - `/test-evidence-review [story-path]` — review a single story's evidence - `/test-evidence-review sprint` — review all stories in the current sprint - `/test-evidence-review [system-name]` — review all stories in an epic/system - No argument — ask which scope: "Single story", "Current sprint", "A system" --- ## 2. Load Stories in Scope Based on the argument: **Single story**: Read the story file directly. Extract: Story Type, Test Evidence section, story slug, system name. **Sprint**: Read the most recently modified file in `production/sprints/`. Extract the list of story file paths from the sprint plan. Read each story file. **System**: Glob `production/epics/[system-name]/story-*.md`. Read each. For each story, collect: - `Type:` field (Logic / Integration / Visual/Feel / UI / Config/Data) - `## Test Evidence` section — the stated expected test file path or evidence doc - Story slug (from file name) - System name (from directory path) - Acceptance Criteria list (all checkbox items) --- ## 3. Locate Evidence Files For each story, find the evidence: **Logic stories**: Glob `tests/unit/[system]/[story-slug]_test.*` - If not found, also try: Grep in `tests/unit/[system]/` for files containing the story slug **Integration stories**: Glob `tests/integration/[system]/[story-slug]_test.*` - Also check `production/session-logs/` for playtest records mentioning the story **Visual/Feel and UI stories**: Glob `production/qa/evidence/[story-slug]-evidence.*` **Config/Data stories**: Glob `production/qa/smoke-*.md` (any smoke check report) Note what was found (path) or not found (gap) for each story. --- ## 4. Review Automated Test Quality (Logic / Integration) For each test file found, read it and evaluate: ### Assertion coverage Count the number of distinct assertions (lines containing assert, expect, check, verify, or engine-specific assertion patterns). Low assertion count is a quality signal — a test that makes only 1 assertion per test function may not cover the range of expected behaviour. Thresholds: - **3+ assertions per test function** → normal - **1-2 assertions per test function** → note as potentially thin - **0 assertions** (test exists but no asserts) → flag as BLOCKING — the test passes vacuously and proves nothing ### Edge case coverage For each acceptance criterion in the story that contains a number, threshold, or "when X happens" conditional: check whether a test function name or test body references that specific case. Heuristics: - Grep test file for "zero", "max", "null", "empty", "min", "invalid", "boundary", "edge" — presence of any is a positive signal - If the story has a Formulas section with specific bounds: check whether tests exercise at minimum/maximum values ### Naming quality Test function names should describe: the scenario + the expected result. Pattern: `test_[scenario]_[expected_outcome]` Flag functions named generically (`test_1`, `test_run`, `testBasic`) as **naming issues** — they make failures harder to diagnose. ### Formula traceability For Logic stories where the GDD has a Formulas section: check that the test file contains at least one test whose name or comment references the formula name or a formula value. A test that exercises a formula without mentioning it by name is harder to maintain when the formula changes. --- ## 5. Review Manual Evidence Quality (Visual/Feel / UI) For each evidence document found, read it and evaluate: ### Criterion linkage The evidence doc should reference each acceptance criterion from the story. Check: does the evidence doc contain each criterion (or a clear rephrasing)? Missing criteria mean a criterion was never verified. ### Sign-off completeness Check for three sign-off lines (or equivalent fields): - Developer sign-off - Designer / art-lead sign-off (for Visual/Feel) - QA lead sign-off If any are missing or blank: flag as INCOMPLETE — the story cannot be fully closed without all required sign-offs. ### Screenshot / artefact completeness For Visual/Feel stories: check whether screenshot file paths are referenced in the evidence doc. If referenced, Glob for them to confirm they exist. For UI stories: check whether a walkthrough sequence (step-by-step interaction log) is present. ### Date coverage Evidence doc should have a date. If the date is earlier than the story's last major change (heuristic: compare against sprint start date from the sprint plan), flag as POTENTIALLY STALE — the evidence may not cover the final implementation. --- ## 6. Build the Review Report For each story, assign a verdict: | Verdict | Meaning | |---------|---------| | **ADEQUATE** | Test/evidence exists, passes quality checks, all criteria covered | | **INCOMPLETE** | Test/evidence exists but has quality gaps (thin assertions, missing sign-offs) | | **MISSING** | No test or evidence found for a story type that requires it | The overall sprint/system verdict is the worst story verdict present. ```markdown ## Test Evidence Review > **Date**: [date] > **Scope**: [single story path | Sprint [N] | [system name]] > **Stories reviewed**: [N] > **Overall verdict**: ADEQUATE / INCOMPLETE / MISSING --- ### Story-by-Story Results #### [Story Title] — [Type] — [ADEQUATE/INCOMPLETE/MISSING] **Test/evidence path**: `[path]` (found) / (not found) **Automated test quality** *(Logic/Integration only)*: - Assertion coverage: [N per function on average] — [adequate / thin / none] - Edge cases: [covered / partial / not found] - Naming: [consistent / [N] generic names flagged] - Formula traceability: [yes / no — formula names not referenced in tests] **Manual evidence quality** *(Visual/Feel/UI only)*: - Criterion linkage: [N/M criteria referenced] - Sign-offs: [Developer ✓ | Designer ✗ | QA Lead ✗] - Artefacts: [screenshots present / missing / N/A] - Freshness: [dated [date] — current / potentially stale] **Issues**: - BLOCKING: [description] *(prevents story-done)* - ADVISORY: [description] *(should fix before release)* --- ### Summary | Story | Type | Verdict | Issues | |-------|------|---------|--------| | [title] | Logic | ADEQUATE | None | | [title] | Integration | INCOMPLETE | Thin assertions (avg 1.2/function) | | [title] | Visual/Feel | INCOMPLETE | QA lead sign-off missing | | [title] | Logic | MISSING | No test file found | **BLOCKING items** (must resolve before story can be closed): [N] **ADVISORY items** (should address before release): [N] ``` --- ## 7. Write Output (Optional) Present the report in conversation. Ask: "May I write this test evidence review to `production/qa/evidence-review-[date].md`?" This is optional — the report is useful standalone. Write only if the user wants a persistent record. After the report: - For BLOCKING items: "These must be resolved before `/story-done` can mark the story Complete. Would you like to address any of them now?" - For thin assertions: "Consider running `/test-helpers [system]` to see scaffolded assertion patterns for common cases." - For missing sign-offs: "Manual sign-off is required from [role]. Share `[evidence-path]` with them to complete sign-off." Verdict: **COMPLETE** — evidence review finished. Use CONCERNS if BLOCKING items were found. --- ## Collaborative Protocol - **Report quality issues, do not fix them** — this skill reads and evaluates; it does not modify test files or evidence documents - **ADEQUATE means adequate for shipping, not perfect** — avoid nitpicking tests that are functioning and comprehensive enough to give confidence - **BLOCKING vs. ADVISORY distinction is important** — only flag BLOCKING when the gap leaves a story criterion genuinely unverified - **Ask before writing** — the report file is optional; always confirm before writing