Categorizing False Positives in Automated Scan Results

A false positive is a finding the engine reports as a WCAG failure that is not actually perceivable by assistive technology — a contrast rule tripping on a skeleton loader still at opacity: 0, a landmark rule firing on a synthetic framework wrapper, a label rule evaluating a node that an ancestor’s aria-hidden="true" has already removed from the accessibility tree. This page resolves one narrow question: how to classify each of those findings deterministically so genuine regressions still block a merge while phantom violations are suppressed with an audit trail rather than a blanket rule disable.

It is the detail companion to the error categorization and triage pipelines guide — where that page defines the four-stage transform over an aggregated dataset, this one owns the false-positive scoring stage specifically — and both sit inside the broader automated scanning and dynamic content ingestion strategy for auditing client-rendered applications.

When This Applies

False-positive scoring only earns its complexity above a certain scale. On a ten-page marketing site you can eyeball the report. On an enterprise crawl emitting tens of thousands of raw violations per cycle, a double-digit false-positive rate is the difference between a trusted quality signal and a muted Slack channel. The technique below is relevant when three conditions hold at once: the target is a single-page application that hydrates after the initial fetch, the scan runs unattended in CI, and the same rule keeps flagging elements that a manual audit confirms are compliant.

At scale, false positives concentrate around three rendering-lifecycle friction points, and knowing which one you are looking at determines the fix:

Transient DOM evaluation. The engine captures a snapshot before CSS transitions, lazy-loading, or virtualized lists stabilize, so it scores a mid-flight state the user never sees. Correct dynamic content boundary detection reduces this class at the source, but never eliminates it.
Framework abstraction layers. React portals, Angular host bindings, and Vue transition wrappers inject synthetic nodes that lack an explicit role or aria-* attribute, tripping landmark or naming rules on containers that own no content.
Contextual heuristic gaps. Rules like color-contrast and label evaluate an isolated node without accounting for a parent-level style override, an SVG fallback, or aria-hidden state propagating down from an ancestor.

Blindly disabling the offending rules degrades audit coverage across every route. The goal instead is to map each misclassification to its lifecycle cause and suppress it only where the context proves it is safe.

Minimal Reproducible Example

The problem is easiest to see with a naive triage step that tickets every node of every violation straight out of an axe-core report:

import json

# results.json is written after page.evaluate() runs axe.run() in the browser.
report = json.load(open("results.json"))

for violation in report["violations"]:
    for node in violation["nodes"]:
        create_ticket(violation["id"], node["target"])  # tickets EVERY node

Feed it a report containing this finding and it opens a ticket for a color-contrast failure that no user can perceive, because the flagged element is a skeleton placeholder mid-transition:

{
  "id": "color-contrast",
  "impact": "serious",
  "nodes": [{
    "target": [".card-skeleton__line"],
    "html": "<div class=\"card-skeleton__line\" aria-hidden=\"true\"></div>",
    "failureSummary": "Element has insufficient color contrast of 1.4:1"
  }]
}

The element carries aria-hidden="true" and, in the live DOM, sits at opacity: 0 until hydration swaps in real content. It is invisible to both sighted users and screen readers, yet the engine — which read the accessibility tree a few milliseconds too early — scored it as a serious failure. The naive loop cannot tell this apart from a genuine 1.4:1 contrast defect on visible body text. That is the gap the scorer closes.

Classifying a Finding as Real or False

Classification is a chain of contextual checks, not a single heuristic. The decision tree below traces one flagged violation through successive tests: is it on an aria-hidden or off-screen node, is it a synthetic framework wrapper, and does its confidence score clear the blocking threshold. Only findings that survive every check are confirmed as real and routed to an owner; everything else is either suppressed with a recorded reason or sent to a reviewed quarantine.

The one non-negotiable rule in this tree: quarantine is a reviewed destination, never a silent delete. An accessibility program has to be able to prove why a finding was not actioned, and a discarded record cannot be reviewed.

Correct Implementation

Two things make the scorer reliable: it never re-opens a browser, and it never guesses. The DOM context every check needs — computed visibility, opacity, bounding box, ancestor aria-hidden — is captured once, inside the page, at the moment the scan runs, and travels with each node. This enrichment step belongs in the Playwright headless scanning workflows layer, immediately after axe.run() resolves and before the browser closes:

// Runs inside page.evaluate() right after axe finishes.
// Attaches the computed context each node needs so the Python scorer
// downstream never has to re-open a browser to make a decision.
function enrichNodes(results) {
  for (const violation of results.violations) {
    for (const node of violation.nodes) {
      const el = document.querySelector(node.target[0]);
      if (!el) { node.context = { missing: true }; continue; }
      const cs = getComputedStyle(el);
      const box = el.getBoundingClientRect();
      node.context = {
        aria_hidden: el.closest('[aria-hidden="true"]') !== null,
        hidden: cs.visibility === 'hidden' || cs.display === 'none',
        opacity_zero: parseFloat(cs.opacity) === 0,
        offscreen: box.bottom < 0 || box.right < 0 ||
                   box.top > innerHeight || box.left > innerWidth,
      };
    }
  }
  return results;
}

The Python scorer then reduces that context to a single probability that the finding is a false positive. Each branch encodes one of the three lifecycle causes, and every non-zero score is explainable to a reviewer:

from dataclasses import dataclass

# Framework containers that render structure but carry no semantic content.
SYNTHETIC_WRAPPER_HINTS = ("react-portal", "ng-host", "vue-transition", "cdk-overlay")

# Rules that are only meaningful on a visible, settled node.
VISIBILITY_SENSITIVE = {"color-contrast", "link-name", "label", "image-alt"}


@dataclass
class NodeContext:
    aria_hidden: bool    # element or an ancestor carries aria-hidden="true"
    hidden: bool         # computed visibility:hidden or display:none
    opacity_zero: bool   # computed opacity == 0 (skeletons mid-transition)
    offscreen: bool      # bounding box entirely outside the layout viewport
    selector: str


def false_positive_score(rule_id: str, ctx: NodeContext) -> float:
    """P(false positive) in [0.0, 1.0]; higher means safer to suppress."""
    # A node the AT will never reach cannot fail a perceivability rule.
    if ctx.aria_hidden or ctx.hidden:
        return 1.0
    # Skeleton loaders sit at opacity:0 during hydration; contrast is undefined.
    if rule_id in VISIBILITY_SENSITIVE and (ctx.opacity_zero or ctx.offscreen):
        return 0.9
    # Synthetic wrappers trip role/landmark rules but own no content.
    if any(hint in ctx.selector for hint in SYNTHETIC_WRAPPER_HINTS):
        return 0.75
    return 0.0  # nothing suppressible — treat as a real violation

Routing applies the same 0.85 threshold the decision tree uses, with a middle band that goes to humans rather than to either extreme:

BLOCK_SUPPRESSION = 0.85   # at or above: add to version-controlled baseline
QUARANTINE_FLOOR = 0.50    # in [0.50, 0.85): reviewed, never dropped


def route(rule_id: str, ctx: NodeContext, impact: str) -> str:
    score = false_positive_score(rule_id, ctx)
    if impact == "critical":
        return "ticket"                      # never let a score demote a blocker
    if score >= BLOCK_SUPPRESSION:
        baseline.add(ctx.selector, rule_id)  # suppress, with the reason recorded
        return "suppressed"
    if score >= QUARANTINE_FLOOR:
        quarantine.enqueue(rule_id, ctx)     # human review
        return "quarantine"
    return "ticket"                          # real violation -> owner

The impact == "critical" guard is deliberate and belongs before the score check: a missing-label or image-alt failure must never be scored out of routing, no matter how the visibility heuristics read. That single line is what keeps aggressive suppression from silently dropping a Level AA blocker.

Common false-positive classes and their fixes

The scorer handles the general case; these are the specific patterns that dominate enterprise reports, with the source-level fix that stops each one from being generated in the first place. Structural exclusions you never own belong in the axe-core enterprise configuration exclude scope so they cost nothing downstream, while context-dependent cases stay in the scorer where a reviewer can audit them.

Violation class	Typical trigger	Root cause	Resolution pattern
`color-contrast`	Off-screen tooltips, skeleton loaders	Engine evaluates hidden DOM before `opacity: 0` or `display: none` applies	Add `aria-hidden="true"` to transient elements; `exclude` skeleton selectors, or post-filter by re-checking computed opacity/visibility before ticketing
`aria-allowed-role`	Framework wrapper divs	Synthetic containers inherit an implicit role that conflicts with an explicit `role`	Remove redundant `role` declarations on hosts; `exclude` verified architectural wrappers
`duplicate-id`	SSR hydration mismatches	Client rehydration emits duplicate `id` attributes before React reconciles	Use `useId()` (React 18+) or UUIDs; defer the audit until hydration completes
`landmark-one-main`	Component-level route scans	Partial DOM snapshots lack `<main>` context	Scope `axe.run()` to the route container via `context.include` rather than the document
`focus-trap`	Custom modal portals	Engine misreads `tabindex="-1"` on a backdrop as focusable	Give the backdrop `aria-hidden="true"` and `inert`; verify focus order with `page.keyboard.press()`

Pipeline Integration

In a full run, this scorer is a pure transform sandwiched between deterministic classification and the routing gate: it opens no browsers and mutates no DOM, so it scales, fails, and replays independently of the crawl, and it can be unit-tested against golden fixtures of enriched findings. Its 0.85 output threshold is the same confidence number the CI/CD threshold gating strategy uses to decide what blocks a pull request: only findings the scorer confirms as real (score below the quarantine floor) and marks critical hard-fail the build; suppressions land in a version-controlled baseline that CI diffs each run so a newly suppressed selector shows up in code review; and quarantined findings post as non-blocking PR comments. Because the enrichment happens once at scan time and the scoring is a cheap lookup, adding it to an existing pipeline that already streams through the batch validation architecture costs a few milliseconds per finding, not another crawl.

Gotchas

Authenticated and multi-tenant routes shift the baseline. A selector that is a safe suppression behind a logged-out marketing route may map to a real, content-bearing component once a tenant’s data hydrates it. Key baseline entries on route template plus selector, not selector alone, or a suppression approved on one tenant will silently mask a genuine failure on another.
Viewport variance flips the offscreen check. An element outside the layout viewport at 1280px can be on-screen at 375px, so a false positive scored at desktop width becomes a real, visible violation at mobile width. Capture NodeContext per emulated viewport and never reuse a suppression across breakpoints.
A stale baseline hides regressions. Suppression entries pinned to a component version go stale the moment that component ships a real defect at the same selector. Expire baseline entries on the component hash they were approved against, and re-quarantine — rather than auto-suppress — any entry whose surrounding markup has changed since approval.

Frequently Asked Questions

Why re-check computed visibility instead of just disabling the noisy rule?

Disabling color-contrast to silence skeleton-loader noise turns off contrast checking on every visible element across every route, which is exactly the class of defect the audit exists to catch. Re-checking computed opacity and aria-hidden per node suppresses only the specific findings that are provably imperceptible and leaves the rule fully active everywhere else.

Should suppressed false positives live in the engine config or the pipeline?

Both, at different layers. Structural exclusions you never own — third-party iframes, analytics widgets, chart libraries — belong in the engine’s exclude scope so they never generate a finding. Context-dependent suppressions that require judgment belong in the pipeline’s scorer, where the decision is auditable, reversible, and held in quarantine for review rather than hard-coded in worker code where it escapes review.

What confidence threshold should actually block CI?

Bind the blocking tier to your conformance target, not a raw count. In the reference implementation, findings the scorer confirms as real (false-positive score below the quarantine floor) and the engine marks critical hard-fail; everything between the floor and 0.85 is quarantined for human review; and 0.85 or above is suppressed into a version-controlled baseline. The critical-impact guard runs first so no score can demote a genuine blocker.

How do I stop a suppression from hiding a real regression later?

Pin every baseline entry to the component hash it was approved against and diff the baseline on each run. When the markup around a suppressed selector changes, re-route the finding through quarantine instead of auto-suppressing it, so a human re-confirms the suppression against the new DOM before it silences the finding again.

Categorizing False Positives in Automated Scan Results

When This Applies #

Minimal Reproducible Example #

Classifying a Finding as Real or False #

Correct Implementation #

Common false-positive classes and their fixes #

Pipeline Integration #

Gotchas #

Frequently Asked Questions #

Why re-check computed visibility instead of just disabling the noisy rule? #

Should suppressed false positives live in the engine config or the pipeline? #

What confidence threshold should actually block CI? #

How do I stop a suppression from hiding a real regression later? #

Related #