Running Playwright Accessibility Checks in CI/CD

To turn a Playwright accessibility scan into a deployment gate that a team actually trusts, wrap the scan in a script that serializes every result to an artifact, counts violations by impact level, and exits non-zero only when critical or serious findings cross a committed threshold — never on the raw pass/fail of axe.run() alone. This page resolves the specific failure where a CI accessibility step is either so noisy it gets disabled or so lenient it blocks nothing, and shows the exact gating script and GitHub Actions job that fix it.

This is the CI/CD implementation reference within the broader Playwright Headless Scanning Workflows strategy, which itself sits under the pipeline described in Automated Scanning & Dynamic Content Ingestion. The parent guide covers how to launch an isolated context and stabilize the DOM; this page assumes you already have a trustworthy scan function and focuses only on wiring its output into a pipeline gate.

When This Applies

Reach for this pattern once a scan runs reliably in isolation but the gate behaviour is wrong. The symptoms are specific:

A single new moderate finding fails the whole build, so the team sets the step to continue-on-error and it now blocks nothing.
The job passes locally and fails in CI, or reports different counts on reruns of the same commit.
A failed run leaves no artifact behind, so nobody can see what failed without re-running with tracing.
The scan is fast on ten routes but OOM-kills the runner at three thousand.

If instead your scan reports a clean page that is visibly broken, that is a hydration race, not a gating problem — fix it with the stabilization guards in the parent guide before you touch thresholds. Threshold tuning only makes sense once the underlying scan is deterministic.

Minimal Reproducible Example

The most common broken gate is a step that treats any non-empty violations array as a build failure. It looks reasonable and fails immediately in practice, because a real enterprise route almost always carries a handful of low-impact findings that are already ticketed.

# ci_gate_naive.py — DO NOT SHIP. Fails on the first violation of any impact.
import sys
from scan import run_accessibility_scan  # your existing scan function

results = run_accessibility_scan("https://staging.example.com/dashboard")

if results["violations"]:
    # Every moderate/minor finding blocks the deploy, so the team disables the step.
    print(f"{len(results['violations'])} violations found")
    sys.exit(1)

sys.exit(0)

Two things are wrong here. There is no severity discrimination, so a decorative-icon contrast nit blocks a release the same as a keyboard trap. And there is no artifact: the process exits with a count and nothing durable, so the failure is invisible after the runner is torn down.

Correct Implementation

A dependable gate does three things in order: serialize the full payload to a durable artifact, aggregate violations by impact level, then compare against a committed threshold. The script below is the gate; it takes an already-scanned payload and owns only the decision.

# ci_gate.py — impact-aware accessibility gate for CI.
import json
import sys
from collections import Counter
from pathlib import Path

# Committed thresholds. Zero critical/serious is the enforced line;
# moderate/minor are tracked as a budget, not a hard block.
THRESHOLDS = {"critical": 0, "serious": 0, "moderate": 25, "minor": 100}
ARTIFACT_DIR = Path("a11y-artifacts")


def count_by_impact(violations):
    # axe-core stamps each violation with an `impact` field.
    # Fall back to "minor" if the engine ever omits it.
    return Counter(v.get("impact") or "minor" for v in violations)


def gate(results, route):
    ARTIFACT_DIR.mkdir(exist_ok=True)
    # 1. Persist the COMPLETE payload before deciding anything, so a failed
    #    gate always leaves inspectable evidence behind in the runner.
    slug = route.rstrip("/").rsplit("/", 1)[-1] or "root"
    artifact = ARTIFACT_DIR / f"{slug}.json"
    artifact.write_text(json.dumps(results, indent=2), encoding="utf-8")

    # 2. Aggregate by severity rather than by raw count.
    counts = count_by_impact(results["violations"])
    breaches = {
        impact: (counts[impact], limit)
        for impact, limit in THRESHOLDS.items()
        if counts[impact] > limit
    }

    for impact, (found, limit) in sorted(breaches.items()):
        print(f"[BLOCK] {impact}: {found} > {limit} allowed  ({route})")
    for impact, limit in THRESHOLDS.items():
        if impact not in breaches:
            print(f"[ok]    {impact}: {counts[impact]} <= {limit}")

    # 3. Block only when critical/serious breach; moderate/minor over budget
    #    warns but does not fail, keeping the gate credible.
    hard = {"critical", "serious"} & breaches.keys()
    return 1 if hard else 0


if __name__ == "__main__":
    payload = json.loads(Path(sys.argv[1]).read_text(encoding="utf-8"))
    sys.exit(gate(payload, sys.argv[2]))

The gate never re-runs the browser; it consumes a JSON payload the scan already produced. That separation matters — the scan is slow and stateful, the gate is fast and pure, and keeping them apart lets you unit-test the decision logic against fixture JSON without launching Chromium. Before the payload reaches this gate it should pass the JSON Schema validation for accessibility data contract, so a truncated or malformed result fails as a structural error rather than silently counting zero violations and passing the gate.

The severity buckets themselves come from the engine, not from this script. Which rules run and how they map to WCAG success criteria is set once in your axe-core enterprise configuration; the gate only reads the impact the engine assigned. If a rule is producing noise, tune it there — suppressing a rule in configuration with a tracked review ticket is auditable, whereas raising the moderate budget to hide it is not.

The gate’s decision path is small but easy to get subtly wrong, so it is worth drawing.

Wiring it into GitHub Actions

The workflow provisions a pinned browser, runs the scan into a11y-artifacts/, invokes the gate, and — critically — uploads the artifacts with if: always() so the evidence survives even when the gate exits non-zero.

# .github/workflows/a11y-gate.yml
name: Accessibility Gate
on: [pull_request]

jobs:
  a11y:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Install pinned deps
        run: |
          pip install -r requirements.txt          # playwright==1.44.*, jsonschema
          playwright install --with-deps chromium   # identical Chromium everywhere
      - name: Scan and gate
        run: |
          python scan_route.py "$TARGET_URL"  a11y-artifacts/dashboard.raw.json
          python ci_gate.py    a11y-artifacts/dashboard.raw.json  "$TARGET_URL"
        env:
          TARGET_URL: https://staging.example.com/dashboard
      - name: Upload evidence
        if: always()   # keep artifacts even when the gate fails the job
        uses: actions/upload-artifact@v4
        with:
          name: a11y-report
          path: a11y-artifacts/

Pinning both the Playwright package and the injected engine build is what makes the gate reproducible across a developer laptop and a headless runner; a floating version resolves to different rule sets on different days and turns environment drift into phantom regressions.

Pipeline Integration Note

A single-route gate is the unit; a real pipeline fans it out and then converges the results. At enterprise route counts you do not drive thousands of URLs from one process — you shard them with the sharding model in the batch validation architecture, run each shard as a parallel CI job that writes its own artifacts, and aggregate the per-shard counts in a final gate step so the threshold applies to the whole surface rather than to each route in isolation. Blocking findings feed the error categorization triage pipelines, which deduplicate selectors across routes and attach component ownership, so a failed gate produces routed tickets instead of a wall of raw JSON. Routes that never settle — continuously streaming feeds — must be scanned with the progressive traversal in async crawling for infinite scroll pages before they enter the gate, or the shard for that route will time out rather than fail cleanly.

Gotchas

Authenticated routes need storage state committed as a CI secret, not a login step. Scripting a login inside the job adds a flaky, rate-limited dependency to every run. Instead capture storage_state once, store it as an encrypted secret, and load it into the context — the gate then scans the same authenticated surface a user sees without a live credential exchange in the hot path.
Multi-tenant routing can silently scan the wrong tenant’s shell. When a host resolves tenant by subdomain or header, a CI runner hitting the bare staging URL may land on a marketing shell with zero real violations and pass a gate for an app it never evaluated. Pin the tenant explicitly (host header or path prefix) and assert on a tenant-specific landmark before trusting the count.
Viewport variance flips target-size and reflow results. A gate that scans at 1280x800 and a developer who reproduces at a laptop’s native width will disagree on target-size and contrast findings. Fix the viewport in both the scan config and the local repro instructions so a “works on my machine” dispute cannot start. This is the same boundary sensitivity handled at the architecture level by dynamic content boundary detection.

Frequently Asked Questions

Why gate on impact level instead of total violation count?

Total count conflates a keyboard trap with a decorative-icon nit, so any threshold you pick is either too loose to catch real regressions or too tight to ever pass. Bucketing by the engine’s impact field lets you enforce a hard zero on critical and serious while carrying moderate and minor as a tracked budget, which keeps the gate both credible and unblocked.

Should the scan and the gate be the same script?

No. Keep the browser scan and the threshold decision in separate steps that communicate through a JSON artifact. The scan is slow and stateful; the gate is fast and pure. Splitting them lets you unit-test the gate against fixture JSON without launching Chromium, and it guarantees the payload is written to disk before any exit code is decided.

How do I keep artifacts when the gate fails the job?

Upload them with if: always() in GitHub Actions (or the equivalent unconditional step in your CI), and write the artifact inside the gate before it computes the exit code. If you only upload on success, every failing run — the ones you most need to inspect — loses its evidence when the runner is torn down.

My CI counts differ from my local run on the same commit. What is unpinned?

Almost always the engine build, the viewport, or the locale. Install axe-core as a locked npm dependency and inject it from disk rather than a CDN, provision the identical Chromium through playwright install, and fix viewport, locale, and timezone in the context options. Diff the violation IDs between environments; the rules that differ point straight at the drifted dimension.

Running Playwright Accessibility Checks in CI/CD

When This Applies #

Minimal Reproducible Example #

Correct Implementation #

Wiring it into GitHub Actions #

Pipeline Integration Note #

Gotchas #

Frequently Asked Questions #

Related #