Error Categorization & Triage Pipelines

A single enterprise crawl can emit tens of thousands of raw violations across thousands of routes, and the overwhelming majority of them are noise the moment they land: cross-route duplicates of one component defect, false positives from third-party widgets, and low-impact deviations no one will action this quarter. The specific obstacle this page solves is turning that undifferentiated firehose into a small, ranked, developer-ready set of tickets — deterministically, on every run, without a human triaging the queue by hand. Get the boundary wrong and one of two failure signatures follows: the pipeline forwards everything and engineers learn to ignore accessibility tickets wholesale, or it filters too aggressively and quietly drops real Level AA blockers that resurface in a legal complaint.

This guide is part of the broader Automated Scanning & Dynamic Content Ingestion strategy and sits directly downstream of it. Where the batch validation architecture establishes how thousands of route evaluations are coordinated and aggregated into a schema-valid dataset, this page establishes what happens to that dataset next — how each finding is normalized, classified into a severity tier, scored for false-positive likelihood, and routed to the correct engineering queue or quarantine. The triage pipeline is a pure transform over the aggregated findings: it opens no browsers and evaluates no DOM, so it can scale, fail, and be replayed independently of the crawl. The five stages below trace one raw violation from ingestion to its final routing destination.

Prerequisites and Environment Context

The triage pipeline consumes the finding contract produced upstream, so version drift in that contract is the fastest way to break it silently. Pin the following before implementing anything below:

Python 3.11+ for the orchestration layer. The examples use dataclasses, structural pattern matching, and enum.IntEnum for ordered severity tiers.
The same pinned JSON Schema (jsonschema 4.x) that the JSON Schema validation for accessibility data contract defines. Triage must validate against the identical schema version the aggregator emitted, or a field rename upstream becomes a KeyError here.
A broker with visibility timeouts — Redis 7.x via redis-py 5.x, or any queue supporting per-message TTL and dead-letter routing — to batch classification work without loading the full dataset into RAM.
A version-controlled rule map that translates engine rule IDs into WCAG success criteria and severity tiers. This map is the single source of truth for classification and must be auditable; treat a change to it like a change to the axe-core enterprise configuration it mirrors.
Issue-tracker and chat credentials (Jira/GitHub Issues, Slack/Teams) injected as short-lived CI secrets, never checked in.

Environment parity matters here too: the rule map, the schema, and the engine version must move together. If a Chromium or axe-core bump upstream changes an impact value, the severity mapping must be updated in the same commit, or the pipeline will misclassify findings the day the crawler upgrades.

Conceptual Model: Normalize, Classify, Filter, Route

The pipeline is four transforms with a queue between ingestion and classification, and a routing gate at the end. Each transform is a pure function of its input, so it can be unit-tested against a golden fixture and replayed against historical datasets without re-crawling.

Normalization enforces the finding contract. Heterogeneous engine payloads carry rule IDs, DOM paths, impact levels, and contextual metadata in inconsistent shapes; normalization rejects malformed records at the boundary and produces a flat, typed model every downstream stage can consume without defensive parsing.

Classification is a deterministic lookup, never a heuristic guess. It maps each normalized record’s engine rule ID onto a standardized WCAG success criterion and an enterprise severity tier, then enriches it with business-impact signals (page reach, conversion-path proximity, historical remediation status) so that ranking reflects risk rather than raw count.

False-positive filtering scores the likelihood that a flagged element is functionally compliant despite tripping a rule. It draws the noise boundary the same way dynamic content boundary detection draws the “settled DOM” boundary — with explicit, testable predicates rather than intuition — and the deeper heuristics live in the child guide on categorizing false positives in automated scan results.

Routing applies the gate: confident critical findings block the pull request and open a ticket; confident non-critical findings go to backlog grooming; low-confidence findings are quarantined for human review rather than dropped.

The flow below traces one finding from the aggregated dataset through the routing gate. Classification is cheap; filtering is where the judgment lives; routing is strict and idempotent.

The reason the queue sits between normalization and classification is the same backpressure argument that governs the upstream crawl: a burst of forty thousand findings should not spawn forty thousand concurrent tracker API calls. The reason quarantine is a first-class destination — not a silent drop — is auditability: an accessibility program has to be able to prove why a finding was not actioned, and a discarded record cannot be reviewed.

Step-by-Step Implementation

1. Normalize and schema-validate each finding

Normalization flattens the engine payload into a typed record and validates it against the pinned contract. A record that fails validation is dead-lettered with its errors attached, never passed downstream.

import json
from dataclasses import dataclass, asdict
from jsonschema import Draft202012Validator

with open("schemas/finding.schema.json") as fh:
    VALIDATOR = Draft202012Validator(json.load(fh))

@dataclass(frozen=True)
class Finding:
    rule_id: str          # engine rule identifier, e.g. "color-contrast"
    wcag_criterion: str    # mapped later; empty until classification
    element_selector: str  # canonical DOM path to the node
    impact: str            # engine impact: minor|moderate|serious|critical
    page_url: str

def normalize(raw: dict) -> Finding:
    # Coerce the heterogeneous engine shape into one flat contract. Missing
    # optional fields become explicit empties so downstream code never guards.
    return Finding(
        rule_id=raw["id"],
        wcag_criterion="",  # populated in step 3, not trusted from the engine
        element_selector=raw["nodes"][0]["target"][0],
        impact=raw.get("impact") or "moderate",
        page_url=raw["url"],
    )

def validate(finding: Finding) -> list[str]:
    # Return schema errors instead of raising, so the caller can dead-letter
    # the record with a reason rather than crashing the whole batch.
    return [e.message for e in VALIDATOR.iter_errors(asdict(finding))]

2. Batch through the queue with streaming ingestion

For datasets spanning millions of URLs, never deserialize the whole report into memory. Stream records off the aggregated artifact and enqueue them in bounded chunks so classification workers pull only as fast as they can drain.

import ijson  # streaming JSON parser — does not load the whole blob
import redis

r = redis.Redis(host="broker", decode_responses=True)

def stream_enqueue(report_path: str, chunk: int = 500, ttl: int = 900) -> int:
    # ijson yields findings one at a time from disk, capping peak RSS
    # regardless of report size; contrast with json.load() on a 2 GB blob.
    count, pipe = 0, r.pipeline()
    with open(report_path, "rb") as fh:
        for raw in ijson.items(fh, "violations.item"):
            pipe.rpush("triage:pending", json.dumps(raw))
            count += 1
            if count % chunk == 0:
                pipe.execute()          # flush in bounded batches
    pipe.execute()
    r.expire("triage:pending", ttl, nx=True)
    return count

3. Classify deterministically and map severity

Classification is a table lookup against the version-controlled rule map, followed by business-impact enrichment. It never infers a criterion the engine did not report; an unmapped rule ID is an error condition, not a default.

from enum import IntEnum

class Severity(IntEnum):
    INFORMATIONAL = 0
    HIGH = 1
    CRITICAL = 2   # ordered so gating can compare with >=

# Version-controlled and auditable — one row per engine rule.
RULE_MAP = {
    "color-contrast":   {"wcag": "1.4.3", "base": Severity.HIGH},
    "image-alt":        {"wcag": "1.1.1", "base": Severity.CRITICAL},
    "label":            {"wcag": "4.1.2", "base": Severity.CRITICAL},
    "region":           {"wcag": "1.3.1", "base": Severity.INFORMATIONAL},
}

def classify(finding: Finding, monthly_views: int, on_conversion_path: bool) -> dict:
    rule = RULE_MAP.get(finding.rule_id)
    if rule is None:
        # Unmapped rule = the map drifted from the engine. Quarantine, don't guess.
        raise KeyError(f"unmapped rule_id: {finding.rule_id}")
    severity = rule["base"]
    # Business signals can escalate, never silently downgrade a legal blocker.
    if on_conversion_path and severity < Severity.CRITICAL:
        severity = Severity(severity + 1)
    reach = min(monthly_views / 10_000, 1.0)
    return {
        **asdict(finding),
        "wcag_criterion": rule["wcag"],
        "severity": severity.name,
        "priority_score": round(severity.value * 0.7 + reach * 0.3, 3),
    }

4. Score false-positive likelihood and set a confidence gate

Automated engines flag dynamic content, ARIA overrides, and third-party widgets that are functionally compliant. Score each classified finding against explicit predicates and attach a confidence value; only high-confidence findings proceed to routing.

import re

VENDOR_HOSTS = re.compile(r"(intercom|hubspot|drift|__vendor)")

def false_positive_score(record: dict) -> float:
    # 0.0 = certainly real, 1.0 = almost certainly a false positive.
    # Each predicate is testable in isolation; see the child guide for the full set.
    score = 0.0
    if VENDOR_HOSTS.search(record["element_selector"]):
        score += 0.5   # third-party widget the team does not own
    if record["rule_id"] == "color-contrast" and "svg" in record["element_selector"]:
        score += 0.3   # dynamic SVG contrast is a known noisy check
    if record["rule_id"] == "region" and record["severity"] == "INFORMATIONAL":
        score += 0.2   # landmark best-practice, rarely a hard failure
    return min(score, 1.0)

def with_confidence(record: dict, quarantine_at: float = 0.5) -> dict:
    fp = false_positive_score(record)
    record["confidence"] = round(1.0 - fp, 3)
    record["disposition"] = "quarantine" if fp >= quarantine_at else "route"
    return record

5. Route idempotently to trackers and the PR gate

Routing maps a confident finding to an issue tracker via REST, keyed on a deterministic fingerprint so retries never create duplicate tickets. Critical findings additionally fail the CI run and comment on the pull request.

import hashlib
import httpx

def ticket_key(record: dict) -> str:
    # Deterministic: same defect on the same component maps to one ticket
    # across reruns, so network retries are idempotent, not duplicative.
    basis = f"{record['rule_id']}::{record['element_selector']}"
    return hashlib.sha256(basis.encode()).hexdigest()[:16]

def route(record: dict, client: httpx.Client) -> str:
    if record["disposition"] == "quarantine":
        return "quarantined"          # held for human review, never dropped
    key = ticket_key(record)
    # Upsert by external key so a re-run reopens rather than clones the ticket.
    client.put(f"/rest/api/issue/{key}", json={
        "wcag": record["wcag_criterion"],
        "severity": record["severity"],
        "url": record["page_url"],
        "selector": record["element_selector"],
    }, timeout=10.0)
    return "blocked" if record["severity"] == "CRITICAL" else "backlog"

Configuration Reference

The parameters that govern classification strictness, filtering aggressiveness, and routing behavior. Tune the confidence gate against a labeled sample of your own findings, not a default — noise profiles differ sharply between marketing sites and authenticated apps.

Parameter	Type	Default	Description
`CHUNK_SIZE`	int	`500`	Findings enqueued per pipeline flush. Larger chunks reduce round-trips but raise peak worker RSS.
`QUARANTINE_THRESHOLD`	float	`0.5`	False-positive score at or above which a finding is held for review instead of routed. Lower it to catch more noise, raise it to forward more.
`BLOCKING_SEVERITY`	str	`"CRITICAL"`	Minimum tier that fails the CI run. `"HIGH"` blocks more aggressively; keep at `"CRITICAL"` for primary-journey-only gating.
`CONVERSION_ESCALATION`	bool	`true`	Whether a finding on a conversion-path route is escalated one severity tier.
`REACH_WEIGHT`	float	`0.3`	Weight of page reach in `priority_score`; the remainder weights severity.
`RETRY_BACKOFF_MS`	int	`500`	Base delay for exponential backoff on tracker API failures before dead-lettering.
`DEDUPE_SCOPE`	str	`"rule+selector"`	Fingerprint basis for the idempotent ticket key. Add `page_url` only when per-route tickets are wanted.

CI/CD Integration and Threshold Gating

Position the triage pipeline as a stage that consumes the aggregated report and produces routing decisions: Batch Validation → Aggregated Findings → Triage → Threshold Gate → Deploy / Block. Attach the triage worker to the post-scan artifact — a workflow_run trigger in GitHub Actions or a downstream pipeline job in GitLab CI that fires the moment the crawl publishes its JSON report. The per-run mechanics of wiring an accessibility job into a pipeline are covered in running Playwright accessibility checks in CI/CD, and the same triage output feeds the tuning described in configuring axe-core for enterprise-scale batch scanning.

Avoid a binary pass/fail gate — it teaches teams to suppress findings rather than fix them, and it treats a landmark best-practice deviation like a missing form label. Use tiered logic instead: fail the run on confident CRITICAL findings on primary journeys, warn and attach HIGH findings to the pull request as tracked debt, and run regression detection against a baseline snapshot so the gate blocks new violations on previously clean routes even when the absolute count is within budget. Bind the blocking tier to your conformance target using the A/AA/AAA compliance level mapping so the threshold encodes legal intent rather than an arbitrary number. Attach SLA timers by severity, escalate to Slack or Teams as deadlines approach, and track mean-time-to-remediate per squad from the ticket metadata. All rule-to-criterion mappings should reference the W3C Web Content Accessibility Guidelines (WCAG) 2.2 as the authoritative baseline, and PR status checks should follow the official GitHub Actions documentation for branch protection.

Verification and Testing

Golden-fixture classification. Run classify() over a frozen set of labeled findings and assert every record maps to the expected criterion and tier. A diff means the rule map drifted from the engine — fix it before trusting any gate.
Unmapped-rule guard. Feed a synthetic finding with an unknown rule_id and assert the pipeline quarantines it (via the KeyError path) rather than defaulting it into the backlog. Silent defaults are how coverage rots.
Idempotent routing. Route the same finding list twice against a stubbed tracker and assert ticket cardinality is unchanged; the deterministic ticket_key must absorb the second pass.
Confidence-gate calibration. Score a labeled sample and plot precision against QUARANTINE_THRESHOLD; pick the value where real violations are not being quarantined, then re-check it after any engine upgrade.
Memory ceiling. Run stream_enqueue() against a deliberately oversized report and watch RSS stay flat — if it climbs with report size, something is buffering the whole blob and the streaming parser is being bypassed.

Failure Modes and Troubleshooting

Alert fatigue from unfiltered forwarding. Every finding becomes a ticket and engineers mute the whole channel within a sprint. Root cause: routing before filtering, or a QUARANTINE_THRESHOLD set so high nothing is held back. Fix: score false-positive likelihood before routing, forward only high-confidence findings, and calibrate the gate against a labeled sample rather than shipping the default.

Real Level AA blockers silently dropped. A missing-label or image-alt failure never reaches a ticket. Root cause is almost always an over-aggressive filter or a rule that quarantines a whole class of findings. Fix: make quarantine a reviewed destination, not a delete; alert on quarantine depth; and never let false_positive_score escalate a CRITICAL finding out of routing.

Duplicate tickets on every retry. A transient tracker timeout triggers a retry and a second ticket appears. Root cause: non-idempotent creation keyed on a mutable field or a random ID. Fix: upsert by the deterministic ticket_key, and make the tracker call a PUT-by-external-key rather than a POST-creates-new.

Memory exhaustion on large reports. The triage worker is OOM-killed on a multi-gigabyte artifact. Root cause: json.load() on the whole blob instead of streaming. Fix: parse incrementally with ijson, enqueue in bounded chunks, and cap in-flight records so peak RSS is independent of report size.

False positives from third-party widgets and dynamic SVGs. Marketing embeds and chart libraries trip contrast or role rules the team does not own, inflating the queue. Fix: scope them out at the source in axe-core enterprise configuration via exclude, and route the residue through the false-positive scorer so approved suppressions never re-enter the blocking set. The full predicate set lives in categorizing false positives in automated scan results.

Frequently Asked Questions

Why are engineers ignoring the accessibility tickets my pipeline creates?

Almost always because the pipeline forwards noise. If every axe finding becomes a ticket — including cross-route duplicates, third-party widget contrast, and landmark best-practices — the signal-to-noise ratio collapses and the channel gets muted. Score false-positive likelihood before routing, block only on confident critical findings on primary journeys, and collapse identical defects across routes into one ticket via a deterministic fingerprint.

How do I map engine rule IDs to WCAG success criteria without guessing?

Maintain a version-controlled lookup table with one row per engine rule, mapping its ID to a success criterion and a severity tier, and treat a change to it like a change to your engine configuration. Never infer a criterion the engine did not report; an unmapped rule ID should quarantine the finding for review, not silently default it. This keeps the mapping auditable, which is exactly what a compliance program needs to defend a classification.

Should the CI gate fail on every violation the triage pipeline surfaces?

No. A binary gate pushes teams to suppress rather than remediate. Fail on confident critical findings on primary journeys, warn on high-impact findings as tracked debt, and add regression detection against a baseline so new violations on previously clean routes block even when the total count is within budget. Bind the blocking tier to your conformance target rather than a raw number.

How does the triage pipeline avoid creating duplicate tickets on retries?

Key ticket creation on a deterministic fingerprint of the rule ID plus the DOM selector, and upsert by that external key instead of posting a new issue. A transient tracker timeout then reopens or updates the existing ticket on retry rather than cloning it. Pair this with exponential backoff so a slow tracker degrades gracefully instead of stampeding the API.

Where should suppressed false positives be handled — in the engine config or the pipeline?

Both, at different layers. Structural exclusions you never own (third-party iframes, analytics widgets) belong in the engine’s exclude scoping so they never generate a finding at all. Context-specific suppressions that require judgment belong in the triage layer’s false-positive scorer, where they are auditable, reversible, and held in quarantine for review. Never hard-code suppressions inline in worker code, where they escape review and drift out of sync with the standard.

Error Categorization & Triage Pipelines

Prerequisites and Environment Context #

Conceptual Model: Normalize, Classify, Filter, Route #

Step-by-Step Implementation #

1. Normalize and schema-validate each finding #

2. Batch through the queue with streaming ingestion #

3. Classify deterministically and map severity #

4. Score false-positive likelihood and set a confidence gate #

5. Route idempotently to trackers and the PR gate #

Configuration Reference #

CI/CD Integration and Threshold Gating #

Verification and Testing #

Failure Modes and Troubleshooting #

Frequently Asked Questions #

Why are engineers ignoring the accessibility tickets my pipeline creates? #

How do I map engine rule IDs to WCAG success criteria without guessing? #

Should the CI gate fail on every violation the triage pipeline surfaces? #

How does the triage pipeline avoid creating duplicate tickets on retries? #

Where should suppressed false positives be handled — in the engine config or the pipeline? #

Related #