WCAG 2.2 vs 3.0 Success Criteria Taxonomy

An audit pipeline that hard-codes WCAG 2.2 success criteria as its unit of measurement will need re-instrumenting the day WCAG 3.0 becomes normative, because 3.0 does not measure the same thing: it replaces the pass/fail success criterion with a weighted, outcome-based score. The engineering obstacle this page solves is how to build one evaluation layer that reports a defensible WCAG 2.2 Level AA conformance result today while accumulating the graded telemetry WCAG 3.0 will demand — without maintaining two parallel scanners. This is the standards-mapping reference within the broader Enterprise WCAG Audit Architecture & Standards Mapping strategy: it defines the taxonomy both standards use, the abstraction that lets a single set of atomic test results project into either one, and the failure modes that quietly corrupt a dual-standard report.

The audience is accessibility specialists translating criteria into assertions, frontend QA teams who own the gate, and Python automation engineers who will copy the mapping module directly. WCAG 3.0 is a W3C Working Draft as of 2026 and its conformance model may change before publication, so treat every 3.0 detail here as design direction to abstract against, not a finalized specification. The one architectural commitment that holds regardless of how 3.0 lands is this: keep the atomic result — a single rule firing against a single node — as your source of truth, and derive every conformance verdict from it rather than storing verdicts directly.

Prerequisites & Environment Context

The mapping layer sits downstream of the scanner and upstream of the gate, so it inherits the version pins of both. Fix them before writing a line of mapping code, because a taxonomy that drifts against the rule engine underneath it produces conformance claims that no longer correspond to what was actually tested.

Python 3.11+ for the mapping and scoring module. The examples use only the standard library plus jsonschema>=4.21 to validate result payloads; the scoring math is pure Python so it runs identically on a laptop and a CI runner.
A pinned rule engine. These examples assume axe-core enterprise configuration is the source of raw findings, with axe-core pinned in the lockfile. Rule IDs and their tags (for example wcag2aa, wcag412) are the join key between the engine and this taxonomy, and they change across major versions — an unpinned engine silently repoints your mapping.
The normative criteria lists. Keep a local, versioned copy of the WCAG 2.2 Recommendation success-criteria list and the W3C ACT Rules that formalize how each criterion is tested. ACT rule identifiers are the stable bridge between a human-readable criterion and an executable check.
A validated result contract. Every raw finding must pass JSON Schema validation for accessibility data before the mapping layer touches it, so a malformed engine output fails loudly instead of skewing a graded score.

Once those are pinned, the mapping module is deterministic: the same DOM and the same rule engine version yield the same atomic results, and therefore the same 2.2 verdict and 3.0 score on every run.

Structural Divergence: The Two Conformance Models

WCAG 2.2 is a binary model organized under four principles — Perceivable, Operable, Understandable, and Robust (POUR). Every success criterion carries a conformance level of A, AA, or AAA, and a page either satisfies a criterion or it does not. That determinism is what makes 2.2 map cleanly onto automated assertions: each criterion becomes a discrete rule that returns a boolean, and a conformance target is simply the set of criteria at or below a chosen level that must all pass. How those levels compose into a gate threshold is the subject of the A/AA/AAA compliance level mapping.

WCAG 3.0 discards the binary verdict for a graded score. Instead of principles and criteria, its draft organizes work around outcomes (what the user must be able to do) and methods (concrete ways to test an outcome), and it rates results on a spectrum rather than pass/fail. This forces the automation layer to capture severity gradients and weighted aggregates rather than a single boolean, and to reason about partial credit — a page that gets most of a data table right but mislabels one header is no longer simply “failed.” The taxonomy also flattens the rigid POUR hierarchy into a more outcome-oriented structure that accommodates a wider range of assistive technologies. The table below fixes the vocabulary this page uses for both standards.

Dimension	WCAG 2.2	WCAG 3.0 (draft)
Conformance model	Binary (pass / fail)	Graded (0–100 score)
Top-level structure	POUR principles	Outcomes
Conformance levels	A, AA, AAA	Bronze, Silver, Gold
Testing unit	Success criterion	Outcome + method
Granularity	Criterion-level	Atomic test-level
Failure handling	Any failure breaks the level	Weighted deduction from the score

Conceptual Model: One Atomic Result, Two Projections

The mistake that makes a dual-standard audit expensive is storing conformance verdicts instead of the evidence behind them. If the pipeline persists “SC 1.4.3 = fail,” it has thrown away the per-node detail that a 3.0 score needs, and it cannot answer a 3.0 question without re-scanning. The durable design decouples the test-execution layer from the reporting layer: a single traversal collects raw atomic results, and each result is projected into whichever conformance model the consumer asks for.

An atomic result is the smallest fact the pipeline knows: one rule, run against one node, with an outcome and enough context to grade it. Formally it carries the node selector, the ACT/axe rule ID, the WCAG success criteria that rule maps to, a discrete outcome (pass, fail, or cantTell), and an impact severity. Everything else — the binary 2.2 verdict, the graded 3.0 score — is a pure function of a collection of these facts. The routing below shows how one criterion moves from raw finding to a conformance report before it can gate a deploy.

The branch on the left matters as much as the code: a large fraction of WCAG criteria are not machine-decidable, and a rule that cannot decide must emit cantTell, not a silent pass. The 2.2 projection routes cantTell to manual review; the 3.0 projection must exclude it from the scored denominator rather than count it as full credit. Getting that one rule wrong is the single most common way a graded score is inflated, and it is covered again in the failure modes below.

Step-by-Step Implementation

The following four steps build the mapping module: a versioned registry that joins rule IDs to criteria, a deterministic 2.2 assertion, a 2.2 conformance projection, and a 3.0 graded projection over the same atomic results.

1. Model the criteria as a versioned rule registry

The registry is the join table between the rule engine and the two taxonomies. Keyed by ACT/axe rule ID, each entry names the success criteria the rule proves, the 2.2 level, and the 3.0 outcome and weight it contributes. Versioning the registry against the engine version is what keeps the mapping honest when axe-core renames or splits a rule.

# criteria_registry.py — the single join between engine rules and both taxonomies
from dataclasses import dataclass, field

@dataclass(frozen=True)
class RuleMapping:
    rule_id: str            # axe-core / ACT rule id, the join key
    success_criteria: tuple # WCAG 2.2 SC numbers this rule helps prove
    level: str              # "A" | "AA" | "AAA" — the 2.2 conformance level
    outcome_id: str         # WCAG 3.0 outcome this rule contributes to
    weight: int             # 3.0 method weight; higher blocks more user impact

# Pinned to a specific axe-core version so rule ids cannot drift underneath us.
REGISTRY = {
    "target-size": RuleMapping("target-size", ("2.5.8",), "AA", "pointer-target", 3),
    "color-contrast": RuleMapping("color-contrast", ("1.4.3",), "AA", "text-contrast", 5),
    "label": RuleMapping("label", ("1.3.1", "4.1.2"), "A", "form-labels", 5),
    "image-alt": RuleMapping("image-alt", ("1.1.1",), "A", "text-alternatives", 4),
}
ENGINE_VERSION = "4.10.0"  # assert this matches the running engine at load time

2. Encode a 2.2 criterion as a deterministic assertion

Where a criterion is machine-decidable, it reduces to a boolean. The example encodes SC 2.5.8 Target Size (Minimum), which requires interactive targets to measure at least 24 by 24 CSS pixels. Keep such assertions pure and side-effect free so they are trivially unit-testable and identical across environments.

# WCAG 2.2 SC 2.5.8 — a deterministic, boolean assertion
def validate_target_size(element, min_size=24):
    """Return True when an interactive target meets the 24x24 CSS px minimum."""
    box = element.bounding_box()
    if box is None:            # off-screen or display:none — not evaluable here
        return None            # None => cantTell, routed to manual review
    return box["width"] >= min_size and box["height"] >= min_size

The None return is deliberate: an element the engine cannot measure is undecided, not passing. Preserving that third state at the assertion level is what lets both projections treat it correctly downstream.

3. Project atomic results into a 2.2 conformance verdict

The 2.2 projection is an AND across every in-scope criterion at or below the target level. A single fail breaks the level; any cantTell is surfaced for manual review rather than absorbed. This projection is what the deploy gate reads.

# Project raw atomic results into a binary WCAG 2.2 conformance verdict
LEVEL_ORDER = {"A": 1, "AA": 2, "AAA": 3}

def conformance_22(results, registry, target_level="AA"):
    """results: list of {'rule_id', 'outcome'} atomic facts."""
    ceiling = LEVEL_ORDER[target_level]
    failures, needs_review = [], []
    for r in results:
        mapping = registry.get(r["rule_id"])
        if not mapping or LEVEL_ORDER[mapping.level] > ceiling:
            continue                        # out of scope for this level
        if r["outcome"] == "fail":
            failures.extend(mapping.success_criteria)
        elif r["outcome"] == "cantTell":
            needs_review.extend(mapping.success_criteria)
    return {
        "level": target_level,
        "conformant": not failures,
        "failed_criteria": sorted(set(failures)),
        "needs_manual_review": sorted(set(needs_review)),
    }

4. Project the same results into a 3.0 graded score

The graded projection consumes the identical atomic results and computes a weighted mean of method scores, normalized to 0–100. The aggregate outcome score is the weighted mean of individual method scores:

score = (\frac{\sum _{i} w _{i} \cdot s _{i}}{\sum _{i} w _{i}}) \times 100

Critically, cantTell results are excluded from both the numerator and the denominator — an undecided method contributes no evidence in either direction rather than free credit.

# Project the SAME atomic results into a WCAG 3.0 style graded score (0-100)
def graded_score_30(results, registry):
    """s_i is 1.0 for pass, 0.0 for fail; cantTell is excluded from the mean."""
    weighted_sum = 0.0
    total_weight = 0
    for r in results:
        mapping = registry.get(r["rule_id"])
        if not mapping or r["outcome"] == "cantTell":
            continue                        # undecided => not scored, not zeroed
        s_i = 1.0 if r["outcome"] == "pass" else 0.0
        weighted_sum += mapping.weight * s_i
        total_weight += mapping.weight
    if total_weight == 0:
        return {"score": None, "band": "unscored"}
    score = round((weighted_sum / total_weight) * 100)
    band = "Gold" if score >= 90 else "Silver" if score >= 75 else "Bronze"
    return {"score": score, "band": band}

Weighting is where the model earns its keep: assign higher weights to methods whose failure blocks a critical task — an unlabeled checkout button — and lower weights to cosmetic deviations, so the score tracks lived user impact rather than raw defect counts. Because both projections read the same results list, the pipeline never re-scans to answer a second standard, and the same findings can flow straight into the error categorization triage pipelines that turn failures into tickets. A worked end-to-end walkthrough of wiring these assertions into a running suite lives in how to map WCAG 2.2 success criteria to automated tests.

Configuration Reference

The fields below define a single RuleMapping entry and the projection parameters. Keep them in version control alongside the pinned engine version so a mapping change is always reviewable against the rule set it targets.

Field	Type	Default	Description
`rule_id`	str	—	ACT/axe-core rule identifier; the join key between the engine output and this taxonomy. Must exist in the pinned engine version.
`success_criteria`	tuple[str]	`()`	WCAG 2.2 SC numbers the rule helps prove. A rule may prove more than one (for example `label` covers 1.3.1 and 4.1.2).
`level`	str	`"AA"`	2.2 conformance level: `A`, `AA`, or `AAA`. Drives which criteria the 2.2 projection includes for a given target.
`outcome_id`	str	—	WCAG 3.0 outcome the rule contributes to; groups methods for graded aggregation.
`weight`	int	`1`	3.0 method weight. Set proportional to user-task impact, not defect frequency, so the score reflects severity.
`target_level`	str	`"AA"`	Projection input for `conformance_22`; the highest 2.2 level the gate enforces.
`ENGINE_VERSION`	str	—	Pinned engine version; assert it matches the running engine at load so rule IDs cannot drift.

Because engineers copy these rows directly into a config module, keep the table in a container that scrolls horizontally on narrow screens rather than wrapping cells.

Verification & Testing

The mapping module gates other people’s deployments, so it needs its own regression suite before it can be trusted.

ACT golden cases. For each mapped rule, run the module against the passed and failed example fixtures published in the corresponding W3C ACT rule and assert the atomic outcome matches the rule’s expected result. This proves your registry agrees with the normative definition of the criterion, not just with your own reading of it.
Projection equivalence. Feed one fixed results list through both conformance_22 and graded_score_30 and snapshot both outputs. A change in either projection that was not intended by a code change is a regression — the two must stay derivable from the same evidence.
cantTell handling. Construct a result set that is all cantTell and assert the 2.2 projection reports conformant=True with every criterion in needs_manual_review, while the 3.0 projection returns score=None / unscored. This is the highest-value test in the suite because it pins the one rule that most often inflates a graded score.
Engine-version guard. Assert at load time that ENGINE_VERSION equals the version of the running engine, and fail the build on mismatch. Validate every incoming payload against the shared schema first so a shape change surfaces here rather than as a silent mis-mapping. Feeding the verified verdict into the CI/CD threshold gating layer closes the loop from criterion to deploy decision.

Failure Modes & Troubleshooting

Rule-ID drift between the engine and the registry. Symptom: criteria silently stop being evaluated after an engine upgrade, and conformance improves for no real reason. Root cause: axe-core renamed or split a rule, so a registry entry now joins to nothing and its criteria are never scored. Fix: pin ENGINE_VERSION, assert it at load, and diff the engine’s rule list against the registry keys in CI so an unmapped rule fails the build instead of vanishing from the report.

cantTell counted as credit. Symptom: the 3.0 score climbs on pages that got harder to test, and manual reviewers keep finding failures the score called clean. Root cause: an undecided result was treated as pass (or zeroed into the denominator) instead of excluded. Fix: keep the three-state outcome all the way through both projections — route cantTell to manual review in 2.2 and drop it from the weighted mean entirely in 3.0, exactly as in steps 3 and 4.

Double-counting a node that fails several criteria. Symptom: one broken component tanks the score far more than its user impact warrants. Root cause: a single node emitted multiple atomic failures (a control that is both unlabeled and low-contrast), and each was weighted independently. Fix: decide the aggregation unit deliberately — either dedupe per node before scoring or accept per-method weighting as intended — and document which, so the score is reproducible and explainable.

Weight calibration that hides critical failures. Symptom: a page with a keyboard-trapped checkout still lands in the Silver band. Root cause: weights were set by defect frequency or assigned uniformly, so a task-blocking failure carries the same weight as a cosmetic one. Fix: weight methods by user-task impact and validate the band boundaries against a set of hand-scored reference pages, treating any critical blocker as a hard cap on the band regardless of the weighted mean.

AAA criteria leaking into the AA gate. Symptom: builds fail on criteria the organization never committed to. Root cause: the projection did not filter by LEVEL_ORDER, so AAA rules in the registry entered an AA verdict. Fix: enforce the level ceiling in conformance_22 (as shown) and unit-test that an AAA failure leaves an AA verdict conformant while still surfacing in a separate AAA report.

Frequently Asked Questions

Should I gate deploys on the WCAG 3.0 graded score today?

No. WCAG 3.0 is a Working Draft and its scoring model can still change, so gate on the binary WCAG 2.2 projection and treat the 3.0 score as telemetry you accumulate in parallel. Because both derive from the same atomic results, you lose nothing by tracking the score early and can promote it to a gate once the standard stabilizes.

How do I map a criterion that axe-core reports as incomplete?

Translate the engine’s incomplete into your cantTell outcome, never into pass. incomplete means the engine could not decide, so the 2.2 projection routes it to manual review and the 3.0 projection excludes it from the weighted mean. Collapsing it to a pass is the most common source of an inflated score.

One axe-core rule maps to several success criteria — how do I score it?

Store all the criteria a rule proves in success_criteria so the 2.2 projection can attribute a failure to each, but decide a single aggregation unit for 3.0 (per node or per method) and apply it consistently. Attributing one failure to three criteria in the score without deduping triple-counts the same defect.

Do I need a second scanner to produce a WCAG 3.0 score?

No. A second scanner is the anti-pattern this page exists to prevent. Run one traversal, persist atomic results as the source of truth, and add a pure projection function per standard. A new conformance model becomes a new projection over existing evidence, not a new scan.

Where do the Bronze / Silver / Gold band thresholds come from?

They are illustrative in the current draft, so treat the 90/75 cutoffs in the example as configuration, not fixed law, and pin them per organization. Calibrate the boundaries against hand-scored reference pages and re-check them whenever a new draft revises the scoring rubric.

WCAG 2.2 vs 3.0 Success Criteria Taxonomy

Prerequisites & Environment Context #

Structural Divergence: The Two Conformance Models #

Conceptual Model: One Atomic Result, Two Projections #

Step-by-Step Implementation #

1. Model the criteria as a versioned rule registry #

2. Encode a 2.2 criterion as a deterministic assertion #

3. Project atomic results into a 2.2 conformance verdict #

4. Project the same results into a 3.0 graded score #

Configuration Reference #

Verification & Testing #

Failure Modes & Troubleshooting #

Frequently Asked Questions #

Related #