RefResolver is deprecated, what replaced it?

In jsonschema 4.18 and later, ref resolution moved to the separate referencing library and the old RefResolver is deprecated pending removal. If your schema splits across files via ref, build a referencing.Registry, load each subschema into it, and pass it to the validator. Pin a recent 4.x release to stay on the supported path.

JSON Schema Validation for Accessibility Data

Automated auditing at enterprise scale produces a firehose of semi-structured telemetry: when scanners traverse thousands of routes across dynamic single-page applications, the accessibility reports that land downstream are rarely uniform. One scanner emits a violations array with an impact string; another nests the same finding two levels deeper and omits severity entirely; a browser crash truncates a payload mid-array. The specific obstacle this page solves is contract enforcement at the ingestion boundary — how to guarantee that every accessibility finding conforms to a single, predictable shape before it reaches a triage queue, an issue tracker, or a compliance data lake, and how to quarantine the records that do not without losing the audit trail.

This guide sits inside the broader Automated Scanning & Dynamic Content Ingestion strategy: where the sibling guides establish how findings are produced, this one establishes the shape they must arrive in. It defines the seam that lets the batch validation architecture aggregate strictly and lets error categorization triage pipelines skip defensive parsing entirely. Get the schema wrong and you inherit the classic ingestion failure signature — silent data corruption, unpredictable KeyError exceptions in the aggregator, and violation counts nobody trusts because malformed records were dropped instead of recorded.

Prerequisites and Environment Context

Schema validation is a contract, and a contract only holds if every party pins the same version of it. Settle the following before writing any validation code:

Python 3.11+ for the validation layer. The examples use dataclasses and modern typing; nothing below requires a newer runtime, but pin it so behaviour is identical on the crawl host and in CI.
jsonschema 4.18+ as the validator. This is the release line where the legacy RefResolver was deprecated in favour of the standalone referencing library. Pin a recent 4.x rather than depending on the old resolver API, which will be removed.
A single, version-controlled schema file — the finding contract. Store it in the repository (schemas/finding.schema.json), tag it with a $id, and treat any change to it as a breaking API change subject to review. This is the same schema the batch validation architecture loads at its aggregation gate.
Draft 2020-12 as the dialect. It is the most widely supported modern draft, it gives you prefixItems, unevaluatedProperties, and stable $ref semantics, and it matches what the axe-core result shape maps onto cleanly.

Environment parity matters here as much as it does for the scanners themselves: the schema must describe the output of the exact engine build your workers run. A minor engine upgrade can add a field or change an impact enum, and a schema pinned to the old shape will reject valid records or, worse, accept malformed ones. Version the schema and the engine together, and fail CI when they drift out of lockstep.

Conceptual Model: A Deterministic Gate, Not a Logger

Validation is a stateless, deterministic gate. A payload enters, and exactly one of two things happens: it conforms and advances to triage, or it fails and is routed — with its recoverable metadata intact — to a quarantine queue for schema-drift analysis. Nothing is dropped, and nothing malformed passes. The distinction that trips teams up is that a validator is not a logger: its job is to make a routing decision, not to record that something looked odd.

The mechanism has three moving parts. First, the schema itself declares the contract — required fields, type constraints, and closed enumerations for severity. Second, a compiled validator instance walks each payload and yields a structured error for every point of divergence, each carrying a JSON path to the exact offending node. Third, a routing wrapper turns “zero errors” into advance and “one or more errors” into quarantine, extracting scan_id and url even from a corrupted record so the audit trail survives.

The flow below shows how a raw payload is gated: conforming records advance to triage while malformed ones are quarantined rather than discarded.

The reason iter_errors() is preferred over a boolean is_valid() check is that a single malformed payload usually has several problems at once, and a triage engineer needs all of them — the JSON path plus the constraint that failed — to diagnose drift in one pass rather than fixing one field, re-running, and discovering the next. This early validation prevents malformed records from polluting the accessibility data warehouse and ensures remediation teams receive only actionable findings. Without the gate, engineering teams face unpredictable parsing exceptions, inconsistent WCAG violation categorization, and inflated false-positive rates that erode trust in the whole automated audit.

Step-by-Step Implementation

1. Define the Canonical Schema

Author a strict schema that mirrors the enterprise accessibility data contract. Enforce required fields, restrict severity to a closed enumeration, and validate nested violation objects down to the node level. Reference the official JSON Schema specification for draft compliance and advanced constraint definitions. The additionalProperties: false at the root is deliberate — it turns an unexpected new field from a silent pass into an explicit rejection, which is how you catch an engine upgrade that changed the output shape.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://www.wcag-audit.org/schemas/finding.schema.json",
  "title": "AccessibilityAuditReport",
  "type": "object",
  "required": ["scan_id", "url", "timestamp", "violations"],
  "properties": {
    "scan_id": { "type": "string", "format": "uuid" },
    "url": { "type": "string", "format": "uri" },
    "timestamp": { "type": "string", "format": "date-time" },
    "violations": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["rule_id", "impact", "nodes"],
        "properties": {
          "rule_id": { "type": "string", "pattern": "^[a-z0-9-]+$" },
          "impact": { "type": "string", "enum": ["critical", "serious", "moderate", "minor"] },
          "nodes": {
            "type": "array",
            "minItems": 1,
            "items": {
              "type": "object",
              "required": ["html", "target"],
              "properties": {
                "html": { "type": "string" },
                "target": { "type": "array", "items": { "type": "string" }, "minItems": 1 }
              }
            }
          }
        }
      }
    }
  },
  "additionalProperties": false
}

The impact enumeration mirrors axe-core’s four severity levels exactly; keeping it closed means a typo like "severe" fails validation instead of flowing downstream and silently escaping every severity filter. The nodes.minItems: 1 constraint encodes a real invariant — a violation with no affected node is a serialization bug, not a finding, and should be quarantined for inspection.

2. Implement the Validation Engine

Wrap the schema in a reusable class. Compile the validator once and reuse it across every payload — in a high-throughput pipeline, recompiling per record is a measurable cost. Call check_schema at construction so an invalid schema fails fast at startup rather than on the first payload in production.

import json
from jsonschema import Draft202012Validator, ValidationError
from typing import Any, Dict, List

class AccessibilityPayloadValidator:
    def __init__(self, schema_path: str):
        with open(schema_path, "r") as f:
            self.schema = json.load(f)
        # Fail fast if the schema itself is malformed, before building the validator.
        Draft202012Validator.check_schema(self.schema)
        # One compiled instance, reused across every payload in the pipeline.
        self.validator = Draft202012Validator(self.schema)

    def validate(self, payload: Dict[str, Any]) -> Dict[str, Any]:
        errors: List[str] = []
        # iter_errors surfaces every divergence at once, each with a JSON path,
        # so triage sees the whole picture instead of one error per re-run.
        for error in self.validator.iter_errors(payload):
            path = ".".join(str(p) for p in error.absolute_path) or "root"
            errors.append(f"{path}: {error.message}")

        if errors:
            raise ValidationError(
                f"Accessibility payload validation failed with "
                f"{len(errors)} error(s):\n" + "\n".join(errors)
            )
        return payload

3. Route Failures Instead of Discarding Them

When validation fails, do not drop the payload — extract whatever is recoverable and quarantine the rest. A fallback serializer that pulls scan_id, url, and timestamp even when the violations array is corrupted keeps the audit trail complete, which is a hard requirement for compliance evidence. This mirrors the quarantine-not-discard rule enforced by error categorization triage pipelines, where data loss during ingestion failure is unacceptable.

from dataclasses import dataclass, field
from typing import Any, Dict, List

@dataclass
class RoutingResult:
    valid: List[Dict[str, Any]] = field(default_factory=list)
    quarantined: List[Dict[str, Any]] = field(default_factory=list)

def route(validator: AccessibilityPayloadValidator,
          payloads: List[Dict[str, Any]]) -> RoutingResult:
    result = RoutingResult()
    for payload in payloads:
        try:
            result.valid.append(validator.validate(payload))
        except ValidationError as exc:
            # Preserve identity even when the body is corrupt, so drift is traceable.
            result.quarantined.append({
                "scan_id": payload.get("scan_id", "unknown"),
                "url": payload.get("url", "unknown"),
                "timestamp": payload.get("timestamp"),
                "errors": str(exc),
            })
    return result

For teams whose findings carry richer structure — localization tags, framework-specific ARIA attributes, or compliance-mapping layers — the child guide on validating accessibility metadata with JSON Schema extends this contract to cover those nested cases without loosening the root guarantees.

Configuration Reference

The parameters that govern how strictly the gate behaves and how it scales. Tune them against real payload volume and your tolerance for drift, not aspirationally.

Parameter	Type	Default	Description
`SCHEMA_PATH`	str	`schemas/finding.schema.json`	Path to the version-controlled contract. Treat any edit as a breaking change under review.
`DIALECT`	str	`draft2020-12`	JSON Schema draft. Pin it; mixing dialects across services silently changes `$ref` and `items` semantics.
`STRICT_FORMAT`	bool	`true`	Whether `format` (uuid, uri, date-time) is asserted rather than annotation-only. Requires the `format` checkers enabled.
`ADDITIONAL_PROPERTIES`	bool	`false`	Reject unknown fields at the root. Keep `false` so an engine upgrade that adds a field surfaces as a failure, not a silent pass.
`QUARANTINE_DIR`	str	`artifacts/quarantined/`	Sink for records that fail validation. Never route failures to `/dev/null`.
`FAIL_FAST_FIELDS`	list	`["scan_id", "url"]`	Fields whose absence hard-fails the CI stage rather than merely quarantining the record.
`BATCH_SIZE`	int	`500`	Records validated per chunk in async mode. Cap it so a corrupt megapayload cannot exhaust worker memory.

To assert format keywords such as uuid and uri rather than treating them as annotations, enable the format checker explicitly when constructing the validator: Draft202012Validator(schema, format_checker=Draft202012Validator.FORMAT_CHECKER). Without it, an impact of "critical" validates but a malformed scan_id slips through, because format is annotation-only by default in every draft.

Verification and Testing

Golden-payload conformance. Keep a fixture of one known-good payload and one deliberately corrupted payload (missing url, bad impact enum, empty nodes). Assert the validator passes the first and raises on the second in a unit test. This is your early warning that an engine upgrade changed the output shape.
Schema self-check in CI. Run Draft202012Validator.check_schema() against the contract as its own CI step, so a malformed schema fails the build before it ever reaches a payload.
Round-trip on real output. Point the validator at an artifact directory of actual recent scan results and assert the quarantine rate is near zero. A sudden spike means drift — the engine and the schema have diverged.
Format assertion coverage. Feed a payload with a syntactically valid but semantically wrong scan_id (e.g. "not-a-uuid") and confirm it is rejected. If it passes, the format_checker is not wired in.
Quarantine integrity. Corrupt the violations body of a payload while leaving scan_id and url intact, then assert the quarantined record still carries both identifiers. The audit trail must survive a malformed body.

CI/CD Integration and Threshold Gating

Embedding validation in the pipeline turns accessibility auditing from a post-deployment activity into a continuous quality gate. Position it as a dedicated stage after scan execution but before artifact archival or database ingestion: Run Scans → Validate Payloads → Threshold Gate → Archive / Ingest. The same CI/CD threshold gating discipline that governs violation counts governs schema conformance here — a run that quarantines more than a set fraction of its records should fail the gate, because that fraction is a drift signal, not noise to ignore.

Frontend QA teams and accessibility specialists should also run validation locally before committing scan configurations or custom rule definitions. A lightweight pre-commit hook that validates example payloads against the canonical schema catches structural regressions early and keeps pipeline noise down.

# .github/workflows/accessibility-validation.yml
name: Validate Accessibility Telemetry
on:
  workflow_run:
    workflows: ["Run Accessibility Scans"]
    types: [completed]

jobs:
  validate-payloads:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Install dependencies
        run: pip install "jsonschema>=4.18" pyyaml
      - name: Run Schema Validation
        run: |
          python scripts/validate_accessibility.py \
            --schema schemas/finding.schema.json \
            --input-dir artifacts/scan-results/ \
            --output-dir artifacts/validated/ \
            --quarantine-dir artifacts/quarantined/ \
            --fail-over-quarantine-rate 0.02
      - name: Upload Validated Artifacts
        uses: actions/upload-artifact@v4
        with:
          name: validated-audit-reports
          path: artifacts/validated/

Configure the gate to fail fast on structural violations that break identity (a missing url or scan_id) while routing non-critical deviations — extra metadata fields, an unexpected optional property — to quarantine for review. This dual strategy keeps the pipeline moving while preserving every record for the compliance audit trail. Bind the blocking decision to the conformance target defined in your A/AA/AAA compliance level mapping, so the gate encodes legal intent rather than an arbitrary threshold.

Scaling to Asynchronous Validation

As scanning grows to thousands of routes, synchronous validation on the result-handling worker introduces latency. Move validation onto a dedicated consumer behind a message broker — RabbitMQ or AWS SQS — so payloads are pulled, validated in parallel batches, and republished to downstream analytics or ticketing without blocking the scanners. Reuse one compiled validator instance per consumer process and stream large payloads rather than loading entire JSON blobs into memory. Pair this with connection pooling on downstream writes to hold throughput during peak crawl windows. The routing tier this feeds is the same one described in the batch validation architecture; validation is the gate immediately before its aggregation step.

Failure Modes and Troubleshooting

Everything passes but downstream still breaks on bad data. The schema validates but a malformed scan_id or url slips through. Root cause: format keywords are annotation-only by default, so uuid and uri are never asserted. Fix: construct the validator with an explicit format_checker, and add a golden test that feeds a syntactically valid but semantically wrong identifier and expects rejection.

Quarantine rate spikes overnight with no code change. A large fraction of records suddenly fails on an unexpected property. Root cause is almost always an engine upgrade that added or renamed a field, colliding with additionalProperties: false. Fix: pin the engine and schema versions together, alert on quarantine rate rather than only on hard failures, and update the contract as a reviewed change — never by loosening additionalProperties to true to make the alert stop.

Validator throughput collapses under load. Latency climbs linearly with payload volume. Root cause: the schema is being recompiled per record, or is_valid() is being called and then iter_errors() re-run for the message — two full validation passes. Fix: compile one Draft202012Validator at startup and reuse it, and call iter_errors() once, treating an empty result as valid.

Memory exhaustion on a single giant payload. A worker is OOM-killed mid-batch. Root cause: a route emitted an enormous violations array (often an infinite-scroll page that was never bounded during the crawl) and the whole blob was loaded and validated at once. Fix: cap BATCH_SIZE, stream-parse large documents, and bound traversal upstream via async crawling for infinite scroll pages so payload size stays predictable.

Legitimate dynamic findings rejected as malformed. Records from client-rendered components fail nodes.minItems or carry lazily-attached ARIA attributes the schema does not model. Root cause: evaluation fired before the framework settled, producing a structurally incomplete finding. Fix: gate evaluation on a readiness signal per dynamic content boundary detection rather than loosening the schema, so the contract stays strict while the input becomes correct.

Frequently Asked Questions

Why validate with jsonschema when pydantic could enforce the same contract?

Both work, and the choice is about where the contract lives. A standalone JSON Schema file is language-neutral: the same finding.schema.json validates payloads in Python workers, in a JavaScript pre-commit hook, and in any third-party consumer, and it can be published as a versioned artifact. pydantic is excellent when the contract lives inside one Python service and you want typed models plus parsing in one step — and it can emit a JSON Schema from those models. For a cross-language ingestion boundary shared by several services, keep the schema as the source of truth and let each language load it.

Should I set additionalProperties to false or true at the root?

Keep it false on the ingestion contract. The whole point of the gate is to notice when the engine output changes shape, and additionalProperties: false is what turns a new, unmodelled field into an explicit rejection you can review rather than a silent pass that drifts into the data lake. Loosen it only on nested objects you deliberately treat as open-ended, and never as a quick fix to silence a quarantine-rate alert.

My validator accepts a bad UUID or malformed URL — why?

Because format keywords (uuid, uri, date-time) are annotation-only in every JSON Schema draft unless you opt into assertion. Construct the validator with format_checker=Draft202012Validator.FORMAT_CHECKER so those formats are actually enforced. Without it, the string type passes and the semantic constraint is ignored.

RefResolver is showing as deprecated — what replaced it?

In jsonschema 4.18+, $ref resolution moved to the separate referencing library, and the old RefResolver is deprecated pending removal. If your schema splits into multiple files via $ref, build a referencing.Registry, load each subschema into it, and pass it to the validator rather than relying on the legacy resolver. Pin a recent 4.x release so you are on the supported path.

Should the CI gate fail on every quarantined record?

No. Hard-fail only when a record loses identity — a missing scan_id or url — because those cannot be triaged after the fact. Route everything else to quarantine and fail the gate only when the quarantine rate crosses a threshold (e.g. 2%), since a rate spike is the real drift signal. A per-record hard fail pushes teams to loosen the schema to keep CI green, which defeats the contract.

JSON Schema Validation for Accessibility Data

Prerequisites and Environment Context #

Conceptual Model: A Deterministic Gate, Not a Logger #

Step-by-Step Implementation #

1. Define the Canonical Schema #

2. Implement the Validation Engine #

3. Route Failures Instead of Discarding Them #

Configuration Reference #

Verification and Testing #

CI/CD Integration and Threshold Gating #

Scaling to Asynchronous Validation #

Failure Modes and Troubleshooting #

Frequently Asked Questions #

Why validate with jsonschema when pydantic could enforce the same contract? #

Should I set additionalProperties to false or true at the root? #

My validator accepts a bad UUID or malformed URL — why? #

RefResolver is showing as deprecated — what replaced it? #

Should the CI gate fail on every quarantined record? #

Related #