Audit Data Storage & Retention Policies

Enterprise accessibility scanning is a firehose: forty thousand routes evaluated on every merge emit millions of violation nodes, DOM fragments, and engine-version stamps per week. The specific obstacle this page solves is keeping that telemetry durable, queryable, and legally defensible without letting it become an unbounded cost centre or a compliance liability. Naive storage — dumping raw scanner JSON into an ever-growing table — produces four predictable failures: storage bills that grow superlinearly, findings whose provenance is lost the moment the rule engine updates, personally identifiable data captured in DOM snapshots that no retention job can locate to delete, and longitudinal trend reports that are statistically meaningless because nobody recorded which engine version produced which number. This guide treats the persistence layer as first-class engineering: a schema that preserves reproducibility, a tiered lifecycle that ages records through hot, warm, and cold storage automatically, and cryptographic deletion routines that satisfy regulatory sunset requirements.

This work is part of the broader Enterprise WCAG Audit Architecture & Standards Mapping strategy. Where the parent guide establishes how the evaluation pipeline produces findings, this page establishes what happens to those findings after aggregation — the storage contract, the lifecycle policy, and the deletion guarantees. It sits directly downstream of batch validation architecture, which fans scanning across workers and emits the schema-valid dataset this layer persists, and directly upstream of the compliance reports that governance boards and external auditors consume.

Prerequisites and Environment Context

Storage and retention are infrastructure code, and drift between the database schema, the ingestion writer, and the lifecycle job is the most common source of orphaned or unreconcilable data. Pin the following before implementing anything below:

PostgreSQL 15+ (or a managed equivalent such as Cloud SQL / RDS) for the relational store. The examples use JSONB, TIMESTAMPTZ, and generated columns available from 15. A document store (DynamoDB, Cosmos DB) is a valid alternative for the raw-payload tier, but the run/finding relationship below assumes relational integrity.
Object storage with lifecycle rules and object-lock — Amazon S3, GCS, or Azure Blob. The archive tier depends on storage-class transitions (GLACIER_IR / Nearline) and, for legal-hold scenarios, write-once-read-many (WORM) object-lock.
Python 3.11+ with boto3 1.34+ and psycopg 3.x for the lifecycle orchestration. Pin these in the same lockfile the scanners use so a boto3 bump cannot silently change StorageClass defaults.
Retention policy as code. The tier windows, deletion actions, and legal-hold exceptions live in a version-controlled retention_policy.yaml, never in a console setting. The lifecycle job reads that file; changing retention is a reviewed pull request, not a click.
A pinned finding contract. The records written here must already conform to the schema defined in JSON Schema validation for accessibility data, so the storage layer can assume a fixed shape and index it rather than defensively parse it.

Environment parity matters here too: the database the ingestion writer targets in CI must have the same migration state as production, or a scan will write findings the lifecycle job cannot later classify. Run schema migrations as a gated CI step before any ingestion job is allowed to connect.

Conceptual Model: Provenance-First Records Aging Through Tiers

The design rests on two decisions. First, every record carries enough provenance to be replayed — the engine version, spec version, and rule-set hash that produced it — so a violation count is never a bare number but an assertion tied to the exact evaluator that made it. Second, records age through storage tiers by policy, not by hand, so cost and legal exposure both shrink automatically as data gets older.

Provenance is what makes longitudinal analysis honest. When an evaluation engine transitions between specification releases, a raw count of “target-size” violations can drop by half simply because the rule changed, not because the site improved. Storing engine_version, spec_version, and rule_set_hash alongside every finding lets analysts route legacy findings to historical partitions and normalize new outputs against current taxonomies — the same mapping problem worked through in the WCAG 2.2 vs 3.0 Success Criteria Taxonomy. Without it, every engine upgrade silently poisons the trend line.

The tiered lifecycle then moves each record through three windows, applying a distinct action as it ages past each boundary:

Tier	Window	Access Pattern	Use Case
Hot / Active	0–24 months	Full query, real-time dashboards	Sprint remediation tracking, regression baselines, CI gate validation
Warm / Archive	24–60 months	Restricted query, batch retrieval	Quarterly compliance reporting, legal discovery, trend analysis
Cold / Purge	>60 months	Cryptographic deletion	Regulatory sunset, PII/DOM-fragment sanitization, cost reduction

The retention lifecycle moves each audit record through those three tiers, applying a distinct action as it ages past each window:

Beyond the active threshold, full records transition to object storage under restricted IAM, while the database keeps only aggregated compliance metrics — the numbers reports need without the bulky DOM payloads. Past the archive threshold, cryptographic deletion purges personally identifiable information, session tokens, and DOM fragments that could expose internal routing logic, while the small aggregate rollups survive to preserve the historical trend line. The key insight is that deletion and analytics are not in tension: you delete the raw, reconstructable payload and retain the derived, anonymized metric.

Step-by-Step Implementation

1. Design a normalized, provenance-carrying schema

Decouple the audit run (one execution, its configuration, and its provenance) from its findings (many violations, each pointing at a hashed DOM snapshot rather than embedding it). The run row is small and queried constantly; the finding rows are numerous and the DOM payload is the heavy part, so the snapshot lives in object storage and the row keeps only its hash.

-- PostgreSQL normalized schema: runs carry provenance, findings carry pointers
CREATE TABLE audit_runs (
    run_id            UUID PRIMARY KEY,
    target_url        TEXT NOT NULL,
    engine_version    VARCHAR(20) NOT NULL,   -- e.g. axe-core 4.9.1
    spec_version      VARCHAR(10) NOT NULL,   -- e.g. WCAG 2.2
    rule_set_hash     VARCHAR(64) NOT NULL,   -- hash of the active rule config
    execution_context JSONB NOT NULL,         -- scan_initiated, dom_render_complete, ...
    status            VARCHAR(20) DEFAULT 'active',  -- active | archived | purged
    created_at        TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE audit_findings (
    finding_id        UUID PRIMARY KEY,
    run_id            UUID REFERENCES audit_runs(run_id) ON DELETE CASCADE,
    wcag_criterion    VARCHAR(10) NOT NULL,   -- e.g. 1.4.3
    severity          VARCHAR(20) NOT NULL,   -- minor | moderate | serious | critical
    dom_snapshot_hash VARCHAR(64) NOT NULL,   -- pointer into object storage, not the blob
    violation_payload JSONB NOT NULL,
    remediation_status VARCHAR(30) DEFAULT 'open'
);

-- The lifecycle job scans by age and status; this partial index keeps it cheap.
CREATE INDEX idx_runs_lifecycle ON audit_runs (created_at) WHERE status = 'active';

Use an immutable identifier per run — a UUIDv4 or ULID paired with the rule_set_hash — so parallel pipeline shards cannot overwrite each other and any run can be reproduced from its recorded configuration.

2. Write findings idempotently at ingestion time

The ingestion writer is the seam with the scanning pipeline. It must be idempotent: re-running the same batch (a retried CI job, a replayed shard) must not double-count. Key the run on its deterministic run ID and let a conflict be a no-op.

import json
import psycopg  # psycopg 3.x

def persist_run(conn, run: dict, findings: list[dict]) -> None:
    """Idempotently store one audit run and its findings in a single transaction."""
    with conn.transaction():  # all-or-nothing: a partial run is never visible
        conn.execute(
            """
            INSERT INTO audit_runs
              (run_id, target_url, engine_version, spec_version,
               rule_set_hash, execution_context)
            VALUES (%(run_id)s, %(target_url)s, %(engine_version)s,
                    %(spec_version)s, %(rule_set_hash)s, %(execution_context)s)
            ON CONFLICT (run_id) DO NOTHING
            """,
            {**run, "execution_context": json.dumps(run["execution_context"])},
        )
        # executemany keeps the round-trips bounded even for large finding sets.
        conn.cursor().executemany(
            """
            INSERT INTO audit_findings
              (finding_id, run_id, wcag_criterion, severity,
               dom_snapshot_hash, violation_payload)
            VALUES (%(finding_id)s, %(run_id)s, %(wcag_criterion)s,
                    %(severity)s, %(dom_snapshot_hash)s, %(violation_payload)s)
            ON CONFLICT (finding_id) DO NOTHING
            """,
            [{**f, "run_id": run["run_id"],
              "violation_payload": json.dumps(f["violation_payload"])}
             for f in findings],
        )

3. Orchestrate the lifecycle with an idempotent retention job

The lifecycle manager evaluates run metadata against the policy, archives serializable payloads to object storage, and flips the row status. It must be safe to run repeatedly — the status filter guarantees a re-run skips already-archived rows rather than re-uploading them.

# scripts/lifecycle_manager.py
import os
import json
import boto3
import psycopg
from datetime import datetime, timedelta, timezone

def archive_expired_runs(db_uri: str, s3_bucket: str, retention_days: int = 730) -> int:
    cutoff = datetime.now(timezone.utc) - timedelta(days=retention_days)
    s3 = boto3.client("s3")
    archived = 0
    with psycopg.connect(db_uri) as conn:
        # Materialize the target set before issuing UPDATEs so we are not mutating
        # the table while iterating the same result cursor.
        rows = conn.execute(
            """
            SELECT run_id, target_url, execution_context
            FROM audit_runs
            WHERE created_at < %s AND status = 'active'
            """,
            (cutoff,),
        ).fetchall()

        for run_id, url, ctx in rows:
            # Archive to the warm tier as immutable, infrequently-accessed storage.
            s3.put_object(
                Bucket=s3_bucket,
                Key=f"archive/{run_id}/manifest.json",
                Body=json.dumps({"run_id": str(run_id), "url": url, "context": ctx}),
                StorageClass="GLACIER_IR",
            )
            # Flip status only after the upload succeeds — never lose a row to a
            # failed PUT. A crash mid-loop leaves the rest 'active' for the next run.
            conn.execute(
                "UPDATE audit_runs SET status = 'archived' WHERE run_id = %s",
                (run_id,),
            )
            archived += 1
    return archived

if __name__ == "__main__":
    n = archive_expired_runs(
        db_uri=os.environ["AUDIT_DB_URI"],
        s3_bucket=os.environ["AUDIT_ARCHIVE_BUCKET"],
        retention_days=int(os.environ.get("RETENTION_DAYS", 730)),
    )
    print(f"archived {n} runs")

4. Cryptographically delete past the sunset window

Deletion at the cold boundary must be irreversible and must target the reconstructable data specifically — DOM fragments and any captured PII — while preserving the aggregate metrics compliance history depends on. Delete the object-storage blobs, then cascade-delete the database rows, and record only the anonymized rollup.

def purge_and_summarize(db_uri: str, s3_bucket: str, purge_days: int = 1825) -> None:
    """Past the sunset window, destroy raw payloads but keep aggregate metrics."""
    cutoff = datetime.now(timezone.utc) - timedelta(days=purge_days)
    s3 = boto3.client("s3")
    with psycopg.connect(db_uri) as conn:
        rows = conn.execute(
            "SELECT run_id FROM audit_runs WHERE created_at < %s AND status = 'archived'",
            (cutoff,),
        ).fetchall()
        for (run_id,) in rows:
            # Persist the derived metric BEFORE destroying the source rows, so a
            # crash cannot leave us with neither the raw data nor the summary.
            conn.execute(
                """
                INSERT INTO compliance_metrics (run_id, criterion, severity, count)
                SELECT run_id, wcag_criterion, severity, COUNT(*)
                FROM audit_findings WHERE run_id = %s
                GROUP BY run_id, wcag_criterion, severity
                """,
                (run_id,),
            )
            # Object-lock/versioned buckets require deleting every version to make
            # the blob truly unrecoverable — a plain delete only adds a marker.
            _delete_all_versions(s3, s3_bucket, prefix=f"archive/{run_id}/")
            conn.execute(
                "UPDATE audit_runs SET status = 'purged' WHERE run_id = %s", (run_id,)
            )
            conn.execute("DELETE FROM audit_findings WHERE run_id = %s", (run_id,))

Align the destruction step with a recognized media-sanitization standard such as NIST SP 800-88 Rev. 2 Guidelines for Media Sanitization so archived artifacts cannot be reconstructed once purged, and coordinate the PII-handling rules with security and privacy framework integration, which governs what may be captured in the first place.

5. Wire lifecycle enforcement into the pipeline

Ingestion runs on every merge; retention runs on a schedule. Keep them as distinct stages so a scan never blocks on a retention sweep and a retention sweep never races an in-flight ingestion.

# .gitlab-ci.yml
stages:
  - scan
  - ingest
  - retention-eval

scan_accessibility:
  stage: scan
  script:
    - python -m audit_engine --target $TARGET_URL --output results.json
  artifacts:
    paths: [results.json]

ingest_telemetry:
  stage: ingest
  script:
    - python scripts/ingest_findings.py --input results.json --db-uri $AUDIT_DB_URI
  only:
    - main
    - release/*

evaluate_retention:
  stage: retention-eval
  # Weekly schedule via CI/CD > Schedules (cron "0 2 * * 0", Sunday 02:00 UTC).
  # The rule restricts this job to scheduled pipelines so merges never trigger it.
  rules:
    - if: '$CI_PIPELINE_SOURCE == "schedule"'
  script:
    - python scripts/lifecycle_manager.py --config retention_policy.yaml

Integrate storage validation into the merge checks as well: if the persistence layer rejects a malformed payload or violates a schema constraint, the pipeline fails before deployment, so every merged change produces queryable, standards-compliant telemetry rather than silent corruption.

Configuration Reference

The parameters that govern the lifecycle and its cost/legal trade-offs. Bind tier windows to your actual regulatory obligations, not to round numbers.

Parameter	Type	Default	Description
`RETENTION_DAYS`	int	`730`	Age at which a run leaves the hot tier and archives to object storage (24 months).
`PURGE_DAYS`	int	`1825`	Age at which archived runs are cryptographically deleted and reduced to aggregate metrics (60 months).
`AUDIT_DB_URI`	str	—	Connection string for the relational store; injected as a CI secret, never committed.
`AUDIT_ARCHIVE_BUCKET`	str	—	Object-storage bucket for the warm tier; should have versioning + object-lock for legal hold.
`ARCHIVE_STORAGE_CLASS`	str	`GLACIER_IR`	Storage class for archived manifests. `GLACIER_IR` balances retrieval latency against cost.
`LEGAL_HOLD_TAGS`	list	`[]`	Run tags exempt from purge regardless of age (active litigation, regulatory inquiry).
`RETENTION_BATCH_SIZE`	int	`500`	Runs processed per lifecycle invocation; caps memory and transaction duration on large backlogs.
`DELETE_ALL_VERSIONS`	bool	`true`	On versioned buckets, remove every object version so a purge is truly unrecoverable.

Verification and Testing

Idempotent ingestion. Persist the same batch twice and assert audit_runs and audit_findings cardinality is unchanged; the ON CONFLICT DO NOTHING clauses should absorb the replay. A count that grows means a non-deterministic ID upstream.
Lifecycle dry-run. Run archive_expired_runs against a seeded fixture with clock-shifted created_at values and assert exactly the rows past RETENTION_DAYS flip to archived, and that a second invocation archives zero.
Deletion completeness. After purge_and_summarize, assert no audit_findings rows remain for purged runs, that _delete_all_versions left no recoverable object version, and that the corresponding compliance_metrics rollup exists. Reconstructability is the failure you are testing against.
Reconciliation. Sum runs across active + archived + purged and reconcile against the ingestion manifest count; a gap means a lost row or a failed archive PUT, which should surface as a scheduled-job alert.
Restore drill. Periodically restore one archived manifest from the warm tier and confirm it deserializes to a valid run — an archive you cannot read is not a backup.

Failure Modes and Troubleshooting

Storage bill grows superlinearly despite a retention policy. The lifecycle job runs but nothing ages out. Root cause is usually a status value the job does not recognize (a manual edit set it to retained) or DOM snapshots written inline into violation_payload instead of hashed into object storage. Fix: keep the raw snapshot out of the row entirely — store only dom_snapshot_hash — and assert the lifecycle job’s status vocabulary in a test.

Trend line breaks after an engine upgrade. Violation counts jump or collapse overnight with no code change. Root cause: findings from different engine_version / spec_version values are being aggregated as if comparable. Fix: always group longitudinal queries by the provenance columns and normalize legacy criteria through the WCAG 2.2 vs 3.0 Success Criteria Taxonomy before comparing across a version boundary.

Purge cannot locate PII to delete. A subject-access or erasure request arrives and the captured DOM contains user data with no way to target it. Root cause: PII was embedded in violation_payload without a locator. Fix: never persist raw PII — hash or redact at ingestion, keep DOM fragments behind the dom_snapshot_hash pointer, and coordinate capture rules with security and privacy framework integration so the sensitive fields never reach storage.

Deleted objects remain recoverable. A purge “succeeded” but the blob is restorable from a version marker. Root cause: a plain delete on a versioned or object-locked bucket only adds a delete marker. Fix: enumerate and remove every version (DELETE_ALL_VERSIONS), and confirm object-lock retention has elapsed before attempting deletion — a legal hold correctly blocks it.

Retention job races an in-flight ingestion. A run is archived while its findings are still being written, producing a half-empty manifest. Root cause: ingestion and retention overlap in time. Fix: keep retention on a scheduled-pipeline-only trigger (as in the CI config above), wrap ingestion in a single transaction so partial runs are never visible, and have the lifecycle job skip runs younger than a short grace period.

Frequently Asked Questions

Should DOM snapshots live in the database or object storage?

Object storage, with only the hash in the row. A serialized DOM snapshot can be tens to hundreds of kilobytes; multiplied across millions of findings it dominates database size, slows every index, and inflates backup cost. Keep dom_snapshot_hash as the pointer, put the blob in the archive bucket, and the hot tier stays small enough to query in real time. This also makes cryptographic deletion cleaner — you destroy the blob and the pointer dangles harmlessly until its row is purged.

How long should audit telemetry actually be retained?

Bind the windows to your legal and regulatory obligations rather than to defaults. Sprint-level remediation only needs the hot tier’s 0–24 months; compliance reporting and discovery typically drive the 24–60 month warm window; anything past the regulatory sunset should be purged to shrink both cost and liability. Encode the exact numbers in retention_policy.yaml and treat exceptions (litigation hold) as tagged, reviewed opt-outs — not as a reason to retain everything forever.

Why store the rule-set hash if I already store the engine version?

Because two runs on the same engine version can still apply different rules — a team enables an experimental check, scopes out a third-party widget, or changes a tag set. The engine_version tells you which evaluator ran; the rule_set_hash tells you which configuration it ran with. You need both to reproduce a finding, and reproducibility is the whole point of provenance. The hash is cheap and it is the difference between “we think this is comparable” and “this is provably the same evaluation.”

Does deleting old data destroy my long-term compliance trend?

No, if you separate the raw payload from the derived metric. The purge step destroys the reconstructable data — DOM fragments, full violation payloads, anything with PII exposure — but first writes an anonymized rollup (criterion, severity, count per run) into a compliance_metrics table that survives. Your multi-year trend line is built from those small aggregates, so you can hold a five-year compliance trajectory while retaining zero raw DOM past the sunset window.

How do I keep the retention job from timing out on a large backlog?

Process runs in bounded batches (RETENTION_BATCH_SIZE) rather than one unbounded sweep, so each invocation holds a short transaction and predictable memory. On the first run against years of accumulated data, lower the batch size and let the weekly schedule drain the backlog over several cycles instead of one marathon job. Index the lifecycle scan with a partial index on created_at WHERE status = 'active' so the query that finds expired runs never table-scans the whole history.

Audit Data Storage & Retention Policies

Prerequisites and Environment Context #

Conceptual Model: Provenance-First Records Aging Through Tiers #

Step-by-Step Implementation #

1. Design a normalized, provenance-carrying schema #

2. Write findings idempotently at ingestion time #

3. Orchestrate the lifecycle with an idempotent retention job #

4. Cryptographically delete past the sunset window #

5. Wire lifecycle enforcement into the pipeline #

Configuration Reference #

Verification and Testing #

Failure Modes and Troubleshooting #

Frequently Asked Questions #

Should DOM snapshots live in the database or object storage? #

How long should audit telemetry actually be retained? #

Why store the rule-set hash if I already store the engine version? #

Does deleting old data destroy my long-term compliance trend? #

How do I keep the retention job from timing out on a large backlog? #

Related #