Async Crawling for Infinite Scroll Pages

Infinite scroll defeats the single-request assumption baked into most accessibility scanners: the crawler fetches one URL, parses the initial HTML, and reports on a fraction of the content a real user reaches after ten scroll gestures. The nodes injected by an IntersectionObserver never enter the accessibility tree the scanner evaluates, so keyboard traps, unlabeled controls, and contrast failures in the tail of the feed pass silently. This page solves that coverage gap. It shows how to drive a headless browser through an infinite feed deterministically — advancing the viewport, waiting for network and DOM to settle, and running scoped audits on each freshly hydrated batch — so that a scroll-dependent page produces the same complete, reproducible violation set on every run rather than a different partial one each time.

This workflow sits inside the broader Automated Scanning & Dynamic Content Ingestion strategy and depends directly on the browser-orchestration primitives established in Playwright Headless Scanning Workflows. Where that parent workflow guarantees a stabilized DOM for a single view, the technique here extends stabilization across an unbounded sequence of views without duplicating findings or exhausting the browser’s memory.

Prerequisites and Environment Context

Async infinite-scroll crawling is timing-sensitive, so environment parity between local runs and CI matters more here than in static scans. Pin the following before implementing:

Python 3.11+ and Playwright 1.40+ (pip install playwright && playwright install chromium). The async API (playwright.async_api) is required; the sync API blocks the event loop and serializes scroll waits inefficiently.
axe-core 4.10.x, injected at runtime from a pinned CDN URL or a vendored local copy. Pin the exact version — rule IDs and default tag membership shift between minor releases, which changes your violation baseline.
A fixed rendering environment. Use the official mcr.microsoft.com/playwright/python image in CI so font rendering, device-pixel-ratio, and scroll physics match. Contrast results in particular depend on the fonts actually available in the container.
Deterministic inputs. Freeze viewport dimensions, timezone (TZ), and locale. A feed sorted by “recent” or personalized per session will hydrate different nodes on each run and make deduplication unreliable — seed test data or pin a fixture route where possible.

Establish the axe-core rule set with the same Axe-Core Enterprise Configuration you use for static routes, restricting runOnly to wcag2a and wcag2aa conformance tags. Reusing one configuration source across scroll and non-scroll routes keeps a single source of truth for which WCAG 2.2 success criteria are actively evaluated.

Conceptual Model: Advance, Stabilize, Scope, Ledger

A reliable infinite-scroll crawl is a loop over four synchronized primitives. Treating any one of them as optional is the root cause of nearly every flaky result.

Viewport advancement — a predictable scroll increment that respects CSS scroll-snap, momentum scrolling, and virtualization boundaries. Advance too far and virtualized rows mount and unmount before you can evaluate them; advance too little and the crawl never terminates.
Stabilization detection — a composite wait that combines network-idle detection with DOM mutation quiescence. Network idle alone fires before client-side frameworks finish hydrating fetched JSON into DOM; mutation quiescence alone misses in-flight requests. Both are required.
Targeted audit execution — a scoped axe.run() against only the nodes newly exposed by this scroll batch. Re-auditing the whole document on every increment turns an O(n) crawl into O(n²) work and re-reports the same violations dozens of times.
A cumulative violation ledger — a deduplicating store keyed on a stable identity, so the same offending node counted across overlapping batches collapses to one finding.

The loop below shows how these primitives feed one another until the content stream is exhausted. It stays in the parent Playwright Headless Scanning Workflows lifecycle — the same launch/navigate/inject sequence — but replaces the single evaluation step with an iterated scroll-and-audit cycle.

The critical insight is that stabilization must be proven, not assumed. A fixed sleep() after each scroll is the single most common cause of both flakiness (too short on a slow CI runner) and wasted minutes (too long on a fast one). The pattern below waits on observable signals — network state and mutation counts — and uses a small fixed buffer only to absorb sub-perceptual jitter like image decoding.

Step-by-Step Implementation

The following pattern implements a deterministic async crawl in Python and Playwright. Each step is self-contained; compose them into a single module for CI use.

1. Initialize the Headless Environment and Suppress Noise

Configure the browser context with route interception to drop assets that do not affect the accessibility tree (images, fonts, analytics beacons). This cuts network chatter so that networkidle becomes a meaningful stabilization signal rather than a moving target.

import re
from playwright.async_api import async_playwright

# Playwright route patterns accept a compiled regex for multi-extension matching.
HEAVY_ASSETS = re.compile(r"\.(png|jpg|jpeg|gif|svg|webp|woff|woff2|ttf|otf|eot)(\?.*)?$")

async def init_browser(p):
    browser = await p.chromium.launch(args=["--js-flags=--expose-gc"])
    context = await browser.new_context(
        viewport={"width": 1440, "height": 900},
        java_script_enabled=True,
        bypass_csp=True,  # allows axe injection on strict-CSP properties
    )
    page = await context.new_page()
    # Abort heavy assets to keep networkidle a reliable stabilization signal.
    await page.route(HEAVY_ASSETS, lambda route: route.abort())
    # Caller owns teardown: await context.close() / await browser.close() in a finally block.
    return browser, context, page

Launching with --js-flags=--expose-gc is what makes the manual garbage-collection trigger in the failure-modes section actually available; without it, window.gc is undefined.

2. Read Scroll Metrics After Each Hydration Cycle

Infinite-scroll implementations frequently mount content inside a virtualized container (overflow: auto) rather than growing the document. Re-read scrollHeight after every cycle — it changes as new batches hydrate, and a value cached from page load will make the loop terminate early.

async def get_scroll_metrics(page):
    return await page.evaluate("""() => ({
        scrollHeight: document.documentElement.scrollHeight,
        clientHeight: document.documentElement.clientHeight,
        scrollTop: window.scrollY
    })""")

3. Execute the Scroll and Stabilization Loop

Advance the viewport by a configurable increment (roughly 1.3× viewport height gives overlap without skipping virtualized rows), then wait on network idle plus a short hydration buffer. Emit a stable CSS selector for each landmark or interactive node so that deduplication and axe’s include context both consume the same identity.

async def scroll_and_stabilize(page, increment=1200, max_scrolls=50):
    seen_keys = set()
    for _ in range(max_scrolls):
        await page.evaluate(f"window.scrollBy(0, {increment})")

        # Prove stability instead of assuming it: network first, then a short
        # buffer for JS hydration and image decoding.
        await page.wait_for_load_state("networkidle")
        await page.wait_for_timeout(800)

        # Emit a stable selector (not truncated HTML) per candidate node so the
        # key stays consistent across overlapping batches and feeds axe.include.
        selectors = await page.evaluate("""() => {
            const sel = '[role], [aria-label], [tabindex], a, button, input, select, textarea, h1, h2, h3, h4, h5, h6';
            return Array.from(document.querySelectorAll(sel)).map(el => {
                if (el.id) return `#${CSS.escape(el.id)}`;
                const tid = el.getAttribute('data-testid');
                return tid ? `[data-testid="${tid}"]` : null;
            }).filter(Boolean);
        }""")

        new_selectors = [s for s in selectors if s not in seen_keys]
        seen_keys.update(new_selectors)
        yield new_selectors

        # Terminate when the scroll position stops advancing (end of stream).
        metrics = await get_scroll_metrics(page)
        if metrics["scrollTop"] + metrics["clientHeight"] >= metrics["scrollHeight"]:
            break

Yielding batch-by-batch lets the caller audit incrementally and release each batch’s report before the next scroll, keeping peak memory flat regardless of feed length.

4. Run a Scoped Accessibility Audit on the New Batch

Inject axe-core into the stabilized context, then scope the run with a context object whose include is an array of CSS-selector arrays. axe.run() returns a Promise, so it must be awaited inside an async arrow function passed to page.evaluate.

AXE_CDN = "https://cdnjs.cloudflare.com/ajax/libs/axe-core/4.10.2/axe.min.js"

async def run_targeted_audit(page, include_selectors):
    if not include_selectors:
        return []

    # axe must be present in the page before it can run; re-adding is idempotent.
    await page.add_script_tag(url=AXE_CDN)

    # include is an array of selector arrays, e.g. [["#feed-item-42"], ["#feed-item-43"]].
    context = {"include": [[sel] for sel in include_selectors]}
    report = await page.evaluate(
        "async (ctx) => await axe.run(ctx, {resultTypes: ['violations']})",
        context,
    )
    return report["violations"]

Map each returned violation to its WCAG success criterion using the tags array on the rule, cross-referenced against the W3C Web Content Accessibility Guidelines (WCAG) 2.2 so that downstream triage inherits a conformance level rather than a bare rule ID.

5. Deduplicate Into a Cumulative Ledger

Overlapping scroll batches will surface the same node more than once. Collapse findings on a stable composite key before they reach any reporting store.

def merge_into_ledger(ledger, route, violations):
    for v in violations:
        for node in v["nodes"]:
            target = node["target"][0] if node["target"] else ""
            key = (route, v["id"], target)  # route + rule + element selector
            ledger.setdefault(key, {
                "route": route,
                "rule": v["id"],
                "impact": v["impact"],
                "target": target,
                "wcag": [t for t in v["tags"] if t.startswith("wcag")],
            })
    return ledger

Keying on route + rule + element_selector — rather than on a truncated HTML snippet — is what makes the count stable across runs. Feed the deduplicated ledger into the same batch validation architecture that ingests your static-route results so scroll and non-scroll findings share one normalization and gating path.

Configuration Reference

Tune these parameters per property. The defaults suit a text-dense feed at desktop viewport; media-heavy or mobile-parity crawls need the noted adjustments.

Parameter	Type	Default	Description
`increment`	int (px)	`1200`	Vertical scroll distance per cycle. Keep it below viewport height × 1.5 so virtualized rows do not mount and unmount between evaluations.
`max_scrolls`	int	`50`	Hard ceiling on cycles; the loop exits earlier when the bottom is reached. Prevents runaway crawls on truly endless feeds.
`stabilization_buffer`	int (ms)	`800`	Fixed wait after `networkidle` to absorb hydration and image decoding. Raise to `1200` on slow CI runners; lower to `500` for pure-text feeds.
`networkidle`	wait state	enabled	Waits until ≤ 0 network connections for 500 ms. Disable only for feeds that hold open a streaming/WebSocket connection, and substitute mutation-count quiescence instead.
`include` scope	selector[][]	new nodes only	axe evaluation context. Scoping to the current batch plus one buffer zone bounds memory and audit time.
`AXE_CDN`	URL	`4.10.2`	Pinned axe-core build. Never float this to `latest` — rule membership changes shift the baseline.
`--js-flags=--expose-gc`	launch arg	set	Exposes `window.gc()` so long crawls can force context cleanup between batches.

Verification and Testing

Confirm the crawl captures deferred content before you trust its zero-violation runs.

Assert node growth. Log len(seen_keys) per cycle. A healthy crawl shows the count climbing and then plateauing at the true end of the feed. A flat count from cycle one means scroll events are not firing — check for a scroll-jacking container that ignores window.scrollBy and requires scrolling a specific element instead.
Seed a known violation. In a fixture build, inject a control with a deliberate failure (an <img> without alt, a button labeled only by an icon) deep in the feed — say the 40th item. A correct crawl must report it; if it does not, your stabilization window is closing before that item hydrates.
Run it twice and diff the ledgers. Determinism is the whole point. Two consecutive runs against the same fixture must produce byte-identical deduplicated ledgers. A diff exposes personalization, unfrozen timestamps, or a race in the stabilization wait.
Gate in CI. Emit the ledger as JUnit XML and fail the job when impact-critical or impact-serious counts exceed the route’s threshold. Validate the raw JSON against a schema first, using the same contract described in JSON Schema validation for accessibility data, so a malformed crawl output fails loudly rather than passing an empty ledger.

The pipeline wraps the crawler in four stages: a pre-crawl step resolves the sitemap or route registry and fans each route out to an isolated worker container; workers run the async crawler with frozen viewport, timezone, and seed data; raw results pass a schema validator that normalizes severity and element paths; and a threshold gate fails the build when critical violations exceed the per-route budget. Routing of the surviving violations into developer-owned tickets is handled by the error categorization and triage pipelines, which classify each finding by framework pattern and severity before assignment.

stages:
  - crawl
  - validate
  - report

accessibility-crawl:
  stage: crawl
  image: mcr.microsoft.com/playwright/python:v1.40.0-jammy
  script:
    - pip install -r requirements.txt
    - python -m src.crawler --routes sitemap.json --output raw_violations.json
  artifacts:
    paths: [raw_violations.json]

schema-validation:
  stage: validate
  script:
    - python -m src.validator --input raw_violations.json --schema accessibility_schema.json
    - python -m src.reporter --input validated_violations.json --format junit
  artifacts:
    reports:
      junit: junit_accessibility.xml

Failure Modes and Troubleshooting

1. Premature evaluation on un-hydrated DOM (false negatives)

Symptom: the crawl passes locally but reports intermittent zero-violation runs in CI on the same route. Root cause: networkidle fired after the fetch completed but before the framework rendered the JSON into DOM, so axe.run() saw an empty region. Fix: pair network-idle with mutation quiescence — poll a MutationObserver record count and only proceed after it stays at zero for one buffer interval. Raise stabilization_buffer on slower runners rather than lowering max_scrolls.

2. Memory exhaustion on long feeds

Symptom: the headless process is OOM-killed after 30–40 cycles. Root cause: detached-node retention and an ever-growing document as batches accumulate. Fix: audit and release batch reports incrementally (never hold all reports in memory), scope include to the current viewport plus one buffer zone, and force cleanup every 5–10 cycles with await page.evaluate("window.gc && window.gc()") (available only with --expose-gc). For pathological feeds, checkpoint the ledger, navigate to about:blank, and re-enter to reset the context.

3. Duplicate findings flooding triage

Symptom: one broken feed card is reported 20 times. Root cause: overlapping scroll batches re-expose the same node, and the ledger key includes a volatile field (a truncated HTML snippet or an index that shifts as items mount). Fix: key strictly on route + rule + stable_selector. Prefer id or data-testid selectors; if the feed emits neither, generate a structural path once and cache it on first sighting.

4. Scroll-jacking containers that ignore `window.scrollBy`

Symptom: seen_keys never grows; the page visibly does not move. Root cause: the feed captures wheel events and scrolls an inner element, so window-level scrolling is a no-op. Fix: detect the actual scroll container (document.scrollingElement may not be it) and call element.scrollBy on it, or dispatch synthetic wheel events. Fall back to element.scrollIntoView() on the last rendered item.

5. False positives from framework transition artifacts

Symptom: violations appear on wrapper nodes that no real user ever perceives (a Vue <transition> shell, a React portal placeholder). Root cause: the audit caught a node mid-transition, in a transient state. Fix: exclude known transition wrappers via exclude selectors in the axe context, and cross-reference dynamic violations against a known-artifact allowlist before writing them to the ledger. For single-page feeds where scroll boundaries intersect client-side routing, align scroll termination with router state as described in Implementing Async Crawling for Single Page Applications.

Frequently Asked Questions

Should I re-inject axe-core after every scroll, or once?

Inject once per page context; page.add_script_tag is idempotent, and re-adding the same pinned URL is cheap, but there is no need to re-run it on every cycle. What must repeat each cycle is axe.run() with a fresh, batch-scoped include context. If you navigate to about:blank to reset memory, re-inject after returning, because the fresh context has no axe global.

Why does my CI run report more violations than my local run?

Almost always a font or viewport mismatch. Color-contrast rules resolve against the fonts actually rendered in the container, and a different device-pixel-ratio changes computed sizes. Pin the official Playwright image, freeze the viewport, and set TZ so both environments render identically. A determinism diff between two consecutive runs on the same machine isolates whether the variance is environmental or a genuine race.

Can I skip networkidle and just use a fixed timeout?

Not reliably. A fixed timeout is either too short on a loaded CI runner (premature evaluation, false negatives) or too long on a fast one (wasted minutes across thousands of routes). Wait on observable signals — network idle plus mutation quiescence — and use a small fixed buffer only to absorb sub-perceptual jitter like image decoding.

How do I keep the crawl from running forever on an endless feed?

Enforce both a hard max_scrolls ceiling and an end-of-content check that compares scrollTop + clientHeight against a freshly read scrollHeight. On genuinely infinite feeds (social timelines), the ceiling is your real terminator; set it to cover the depth your conformance scope requires and document that the tail beyond it is out of audit scope.

Async Crawling for Infinite Scroll Pages

Prerequisites and Environment Context #

Conceptual Model: Advance, Stabilize, Scope, Ledger #

Step-by-Step Implementation #

1. Initialize the Headless Environment and Suppress Noise #

2. Read Scroll Metrics After Each Hydration Cycle #

3. Execute the Scroll and Stabilization Loop #

4. Run a Scoped Accessibility Audit on the New Batch #

5. Deduplicate Into a Cumulative Ledger #

Configuration Reference #

Verification and Testing #

Failure Modes and Troubleshooting #

1. Premature evaluation on un-hydrated DOM (false negatives) #

2. Memory exhaustion on long feeds #

3. Duplicate findings flooding triage #

4. Scroll-jacking containers that ignore window.scrollBy #

5. False positives from framework transition artifacts #

Frequently Asked Questions #

Related #