The automatable boundary of WCAG testing

The short version

A four-week effort to automate manual WCAG checks landed reliably on 2.4.7 Focus Visible but produced false positives or missed areas on 2.1.1 Keyboard, 2.1.2 No Keyboard Trap, and 1.3.1 Info & Relationships. That split is not a measure of engineering skill. It tracks a boundary the accessibility standards bodies themselves drew: some criteria are machine-decidable, and some require human judgment. The productive move is not to harden the detector — it is to adopt the published partition and put a person in the loop where the spec says a person belongs.

Why 2.4.7 yielded

The entire requirement is a single observable state:

“Any keyboard operable user interface has a mode of operation where the keyboard focus indicator is visible.”¹

One condition, present or absent. A tool can move focus and check whether a visible indicator appears. This is the kind of criterion automation handles well.

Why 2.1.1, 2.1.2, and 1.3.1 resist

2.1.1 Keyboard — behavioural coverage

The requirement is that

“All functionality of the content is operable through a keyboard interface without requiring specific timings for individual keystrokes”²

To verify it, a tool must exercise all functionality and decide whether each piece is operable by keyboard — behaviour- and path-dependent across arbitrary widgets. Static markup analysis cannot confirm operability it never triggered, which is where both the false positives and the missed areas come from.

2.1.2 No Keyboard Trap — you have to drive it

The criterion requires that, once focus enters a component,

“…then focus can be moved away from that component using only a keyboard interface…”³

Detecting a trap means actually navigating into and out of components, not reading the DOM. A scanner that never tabs through the widget cannot know whether focus is stuck.

1.3.1 Info & Relationships — the false-positive engine

The normative text asks that

“Information, structure, and relationships conveyed through presentation can be programmatically determined or are available in text”⁴

The trap is in the intent: the relationships that must be preserved are the ones

“…implied by visual or auditory formatting are preserved when the presentation format changes…”⁵

A tool can read the programmatic structure (DOM, ARIA, headings, lists, tables). It cannot reliably judge whether that structure matches the intended visual or auditory meaning. That inference gap is exactly where a non-AI checker flags relationships that are fine, and misses ones that are broken.

The boundary is already published

This is the part worth not re-deriving from scratch. W3C's Accessibility Conformance Testing (ACT) Rules tag every rule by how it can be implemented:

“Implementation types can be manual, semi-automated, automated, or linter.”⁶

The rules are designed to serve both kinds of tester — they provide

“guidance for developers of automated testing tools and manual testing methodologies”⁷

In other words, the standard expects conformance testing to be part-automated and part-manual by design. A mainstream engine reflects the same line in code: axe-core's rules are element/attribute/ARIA checks — for example, area-alt, whose job is to

“Ensure <area> elements of image maps have alternative text”⁸

and axe-core tags each rule with an issue type — some resolve to a definitive failure, others to needs review (area-alt, for example, is failure, needs review⁹) — the engine itself separates a definitively automatable tier from a human-review tier rather than pretending everything is pass/fail.

What automation reliably catches

The empirical picture sets honest expectations. WebAIM's 2026 analysis of the top one million home pages counted only automatically detectable failures and still concluded that

“…the rate of full WCAG 2 A/AA conformance was certainly lower than 4.1%”¹⁰

and cautions plainly that

“Absence of detected errors does not indicate that a page is accessible or conformant.”¹¹

Crucially, what automated scanners do catch is concentrated — 96% of all errors detected fall into these six categories.¹²:

WebAIM Million 2026 — most common automatically-detected WCAG failures, by share of home pages.¹²
WCAG failure type	% of home pages
Low contrast text	83.9%
Missing alternative text for images	53.1%
Missing form input labels	51%
Empty links	46.3%
Empty buttons	30.6%
Missing document language	13.5%

Every one of those is an attribute- or structure-level failure. None of them is one of the three criteria the automation effort got stuck on. Automation's reliable yield sits precisely where the hard criteria are not.

Recommendation

Stop chasing full automation of the judgment criteria

Ship what is decidable: 2.4.7 plus the attribute/structural tier (alt text, labels, contrast, document language, empty controls) — the same surface axe-core covers.
Adopt the ACT classification for 2.1.1, 2.1.2, and 1.3.1: map each to its existing ACT rule(s) and treat the non-automatable parts as manual or semi-automated steps rather than detector bugs to be eliminated.
Wrap a human in the loop for the judgment calls — a guided, semi-automated review (the tool surfaces candidates; a person rules) rather than a zero-false-positive autodetector that the spec says cannot exist.
That is where an AI layer earns its cost: as the assist for the semi-automated tier specifically (proposing candidate relationships / keyboard paths for a human to confirm), not as a from-scratch detector for criteria that turn on intended meaning.

Limitations of this brief

The ACT Rules and the Understanding documents are informative, not normative; the normative requirements are the WCAG success criteria themselves.
The WebAIM Million measures home pages only, and counts only automatically detectable failures — it understates, not overstates, the manual surface.
axe-core is one engine among several (WAVE, IBM Equal Access, Pa11y, ARC); the tiering pattern generalises but specific rule coverage varies.
This brief did not inspect or test the originating tool's code; it addresses the criteria, not the implementation.

Evidence register

Every quotation above was re-checked, verbatim, against the captured source passage it cites at build time. Tiers: T1 primary/normative, T2 authoritative secondary, T3 tool documentation.

1. “Any keyboard operable user interface has a mode of operation where the keyboard focus indicator is visible.” — WCAG 2.1 — SC 2.4.7 Focus Visible (Level AA) T1 · primary / normative
2. “All functionality of the content is operable through a keyboard interface without requiring specific timings for individual keystrokes” — WCAG 2.1 — SC 2.1.1 Keyboard (Level A) T1 · primary / normative
3. “then focus can be moved away from that component using only a keyboard interface” — WCAG 2.1 — SC 2.1.2 No Keyboard Trap (Level A) T1 · primary / normative
4. “Information, structure, and relationships conveyed through presentation can be programmatically determined or are available in text” — WCAG 2.1 — SC 1.3.1 Info and Relationships (Level A) T1 · primary / normative
5. “implied by visual or auditory formatting are preserved when the presentation format changes” — Understanding SC 1.3.1 — Intent (W3C WAI) T2 · authoritative secondary
6. “Implementation types can be manual, semi-automated, automated, or linter.” — All ACT Rules (W3C WAI) T2 · authoritative secondary
7. “guidance for developers of automated testing tools and manual testing methodologies” — Understanding Test Rules for WCAG 2 (W3C WAI) T2 · authoritative secondary
8. “Ensure <area> elements of image maps have alternative text” — axe-core rule descriptions (Deque) T3 · tool documentation
9. “failure, needs review” — axe-core rule descriptions — issue types (Deque) T3 · tool documentation
10. “the rate of full WCAG 2 A/AA conformance was certainly lower than 4.1%” — WebAIM Million 2026 T2 · authoritative secondary
11. “Absence of detected errors does not indicate that a page is accessible or conformant.” — WebAIM Million 2026 T2 · authoritative secondary
12. “96% of all errors detected fall into these six categories.” — WebAIM Million 2026 — most common errors T2 · authoritative secondary