The short version
A four-week effort to automate manual WCAG checks landed reliably on 2.4.7 Focus Visible but produced false positives or missed areas on 2.1.1 Keyboard, 2.1.2 No Keyboard Trap, and 1.3.1 Info & Relationships. That split is not a measure of engineering skill. It tracks a boundary the accessibility standards bodies themselves drew: some criteria are machine-decidable, and some require human judgment. The productive move is not to harden the detector — it is to adopt the published partition and put a person in the loop where the spec says a person belongs.
Why 2.4.7 yielded
The entire requirement is a single observable state:
“Any keyboard operable user interface has a mode of operation where the keyboard focus indicator is visible.”1
One condition, present or absent. A tool can move focus and check whether a visible indicator appears. This is the kind of criterion automation handles well.
Why 2.1.1, 2.1.2, and 1.3.1 resist
2.1.1 Keyboard — behavioural coverage
The requirement is that
“All functionality of the content is operable through a keyboard interface without requiring specific timings for individual keystrokes”2
To verify it, a tool must exercise all functionality and decide whether each piece is operable by keyboard — behaviour- and path-dependent across arbitrary widgets. Static markup analysis cannot confirm operability it never triggered, which is where both the false positives and the missed areas come from.
2.1.2 No Keyboard Trap — you have to drive it
The criterion requires that, once focus enters a component,
“…then focus can be moved away from that component using only a keyboard interface…”3
Detecting a trap means actually navigating into and out of components, not reading the DOM. A scanner that never tabs through the widget cannot know whether focus is stuck.
1.3.1 Info & Relationships — the false-positive engine
The normative text asks that
“Information, structure, and relationships conveyed through presentation can be programmatically determined or are available in text”4
The trap is in the intent: the relationships that must be preserved are the ones
“…implied by visual or auditory formatting are preserved when the presentation format changes…”5
A tool can read the programmatic structure (DOM, ARIA, headings, lists, tables). It cannot reliably judge whether that structure matches the intended visual or auditory meaning. That inference gap is exactly where a non-AI checker flags relationships that are fine, and misses ones that are broken.
The boundary is already published
This is the part worth not re-deriving from scratch. W3C's Accessibility Conformance Testing (ACT) Rules tag every rule by how it can be implemented:
“Implementation types can be manual, semi-automated, automated, or linter.”6
The rules are designed to serve both kinds of tester — they provide
“guidance for developers of automated testing tools and manual testing methodologies”7
In other words, the standard expects conformance testing to be part-automated and part-manual
by design. A mainstream engine reflects the same line in code: axe-core's rules
are element/attribute/ARIA checks — for example, area-alt, whose job is to
“Ensure <area> elements of image maps have alternative text”8
and each rule carries an issue type of failure, needs review9 — the
engine itself separates a definitively automatable tier from a human-review tier rather than
pretending everything is pass/fail.
What automation reliably catches
The empirical picture sets honest expectations. WebAIM's 2026 analysis of the top one million home pages counted only automatically detectable failures and still concluded that
“…the rate of full WCAG 2 A/AA conformance was certainly lower than 4.1%.”10
and cautions plainly that
“Absence of detected errors does not indicate that a page is accessible or conformant.”11
Crucially, what automated scanners do catch is concentrated — 96% of all errors detected fall into these six categories.12:
| WCAG failure type | % of home pages |
|---|---|
| Low contrast text | 83.9% |
| Missing alternative text for images | 53.1% |
| Missing form input labels | 51% |
| Empty links | 46.3% |
| Empty buttons | 30.6% |
| Missing document language | 13.5% |
Every one of those is an attribute- or structure-level failure. None of them is one of the three criteria the automation effort got stuck on. Automation's reliable yield sits precisely where the hard criteria are not.
Recommendation
Stop chasing full automation of the judgment criteria
- Ship what is decidable: 2.4.7 plus the attribute/structural tier (alt text, labels, contrast, document language, empty controls) — the same surface axe-core covers.
- Adopt the ACT classification for 2.1.1, 2.1.2, and 1.3.1: map each to its existing ACT rule(s) and treat the non-automatable parts as manual or semi-automated steps rather than detector bugs to be eliminated.
- Wrap a human in the loop for the judgment calls — a guided, semi-automated review (the tool surfaces candidates; a person rules) rather than a zero-false-positive autodetector that the spec says cannot exist.
- That is where an AI layer earns its cost: as the assist for the semi-automated tier specifically (proposing candidate relationships / keyboard paths for a human to confirm), not as a from-scratch detector for criteria that turn on intended meaning.
Limitations of this brief
- The ACT Rules and the Understanding documents are informative, not normative; the normative requirements are the WCAG success criteria themselves.
- The WebAIM Million measures home pages only, and counts only automatically detectable failures — it understates, not overstates, the manual surface.
- axe-core is one engine among several (WAVE, IBM Equal Access, Pa11y, ARC); the tiering pattern generalises but specific rule coverage varies.
- This brief did not inspect or test the originating tool's code; it addresses the criteria, not the implementation.
Evidence register
Every quotation above was re-checked, verbatim, against the captured source passage it cites at build time. Tiers: T1 primary/normative, T2 authoritative secondary, T3 tool documentation.
- 1. “Any keyboard operable user interface has a mode of operation where the keyboard focus indicator is visible.”
- 2. “All functionality of the content is operable through a keyboard interface without requiring specific timings for individual keystrokes”
- 3. “then focus can be moved away from that component using only a keyboard interface”
- 4. “Information, structure, and relationships conveyed through presentation can be programmatically determined or are available in text”
- 5. “implied by visual or auditory formatting are preserved when the presentation format changes”
- 6. “Implementation types can be manual, semi-automated, automated, or linter.”
- 7. “guidance for developers of automated testing tools and manual testing methodologies”
- 8. “Ensure <area> elements of image maps have alternative text”
- 9. “failure, needs review”
- 10. “the rate of full WCAG 2 A/AA conformance was certainly lower than 4.1%”
- 11. “Absence of detected errors does not indicate that a page is accessible or conformant.”
- 12. “96% of all errors detected fall into these six categories.”