Accessibility Incidents Belong in the SRE Postmortem

Image description: A monitor showing an SRE incident-management dashboard with a red ‘INCIDENT’ alert banner and a pager device beside it — the visual marker for accessibility-as-incident reporting.

Reading Time: 9 minutes

When a checkout page goes down, an SRE team gets paged, a severity tier is assigned, a war room opens, and twenty-four hours later a blameless postmortem document circulates with a timeline, a root-cause section, and a list of corrective actions. When the same checkout page ships a regression that makes the credit-card field unreachable by keyboard, what usually happens is that a frontend engineer notices it three sprints later, files a Jira ticket tagged “accessibility,” and the ticket sits in a backlog until someone has spare capacity. The two outcomes — one user group locked out of a working production system — are functionally identical. The internal response is wildly different. This essay argues that the second response is broken, that the first response is the right response, and that the path from the second to the first is shorter than most engineering organisations assume. For the broader practitioner context, see our companion piece on treating accessibility debt as technical debt; this piece is about incident response specifically.

The shift is not philosophical. Accessibility regressions are observable, they are tierable, and they fit cleanly into the same incident-management workflow your team already runs on PagerDuty, Opsgenie, FireHydrant, Statuspage, and whatever runbook repository your org has standardised on. The instruments exist. The signal exists. The categorisation framework — WCAG 2.2 — is a published, machine-comparable contract with criteria that map directly to severity tiers. What is usually missing is the org-design step: somebody has to declare that an a11y regression in production is an incident with a capital I, and that declaration has to come with an on-call rotation, a severity matrix, a postmortem template, and a corrective-action review board. That declaration is the work this essay is trying to support.

An incident, in modern SRE practice, is any unplanned event that degrades the service for users. The definition does not specify which users, which interaction modality, or which assistive technology. A login button that returns a 500 error is an incident because the user cannot log in. A login button that has lost its accessible name and now announces as “button” to a screen reader is also an incident, because that user cannot log in either. The internal teams reading those two failure modes have historically applied different mental categories — the first is “an outage,” the second is “a bug” — but from the user’s seat the experience is the same: a working production system has stopped working for them.

The reason a11y has lived outside this frame is partly tooling. Outages used to be observable through synthetic monitors and error-rate dashboards while a11y regressions only surfaced through manual audits weeks or months after deploy. That gap has closed. Axe-core, Pa11y, Lighthouse CI, and Deque’s continuous-monitoring suite now run on every deploy in mature pipelines, with deltas surfaced into the same Datadog or Grafana panels that show p99 latency and 5xx rates. The signal is real-time. The other reason a11y has lived outside the frame is severity-tier confusion: an outage’s severity is obvious because the metric is binary (the page returns or it doesn’t), while an a11y regression’s severity has felt softer. It is not softer. A WCAG 2.2 A failure on a checkout page is a Sev-1 — the legally and operationally critical surface is unusable for an entire user class. A WCAG AA failure on the same checkout is a Sev-2. A AAA enhancement regression on a marketing page is a Sev-4. The matrix is publishable in a one-page document and can be ratified by an engineering org in a single planning meeting.

Detection: scanning and alerting

The detection stack for a11y-as-incident has three layers and they all already exist in your CI pipeline if you have done any continuous-accessibility work at all. The first layer is build-time scanning. Every pull request runs axe-core or an equivalent against a representative set of pages, returns a JSON report, and either fails the build on regressions or files a finding. This is the same shape as Snyk, SonarQube, or any other quality gate. The second layer is deploy-time synthetic monitoring. After a deploy lands in production, an a11y synthetic — running headless Chrome against the same critical-user-journey pages your uptime monitor hits — runs the same axe scan and writes the result to your time-series store. The third layer is runtime anomaly detection. Whenever the deploy-time scan returns a delta — a new violation that was not present in the prior deploy — that delta fires a webhook into PagerDuty (or Opsgenie, or whatever your team uses), with a payload that includes the page URL, the WCAG criterion, the severity tier, and a deeplink to the screenshot.

That webhook is where the magic happens. The PagerDuty integration treats the a11y event as a normal incident with a normal severity, fires the normal alert to the normal on-call rotation, and opens a normal incident channel in Slack or Teams. The on-call engineer who picks it up does not need any special accessibility training to triage — they need only the runbook entry that says “for an a11y Sev-1, rollback the deploy and page the a11y SME in the rotation.” That runbook entry is a five-line YAML file, no more complicated than the runbook for a database failover. The detection stack is not the hard part. What is hard is the cultural step of treating the resulting page as a real page, not as a notification someone can silence.

The postmortem template

A blameless postmortem for an a11y incident shares the standard sections of any postmortem — summary, timeline, impact, root cause, lessons learned, action items — and adds two specific fields that the generic template omits. The first additional field is users-affected expressed as an assistive-technology-population estimate. A standard postmortem reports “approx. 14,000 users were unable to complete checkout between 14:02 and 15:37.” An a11y postmortem reports “approx. 280,000 users worldwide depend on a screen reader for credit-card entry; the regression made the field unannounceable; the credit-card-entry rate for users navigating without sight dropped to zero for the duration of the incident.” The second additional field is WCAG criteria violated, expressed by criterion number and conformance level: “1.3.1 Info and Relationships (A), 4.1.2 Name, Role, Value (A), 2.4.6 Headings and Labels (AA).” These two fields are what make the postmortem legible to legal, accessibility, and compliance partners who do not read engineering postmortems by default.

The rest of the document follows the standard SRE Workbook template — a clean prose timeline keyed to UTC timestamps, a “what went well / what went poorly” reflection block, and a list of corrective actions each owned by a named engineer with a due date and a Jira ticket. The blameless framing matters here as much as anywhere else: the goal of the postmortem is not to find the engineer who shipped the regression, it is to find the system gap that allowed the regression to ship. A11y postmortems written in a blaming voice produce one outcome — engineers stop reporting a11y issues. A11y postmortems written in a blameless voice produce the opposite outcome — engineers start volunteering them, because the conversation is about the pipeline, not the person.

The 5-whys, adapted for accessibility

Toyota’s 5-whys exercise — drill from symptom to cause by asking “why” five times in succession — translates cleanly to a11y regressions and produces a different set of root-cause findings than the equivalent exercise on a latency outage. A worked example: the credit-card field has lost its accessible name. Why? Because the <label> element was removed in the last redesign sprint. Why? Because the redesigner replaced the label with a floating-label pattern implemented as a styled <span>. Why? Because the design-system component the redesigner used does not ship an accessible-by-default floating-label variant. Why? Because the design-system contributor who built that component never ran axe-core against its Storybook entry. Why? Because the design-system repository does not have an axe-core CI gate.

The corrective action falls out of the fifth why: add axe-core to the design-system CI. Notice how different that conclusion is from the corrective action a one-why exercise would produce (“re-add the label to the credit-card field”). The one-why fix patches the symptom. The five-why fix prevents the next twenty regressions of the same shape. A11y is particularly responsive to five-whys analysis because most a11y regressions root-cause to a pipeline or a design-system gap rather than to a single negligent commit — once you find the gap, you fix it once and the entire class of regressions stops happening. A team that runs five-whys on every Sev-1 and Sev-2 a11y incident for six months ends up with a pipeline that catches the vast majority of regressions before they reach production, without anyone having to write a single additional manual audit.

Three case studies

A fintech platform we have spoken with in the European retail-banking sector adopted the a11y-as-incident pattern in late 2024 after a regulator inquiry forced a posture shift. They added axe-core scans to every deploy, wired the results into PagerDuty as a dedicated “a11y” service, and added an a11y SME to the incident-commander rotation as a second-tier responder paged for Sev-1 and Sev-2 events. In the first six months they recorded eleven a11y incidents — three Sev-1 (login flow, transaction confirmation, statement download), six Sev-2 (account-settings forms, document-upload widgets, the marketing carousel), and two Sev-3 (cosmetic colour-contrast regressions on the help-centre). The Sev-1 mean-time-to-resolve was forty-six minutes. The Sev-1 mean-time-to-resolve in the equivalent period of the prior year, before the pattern was adopted, was thirty-eight days.

An eCommerce platform on the US west coast wired the same pattern into FireHydrant rather than PagerDuty and added a Statuspage component called “Assistive technology compatibility” that reports an explicit status to public-facing customers. The Statuspage component went red twice in the first quarter — once for a screen-reader regression on the search-results page, once for a keyboard-trap on the address-autocomplete modal — and both times the public posting produced inbound feedback from affected users within four hours, which materially accelerated the remediation. The customer-trust effect of publicly acknowledging an a11y incident, the platform’s head of engineering told us, was an unexpected positive externality. A SaaS B2B vendor selling project-management software took the pattern further: they appointed an a11y subject-matter-expert in the incident-commander rotation, gave that role veto power on production deploys that introduce Sev-1 or Sev-2 a11y regressions, and reduced their post-deploy a11y-incident rate by approx. 70 percent over twelve months. The org-design step — putting a named person in a named seat with named authority — was the single highest-leverage change.

Org-design implications

The detection and postmortem tooling is the easy part of the shift. The hard part is the organisational design: somebody has to own the a11y on-call rotation, somebody has to chair the corrective-action review board for a11y incidents, and somebody has to write the runbook entries that the generalist on-call engineer reads at three in the morning. The pattern that works in the three case studies above is the same pattern that worked when security teams went through the equivalent shift fifteen years ago: a small embedded a11y team — typically two to four practitioners — owns the runbooks, sits in the incident-commander rotation as a second-tier paged role, and chairs a weekly review of all a11y incidents from the prior week. The generalist on-call engineer handles the first response (rollback the deploy, open the incident channel, page the SME); the SME handles the categorisation, the WCAG mapping, and the postmortem drafting.

The reporting line for this team matters and the case studies disagree on it. The fintech put their a11y team under the SRE org directly. The eCommerce platform put theirs under design-systems. The SaaS B2B vendor put theirs under engineering excellence, a sibling to the security team. None of these is wrong. What matters is that the team has a budget, a headcount, a runbook repository, and incident-commander credentials — the things that distinguish an operating function from an advisory function. A11y teams that have lived inside design departments as advisory functions cannot run a Sev-1 because they cannot rollback a deploy. A11y teams that have lived inside engineering as operating functions can. That is the structural shift this essay is trying to argue for, and the case studies suggest it costs less and ships faster than the leadership conversations around it usually assume. The detection stack is off-the-shelf. The postmortem template is a one-pager. The runbook is five lines of YAML. The org-design change is one named role with one named authority. The result is an a11y posture that closes regressions in forty-six minutes instead of thirty-eight days — and an engineering culture in which the keyboard-only user and the latency-sensitive user are treated, finally, as the same first-class citizen of the system the team is paid to keep running.

Why accessibility regressions are SRE-grade

Detection: scanning and alerting

The postmortem template

The 5-whys, adapted for accessibility

Three case studies

Org-design implications