Post

AI-Assisted Incident Response — When Production Breaks

Production incidents are high-stakes, time-pressured, and often involve code or systems the on-call engineer didn't write. AI can significantly accelerate incident response — but only if you've prepared for it.

AI-Assisted Incident Response — When Production Breaks

Production incidents don’t wait for you to be familiar with the system. At 2am, the on-call engineer gets paged. The service is down. They may or may not have worked on the component that’s failing.

This is the scenario where AI assistance has the highest stakes and the highest potential value. It’s also where it has the highest potential for a wrong AI suggestion leading to a worse incident.


What AI Does Well in Incidents

Codebase navigation under pressure. “Show me every place this function is called and what could change this state” is a question Claude Code answers in 30 seconds. Without AI, it’s a multi-step grep and code read that takes minutes under pressure. In an incident, those minutes matter.

Log interpretation. Error messages and stack traces are something AI interprets quickly and accurately. Given a stack trace, Claude Code identifies the failing component, suggests likely root causes, and points to relevant code sections. This doesn’t replace judgment, but it accelerates the initial triage.

Runbook discovery. If your runbooks are well-documented, AI can retrieve the relevant procedure quickly. “Show me the runbook for payment processor failures” with access to your documentation is faster than searching a wiki under pressure.

Hypothesis generation. “Given this error pattern, what are the likely causes?” AI generates a list of hypotheses ranked by likelihood for common failure modes. The on-call engineer can work through them systematically rather than starting from an empty hypothesis list.

Draft communications. During an incident, stakeholder updates take time and mental bandwidth. AI drafts status updates given the incident timeline. The engineer reviews and sends. This keeps communications consistent without pulling the engineer off the investigation.


Where AI Gets Incidents Wrong

Confident wrong answers. The pressure of an incident makes engineers want a clear answer. AI gives clear answers whether or not it has enough information. The hallucination risk is real: AI suggests a root cause that sounds plausible but is incorrect, and the engineer spends 20 minutes investigating the wrong thing.

Missing context. The incident is usually caused by something that changed. AI doesn’t automatically have access to recent deployments, configuration changes, or the infrastructure changes from earlier that day. If you don’t provide this context, AI reasons from code without the relevant causal information.

The cascade problem. In distributed system incidents, the root cause is often several steps removed from the symptom. AI tends to focus on the visible symptom. A human engineer who knows the system knows to look upstream.


Preparing for AI-Assisted Incident Response

The work you do before an incident determines how useful AI is during one.

Give AI access to relevant context. Configure Claude Code with access to your logs, monitoring dashboards, recent deployment history. The more context available at incident time, the more useful the AI responses.

Document your runbooks in AI-accessible formats. Runbooks in a format Claude Code can read and retrieve are more valuable than PDF runbooks in a shared drive. Plain markdown in a repository is ideal.

Pre-incident hypothesis documents. For each critical service, maintain a document of known failure modes, their symptoms, and their remedies. AI can retrieve and cross-reference these during an incident.

Post-incident feedback loops. After each incident, add the root cause and resolution to your institutional knowledge base. AI gets smarter about your specific systems as that knowledge base grows.


The On-Call AI Workflow

The pattern that works during incidents:

  1. Symptom triage: ask AI to interpret the initial signals — errors, metrics, alerts. Get hypotheses.
  2. Targeted investigation: use AI to navigate the codebase and logs relevant to the top hypotheses.
  3. Verify before acting: any remediation action should be understood by the engineer before execution. “AI suggested this rollback” is not sufficient justification for a production change.
  4. Communications: use AI to draft status updates; engineer reviews before sending.
  5. Post-incident: use AI to draft the initial post-mortem structure; engineer fills in the context.

The through-line: AI accelerates; the engineer decides. No remediation action should happen because AI suggested it without engineer understanding and confirmation.


Day 17 of the AI-First Engineering Team series. Previous: Knowledge Transfer and Institutional Memory with AI

This post is licensed under CC BY 4.0 by the author.