Compromised AI agent: 0-72h incident response runbook

A team discovers on Tuesday morning that one of its agents sent 230 emails to a customer list overnight, using a wrong template. Likely cause: prompt injection, hallucination, or logic bug. What do you do in 72 hours?

This is exactly the situation an agent runbook must answer, written cold, before the incident. Without one, the team loses 12 to 36 hours organizing what should have been organized once.

Here's the 0-72h runbook I use in field work, adapted to AI agents.

H+0 to H+1 — Contain

Freeze the agent immediately

Trigger the global kill-switch: all agents in this version stopped.
No global kill-switch: feature flag "agents off", or route removal at the load balancer.
Confirm in logs that agents receive no new requests.

Binary decision a single person (on-call or CISO) takes, not a committee.

Preserve evidence

Snapshot of agent long-term memory at time t.
Export complete audit logs for the last 48-72h.
Capture of the state of tools accessible to the agent.
Identify affected sessions.

No remediation action should alter this data before export. Containment is read-only on evidence by design.

Internal notification

Security on-call.
Agent product owner.
DPO if potential GDPR impact.
Executive leadership if external impact or significant amount at stake.

These four must be reachable within an hour. If you don't know who's on call, the runbook has already failed.

H+1 to H+6 — Understand

Reconstruct the sequence

From the logs:

What's the first session where abnormal behavior appears?
What input triggered it (user prompt, RAG content, tool output)?
Which tools were called and with what parameters?
What's the scope: how many sessions, how many users, which external recipients?

Without structured audit log, this step can take days. With one, hours.

Identify likely cause

Four hypotheses to eliminate in order:

Code bug: regression in agent logic, bad merge.
Prompt injection: untrusted content hijacked the agent.
Memory poisoning: persisted poisoned instruction.
Model changed: silent provider update.

Each calls for a different remediation. Don't skip — fixing the wrong cause guarantees recurrence.

Map impact

External recipients: how many people received an unwanted email, call, notification?
Leaked data: what info left your perimeter?
Irreversible actions: what got deleted, modified, published?
Commitments made: did the agent "promise" something to a customer in your name?

H+6 to H+24 — Remediate (part 1)

Fix the immediate cause

Per confirmed hypothesis:

Code bug: patch + deploy fixed version (without agents for now, just the service).
Prompt injection: add the attack signature to the test set, harden the guardrail.
Memory poisoning: purge contaminated entries, restore from earlier snapshot if available.
Model changed: escalate to vendor, pin to an explicit version (Anthropic and OpenAI now support explicit version pinning).

Prepare communications

To impacted people (customers, partners): factual message, what happened, what you're doing, what you ask of them (cancel an action, ignore a message).
Internal: synthetic note for the team, to align sales and support responses.
To CNIL and/or authorities: if personal data involved, assess GDPR Article 33 72h notification obligation.

Restore what can be restored

Cancel reversible actions (delete unsent scheduled messages, take down publications).
Communicate to recipients of unwanted emails to disregard them.
Document what's irreversible.

H+24 to H+72 — Stabilize

Deploy a fixed version

Patched code, explicit new version.
Non-regression tests on the identified cause.
Red team on the encountered scenario + close variants.
Progressive deployment (canary 5%, then 25%, then 100%) with tight monitoring.

Formal external notifications

CNIL if applicable (notification within 72h on a data breach).
Customers: formal mail/email with incident recap, measures taken, contacts.
Cyber insurance partners: claim filing if covered.
CERT-FR or sectoral authority if regulated (NIS2, DORA).

Structured post-mortem

Within 72h, an internally-shared doc:

Detailed chronology.
Confirmed root cause.
Measured impact (recipients, data, costs).
Corrective actions taken.
Future corrective actions (short, mid-term).
Metrics you'll track to detect recurrence.

Not a blame document. A memory document. It'll be precious in 12 months if a similar incident occurs.

Prerequisites for this runbook to work

Everything above assumes:

Kill-switch operational and tested.
Structured audit log.
On-call defined: who calls whom in what timeframe.
Comm templates: customer email, press statement, CNIL notification — 80% ready in advance.
Documented CNIL notification procedure.
Relationship established with cyber lawyer and (when relevant) a negotiator.

Without these 6, the runbook stays theoretical. With them, you handle an incident keeping the upper hand.

The cold test

At least once a year, a tabletop incident exercise. Not theory — precise scenarios ("your sales agent sent a wrong-pricing offer to 50 customers") and timed decisions. That's the difference between a team with a runbook and a team that can execute it under pressure.

A related topic on your side?

20 minutes to scope it together. No commercial pitch.

Book a Calendly call →