When an agent is compromised — by prompt injection, jailbreak, or logic bug — the question isn't only "how to prevent the compromise?". It is, but secondarily. The first operational question is: "what can the attacker do with a compromised agent?". That's what confinement answers.
Confinement in 2026 rests on three layers: sandbox, capabilities, kill-switch. Each tackles the problem from a different angle. Together they cover the real cases.
Layer 1 — Sandbox
The sandbox bounds the agent's execution environment. If an attacker runs code there (via an arbitrary-exec tool, an injected eval, a bug in the output parser), what do they actually have?
Minimum architecture
- Isolated container per session or per task, never shared between users.
- Ephemeral filesystem, no persistence between invocations.
- Strict allowlist outbound network: the agent reaches only the LLM provider and declared tools. No free egress.
- No long-lived credentials mounted plaintext: everything goes through a broker that authorizes each call.
Special case: agents that execute code
Many modern agents (Claude Code, Cursor agents, OpenAI Code Interpreter) execute code. That's the feature. The sandbox must then:
- Reject all network syscalls outside the allowlist, even via exec.
- Reject writes outside a dedicated folder.
- Enforce strict CPU/memory limits — consumption is an abuse signal.
- Apply a session timeout.
gVisor or Firecracker containers offer sufficient isolation at reasonable cost.
Layer 2 — Capabilities
The sandbox protects the environment. Capabilities protect external actions. It's the least-privilege equivalent for agents.
Principle
Every tool exposed to the agent is associated with:
- A scope: exactly what it can do (read files in project X, not all files; send one email at a time, not in bulk).
- An authorization: whether the agent has it always, or must request via UI.
- Traceability: every use logged with parameters and result.
The brokered-capability pattern
Instead of exposing send_email(to, subject, body) directly to the agent, expose a broker:
`` agent → broker → check (user OK? rate limit? recipient allowlist?) → send_email ``
The broker can: refuse silently if recipient is off-list, prompt UI confirmation for high-impact actions, throttle, log everything.
That's the difference between "my agent has Gmail access" (catastrophic) and "my agent can send 5 emails/day to user contacts, with confirmation for new recipients" (livable).
Capabilities and user permissions
A capability doesn't suffice on its own. It must be derived from the current user's permissions. An agent acting for a low-privilege user must not inherit the app service account's rights. This is the cornerstone for preventing privilege escalation through agents.
Layer 3 — Kill-switch
The kill-switch is the ability to cleanly and instantly stop a misbehaving agent without breaking the rest of the system. Three levels:
Local kill-switch — per session
An endpoint that interrupts the current session, closes tools, saves logs, notifies the user. Must be reachable from the user UI and from an admin dashboard.
Global kill-switch — per version
A flag (feature flag, env var, remote config) that disables all agents of a given version. Useful when v2.3.1 has a bug making the agent vulnerable to a new injection pattern.
Per-capability kill-switch
Selectively disable a tool. Example: if send_email is found exploitable via injection, disable it during investigation without breaking the rest.
What makes a kill-switch useful
Not its existence: its propagation time. If the kill-switch takes 15 minutes to reach all production instances, you lose 15 critical minutes.
Good practice:
- Target propagation < 30 seconds.
- Tested monthly in production exercises (not only staging).
- Documented: who can trigger it, in which case, after which check.
The classic mistake: confinement off by default
Many teams ship an agent and "enable confinement after first feedback". Wrong sequence. Confinement must be:
- On by default at launch.
- Loosened progressively as usage patterns become known.
Not the other way around. Confinement added later is almost always partial — downstream integrations already depend on the loose perimeter.
Maturity test — 5 questions
- Does your agent run in a sandbox isolated per session?
- Can you list, for each tool, allowed parameters and exceptions?
- Is there human confirmation on non-reversible external actions?
- Do you have a kill-switch that propagates in under 30 seconds?
- Did you test it in an exercise this month?
5 yes = top 10% of deployments. 3 yes = good starting point. 0-1 yes = don't ship the agent as-is.