AI agent confinement: sandbox, capabilities, kill-switch

When an agent is compromised — by prompt injection, jailbreak, or logic bug — the question isn't only "how to prevent the compromise?". It is, but secondarily. The first operational question is: "what can the attacker do with a compromised agent?". That's what confinement answers.

Confinement in 2026 rests on three layers: sandbox, capabilities, kill-switch. Each tackles the problem from a different angle. Together they cover the real cases.

Layer 1 — Sandbox

The sandbox bounds the agent's execution environment. If an attacker runs code there (via an arbitrary-exec tool, an injected eval, a bug in the output parser), what do they actually have?

Minimum architecture

Isolated container per session or per task, never shared between users.
Ephemeral filesystem, no persistence between invocations.
Strict allowlist outbound network: the agent reaches only the LLM provider and declared tools. No free egress.
No long-lived credentials mounted plaintext: everything goes through a broker that authorizes each call.

Special case: agents that execute code

Many modern agents (Claude Code, Cursor agents, OpenAI Code Interpreter) execute code. That's the feature. The sandbox must then:

Reject all network syscalls outside the allowlist, even via exec.
Reject writes outside a dedicated folder.
Enforce strict CPU/memory limits — consumption is an abuse signal.
Apply a session timeout.

gVisor or Firecracker containers offer sufficient isolation at reasonable cost.

Layer 2 — Capabilities

The sandbox protects the environment. Capabilities protect external actions. It's the least-privilege equivalent for agents.

Principle

Every tool exposed to the agent is associated with:

A scope: exactly what it can do (read files in project X, not all files; send one email at a time, not in bulk).
An authorization: whether the agent has it always, or must request via UI.
Traceability: every use logged with parameters and result.

The brokered-capability pattern

Instead of exposing send_email(to, subject, body) directly to the agent, expose a broker:

`` agent → broker → check (user OK? rate limit? recipient allowlist?) → send_email ``

The broker can: refuse silently if recipient is off-list, prompt UI confirmation for high-impact actions, throttle, log everything.

That's the difference between "my agent has Gmail access" (catastrophic) and "my agent can send 5 emails/day to user contacts, with confirmation for new recipients" (livable).

Capabilities and user permissions

A capability doesn't suffice on its own. It must be derived from the current user's permissions. An agent acting for a low-privilege user must not inherit the app service account's rights. This is the cornerstone for preventing privilege escalation through agents.

Layer 3 — Kill-switch

The kill-switch is the ability to cleanly and instantly stop a misbehaving agent without breaking the rest of the system. Three levels:

Local kill-switch — per session

An endpoint that interrupts the current session, closes tools, saves logs, notifies the user. Must be reachable from the user UI and from an admin dashboard.

Global kill-switch — per version

A flag (feature flag, env var, remote config) that disables all agents of a given version. Useful when v2.3.1 has a bug making the agent vulnerable to a new injection pattern.

Per-capability kill-switch

Selectively disable a tool. Example: if send_email is found exploitable via injection, disable it during investigation without breaking the rest.

What makes a kill-switch useful

Not its existence: its propagation time. If the kill-switch takes 15 minutes to reach all production instances, you lose 15 critical minutes.

Good practice:

Target propagation < 30 seconds.
Tested monthly in production exercises (not only staging).
Documented: who can trigger it, in which case, after which check.

The classic mistake: confinement off by default

Many teams ship an agent and "enable confinement after first feedback". Wrong sequence. Confinement must be:

On by default at launch.
Loosened progressively as usage patterns become known.

Not the other way around. Confinement added later is almost always partial — downstream integrations already depend on the loose perimeter.

Maturity test — 5 questions

Does your agent run in a sandbox isolated per session?
Can you list, for each tool, allowed parameters and exceptions?
Is there human confirmation on non-reversible external actions?
Do you have a kill-switch that propagates in under 30 seconds?
Did you test it in an exercise this month?

5 yes = top 10% of deployments. 3 yes = good starting point. 0-1 yes = don't ship the agent as-is.

A related topic on your side?

20 minutes to scope it together. No commercial pitch.

Book a Calendly call →