AI Security

AI agent confinement: sandbox, capabilities, kill-switch

Three technical layers to limit what an agent can do when it goes off the rails. Concrete architecture, not theory.

Aroua Biri

When an agent is compromised — by prompt injection, jailbreak, or logic bug — the question isn't only "how to prevent the compromise?". It is, but secondarily. The first operational question is: "what can the attacker do with a compromised agent?". That's what confinement answers.

Confinement in 2026 rests on three layers: sandbox, capabilities, kill-switch. Each tackles the problem from a different angle. Together they cover the real cases.

Layer 1 — Sandbox

The sandbox bounds the agent's execution environment. If an attacker runs code there (via an arbitrary-exec tool, an injected eval, a bug in the output parser), what do they actually have?

Minimum architecture

  • Isolated container per session or per task, never shared between users.
  • Ephemeral filesystem, no persistence between invocations.
  • Strict allowlist outbound network: the agent reaches only the LLM provider and declared tools. No free egress.
  • No long-lived credentials mounted plaintext: everything goes through a broker that authorizes each call.

Special case: agents that execute code

Many modern agents (Claude Code, Cursor agents, OpenAI Code Interpreter) execute code. That's the feature. The sandbox must then:

  • Reject all network syscalls outside the allowlist, even via exec.
  • Reject writes outside a dedicated folder.
  • Enforce strict CPU/memory limits — consumption is an abuse signal.
  • Apply a session timeout.

gVisor or Firecracker containers offer sufficient isolation at reasonable cost.

Layer 2 — Capabilities

The sandbox protects the environment. Capabilities protect external actions. It's the least-privilege equivalent for agents.

Principle

Every tool exposed to the agent is associated with:

  • A scope: exactly what it can do (read files in project X, not all files; send one email at a time, not in bulk).
  • An authorization: whether the agent has it always, or must request via UI.
  • Traceability: every use logged with parameters and result.

The brokered-capability pattern

Instead of exposing send_email(to, subject, body) directly to the agent, expose a broker:

`` agent → broker → check (user OK? rate limit? recipient allowlist?) → send_email ``

The broker can: refuse silently if recipient is off-list, prompt UI confirmation for high-impact actions, throttle, log everything.

That's the difference between "my agent has Gmail access" (catastrophic) and "my agent can send 5 emails/day to user contacts, with confirmation for new recipients" (livable).

Capabilities and user permissions

A capability doesn't suffice on its own. It must be derived from the current user's permissions. An agent acting for a low-privilege user must not inherit the app service account's rights. This is the cornerstone for preventing privilege escalation through agents.

Layer 3 — Kill-switch

The kill-switch is the ability to cleanly and instantly stop a misbehaving agent without breaking the rest of the system. Three levels:

Local kill-switch — per session

An endpoint that interrupts the current session, closes tools, saves logs, notifies the user. Must be reachable from the user UI and from an admin dashboard.

Global kill-switch — per version

A flag (feature flag, env var, remote config) that disables all agents of a given version. Useful when v2.3.1 has a bug making the agent vulnerable to a new injection pattern.

Per-capability kill-switch

Selectively disable a tool. Example: if send_email is found exploitable via injection, disable it during investigation without breaking the rest.

What makes a kill-switch useful

Not its existence: its propagation time. If the kill-switch takes 15 minutes to reach all production instances, you lose 15 critical minutes.

Good practice:

  • Target propagation < 30 seconds.
  • Tested monthly in production exercises (not only staging).
  • Documented: who can trigger it, in which case, after which check.

The classic mistake: confinement off by default

Many teams ship an agent and "enable confinement after first feedback". Wrong sequence. Confinement must be:

  • On by default at launch.
  • Loosened progressively as usage patterns become known.

Not the other way around. Confinement added later is almost always partial — downstream integrations already depend on the loose perimeter.

Maturity test — 5 questions

  1. Does your agent run in a sandbox isolated per session?
  2. Can you list, for each tool, allowed parameters and exceptions?
  3. Is there human confirmation on non-reversible external actions?
  4. Do you have a kill-switch that propagates in under 30 seconds?
  5. Did you test it in an exercise this month?

5 yes = top 10% of deployments. 3 yes = good starting point. 0-1 yes = don't ship the agent as-is.

A related topic on your side?

20 minutes to scope it together. No commercial pitch.

Book a Calendly call