Multi-agent collusion: when your agents coordinate against you

A multi-agent system (MAS) — LangGraph, AutoGen, CrewAI, Claude Agent SDK with sub-agents — splits work across coordinated LLMs. It's the archetypal AI architecture for 2026, and it's also the origin of a risk class absent from single-agent: collusion, where several agents propagate a malicious instruction by passing it around, bypassing each individual guardrail.

The mechanic, in plain terms

Picture a three-agent system:

Planner: takes the user request, plans steps.
Researcher: fetches external information (web, RAG, APIs).
Executor: acts (send email, update CRM).

An attacker injects malicious content into a web page. The researcher reads it and folds the instructions into its output. The planner reads the researcher's summary — legitimate in its eyes, it's its own agent — and adjusts its plan. The executor receives the order and acts.

The attacker's instruction crossed three agents. No trust boundary was crossed on the agent side: they trust each other by construction. The boundary was crossed once, at external-data ingestion.

Four MAS risk families

1. Instruction propagation (the case above)

Attacker injects upstream. Compromise diffuses downstream, which inherits trust granted to internal agent outputs.

2. Collusion via shared memory

If several agents share memory (vector store, redis, file), a compromised agent can plant information for the others to read. Multi-session persistence.

3. Uncontrolled feedback loop

Two agents replying to each other can enter a loop where each amplifies an instruction (observed in 2024 AutoGen attacks). Without timeout or escalation detection, the final instruction is many times worse than the original.

4. Compromise via fine-tuned sub-agent

If a sub-agent uses a custom fine-tuned model (increasingly common), fine-tune dataset poisoning can introduce a trigger backdoor: a specific phrase that activates a hidden behavior.

Defensive architecture: 5 principles

Principle 1 — Explicit trust hierarchy

Not all agents are equal. Define trust levels:

Trusted: internal agents, reviewed code, versioned prompts.
Semi-trusted: agents that consume external data (researcher, reader).
Untrusted: any output from an uncontrolled source.

And circulation rules:

A trusted never reads untrusted directly.
The semi-trusted → trusted transition requires a transformation: structured extraction, schema check, removal of anything that looks like an instruction.

Principle 2 — The supervisor/workers pattern

A single supervisor agent centralizes high-impact decisions. Workers run limited tasks and return results to the supervisor, who decides next.

The benefit: dangerous capabilities concentrate on the supervisor. A compromised worker can't act directly, only return a bad result. The supervisor can detect inconsistencies across workers.

Principle 3 — Cross-validation

For critical decisions, ask two independent agents (ideally different models) to produce an answer. If they diverge meaningfully, escalate. Inspired by N-version programming in safety-critical software; expensive in tokens, usable on high-impact actions.

Principle 4 — Capability compartmentalization

No single agent should simultaneously have:

Access to uncontrolled data (web, RAG, incoming email).
Ability to act on the outside world (send, pay, publish).

If you need both, split into two agents with a broker between them that validates and materializes the decision as a structured form, not free text.

Principle 5 — Centralized audit

All agents log to the same bus:

What instruction did it receive, from whom?
Which tools did it call?
What output did it produce?

Without centralized audit, a multi-agent incident is nearly impossible to investigate.

Frameworks

Each framework exposes useful primitives but none ships, as of May 2026, with a default end-to-end security model. You must build it:

LangGraph: explicitly graph-oriented, good for materializing trust hierarchy. No runtime compartmentalization by default.
AutoGen (Microsoft): free conversation between agents — handy but loop-prone. Cap with strict timeouts and max-turns.
CrewAI: role-based, can serve as the base for a trust hierarchy, but capability controls must be added separately.

The practical test

Three questions for a production MAS:

If I compromised the web-reading agent, what could I do in the worst case?
How many agents can call the most dangerous tool directly?
To send an email through your system, how many agents must an attacker compromise?

Ideally: (1) almost nothing — external data is extracted into structured form before reaching any other agent; (2) only one — the supervisor; (3) more than one — sending requires validation.

A related topic on your side?

20 minutes to scope it together. No commercial pitch.

Book a Calendly call →