LLM hallucination — generating false information presented as true — is well-known and widely discussed for chatbots. It's different for agents. A chatbot that hallucinates says something wrong. An agent that hallucinates acts on the wrong thing: calls the wrong tool, transfers the wrong amount, deletes the wrong file, contacts the wrong customer.
The shift from "saying" to "doing" changes risk nature. Classic anti-hallucination defenses (RLHF, RAG, careful prompts) are necessary but insufficient for agents.
Three families of agent hallucination
1. Factual hallucination
The agent asserts something untrue. Example: "the customer's balance is €12,000" when the right amount is €1,200. Consequence in an agent: it acts on the wrong amount.
2. Capability hallucination
The agent thinks it has a tool it doesn't, or vice versa. Tries to call cancel_subscription() that doesn't exist, or thinks update_user() doesn't change the password when it does.
3. Context hallucination
The agent invents missing context. Facing an ambiguous request, it decides "the user probably wants…" and acts on the invented interpretation rather than asking.
All three produce confident action on a false basis. That confidence is exactly what makes an agent dangerous: a human would doubt.
5 defenses that work in practice
1. Strict grounding on authoritative data
Factual information the agent acts on must come from explicit tools, not the model's knowledge:
- To talk about a customer's balance, the agent must call
get_account_balance(client_id)and use only that return. - The system prompt must explicitly forbid generating business facts without grounding.
- User output must cite the source ("according to your dashboard at 14:27").
Less smooth. Also what prevents factual hallucinations from reaching action.
2. Structured validation before execution
When the agent emits a tool call, payload passes through a validator before execution:
- Strict JSON schema.
- Types and value ranges checked.
- Consistency with known context (if session has user_id=42, refuse a call specifying user_id=43).
Many capability hallucinations surface here.
3. Human confirmation on high-impact actions
For everything orange (high impact, hard to reverse), require user confirmation with structured recap:
> "The agent will send this email to client@example.com. Subject: 'Your quote'. Attachment: quote-2026-05.pdf. Confirm?"
A recipient hallucination ("clent@example.com") is caught visually before sending.
4. "Forced ask" on ambiguity
Configure the agent to ask rather than infer on ambiguity. In the system prompt:
> "If information is missing to execute an action, don't guess. Ask the user."
Simple instruction, strong impact on context hallucinations. Costs user friction. On sensitive actions, friction is a virtue.
5. Divergence detection via double inference
For critical decisions:
- Generate the decision twice, ideally with two different models (Claude + GPT).
- Converge: OK.
- Diverge: escalate to a human or a third model as arbiter.
Expensive in tokens. Valuable on critical actions. Many fintech agents do it already.
The metric: false positives and false negatives at action level
Two agent-specific observability indicators:
Action false negatives
The agent refuses a legitimate action thinking it dangerous or ambiguous. Visible in UX: users complain "the agent does nothing". Calibrate to avoid an unusable agent.
Action false positives
The agent executes an illegitimate action because of context hallucination. The worst. Often invisible immediately, detected later by users or audit.
Reasonable 2026 target: under 0.1% action false positives on external-impact actions on a representative sample. Above that, not ready for autonomy on this scope.
Special case: cumulative actions
An underrated risk: the agent hallucinating small but repeatedly:
- Each turn, the agent decides "the user probably wants a notification".
- No individual notification is aberrant.
- The user receives 47 notifications in an hour.
Defenses:
- Rate limits per tool and per session.
- Statistical drift detection (a user getting 47 notifs is in the distribution tail).
- Daily global quota with threshold alerts.
What not to wait for
Waiting for models to "stop hallucinating"
Model improvement is real. Claude Opus 4.7 or GPT-5 hallucinate measurably less than GPT-4. But not zero. Probably not for a long time. Build a system assuming residual hallucinations works in 2026 and in 2028.
Counting on end users to catch errors
In consumer-grade products, users don't read recaps carefully. 2025-2026 HCI studies on automation bias are unambiguous: past a certain agent-trust threshold, humans validate confirmations without really reading. For critical actions, the system can't rely solely on the human click.