AI Security

AI agents and hallucination: when the false becomes an irreversible action

A hallucination in a chatbot is annoying. A hallucination in an agent that acts is an incident. How to reduce impact without killing usefulness.

Aroua Biri

LLM hallucination — generating false information presented as true — is well-known and widely discussed for chatbots. It's different for agents. A chatbot that hallucinates says something wrong. An agent that hallucinates acts on the wrong thing: calls the wrong tool, transfers the wrong amount, deletes the wrong file, contacts the wrong customer.

The shift from "saying" to "doing" changes risk nature. Classic anti-hallucination defenses (RLHF, RAG, careful prompts) are necessary but insufficient for agents.

Three families of agent hallucination

1. Factual hallucination

The agent asserts something untrue. Example: "the customer's balance is €12,000" when the right amount is €1,200. Consequence in an agent: it acts on the wrong amount.

2. Capability hallucination

The agent thinks it has a tool it doesn't, or vice versa. Tries to call cancel_subscription() that doesn't exist, or thinks update_user() doesn't change the password when it does.

3. Context hallucination

The agent invents missing context. Facing an ambiguous request, it decides "the user probably wants…" and acts on the invented interpretation rather than asking.

All three produce confident action on a false basis. That confidence is exactly what makes an agent dangerous: a human would doubt.

5 defenses that work in practice

1. Strict grounding on authoritative data

Factual information the agent acts on must come from explicit tools, not the model's knowledge:

  • To talk about a customer's balance, the agent must call get_account_balance(client_id) and use only that return.
  • The system prompt must explicitly forbid generating business facts without grounding.
  • User output must cite the source ("according to your dashboard at 14:27").

Less smooth. Also what prevents factual hallucinations from reaching action.

2. Structured validation before execution

When the agent emits a tool call, payload passes through a validator before execution:

  • Strict JSON schema.
  • Types and value ranges checked.
  • Consistency with known context (if session has user_id=42, refuse a call specifying user_id=43).

Many capability hallucinations surface here.

3. Human confirmation on high-impact actions

For everything orange (high impact, hard to reverse), require user confirmation with structured recap:

> "The agent will send this email to client@example.com. Subject: 'Your quote'. Attachment: quote-2026-05.pdf. Confirm?"

A recipient hallucination ("clent@example.com") is caught visually before sending.

4. "Forced ask" on ambiguity

Configure the agent to ask rather than infer on ambiguity. In the system prompt:

> "If information is missing to execute an action, don't guess. Ask the user."

Simple instruction, strong impact on context hallucinations. Costs user friction. On sensitive actions, friction is a virtue.

5. Divergence detection via double inference

For critical decisions:

  • Generate the decision twice, ideally with two different models (Claude + GPT).
  • Converge: OK.
  • Diverge: escalate to a human or a third model as arbiter.

Expensive in tokens. Valuable on critical actions. Many fintech agents do it already.

The metric: false positives and false negatives at action level

Two agent-specific observability indicators:

Action false negatives

The agent refuses a legitimate action thinking it dangerous or ambiguous. Visible in UX: users complain "the agent does nothing". Calibrate to avoid an unusable agent.

Action false positives

The agent executes an illegitimate action because of context hallucination. The worst. Often invisible immediately, detected later by users or audit.

Reasonable 2026 target: under 0.1% action false positives on external-impact actions on a representative sample. Above that, not ready for autonomy on this scope.

Special case: cumulative actions

An underrated risk: the agent hallucinating small but repeatedly:

  • Each turn, the agent decides "the user probably wants a notification".
  • No individual notification is aberrant.
  • The user receives 47 notifications in an hour.

Defenses:

  • Rate limits per tool and per session.
  • Statistical drift detection (a user getting 47 notifs is in the distribution tail).
  • Daily global quota with threshold alerts.

What not to wait for

Waiting for models to "stop hallucinating"

Model improvement is real. Claude Opus 4.7 or GPT-5 hallucinate measurably less than GPT-4. But not zero. Probably not for a long time. Build a system assuming residual hallucinations works in 2026 and in 2028.

Counting on end users to catch errors

In consumer-grade products, users don't read recaps carefully. 2025-2026 HCI studies on automation bias are unambiguous: past a certain agent-trust threshold, humans validate confirmations without really reading. For critical actions, the system can't rely solely on the human click.

A related topic on your side?

20 minutes to scope it together. No commercial pitch.

Book a Calendly call