When an agent makes a mistake — wrong recipient, unintended deletion, accidental leak — the critical question is "how?". Without structured audit log, the honest answer is "we don't know". And without that answer, you can't fix, you can't convince a regulator, you can't reassure a customer.
Audit log isn't operational nice-to-have. It's a prerequisite to serious 2026 production launches.
12 minimum fields per tool call
- timestamp — ISO 8601 with timezone and millisecond.
- session_id — unique conversation identifier.
- user_id — end-user identifier (hashed if needed).
- agent_version — agent code + system prompt version.
- model_id — provider + name + version.
- tool_name — tool called.
- tool_input — full parameters.
- tool_output — full response (truncated with hash if huge).
- decision_rationale — extract from model response explaining the choice.
- prior_context_hash — hash of context at call time.
- outcome — success / failure / refused by broker.
- latency_ms — execution time.
Beyond tool calls
Full prompt to the model
Not just user input — system prompt + assembled context + history. Big. Also what lets you answer "why did the model do this". Typical cost: 5-20 KB per turn. 100,000 turns/month = ~10 GB/month raw.
Guardrail and broker decisions
When a guardrail rejects, when a broker refuses: log why. As valuable as successes. Many attacks leave traces in refusals before succeeding.
Memory modifications
Every long-term memory write: who, what, validated by, source.
Version transitions
Mark every agent version change clearly. Otherwise incidents spanning two versions become untraceable.
Format that works
Structured JSON, not text
Always JSON. Text is eye-friendly, scan-unfriendly.
One event per line (NDJSON)
`` {"ts":"2026-05-12T14:30:01.123Z","type":"tool_call","session":"...","user":"...","tool":"send_email","input":{...},"output":{...},...} ``
Plugs straight into Loki, Elasticsearch, BigQuery, ClickHouse, Splunk, Datadog.
Immutable storage
Append-only logs, ideally with chaining (hash of previous event in current). Forgery breaks the chain. Required by more and more SOC 2 / ISO 27001 / 42001 auditors.
Differentiated retention
- 90 days hot.
- 12 months warm.
- 3-5 years cold.
Long retention is needed for late-discovered incidents (memory poisoning can take months to surface).
What NOT to do
Logging PII plaintext
A user pastes a credit card, you log the full prompt — you now store PCI data. Pattern: scrubber that detects and masks PII patterns. Imperfect, essential.
Logging secrets
Tokens in context: don't log. Use HTTP client "sanitize" options.
Logging without volume limit
Looping agents produce millions of events per minute. Rate-limit the logger, alert on overflow. Otherwise costs explode and visibility dies exactly when you need it most.
Operational use
Having logs is 30% of the work. Using them is the rest.
Continuous dashboards
- Refusal rate per guardrail.
- Tool-call distribution.
- Top users/sessions by latency or volume.
- Anomalies (tool called 10x more than usual).
Targeted alerts
- High-impact action outside business hours.
- Repeated user-confirmation refusals.
- Sudden vocabulary change in prompts.
On-demand investigations
When a customer reports odd behavior: find session, find full prompt, find tool outputs. More than an hour to reconstruct = under-invested.
The simple rule
Before go-live, ask: "if tomorrow I have to explain to a customer / regulator / judge what my agent did at 14:27 on May 12, can I?". If the answer is no, audit log isn't sufficient.