Skip to content
ForceTricks
Back to blog

Agentforce in Production: Governance, AWUs, and What Breaks

7 min read
SeriesAgentforce in ProductionPart 2 of 2
  1. 1Agentforce Multi-Agent Orchestration: Design Patterns
  2. 2Agentforce in Production: Governance, AWUs, and What Breaks

When an agent modifies a record, there is no user to hold accountable. No session to attach an audit entry to. No human who consciously approved the action. That's the governance problem Agentforce creates at scale — and the platform doesn't solve it for you.

Salesforce reported 3.8 billion AWUs processed with 111% quarter-over-quarter growth. Someone is paying for that compute, and someone is responsible for the record changes those agents made. This post is about building the architecture that keeps both of those things under control.

AWU Cost: Forecasting and Where Architects Over-Provision

An Agentic Work Unit (AWU) is consumed each time an agent takes an action: a tool call, a record read, a record write, a reasoning step. The billing model means that a poorly designed agent chain doesn't just produce bad outcomes — it produces expensive bad outcomes.

The most common over-provisioning pattern I've seen is agents that loop. An agent is configured to retry on ambiguous input, so it calls the same tool three times with slightly different parameters trying to resolve uncertainty that should have been handled by better prompt design or by returning a structured "needs clarification" response. Each iteration consumes AWUs.

A practical forecasting approach: instrument a sample of 50–100 representative requests in a sandbox and log AWU consumption per request type. Then model against your expected production volume. Don't model against average complexity — model against the 90th percentile, because that's where the budget surprises come from.

Patterns that predictably over-consume at scale:

  • Unbounded retrieval actions: an agent that fetches "all related records" instead of a bounded query. At small record counts this is invisible. At production scale, an account with 2,000 cases triggers a different AWU footprint than one with 5.
  • Retry loops without backoff: agents configured to retry failed tool calls indefinitely, consuming AWUs while waiting for an external system to recover.
  • Unnecessary reasoning steps: agents prompted to "think step by step" for simple CRUD operations that don't benefit from chain-of-thought.

Set AWU consumption alerts at 70% of your entitlement threshold. By the time you're at 90%, you're already in recovery mode.

Audit Trail Design for Agent Actions

Flow audit trails work because a human initiated the Flow — there's a session, a user ID, and an implicit intent attached to the change. When an agent modifies a record, none of that is true by default. The audit entry shows the integration user or the running user of the agent's connected app, and the "why" is completely absent.

This gap matters for compliance. If an auditor asks why Opportunity XYZ had its stage changed on a specific date, "the Agentforce agent did it" is not an acceptable answer in most regulated industries.

The architecture I recommend: every agent action that modifies a record should write a companion record to a custom Agent_Audit__c object before executing the DML. At minimum, capture:

  • The agent name and version
  • The triggering request (or a sanitized summary if it contains PII)
  • The specific action taken and on which record
  • The data state before the change (a JSON snapshot of relevant fields)
  • The confidence level or reasoning summary if the model surfaces it
  • A correlation ID that links back to the full agent session log

This isn't automatic. You have to build the action that writes the audit record and invoke it explicitly in your agent's action sequence. The native Agentforce audit logs give you session-level data; they don't give you field-level change context linked to business intent.

For orgs under SOX, HIPAA, or similar regulatory frameworks, this companion audit object is non-negotiable. Design it at the start of the project, not as a retrofit.

Rollback Strategy

An agent chain executes 15 record changes. Step 12 fails. The platform rolls back step 12's DML transaction, but steps 1 through 11 are already committed. You now have a partially-executed workflow with no built-in recovery path.

Salesforce does not provide automatic saga-style compensation for multi-step agent chains. You design for it, or you accept partial execution as a permanent state.

There are two viable approaches depending on your tolerance and complexity budget.

Checkpoint-based rollback: before each significant stage of the chain, snapshot the relevant record state to the Agent_Audit__c object (or a purpose-built snapshot object). If the chain fails, a compensating Flow or Apex job can restore the snapshots. This works well when the number of affected records is bounded and predictable.

Idempotent re-execution: design every action in the chain to be idempotent — running it twice produces the same result as running it once. Then, on failure, you re-run the entire chain from step 1 with the same input. This requires that your actions check pre-conditions before writing ("if this record already has status X, skip the write") and that external system calls are also idempotent. More upfront design work, but simpler failure recovery.

What doesn't work: relying on the agent to recover itself. If step 12 failed because of a data constraint violation or a downstream timeout, the agent doesn't have enough context to know what state steps 1 through 11 left the org in. Recovery must be outside the agent.

Human-in-the-Loop Checkpoints

For regulated industries, fully autonomous agent chains are often not acceptable — not because the technology can't do it, but because regulatory frameworks require documented human approval for certain categories of action.

The practical pattern: gate high-risk actions behind a Platform Event or a custom approval queue. The agent executes up to the checkpoint, publishes an event (or creates a Pending_Approval__c record), and waits. A human reviews and approves or rejects. The agent chain resumes or terminates based on the decision.

The implementation challenge is that the agent doesn't "wait" in a traditional sense. You need to persist the chain state at the checkpoint, store enough context to resume cleanly, and invoke continuation logic when the approval comes through. This is architecturally closer to an asynchronous workflow than to a single agent execution.

Avoid the temptation to make every action require approval — that's not automation, it's a complicated approval workflow. Gate the genuinely high-risk categories: irreversible actions, high-value transactions, changes to regulated data. Everything else should flow autonomously.

The Zombie Agent Problem

An agent gets stuck in an ambiguous state — waiting for an external system that's down, caught in a retry loop, or stalled on a tool call that never resolves. It's not failing hard enough to surface an error. It's just consuming AWUs and making no progress. This is a zombie agent.

The platform doesn't have a built-in circuit breaker for this at the chain level. You build it.

Detection patterns that work:

  • Time-bounded execution: set a maximum wall-clock time for the chain. If execution hasn't completed within N seconds, trigger a termination event and write a diagnostic record. Salesforce's agent execution has its own timeout behavior, but it's not configurable per-chain — you need external monitoring.
  • AWU consumption threshold per request: if a single request has consumed more than X AWUs without producing a final response, treat it as anomalous and flag it for investigation.
  • Dead letter queue: agent actions that fail after exhausting retries should write to a persistent failure queue (Agent_Dead_Letter__c or similar) with full context. Without this, failed chains disappear into logs that no one monitors.

The circuit-breaker pattern for external dependencies: maintain a Custom Metadata record that tracks the health state of each external system your agents call. Before making a callout, the action checks if the target system is marked as DEGRADED or DOWN. If so, it fails fast with a structured error rather than attempting the call and burning the full timeout. A separate monitoring job updates the health state based on recent call success rates.

None of this is built into Agentforce. It's infrastructure you bring with you.


Next in this series: giving agents safe read/write access to legacy ERPs and mainframes — the integration layer the documentation skips.

What's your approach to agent governance? LinkedIn.

Gabriel Cruz Ferreira

Gabriel Cruz Ferreira

Salesforce Architect · 15x Certified · Road to CTA

Was this post helpful?