July 2026·AI Engineering·11 min read

Designing AI agents so failures are reversible

The question I keep coming back to is not whether an AI agent will make mistakes. It will. The useful question is whether the system around it makes those mistakes visible, contained, and reversible.

The First Serious Agent Failure Is Usually Boring

The failures that teach you the most about AI agents rarely look dramatic from the outside.

It is not always the science-fiction failure where the model goes completely off the rails. More often, the agent does something that is almost reasonable. It picks the wrong record because two names look similar. It sends a draft to the wrong place because the channel context was ambiguous. It treats an old memory as current truth. It creates a duplicate task because the previous run timed out after the side effect already happened. It edits a file correctly, then commits unrelated generated output because nobody constrained the add step.

That kind of failure is frustrating because the agent did not look stupid. It looked confident enough to be dangerous.

When I started building more agent workflows, I had the same instinct many developers have: improve the prompt, tighten the instruction, add one more sentence explaining the edge case. That helps sometimes. But after a while it becomes obvious that prompt quality is only one layer. A production agent also needs a system design that assumes mistakes will happen and makes them survivable.

This changed how I think about agent engineering. I no longer ask only, "Can the agent complete this workflow?" I ask, "If it completes the workflow incorrectly, how quickly can I see what happened, stop the damage, and repair the state?"

That is a much better design question.

Reversibility Is Not A Nice-To-Have

Traditional software can fail too, but many failures are constrained by deterministic code paths. If a validation rule rejects bad input, the bad state never lands. If a database transaction fails, the system can roll back. If a deployment breaks, a previous release can often be restored.

AI agents operate with more flexible reasoning. That flexibility is exactly what makes them useful. They can interpret messy requests, choose tools, combine context, and handle work that would be awkward to express as a rigid form. But flexibility also means the failure surface is wider.

An agent can misunderstand intent while still producing valid-looking tool input. It can choose the right tool for the wrong reason. It can follow a stale instruction. It can use a real credential to perform a real action based on a bad assumption. From the outside, the system may only show the final answer. The messy part happened in the middle.

That is why reversibility matters.

For every workflow, I want to know what the undo path looks like. If the agent publishes content, can the publication be removed or replaced? If it updates a record, is the previous value stored somewhere? If it sends a message, was there an approval checkpoint before it left the private workspace? If it modifies code, is the change isolated in a commit that can be inspected? If it creates tasks, can duplicates be detected?

Some actions are naturally reversible. Some are not. The irreversible ones deserve more friction.

This is not about making agents timid. It is about matching autonomy to blast radius. A good agent should move quickly where mistakes are cheap and pause where mistakes are expensive. The system should make that distinction explicit instead of relying on the model to feel the risk correctly every time.

Start With The Damage Model

Before giving an agent a workflow, I like writing down the damage model in plain language.

What can go wrong? Who notices? How soon? What data could be exposed? What external systems can be changed? What would be annoying but harmless? What would be expensive? What would be embarrassing? What would be legally or operationally serious?

This sounds heavy, but it does not need to become a formal risk ceremony. Even a short list changes the implementation. It forces you to separate read operations from write operations. It exposes which tools need approval. It shows where logs are necessary. It makes idempotency requirements obvious. It tells you whether a scheduled job should run silently, report a draft, or wait for a human.

For example, an agent that summarizes unread emails can be fairly autonomous if it only reads metadata and produces a private summary. An agent that replies to those emails needs a different boundary. An agent that triages GitHub issues can label candidates with low risk. An agent that closes issues or pushes fixes needs traceability. An agent that generates a blog draft can run freely. An agent that publishes and pushes to production should be careful about exactly which files it touches.

The workflow may look similar in a demo, but the damage model is different.

I find this especially useful for scheduled agents. A schedule turns a one-time assistant into a small operations system. If it runs every Tuesday and Friday, it can accumulate small errors for weeks. If it has external write access, the cost of a bad assumption multiplies with each run. The design has to account for repetition.

Damage modeling gives the agent a practical boundary. It is the difference between "be careful" and "you may do these things alone, but these actions require review."

Make Every External Action Explicit

One of the easiest mistakes in agent design is hiding side effects behind friendly tool names.

A tool called handleLead might read a CRM record, enrich data, update a status, send an email, and create a follow-up task. That is convenient for a demo, but it makes the agent hard to reason about. When something goes wrong, nobody knows which part of "handle" caused the problem.

I prefer tools that make external actions explicit. Read tools should read. Draft tools should draft. Write tools should write with clear input. Send tools should send, and their name should make that obvious. If a tool has side effects, the agent should not have to infer them from documentation hidden three layers away.

This helps the model, but it helps humans more.

When reviewing a trace, I want the tool calls to tell the story. The agent searched the source. It prepared the draft. It asked for approval. It committed specific files. It pushed to the configured branch. That sequence is understandable. A single broad action is not.

Explicit side effects also make permission design easier. You can allow an agent to draft without allowing it to send. You can allow it to inspect repository state without allowing it to reset files. You can allow it to create a local artifact without allowing it to publish externally. The permission model becomes a map of actions, not a vague trust decision.

This is where many agent systems quietly become safer without becoming slower. Narrow tools do not prevent autonomy. They give autonomy rails that are visible.

Prefer Drafts Over Direct Writes

Drafts are underrated.

A draft creates a reversible stage between reasoning and consequence. The agent can do useful work, assemble context, generate output, and prepare the next step without immediately changing the outside world. A human or another verification step can inspect the result before it becomes real.

For content workflows, drafts are obvious. Generate the article, preview it, then publish. For code workflows, a branch or a focused commit serves the same purpose. For operations workflows, a proposed change set can act as a draft. For messaging workflows, a composed reply is a draft until sent.

The important detail is that the draft must be inspectable. A draft hidden inside the agent's final response is weaker than a file, commit, ticket, or structured payload that can be reviewed. The artifact should survive beyond the conversation so someone can compare, test, approve, or reject it.

I like draft-first designs because they keep momentum without pretending the model is infallible. The agent still does the heavy lifting. It gathers the facts, writes the content, prepares the diff, or builds the plan. The system simply delays the irreversible step until there is enough confidence.

Over time, some draft workflows can become automatic. If the same kind of draft passes review repeatedly and the damage model is low, the approval boundary can move. That is a product decision backed by evidence, not a leap of faith.

Use Idempotency Like A Seatbelt

Agents retry things.

Sometimes the model retries because it is unsure. Sometimes the platform retries because a tool timed out. Sometimes a scheduled task runs again after a partial failure. Sometimes the human asks the agent to continue, and the agent does not know whether the last side effect succeeded.

If the workflow is not idempotent, retries create duplicates.

This shows up everywhere. Duplicate messages. Duplicate tickets. Duplicate invoices. Duplicate blog routes. Duplicate calendar events. Duplicate records in a CRM. The agent may even report success because each individual tool call succeeded. The system state is still wrong.

Idempotency is one of the most useful boring habits in agent engineering. Use stable keys. Check whether the target already exists before creating it. Store operation IDs. Make create tools accept a client-supplied idempotency key when possible. Make schedules update existing jobs instead of blindly creating new ones. Make commits include only the intended files. Make generated assets use predictable names.

It is not glamorous, but it changes the failure mode. A repeated run becomes a no-op or an update instead of a mess.

For agents, idempotency also reduces anxiety. If I know a workflow can safely be resumed, I can let the agent recover from transient problems. If I know every retry might multiply side effects, I have to supervise every uncertain moment.

Reliable autonomy depends on boring repeatability.

Logs Should Preserve The Decision Path

A normal application log often tells you which endpoint failed and which exception was thrown. An agent trace needs to tell you something slightly different: why did the agent believe this action was correct?

That means preserving the decision path, not just the final output.

I want to see the user request, the relevant context, the tool calls, the important tool outputs, the approval boundaries, and the final action. I want timestamps. I want enough input and output to understand the reasoning without exposing private data unnecessarily. I want to know whether the agent used memory, a file, a search result, or a live API as its source of truth.

When an agent fails, the bug might not be in code. It might be in a prompt, a stale memory, a misleading tool name, a missing route, a bad assumption, or an external system that returned partial data. Without traces, every incident becomes guesswork.

Good logs also make reversibility practical. If a write action happened, the trace should identify what changed and where. If a message was sent, the trace should record enough metadata to find it. If a file was edited, the diff should be available. If a scheduled job ran, the run should have an ID.

I do not think every token needs to be stored forever. That creates privacy and cost problems. But the system should preserve enough structure to answer the operational questions: what did the agent know, what did it decide, what did it do, and how can we undo or repair it?

Approval Is A Product Feature

Some teams treat human approval as a failure of automation. I think that is the wrong framing.

Approval is a product feature when it is placed at the right boundary. It lets the agent do more work safely. It gives the human leverage instead of forcing them to perform every step manually. The agent can prepare a complete action, and the human only decides whether the prepared action is acceptable.

The bad version of approval is vague and constant. The agent asks for permission every few seconds because the workflow has no risk model. That becomes annoying quickly. People stop trusting the system because it feels needy.

The good version of approval is specific. "I prepared this email to this person with this subject. Send it?" "I changed these three files and generated this route. Commit and push?" "I found these duplicate records. Merge them?" The human sees the consequence clearly.

Approval should also be remembered carefully. If a user approves one kind of low-risk action repeatedly, the system may learn that future similar actions can run automatically. But that memory should be scoped. Approving one repository workflow does not mean approving all external writes forever.

In practice, approval design is where trust becomes visible. A system that asks at the right moment feels competent. A system that asks randomly feels unfinished. A system that never asks before irreversible actions feels reckless.

Keep Rollback Close To The Work

A rollback plan that lives only in someone's head is not a rollback plan.

If the agent performs a meaningful action, the undo path should be close to the action. In code, that might mean a clean commit that can be reverted. In content, it might mean the previous version is saved. In a database workflow, it might mean an audit table. In a messaging workflow, it might mean a draft-first boundary because the final send cannot truly be undone.

Different systems need different rollback mechanisms, but the pattern is the same: make repair possible before you need it.

I have learned to appreciate small habits here. Commit only the files related to the change. Do not include unrelated generated output. Use predictable filenames. Save source before deployment. Keep route changes explicit. Record the external URL after publishing. These details sound procedural, but they reduce the time between noticing a mistake and fixing it.

Rollback is not only for disasters. It is for ordinary corrections. The image is wrong. The date is wrong. The agent used a weak title. The route was missed. The output is good but the metadata needs a tweak. A clean change set makes those corrections cheap.

Cheap corrections make teams more willing to use automation. Expensive corrections make everyone nervous.

Test The Boundary, Not The Personality

Testing agents is awkward if you expect deterministic prose every time.

But most important agent tests are not about exact wording. They are about boundaries. Does the agent ask before sending externally? Does it avoid protected client names? Does it read the current source of truth instead of relying on memory? Does it create a draft before publishing? Does it update an existing schedule instead of creating a duplicate? Does it commit only intended files?

These are testable behaviors.

I like turning real failures into regression cases. If an agent once committed unrelated files, add a checklist or test around the git add step. If it once used a stale route, add a build step that catches missing routes. If it once acted on old memory, make it cite the current file or API result before acting. If it once sent something too early, move the approval boundary into the tool or workflow.

The point is not to remove all variability. The point is to protect the contracts that matter.

For some workflows, the test is automated. For others, it is a review checklist. For scheduled jobs, the test may be a small verification command after the run. That is fine. The right amount of testing depends on the damage model.

What matters is that every meaningful failure leaves the system a little better defended than before.

Autonomy Should Grow From Evidence

I do not think agent systems should start with maximum autonomy.

They should earn it.

At first, the agent drafts, explains, and asks. The traces show whether it uses the right sources. The human sees whether the output is useful. The system records where failures happen. Then the low-risk parts become automatic. The approval boundary moves closer to the expensive action. The tools become narrower. The tests improve. The logs get better.

This is slower than a flashy demo, but faster than cleaning up uncontrolled side effects later.

Autonomy based on evidence feels different from autonomy based on optimism. It is not "the model is smart, let it do everything." It is "this workflow has run correctly enough times, the damage is bounded, the rollback path is clear, and the system will tell us if something changes."

That is the kind of autonomy I trust.

The Goal Is Confident Recovery

AI agents will make mistakes. So will the humans designing them. So will the APIs they call, the files they read, the schedules they run on, and the assumptions embedded in their prompts.

The useful engineering goal is not a fantasy of perfect behavior. The goal is confident recovery.

Can we see what happened? Can we understand why? Can we limit the damage? Can we undo or repair the change? Can we turn the failure into a better boundary, tool, test, or instruction? If the answer is yes, the system can improve without becoming paralyzed.

This is why I care about reversible failures. They are not an excuse for sloppy agents. They are how serious agent systems learn safely.

The best automation does not pretend the happy path is the only path. It gives the happy path speed and gives the unhappy path handles.

An AI agent does not need to be perfect to be useful. It needs to fail in ways the system can see, contain, and repair.

Igor Gawrys

AI Engineer & IT Consultant · Katowice, Poland

← Previous

Debugging AI agents: how I trace tool calls, memory, and wrong assumptions

I built my own CI/CD orchestrator because existing tools were not enough