June 2026·AI Engineering·12 min read

Debugging AI agents: how I trace tool calls, memory, and wrong assumptions

Debugging an AI agent is not like debugging a normal function. The failure can live in the prompt, the retrieved context, the tool result, the memory layer, or the model's confident interpretation of all of it.

The Bug Was Not In The Code Path

The first time I seriously debugged an AI agent, I made the same mistake I would have made in a normal application: I looked for the broken code path.

The agent had done something wrong, so my instinct was to ask which function produced the wrong output. Which branch misfired? Which API returned an unexpected value? Which handler swallowed an error? That mindset works well for deterministic systems. It works less well when the system has a language model sitting in the middle, interpreting instructions, choosing tools, compressing context, and filling gaps with assumptions.

The visible failure looked simple. The agent completed a task and reported success, but one of the steps was based on an old assumption. Nothing crashed. The tool call returned valid data. The final answer sounded reasonable. The problem was that the agent had mixed current evidence with stale memory and acted as if both had the same authority.

That changed how I debug agents.

With ordinary software, I usually ask, "Where did the value change?" With agents, I ask a broader question: "Where did the belief come from?" The bug may still be in code, but it may also be in context selection, instruction priority, tool design, memory hygiene, or the absence of a verification step.

AI agents fail less like broken calculators and more like junior teammates who misunderstood the task in a very plausible way. The debugging process has to account for that.

I Start With The Timeline

When an agent fails, I want a timeline before I want an explanation.

Explanations are cheap after the fact. The model can produce a coherent story for almost anything. A timeline is harder to fake because it ties the failure to actual events: what the user asked, what context was loaded, what the agent planned, which tool it called, what the tool returned, what it wrote down, what it ignored, and what it finally told the user.

Without that sequence, debugging becomes vibes. You stare at the final answer and guess which part went wrong. Maybe the model hallucinated. Maybe the tool returned stale data. Maybe the prompt was ambiguous. Maybe memory injected an old preference. Maybe a previous summary dropped a condition. All of those are possible, but guessing is not diagnosis.

So I reconstruct the run as concretely as possible. I read the conversation. I inspect the tool calls. I check file diffs. I compare the final answer with actual state. I look for the first moment where the agent's belief diverged from reality.

That first divergence matters. If the agent misunderstood the task at the beginning, the later code may be perfectly consistent with a wrong goal. If it understood the task but trusted the wrong tool result, the fix is different. If it had the right evidence but compressed it badly, the issue is in memory or summarization. The timeline tells me which layer deserves attention.

Tool Calls Are Evidence, Not Decoration

I treat tool calls as the equivalent of stack frames in agent debugging.

A tool call says: at this point, the agent believed it needed this information or side effect. The arguments show what it thought the task was. The output shows what evidence entered the context next. The absence of a tool call can be just as revealing. If the agent claimed a file was updated but never inspected the diff, that is a debugging signal. If it reported that a page worked but never ran the build, that is not a small omission. It is a missing proof step.

This is why I like agent systems that make tool traces easy to inspect. I want to see the exact command, the working directory, the output, and the exit status. I want to know whether a search returned nothing or whether the agent never searched. I want errors to remain visible instead of being summarized into a polite sentence.

Many agent mistakes hide in the gap between "the tool returned something" and "the agent interpreted it correctly." A search result may be truncated. A command may succeed while still printing a warning. A page fetch may return a login screen instead of the real page. A build may finish with generated output that does not include the route the agent intended to publish.

The tool trace gives me a way to challenge the agent's confidence. If the final answer says "done," the trace should show the work becoming done.

The Most Dangerous Word Is "Probably"

Agents often make mistakes at the exact point where a human would say "probably."

The file is probably named like the others. The route probably follows the same pattern. The old memory is probably still true. The user's latest instruction probably means the same thing as last time. The external service probably deployed after the push. The untracked files are probably irrelevant.

Sometimes those assumptions are harmless. Sometimes they are correct. But when an agent acts on too many of them without labeling them as assumptions, the final result can look more certain than the evidence deserves.

In debugging, I look for silent probablys. They usually appear where the agent skipped a cheap verification step. It did not read the current config. It did not check git status. It did not open the generated file. It did not compare the route list with the new slug. It did not confirm that the downloaded image was actually an image. It just continued because the pattern looked familiar.

The fix is not to ban assumptions. That would make agents painfully slow and needy. The fix is to separate low-risk assumptions from assumptions that affect external side effects, production state, money, reputation, data loss, or user trust. If the cost of being wrong is meaningful, "probably" needs evidence.

Memory Bugs Look Like Personality

Persistent memory makes agents more useful, but it also creates a subtle class of bugs.

A memory bug often does not look like a bug. It looks like the agent being consistent. It remembers a preference, a workflow, a repository quirk, a past decision, or a shortcut. Then it applies that memory to a situation where it no longer belongs.

This is hard to notice because continuity feels good. When an agent remembers how a project works, the workflow becomes smoother. But memory is not automatically truth. It is historical context. It can be outdated, too broad, incomplete, or valid only under conditions that were obvious at the time and missing later.

When I debug memory-related failures, I ask three questions. Where did the memory come from? Was it meant to be durable? Did current evidence confirm it? If the answer to the last question is no, I treat the memory as a hint, not a fact.

The best memory systems I have used separate raw logs from curated memory. Raw logs are useful for reconstruction. Curated memory is useful for behavior. Mixing them together makes every old detail feel equally important, and that is how an agent ends up following a stale note instead of the current task.

Memory should make the agent more grounded, not more superstitious.

Wrong Assumptions Usually Enter Through Gaps

An agent rarely invents a bad assumption in a vacuum. The assumption usually enters through a gap.

The user leaves out a detail because it seems obvious. The repo has two conventions because it is mid-migration. A summary says "the build failed" but not why. A tool returns partial output. A file name almost matches a route but not quite. A safety rule says to ask before publishing, while a current automation explicitly says to publish as part of the scheduled workflow.

Those gaps are where the model tries to be helpful. It chooses the most likely interpretation and keeps moving.

That is often exactly what we want. A useful agent should not stop at every missing comma. It should infer, inspect, and proceed when the risk is low. But when debugging a failure, I pay close attention to which gaps were filled by inference and whether the agent had a cheap way to close them with evidence.

If the agent could have read a file, searched the repo, checked the current branch, or inspected a generated artifact, then the failure is not simply "the model assumed wrong." The workflow allowed an assumption to survive when verification was available.

I Debug The Prompt Last, Not First

It is tempting to fix every agent failure by editing the prompt.

Prompts matter, but they are also an easy place to hide unclear thinking. If an agent skipped verification, I can add another sentence saying "always verify." If it touched the wrong file, I can add "do not touch unrelated files." If it trusted stale memory, I can add "current evidence wins." After a few incidents, the prompt becomes a growing wall of rules.

Sometimes that is necessary. More often, I first want to know whether the system design made the right behavior natural.

Was the authoritative source easy to find? Were tool outputs clear? Did the agent have access to the current state? Was there a checklist near the workflow? Could the final side effect be verified automatically? Were destructive or public actions separated from internal work?

A prompt rule is weakest when it asks the model to remember discipline that the system could enforce. If every deployment requires a build, the workflow should run the build. If every content post requires a route, the route list should be checked. If every external message requires approval, the sending tool should make that boundary explicit.

I still improve prompts, but only after I understand the failure. Otherwise I am just adding instructions to compensate for a missing feedback loop.

Good Agent Logs Need Human Language

Raw logs are necessary, but they are not enough.

When I am debugging an agent, I want both machine-level evidence and human-level summaries. The machine-level evidence is the exact tool call, diff, output, or response. The human-level summary says why the agent did that thing and what it believed the result meant.

Either one alone is incomplete. A raw command log can tell me that the agent ran a build, but not whether it understood why the build mattered. A natural-language summary can tell me the agent thought the build passed, but not whether the command actually exited cleanly. Together, they let me compare intention with evidence.

This is especially useful in long-running workflows. After an hour of work, nobody wants to reconstruct every decision from scratch. A good agent should leave behind a concise trail: selected this topic, edited these files, downloaded this asset, generated this output, saw this warning, committed this hash, pushed this branch, final URL is this.

Those summaries are not just for the user. They are for future debugging. A well-written status message can save more time than a clever abstraction.

Reproduction Means Replaying Beliefs

In normal software, reproduction often means finding the same input that triggers the same bug.

With agents, reproduction also means replaying the same beliefs. What did the agent know at the time? What did it think was authoritative? Which memory entries were active? Which tool outputs were available? Which instruction had priority? Which files were dirty before it started?

This is why agent bugs can feel slippery. If you rerun the task later with cleaner context, updated files, or a corrected memory entry, the failure may disappear. That does not mean the bug was imaginary. It means the original state included context that mattered.

For serious workflows, I like saving enough run metadata to make this possible. Not every token needs to be archived forever, but the important parts should be recoverable: user instruction, selected context, plan, tool calls, outputs, final diff, and final report. If the workflow has external side effects, the need for this evidence is even stronger.

When an agent sends something, publishes something, deletes something, or changes production state, "I think it did the right thing" is not a debugging strategy.

The Fix Is Usually A Smaller Working Set

After debugging enough agent failures, I noticed that many fixes make the agent's working set smaller.

Instead of giving it every project note, I give it the current checklist and a search path for more. Instead of relying on old memory, I make it read the current config. Instead of letting it infer file ownership, I define the files it is expected to edit. Instead of asking it to "handle the deployment," I break the workflow into build, commit, push, and verification.

Smaller working sets reduce the number of plausible wrong answers. They also make tool traces easier to inspect. If an agent only needed five files and one API response, debugging is possible. If it had a giant bag of context from months of work, the source of a bad assumption can disappear into noise.

This is not about making agents less capable. It is about making the current task legible. Humans work this way too. When I debug production code, I do not keep the entire company history in my head. I narrow the problem until the important evidence fits in view.

My Practical Debugging Checklist

When an AI agent behaves incorrectly, I usually walk through the same checklist.

What was the exact user instruction, and did the agent restate the scope correctly?
Which context sources were loaded before the first plan?
Which tool call first introduced the wrong belief?
Was the tool output complete, current, and interpreted correctly?
Did memory influence the decision, and was that memory verified?
Which assumptions were made silently?
Was there a cheap verification step the agent skipped?
Did the final report match actual state, not just intended state?

This checklist is intentionally boring. That is the point. Agent debugging becomes much less mysterious when I stop treating the model as a black box and start treating the whole workflow as a system with inputs, beliefs, actions, and evidence.

Trust Comes From Traceability

I do not trust an agent because it sounds confident. I trust it when I can trace how it got there.

If the answer is correct, I want to know why. If the answer is wrong, I want to know where the wrongness entered. If the workflow succeeded, I want evidence. If it failed, I want a failure that is specific enough to fix.

That is the standard I keep coming back to. An AI agent does not become reliable by being impressive in a demo. It becomes reliable when its work can be inspected, questioned, replayed, and improved.

The best agents I have built are not the ones that never make mistakes. They are the ones that make mistakes in ways I can understand. Once I can understand the failure, I can improve the prompt, the memory, the tool contract, the context selection, or the verification step.

Debugging AI agents is still debugging. The difference is that the variable I care about most is often not a value in memory. It is a belief in context.

If you cannot trace where the agent's belief came from, you cannot tell whether its answer was engineered or guessed.

Igor Gawrys

AI Engineer & IT Consultant · Katowice, Poland

← Previous

Context engineering is the real work behind reliable AI agents

Designing AI agents so failures are reversible