June 2026·AI Engineering·12 min read

Keeping AI agents maintainable after the demo

The demo is usually the easiest part of an AI agent project. The harder work starts when the agent has users, tool access, schedules, memory, logs, failures, and a growing list of edge cases that nobody mentioned during the prototype.

The Demo Was Not The System

Most AI agent demos are intentionally small.

A user asks a question. The agent reads a document, calls a tool, writes a response, maybe updates a record, and everyone in the room can see the promise immediately. The interface feels light. The workflow feels close to magic. The distance between idea and result is short enough that it is easy to believe the hard part is already done.

Then the agent meets production.

Production is not impressed by the demo. Production has expired credentials, ambiguous tasks, partial outages, duplicated records, missing permissions, slow APIs, changed schemas, old prompts, overloaded schedules, users who phrase requests differently, and logs that only make sense if you already know what happened. A prototype can survive on novelty. A production agent has to survive ordinary operational pressure.

This is where the work becomes less glamorous and much more important. The real question is not whether an agent can complete one workflow when everything is prepared for it. The real question is whether the agent can remain understandable, debuggable, and changeable after it becomes part of daily work.

I have become more interested in that second question than in the first one.

A good agent demo proves that a capability exists. A maintainable agent proves that the capability can be trusted repeatedly. That difference changes how I design prompts, tools, memory, permissions, dashboards, and deployment habits. It makes me suspicious of anything that works only because the context is fresh in the builder's head.

If the system needs the original author nearby to explain every strange behavior, it is not mature yet. It may be useful. It may even be impressive. But it is still fragile.

Agents Accumulate Invisible State

Traditional software has state too, but we have decades of habits for containing it. Databases, migrations, queues, caches, sessions, feature flags, and configuration files are not always pleasant, but at least we know where to look.

AI agents introduce a different kind of state. Some of it is obvious: memory files, conversation history, scheduled jobs, tool credentials, vector stores, drafts, intermediate artifacts, and saved preferences. Some of it is less obvious: assumptions inside prompts, examples embedded in instructions, old decisions preserved in summaries, tool descriptions that no longer match reality, and workflow conventions that live only in a chat thread.

That invisible state is where many agent systems become hard to maintain.

An agent may behave differently today because a memory changed yesterday. It may choose the wrong tool because the tool description is too broad. It may continue following an old rule because nobody removed it from the instruction stack. It may skip a useful check because a previous conversation made the happy path feel more common than it really is.

When people say an agent is unreliable, they often mean the model is unreliable. Sometimes that is true. But often the system around the model is carrying too much untracked state.

I try to make that state visible. If a rule matters, it should live in a file or configuration area that can be reviewed. If a memory changes behavior, it should be possible to inspect it. If a scheduled task depends on a prompt, the prompt should be versioned or at least discoverable. If a tool can perform an external action, its boundary should be explicit.

Agents do not become maintainable because they are clever. They become maintainable because their context is organized.

Prompts Need Ownership

A production prompt is not just text. It is part of the system contract.

It defines what the agent believes its job is, what it should refuse, which sources it should trust, how it should handle uncertainty, when it should ask for approval, and which style of output is acceptable. In many agent systems, the prompt quietly becomes a policy engine, a UX layer, a routing table, and an operations manual at the same time.

That can be useful, but it can also become messy fast.

The worst prompt files I have seen are not bad because they are long. They are bad because nobody can tell which parts are current, which parts are experiments, which parts were added to fix one incident, and which parts are now contradicted by newer rules. The prompt becomes a pile of scars.

I prefer treating prompts like code with a product owner. Someone should know why a rule exists. Changes should be small enough to review. Instructions should be grouped by responsibility instead of appended wherever there is space. Old constraints should be removed when the system changes. Repeated failures should become clearer instructions, better tools, or better validation, not just another sentence at the bottom of the prompt.

This does not mean every prompt needs an enterprise process. It means the team should respect that prompt changes can break behavior just like code changes can.

When an agent starts doing more useful work, the prompt deserves more discipline. Otherwise the most important behavior in the system becomes the least reviewed artifact.

Tool Boundaries Matter More Than Tool Count

It is tempting to give an agent every tool it might ever need.

Read email. Write email. Search files. Update the CRM. Create tickets. Read calendars. Push code. Deploy sites. Generate images. Send messages. The list grows quickly because each tool unlocks a more impressive workflow.

But maintainability usually improves when tool access is narrower and clearer.

A tool should have a job that is easy to explain. Its inputs should be structured enough that the agent cannot accidentally smuggle an ambiguous instruction into a free-text field. Its output should be readable enough that the agent can make a good next decision. Its side effects should be obvious. If a tool can write externally, the approval model should be deliberate.

Too many agents are built around powerful tools with vague boundaries. The agent is told to "handle the task" and the tool accepts a flexible payload. That feels convenient until something goes wrong. Then the team has to reconstruct whether the failure came from the model's interpretation, the tool's implementation, missing context, bad input validation, or a misunderstood side effect.

I like tools that make the agent boring in the right places. The model should reason about the work, but the tool should enforce the mechanical contract. If a date must be ISO formatted, the tool should validate it. If an ID must come from a previous lookup, the tool should require that ID. If a destructive action needs approval, the tool should make that impossible to skip by accident.

The maintainable agent is not the one with the largest toolbox. It is the one where each tool has a boundary the team can trust.

Logs Are The Debugger

Debugging agents without logs is mostly archaeology.

A user says the agent did something wrong. The final answer is visible, but the path is not. Which instruction mattered? Which tool was called? What did the tool return? Did the agent ignore a warning? Did it use stale memory? Did it misunderstand the request, or did the tool give it incomplete data?

If the system cannot answer those questions, every incident becomes slower than it needs to be.

For agent work, useful logs are not just backend traces. I want to see the task, the relevant instructions, the tool calls, the important tool outputs, the decisions, the approvals, and the final user-visible result. I want timestamps because latency changes behavior. I want correlation IDs because one agent action can trigger another. I want enough redaction that logs are safe to inspect without leaking private data everywhere.

The goal is not to store every token forever. The goal is to preserve the decision path.

When a normal web request fails, a stack trace may point directly to the broken line. When an agent fails, the broken line may be an instruction, an assumption, a missing tool constraint, a bad memory, a stale external system, or a model choice. Logs need to make those layers visible.

I also care about logs for positive cases. When an agent handles a workflow correctly, that trace becomes an example of desired behavior. It helps future debugging, prompt tuning, onboarding, and regression testing. Good traces are not just incident artifacts. They are documentation written by the system while it works.

Memory Should Be Designed, Not Hoarded

Agent memory sounds harmless until it starts influencing decisions.

Remembering preferences, project facts, workflow conventions, and repeated decisions can make an agent much more useful. It removes friction. It lets the system pick up context without asking the same questions every week. It makes the agent feel less like a stateless form and more like a working assistant.

But memory also creates maintenance risk.

Old memories can become false. Personal preferences can leak into professional contexts. Temporary decisions can look permanent. A note written for one project can affect another. A memory can be too broad, too vague, or too confident. If nobody reviews it, the agent slowly builds a private model of the world that may not match reality.

I think memory needs the same discipline as any other stateful feature. It should be easy to inspect. It should be possible to edit. Sensitive information should not be stored casually. Short-term notes and long-term facts should not be mixed together. The agent should know when a memory is evidence and when it is only a hint.

One pattern I like is separating raw daily notes from curated long-term memory. Raw notes can capture what happened. Curated memory can keep only what is still useful. That gives the system continuity without turning every passing detail into a permanent instruction.

A maintainable agent remembers enough to be helpful and forgets enough to stay clean.

Schedules Turn Agents Into Operations

An agent that only responds when called is one thing. An agent that runs on a schedule is another.

Scheduled agents create operational responsibility. They may publish content, check inboxes, generate reports, monitor systems, triage leads, update dashboards, or prepare daily summaries. Once a job runs every morning or every Friday, it becomes part of the business rhythm. Failure is no longer just an inconvenience. It can mean missed work.

That changes the engineering bar.

A scheduled agent needs a clear owner, a clear expected output, a retry policy, a way to detect failure, and a way to avoid duplicate side effects. If it sends something externally, it needs an approval boundary or a very clear policy. If it depends on an external API, it needs to handle rate limits and expired credentials. If it updates content, it needs to commit only the relevant files and avoid sweeping unrelated changes into the same operation.

Schedules also need quiet behavior. If nothing meaningful changed, the agent should not create noise. If something important failed, it should report the failure with enough detail to act. A scheduled agent that cries wolf every day will be ignored. A scheduled agent that fails silently will not be trusted.

This is where agent engineering starts to look like ordinary operations, and that is a good thing. Cron jobs, queues, alerts, idempotency, audit logs, and ownership are not old-fashioned. They are the boring parts that keep automation useful after the novelty fades.

Regression Testing Is Different But Still Necessary

Testing AI agents can feel awkward because the output is not always deterministic.

That does not mean testing is optional. It means the tests have to focus on the right contracts.

For some workflows, exact text matters less than behavior. Did the agent choose the correct tool? Did it ask for approval before an external action? Did it refuse a request outside scope? Did it preserve required fields? Did it use the current source of truth instead of memory? Did it generate an output with the necessary structure? Did it avoid mentioning private names that should stay anonymized?

Those are testable expectations.

I like building small regression cases from real failures. When an agent makes a bad assumption, misses a boundary, or formats something incorrectly, that case should become a fixture. The goal is not to freeze the agent's personality. The goal is to prevent known bad behavior from returning quietly.

Some tests can be automated. Some need human review. Some are closer to checklists than unit tests. That is fine. The point is to create a habit of verification before trusting changes to prompts, tools, routes, schedules, or memory policy.

Without regression testing, every agent improvement is partly a guess. With even a small test suite, the team can move faster because it has a way to notice when a change improves one behavior and damages another.

Documentation Should Explain The Boundaries

Agent documentation often focuses on what the agent can do. That is useful, but incomplete.

The more important documentation explains boundaries. What is the agent allowed to do alone? What requires approval? Which files define its behavior? Which tools can create side effects? Where are logs stored? How does memory work? What should a human check before publishing or deploying? What is the fallback when a tool fails?

This kind of documentation does not need to be beautiful. It needs to be close to the work and updated when the work changes. A short, accurate file is better than a polished page that nobody trusts.

I also like documenting negative space: what the agent should not do. Do not send messages in group chats unless directly relevant. Do not publish under a person's name without approval. Do not mention protected client names. Do not add unrelated files to a commit. Do not use memory from private contexts in shared contexts.

These rules may sound obvious after an incident. They are less obvious before one.

Good documentation makes the agent easier to operate by someone who did not build it. That matters because maintainability is not only about code quality. It is about whether another person can understand the system well enough to change it responsibly.

Small Changes Beat Heroic Rewrites

When an agent system becomes messy, the temptation is to rebuild it.

Sometimes that is justified. But more often, maintainability improves through small, boring changes: splitting a long instruction file into clearer sections, renaming tools, tightening schemas, deleting old memory, adding a route to prerender, improving a log message, writing down an approval rule, adding one regression case, or moving a repeated workflow into a script.

Those changes do not look dramatic, but they compound.

Every visible boundary reduces the amount of context someone has to keep in their head. Every useful log reduces debugging time. Every narrow tool reduces accidental behavior. Every removed stale rule reduces contradiction. Every clean commit makes deployment easier to audit.

I have learned to respect this kind of work because agents are especially sensitive to hidden complexity. A small ambiguity in ordinary software might produce one bug. A small ambiguity in an agent can influence many future decisions because the model keeps interpreting it in new situations.

Maintenance is not what happens after agent engineering. It is agent engineering.

The Measure Is Change

The best test of maintainability is not the first launch. It is the third change.

Can the team update the prompt without breaking unrelated workflows? Can a new tool be added without making the agent overreach? Can a stale memory be removed without losing important context? Can a failed scheduled run be explained quickly? Can a new person understand why an approval boundary exists? Can the system be deployed without sweeping unrelated files into the release?

If those changes are painful, the agent is carrying too much hidden complexity.

I still enjoy the demo moment. It is satisfying when an agent performs a workflow that used to take manual effort. But I trust the system only after it survives change. That is when the design choices become visible. The logs, the memory policy, the tool boundaries, the approval rules, the documentation, and the tests either support the work or get in the way.

AI agents are not just prompts attached to tools. They are software systems with unusually flexible behavior. That flexibility is powerful, but it needs structure around it.

The demo shows what the agent can do once. Maintenance decides whether it can keep doing it when nobody is watching the happy path.

The hard part of agent engineering is not making the first workflow look intelligent. It is making the hundredth workflow understandable.

Igor Gawrys

AI Engineer & IT Consultant · Katowice, Poland

← Previous

Why I study Law while building AI systems

My setup: the tools I use every day