June 2026·AI Engineering·12 min read

Building automation that survives production

The hard part of automation is not making a script work once. The hard part is building a workflow that keeps behaving when APIs are slow, data is weird, humans are busy, and nobody is watching the terminal.

The First Version Worked, Which Was The Problem

The most dangerous automation I have ever built was not broken. It worked on the first realistic run.

That sounds like a strange complaint, but it is exactly where a lot of automation projects start going wrong. A script works once, the output looks useful, and suddenly everyone starts treating it like a system. Nobody has asked what happens when an API times out halfway through the batch. Nobody has checked whether the job can be resumed. Nobody knows whether the same record will be processed twice if the server restarts at the wrong moment.

I have made that mistake enough times to respect it now.

The first version of an automation is usually built around the happy path. It assumes the input is roughly correct, the network is roughly available, the credentials are still valid, the third-party service responds the way the docs promised, and the person running it has enough attention left to notice if something looks odd.

That is fine for a prototype. It is not fine for production.

Production automation has a different job. It has to survive boredom, drift, retries, partial failure, stale state, duplicate runs, forgotten assumptions, and the quiet reality that humans will eventually stop watching it closely. If it only works while someone is actively babysitting it, it is not automation yet. It is a manual process wearing a shell script.

Automation Is A Promise About Attention

When I automate something, I am not only saving time. I am making a promise about attention.

I am saying that this workflow no longer deserves the same level of manual focus every time it runs. The machine can handle the repetition. The human can move up a level and look at decisions, exceptions, and outcomes.

That promise is powerful, but it has to be earned. If the automation is fragile, it does not remove attention. It steals attention unpredictably. Instead of doing a task manually at a known time, you now wait for a mysterious background process to fail at the least convenient moment.

This is why I think about production automation less as a clever script and more as a contract. The system must be honest about what it did, what it skipped, what it could not prove, and what it needs from a human. It should reduce routine work without hiding uncertainty.

The difference is subtle, but it changes the design. A fragile automation tries to complete the workflow. A mature automation tries to make the workflow observable, recoverable, and safe to repeat.

The Basic Shape: Input, State, Action, Evidence

Most useful automations I build eventually reduce to four things: input, state, action, and evidence.

Input is what the system believes it should process. That might be leads from a search API, unread emails, calendar events, invoices, job offers, social posts, support tickets, or records from an internal system.

State is what the automation remembers between runs. Which items were already processed? Which ones failed? Which ones are waiting for a retry? Which ones were ignored intentionally? Which external IDs map to which local records?

Action is the side effect. Send an email. Generate a page. create a ticket. update a spreadsheet. publish a post. call an API. write a file. This is the part people usually focus on because it feels like the automation is doing real work.

Evidence is the proof that the action happened correctly enough to trust. Logs, timestamps, output files, response IDs, commit hashes, generated URLs, checksums, screenshots, or status messages. Without evidence, the automation has only vibes.

When an automation is small, people often skip state and evidence. They wire input directly to action. That is how you get scripts that feel fast in development and terrifying in production.

Idempotency Is The Difference Between Retry And Damage

If I could force one word into every automation discussion, it would be idempotency.

A workflow is idempotent when running it again does not create accidental duplicate damage. If a job crashes after processing 80 of 100 items, I should be able to restart it without sending the first 80 emails again, charging the same customer twice, publishing duplicate posts, or creating a second version of the same artifact with a slightly different name.

This sounds obvious until you inspect real scripts. A surprising number of them assume they will run once, perfectly, from beginning to end.

That assumption collapses the first time a network call fails, a token expires, a deployment restarts, or the operator hits the command twice because nothing printed for thirty seconds. If the automation cannot tolerate repetition, every retry becomes a risk decision.

I prefer to design retries as a normal path, not an emergency path. That means stable IDs. It means checking whether an output already exists. It means recording external response IDs. It means making actions conditional on state, not just on the presence of input. It means being able to answer: "If this runs again, what exactly will happen?"

Good idempotency makes automation calmer. You stop treating interruption as a disaster and start treating it as a recoverable state.

Partial Failure Is The Default, Not The Edge Case

One mental shift helped me a lot: partial failure is not exceptional. It is the normal condition of any workflow that touches enough moving parts.

The database may be fine while the email provider is down. The search API may return data, but enrichment may fail for half of it. The generation step may succeed, but upload may timeout. A commit may be created locally, but push may fail because the remote has a certificate issue. A build may pass, but deployment may lag behind.

If the system has ten steps, "all ten succeeded instantly" is only one possible outcome. Production design has to care about the other outcomes too.

That is why I dislike automations that only print "done" or "failed." They flatten reality. I want to know which items succeeded, which failed permanently, which failed temporarily, which were skipped by policy, and which require human review. A batch-level status is rarely enough.

When the system records item-level outcomes, recovery becomes practical. You can retry only the failed records. You can inspect patterns. You can avoid making a human reread everything just to find the two rows that matter.

Queues Beat Loops Once The Work Matters

A simple loop is seductive. Fetch items, iterate, process, exit. For prototypes, that is usually enough.

But once the work matters, I often want queue semantics even if I do not introduce a full queue server immediately. I want each unit of work to have a status, an attempt count, a last error, a next retry time, and a clear final state.

This changes the whole operational feel of the automation.

A loop says, "I tried to process a list." A queue says, "Here is the lifecycle of every item." That lifecycle is what lets the system survive restarts, backoff, rate limits, and manual intervention.

Not every project needs RabbitMQ, Redis, SQS, or a formal worker system. Sometimes a database table, a JSON state file, or a disciplined local queue is enough. The important part is not the technology. The important part is treating work as durable units instead of temporary loop iterations.

Rate Limits Are Product Requirements

Engineers sometimes talk about rate limits like annoying infrastructure trivia. I think that is the wrong framing.

Rate limits are product requirements because they shape what the automation can responsibly promise. If an API allows 100 requests per minute, a workflow that needs 10,000 enrichments cannot behave like a one-shot command. It needs pacing, batching, persistence, and expectation management.

This matters especially with automations that cross external services: outreach, publishing, data enrichment, monitoring, AI generation, and anything involving email or social platforms. The technical limit is only one layer. Reputation, deliverability, and account safety are often stricter than the documented API limit.

I have learned to build pacing early. A system should know when to wait. It should know when to stop. It should know when a retry is helpful and when it is just hammering a service that already told you to slow down.

Backoff is not just a defensive coding pattern. It is a sign that the automation understands it is sharing space with other systems.

Logs Are For Future Me Under Pressure

I do not write logs for the version of me who has the whole architecture in short-term memory. I write logs for future me, tired, interrupted, and trying to understand a failure from three days ago.

That future version does not need poetic logging. He needs direct answers.

What job ran? What input did it use? What version of the code was active? How many items were considered? How many were processed? Which external calls failed? What did the provider return? Was this a retry? Was anything skipped intentionally? Where is the generated artifact?

A good log line reduces the number of files I need to open. A bad log line creates another mystery.

I also care about final summaries. At the end of a run, the automation should report the outcome in human terms. Not a stream of technical noise, but a concise status: created this, updated that, skipped these, failed here, next action is this. That summary often becomes the difference between trust and suspicion.

Silent Success Is Still A Design Choice

One of the harder judgment calls is deciding when the automation should notify a human.

If it speaks too often, people ignore it. If it stays quiet too aggressively, important failures age in the dark. The right answer depends on the workflow, but I usually separate outcomes into three classes.

Routine success can be quiet or logged somewhere passive. Expected minor issues can be summarized without urgency. Anything that changes business outcome, blocks the next run, risks external side effects, or needs a human decision should notify clearly.

This is a product design problem, not just an engineering problem. A notification is an interruption. The automation should spend interruptions carefully.

I like systems that can say, "I handled it," "I could not handle it, but it can wait," or "I need you now." Those are very different messages. Mixing them together is how alerts become background noise.

Human Approval Belongs At The Boundary Of Trust

Not every automation should be fully autonomous. Some actions deserve a human checkpoint.

For me, the boundary usually appears where a system crosses into public, financial, legal, reputational, or irreversible territory. Drafting an email is one kind of action. Sending it is another. Generating a post is one kind of action. Publishing it under someone's name is another. Preparing a data cleanup plan is one thing. Deleting production records is very much another.

The point is not to slow everything down. The point is to put friction where judgment still matters.

A good automation can still do most of the work before approval. It can gather context, prepare the artifact, run validation, show a diff, explain risks, and make the final decision cheap. But the final side effect should match the trust level of the system.

This is also where UI and workflow design matter. If approval requires reading ten files and reconstructing context manually, people will approve lazily. If the automation presents a clear summary, a diff, and the exact effect of saying yes, review becomes real.

Configuration Drift Is A Slow Failure Mode

Some automation failures do not happen dramatically. They happen because the world changes slowly around a script.

An API changes a response field. A token expires. A deployment target moves. A domain changes SSL behavior. A file path is renamed. A business rule changes, but the cron job keeps running the old assumption. Nobody notices because the job still exits with code zero.

This kind of drift is frustrating because the automation looks alive while becoming less correct.

I try to defend against it with explicit checks. Validate required environment variables. Check that external endpoints return the expected shape. Fail loudly when a required field disappears. Keep configuration close to the workflow or document where it lives. Include sanity checks that compare the output against basic expectations.

Automation should not be optimistic about its environment. It should verify enough to know when the ground moved.

The Best Automations Are Boring To Operate

There is a kind of automation that feels impressive in a demo and exhausting in production. It has clever prompts, many integrations, dramatic output, and no clear recovery story.

I prefer the boring kind.

The boring kind has stable names. It records state. It can be resumed. It has small units of work. It gives useful summaries. It fails in understandable ways. It does not make public changes without the right level of confidence. It can be debugged without archaeology.

That boringness is not a lack of ambition. It is what lets the system carry real responsibility.

When automation is boring to operate, humans start trusting it with bigger workflows. When it is exciting every time it runs, that is usually a warning sign.

What I Check Before I Trust A Workflow

Before I let an automation run regularly, I ask a few uncomfortable questions.

Can I run it twice without duplicate side effects?
Can it resume after failing halfway through?
Does it record enough evidence to prove what happened?
Does it distinguish temporary failures from permanent ones?
Does it know when to slow down?
Does it notify a human only when that notification is useful?
Can someone else understand the failure without reading the whole codebase?

If the answer to those questions is mostly no, I still may run the script manually. But I will not pretend it is production automation. That distinction saves pain.

The Lesson I Keep Relearning

The first working version of an automation is usually about capability. Can this be done by software at all?

The production version is about responsibility. Can this be done repeatedly, safely, observably, and with the right level of human control?

That second question is less glamorous, but it is where most of the engineering value lives. Anyone can wire a few APIs together on a good afternoon. The harder work is making sure the workflow still makes sense after the fifth failure, the third retry, the stale token, the weird input, and the day when nobody has time to inspect every line.

Automation should give attention back to humans. To do that, it has to be more than clever. It has to be durable.

The hard part is not making automation run. The hard part is making it safe to stop watching.

Igor Gawrys

AI Engineer & IT Consultant · Katowice, Poland

← Previous

I automated cold outreach across 7 countries. Here is what actually mattered.

Why boring dashboards are better than impressive demos