May 2026·AI Engineering·12 min read

How I test AI-generated code before I trust it

AI can write code quickly. That does not make the code safe, maintainable, or production-ready. Over time I built a testing workflow that treats AI output like a fast junior teammate: useful, productive, and never above verification.

The Pull Request That Looked Better Than It Was

A while ago I watched an AI tool generate a fix that looked almost unfairly good on first read. The naming was clean. The structure was elegant. The comments sounded thoughtful. The tests even passed.

If I had reviewed it lazily, I probably would have merged it in ten minutes and moved on with my day feeling efficient.

Then I started reading the code the way I read anything that might reach production. Not as text. As behavior.

The fix was for a validation flow. The happy path worked. The generated tests confirmed exactly that. But one branch quietly changed the semantics of an existing condition, and in a slightly different state transition the system would now accept invalid input that used to be blocked correctly. The AI had not written nonsense. It had written a plausible local solution that missed the broader contract.

That moment reinforced something I now treat as a rule: AI-generated code is not dangerous because it is always bad. It is dangerous because it is often good enough to lower your guard.

So I built a workflow around that reality. I use AI heavily, but I test its output in a way that assumes speed is cheap and confidence must be earned.

I Treat AI Like A Fast Junior Developer

This framing helps more than any abstract policy.

If a smart junior developer handed me a patch, I would not reject it just because they are junior. I would also not trust it blindly because it looks polished. I would ask what problem it solves, what assumptions it makes, what it might break, and whether the tests actually prove the intended behavior.

That is exactly how I handle AI output.

The model is fast, tireless, and often surprisingly helpful. It is also pattern-driven, context-limited, and biased toward producing something coherent even when the surrounding system requires caution. This means I do not ask, "Did the AI write clean code?" I ask a stricter question: "Did this change preserve the system's real invariants?"

Those are very different standards. One rewards appearance. The other protects production.

Step One: Test The Understanding Before The Code

My testing workflow starts before the implementation exists.

If I am using AI for anything non-trivial, I first make it restate the task in plain engineering terms. What is the bug? What behavior should change? What must remain unchanged? Which files are likely affected? Where are the dangerous side effects?

I do this because a lot of failures are not coding failures. They are understanding failures. If the model misunderstands the contract at the beginning, polished code only buries the problem faster.

Sometimes the restatement is enough to catch drift. The AI will focus on the visible symptom and ignore the actual domain rule. Or it will propose a broad refactor for a narrow bug. Or it will optimize a helper without recognizing that the real risk is a database write happening too early in the flow.

When I see that, I correct the direction before any code is generated. It is much cheaper to test understanding at prompt level than to debug a beautiful wrong answer later.

Step Two: Start With The Smallest Useful Surface

I do not like broad prompts that say "implement the whole feature." They create too many places for hidden mistakes to survive.

Instead, I break the work into smaller surfaces. One service method. One validator. One transformation layer. One test file. One migration. Sometimes one function.

This changes testing in an important way. Smaller surfaces are easier to reason about, easier to diff, and easier to falsify. If the AI only generated one narrow piece, I can inspect the full logic with real attention. If it generated the controller, service, repository, tests, and frontend state all at once, the review turns into theater. There is too much code for the time available, which means risk gets hidden inside volume.

One of the easiest ways to improve AI code quality is not asking the model to be smarter. It is reducing the blast radius of each answer.

Step Three: Review For Contracts, Not Just Syntax

This is the part I care about most.

When developers say they reviewed AI-generated code, sometimes they mean they skimmed it and nothing looked obviously broken. That is not review. That is pattern recognition.

My real review checklist is contract-oriented:

What inputs are now accepted that were previously rejected?
What outputs changed, even subtly?
What state transitions are now possible?
Did the error handling become broader or weaker?
Did the code preserve permissions, validation, and ordering guarantees?
Is the implementation solving the root rule or only the visible example?

This matters because AI often produces code that is locally reasonable and globally risky. It can satisfy the immediate prompt while violating assumptions that live elsewhere in the system.

Humans do this too, obviously. The difference is that AI can produce more of it, faster, and with enough confidence in the wording that tired engineers stop interrogating it as hard as they should.

Step Four: Write Tests That Try To Prove The Code Wrong

The worst AI test suites are mirror tests. They reproduce the shape of the implementation and confirm the code behaves exactly the way it was written. That tells me almost nothing.

I want tests that challenge the intent.

If the change is about validation, I write the invalid cases first. If the change is about a state machine, I test forbidden transitions. If it touches permissions, I verify unauthorized paths aggressively. If it is a refactor, I look for snapshot or contract tests that prove behavior did not drift.

In other words, I am not asking, "Can this code pass a test?" I am asking, "What test would embarrass this implementation?"

That mindset catches a lot. AI is very good at building a path through the example it was given. It is much less reliable when pressure comes from the edges, especially edges that require domain awareness rather than generic programming patterns.

My Favorite Pattern: Human-Written Tests, AI-Written Implementation

If I had to recommend one workflow to teams using AI, it would be this one.

I often write the test cases myself, or at least outline them precisely, before letting the AI implement the code. That keeps the definition of correct behavior in human hands.

Once the tests reflect the contract I actually care about, the model can help with the implementation safely. It can still make mistakes, but now the mistakes run into a stronger boundary. The code has to satisfy my understanding, not just the model's interpretation of the prompt.

This also changes the emotional dynamic of code generation. I am no longer hoping the output is right. I am forcing it to survive scrutiny. That is a much healthier relationship with the tool.

I Also Test By Deleting Things

One surprisingly useful trick is asking whether the AI added code that does not really need to exist.

Models love to be helpful in the most expansive possible way. That means extra helpers, defensive branches, abstractions, wrappers, option flags, comments, and little pieces of ceremony that make the patch feel complete. Sometimes they are justified. Often they are not.

So part of my testing process is subtraction. Can I remove this helper and keep the behavior? Can I inline this transformation without losing clarity? Can I delete this branch because the invariant already guarantees it never happens? Can I use the existing pattern instead of introducing a second one?

Over-generated structure is not harmless. It increases the amount of code future-me has to understand under pressure. A test workflow that never questions excess code will slowly turn AI convenience into maintenance debt.

Integration Tests Matter More Than People Admit

AI usually looks strongest at local reasoning and weakest at system boundaries. That is why I care so much about integration coverage.

A generated repository method might look fine. A controller action might look fine. A DTO mapping might look fine. But when the request enters through HTTP, hits validation, transforms state, triggers a side effect, persists data, and returns a response, that is where assumptions collide.

Whenever the change touches more than one layer, I want at least one test that exercises the whole path. Not because end-to-end coverage is glamorous, but because it exposes mismatches that unit tests politely ignore.

I have caught AI mistakes this way that would never have appeared in isolated tests: wrong serialization formats, duplicated writes, stale field names, validation messages wired to the wrong branch, and business rules accidentally bypassed by a refactor that was technically clean.

The code was not broken in isolation. The behavior was broken in context.

I Use Production Smells As A Testing Heuristic

There are certain areas where I slow down immediately, no matter how small the diff looks.

Auth, permissions, money, concurrency, background jobs, idempotency, external API writes, migrations, caching, and anything that changes ordering logic all get reviewed with suspicion. These are places where a one-line AI suggestion can create a very expensive bug.

In those areas, I usually add one more layer of verification than seems strictly necessary. Extra tests. Manual run-through. Diff against previous behavior. Occasionally a local sandbox script to simulate edge cases quickly.

This is not paranoia. It is just respect for the parts of systems that fail expensively.

The Question I Ask Before Merging

Before I trust an AI-generated change, I ask myself one uncomfortable question: if this breaks in production tonight, will I understand it well enough to debug it fast?

If the answer is no, the work is not done.

Maybe the tests are too shallow. Maybe I accepted an abstraction I did not fully interrogate. Maybe the implementation is fine but I cannot explain why it is fine. In all three cases, my confidence is artificial.

I like this question because it cuts through false productivity. Passing tests are good. Clean diffs are good. Fast output is good. But real ownership means I could still defend, modify, and repair the code without the model holding my hand.

What Teams Should Standardize

I do not think this should stay an individual habit. Teams using AI seriously need shared expectations.

At minimum, I think teams should agree on a few things:

Which areas of the codebase require human-written tests first
Which categories of changes require integration coverage
How much generated code is acceptable in one reviewable chunk
Which domains require manual design notes before implementation
How to document reasoning when the code was generated quickly

Without standards, every developer builds a private relationship with AI and the repository becomes a museum of inconsistent trust levels. Some patches are deeply understood. Others are simply accepted because the model sounded convincing and the sprint was busy.

That is not a code problem. That is a process problem waiting to become a production problem.

My Actual Goal Is Not To Catch Every Mistake

No workflow catches everything. Human code does not. AI-generated code definitely does not. My goal is not perfection. It is disciplined skepticism.

I want a system where AI helps me move faster, but cannot quietly smuggle uncertainty into production behind polished syntax. I want the tests to be adversarial enough that the model has to earn the merge. I want my own review habits to stay sharp enough that I never confuse generated confidence with engineering confidence.

That is the balance I keep chasing. Use the tool aggressively. Trust it slowly.

The Standard I Keep Coming Back To

When AI writes code for me, I do not ask whether it looks professional. I ask whether it survived a process designed to prove it wrong.

If it survives that process, great. I merge faster.

If it does not, that is not a failure of the workflow. That is the workflow working.

AI-generated code should not be trusted because it is fast. It should be trusted only after it survives tests written by someone who assumes it might be wrong.

Igor Gawrys

AI Engineer & IT Consultant · Katowice, Poland

← Previous

What enterprise development taught me about documentation

The difference between AI-assisted and AI-dependent development