June 2026·AI Engineering·12 min read

Context engineering is the real work behind reliable AI agents

The visible magic of an AI agent is the answer it gives. The engineering work is everything that happens before that answer: what context it receives, what it is allowed to remember, which tools it can use, and when it should stop and ask for help.

The Agent Was Not Stupid. It Was Underbriefed.

The most useful lesson I learned while building AI agents was also the least dramatic one: many agent failures are not model failures. They are context failures.

It is tempting to blame the model when an agent makes a bad decision. Sometimes that is fair. Models hallucinate, miss details, overgeneralize, and produce confident answers from weak evidence. But when I look back at the failures that actually mattered, a lot of them started earlier. The agent did not know which file was authoritative. It did not know which instruction was stale. It did not know that a human had changed the plan in a later message. It did not know the difference between a safe internal action and an external side effect. It was not stupid. It was underbriefed.

That distinction changed how I think about agent design.

When people talk about AI agents, they often focus on the exciting parts: tool calls, autonomous workflows, multi-step planning, browser control, code generation, memory, and long-running tasks. Those things matter, but they sit on top of a quieter discipline. The agent needs the right working set of information at the right moment, with enough structure to act and enough constraints to stay responsible.

That is context engineering. Not prompt decoration. Not stuffing more text into the window. It is the practice of deciding what the agent should know, what it should ignore, what it should verify, and how it should recover when the available context is not enough.

More Context Is Not The Same As Better Context

Long context windows are useful, but they can create a false sense of safety. If the model can read hundreds of thousands of tokens, it feels like the problem is solved. Put everything in. Let the model figure it out.

In practice, this creates a different problem. The agent now has more opportunities to find old assumptions, irrelevant details, duplicate instructions, conflicting examples, and stale state. A long context window is not a memory system. It is a bigger desk. If the desk is covered with unsorted paper, the size of the desk does not make the work cleaner.

I have seen agents behave worse with more context because the extra material blurred the priority of the task. The user asked for one narrow change, but the prompt included old plans, half-finished experiments, previous preferences, unrelated files, and system notes from a different workflow. The model tried to be helpful across all of it and produced a solution that was technically coherent but operationally wrong.

Good context engineering is selective. It asks: what information is necessary to make this decision? Which source should win if two sources disagree? What can be retrieved later instead of loaded now? Which details are dangerous if treated as current?

The goal is not to minimize context for its own sake. The goal is to make the context legible. A small, precise brief beats a huge pile of vaguely relevant material.

Context Has A Chain Of Custody

When an agent acts on information, I want to know where that information came from.

Was it from the current user message? A project file? A previous summary? A live API response? A cached memory? A generated plan? A tool result from five minutes ago? Those sources do not deserve equal trust.

This matters because agents are good at blending information. That is useful for reasoning, but dangerous for operations. A model can combine a current task with an old instruction and produce a confident action that nobody explicitly asked for. It can remember the shape of a previous solution and apply it to a repo that changed yesterday. It can treat a summary as if it were exact evidence.

So I like context systems that preserve provenance. Tool output should be visibly different from memory. User instructions should outrank generated notes. Live repository state should outrank old summaries. A current error message should outrank an assumption from a plan. If the agent cannot tell where a fact came from, it should be careful about using that fact for irreversible work.

This is especially important in coding agents. Codebases move. A summary that was true last week may be misleading today. Before editing, the agent should read the current files. Before pushing, it should inspect the diff. Before saying a deployment succeeded, it should check evidence from the current run. Context without custody turns into folklore.

The Real Unit Is The Task Boundary

A useful agent needs to understand the boundary of the task, not just the task itself.

If the user says, “publish the next blog post,” the task is not merely writing text. It includes choosing the next topic, following the existing content format, adding the route, downloading a cover image, running the build, committing only relevant files, pushing, updating the queue, and reporting the live link. It also excludes unrelated cleanup, design refactors, changing old posts, and touching untracked files that happened to be sitting in the repo.

That boundary is where many agents get into trouble. They overreach because they interpret helpfulness as doing adjacent work. Or they underreach because they complete the visible artifact and forget the operational steps that make it real.

I try to make agents treat scope as a first-class object. What files are owned by this task? What side effects are expected? What side effects require approval? What existing changes belong to someone else? What should be left alone even if it looks messy?

This sounds like normal engineering discipline because it is. Agents do not remove the need for scope control. They make it more important, because an agent can execute a broad misunderstanding much faster than a human.

Memory Is Useful Only When It Is Curated

Persistent memory is one of the most attractive features of an agent system. It is also one of the easiest to get wrong.

If memory becomes a dumping ground, it stops being memory and becomes sediment. Every preference, temporary plan, past workaround, old credential note, abandoned project, and one-time exception accumulates until the agent is surrounded by context that may or may not still matter.

Curated memory has a different job. It should preserve decisions, durable preferences, active projects, known constraints, and lessons that should affect future behavior. It should not preserve every transient detail with equal weight. It should also be editable. Humans change their minds. Projects get archived. Workflows become obsolete. A memory system that only grows will eventually mislead the agent.

I like separating raw logs from long-term memory. Raw daily notes can capture what happened. Curated memory can hold what should shape future decisions. That separation makes recall less magical and more maintainable. If the agent needs exact history, it can search the logs. If it needs operating context, it can read the curated memory.

The important thing is that memory should not become an excuse to skip verification. Remembered facts are hints. Current evidence still matters.

Tool Context Needs To Be Designed Too

Giving an agent tools is easy. Making tool use reliable is harder.

A tool is not just a function. It is a contract: what it can do, what it cannot do, what its output means, what errors look like, and what side effects it creates. If that contract is vague, the agent has to infer too much.

For example, a web fetch tool and a browser automation tool both access websites, but they should not be treated the same. A fetch is cheap and good for readable content. A browser is heavier and useful when interaction or JavaScript matters. If the agent has no guidance, it may use the expensive tool for simple lookups or trust a fetch result from a page that clearly required dynamic rendering.

The same applies to messaging, git, deployment, calendars, email, and social publishing. Reading an inbox is not the same as sending a reply. Drafting a post is not the same as publishing under a real name. Creating a local commit is not the same as pushing to production. The agent needs to understand the difference between internal work and external action.

Good tool context describes those boundaries explicitly. It also tells the agent how to verify. After generating a static site, run the build. After pushing, check the remote result if possible. After downloading an image, confirm it is actually an image. After editing a route, make sure the route renders. Tool use without verification is just faster guessing.

Instructions Need Priority, Not Volume

Agent prompts often grow by accumulation. Someone adds a rule after a mistake. Then another rule. Then a special case. Then a project convention. Then a safety warning. Eventually the prompt is full of instructions, but the agent still makes mistakes because the instructions are not organized by priority.

Priority matters more than volume. If a current user request conflicts with an old plan, which wins? If a repository convention conflicts with a generic preference, which wins? If a cron says “publish” but an old topic file says “ask before publishing,” how should the agent interpret that? These are not edge cases. They happen constantly in real workflows.

I prefer instruction sets that make precedence explicit. Safety rules outrank convenience. Current user intent outranks stale notes. Repo-local patterns outrank generic style preferences. Tool evidence outranks memory. External side effects deserve more caution than local edits.

Without precedence, the agent has to negotiate conflicts silently. That is where surprising behavior comes from. The model may follow the wrong rule not because it ignored instructions, but because it had too many plausible instructions and no clear way to choose.

Context Compression Is A Product Feature

Long-running agents eventually need compression. They cannot keep every message, every tool call, every diff, and every observation in active context forever.

The hard part is that compression is lossy. A summary may preserve the headline and lose the exact command. It may remember that a build failed but omit the reason. It may say “the user wanted X” while dropping the condition that made X safe. If the agent treats compressed context as exact truth, it will eventually make a bad call.

That means compression needs a retrieval path. A summary should be a map, not the territory. When exact details matter, the agent should be able to expand the relevant history, search prior messages, or re-check the current state. For ordinary continuity, the summary is enough. For commands, file paths, credentials, decisions, and causal claims, it should verify.

This is one of those places where agent UX and engineering meet. A good system does not only compress context. It teaches the agent when compression is sufficient and when it must go back to evidence.

The Agent Should Know When It Is Not Ready

One of the most underrated capabilities in an agent is the ability to pause.

Not every missing detail should become a question to the user. A good agent should be resourceful. It should read files, inspect state, search docs, run safe checks, and make reasonable assumptions when the risk is low. But there are moments where continuing would be fake confidence.

If an action is public, destructive, financial, legal, or hard to reverse, missing context matters more. If the agent cannot identify the target account, cannot verify which branch is production, cannot tell whether a file belongs to the current task, or sees conflicting instructions about an external action, it should slow down and ask.

This is not weakness. It is operational maturity.

Humans do this too. A senior engineer does not ask about every tiny detail, but they do stop before deleting data, sending customer emails, or shipping a change that depends on an unverified assumption. Agents should be held to the same standard.

Small Working Sets Make Better Agents

The best agent workflows I have built do not try to make one giant agent understand everything all the time. They create small working sets.

For a coding task, that might mean reading the specific files involved, the nearby tests, the build command, and the current git status. For a content task, it might mean reading the topic queue, a few previous posts, the publishing checklist, and the image workflow. For an inbox task, it might mean reading only unread messages plus the rules for what deserves attention.

Small working sets keep the agent grounded. They reduce accidental influence from unrelated context and make verification easier. They also help with delegation. If a subtask can be described with a clear ownership boundary, another agent or process can handle it without needing the entire universe of context.

This is similar to good software design. Interfaces matter. Scope matters. Encapsulation matters. An agent with a clean task boundary behaves more predictably than one swimming through every fact it has ever seen.

Context Engineering Is Mostly Unromantic

The funny thing about context engineering is that it does not look impressive in a demo.

A demo shows the agent producing an answer, editing code, opening a page, or finishing a workflow. It usually does not show the boring machinery that made the result trustworthy: reading the current files, ignoring unrelated changes, checking the topic queue, choosing a current source over a stale memory, limiting the commit to specific files, running the build, and reporting the exact output that matters.

But that boring machinery is where reliability comes from.

When context is engineered well, the agent feels calmer. It asks fewer unnecessary questions because it knows where to look. It takes fewer reckless actions because it knows which boundaries matter. It makes fewer stale assumptions because it verifies current state. It recovers better from compression because it can find evidence again. It becomes less magical and more dependable.

That is the version of AI agents I care about. Not a system that pretends to be autonomous in every direction, but a system that can carry real work because its context, memory, tools, and responsibilities are shaped deliberately.

The Checklist I Keep Coming Back To

Before I trust an agent with a workflow, I ask a few practical questions.

Does it know which source of context is authoritative?
Can it distinguish current evidence from remembered assumptions?
Does it understand the boundary of the task?
Are external side effects treated differently from local work?
Can it verify the result before reporting success?
Can it recover exact details when compressed context is not enough?
Does it know when to stop and ask for help?

If those answers are weak, the agent may still produce impressive output. I just would not trust it with much responsibility yet.

The Work Behind The Magic

AI agents look magical when they complete a task from a short instruction. But the magic is usually the visible surface of a lot of unglamorous engineering.

Someone decided what memory means. Someone wrote the tool rules. Someone set the safety boundary. Someone designed the retrieval path. Someone taught the agent to inspect the current repo before editing. Someone made the build part of the workflow. Someone decided that old summaries are useful but not authoritative.

That work is easy to underestimate because it does not produce a flashy screenshot. It produces something better: an agent that behaves with continuity, caution, and enough evidence to be trusted.

For me, that is the real promise of agentic systems. Not replacing engineering judgment, but packaging more of that judgment into the workflow so the agent can operate responsibly when the human is not holding every detail in short-term memory.

The answer is the visible part. The context is the system that decides whether the answer deserves to exist.

Igor Gawrys

AI Engineer & IT Consultant · Katowice, Poland

← Previous

I was 14, running a company, and learning that managing people is harder than any codebase

Debugging AI agents: how I trace tool calls, memory, and wrong assumptions