April 2026·Engineering Stories·12 min read

The hardest bug I ever fixed took three days and zero lines of code

The bug crashed production every Tuesday at 2 AM. Only Tuesday. Only 2 AM. The logs showed nothing. The monitoring showed nothing. Three days later I found it - and the fix was not code.

Tuesday at 2 AM

The first time it happened, we blamed infrastructure. The application - a Laravel monolith serving a mid-size eCommerce platform at a German enterprise - crashed hard. All workers died simultaneously. The load balancer started returning 502s. Customers saw error pages for about four minutes before the auto-restart kicked in.

The on-call engineer checked the logs: nothing unusual before the crash. Memory usage normal. CPU normal. Database connections normal. Then, at 02:00:12 UTC, every PHP-FPM worker process terminated with signal 9 - killed by the operating system.

"Probably an OOM kill during a traffic spike," said the ops team. We added more memory to the servers and moved on.

The next Tuesday, it happened again. Same time. Same behavior. Same lack of evidence.

The Pattern Nobody Believed

By the third Tuesday, I was convinced this was not random. Tuesday at 2 AM, three weeks in a row. The probability of that being coincidence was astronomically low.

But when I told the team "the bug only happens on Tuesday at 2 AM," I got looks. The senior ops engineer actually laughed. "Computers do not care what day it is."

Except they do. Cron jobs care. Scheduled tasks care. Batch processes care. Maintenance windows care. Something was happening at 2 AM on Tuesdays that was not happening at any other time.

I started listing everything that ran on a schedule. The application had seventeen cron jobs. The infrastructure had another dozen. The database had scheduled maintenance. The CDN had cache purge cycles. The monitoring system itself had weekly reports.

None of them ran at 2 AM on Tuesdays.

Day One: Following the Wrong Trail

I spent the first day looking at the application. I instrumented every cron job with detailed logging. I added memory profiling to the main request pipeline. I set up a separate monitoring process that would capture the system state every second starting at 01:55 on Tuesday.

I also reviewed every recent deployment. The crash started three weeks ago - what changed three weeks ago? I went through every commit, every configuration change, every infrastructure update. Nothing stood out.

The application code was clean. The infrastructure was standard. The database was healthy. Everything pointed to "this should not be happening."

That night - Tuesday night - I stayed up. At 01:55, I was watching the monitoring dashboard like a hawk. Memory: stable. CPU: idle. Connections: minimal. At 02:00:12, every metric flatlined. The workers were gone.

My custom monitoring script captured one interesting detail: in the last 200 milliseconds before the crash, memory usage spiked from 2.1 GB to 7.8 GB. Not gradually - instantaneously. Something allocated 5.7 GB of RAM in a fraction of a second, the OOM killer fired, and everything died.

Day Two: The Memory Spike

Now I had a clue. Something was allocating massive amounts of memory at exactly 2 AM on Tuesday. But what?

I added even more instrumentation. I wrote a custom PHP script that hooked into the memory allocator and logged every allocation over 1 MB. I deployed it on Monday evening and waited.

Tuesday, 02:00:11 - the log showed a single entry: a function in the order export module allocated a 5.7 GB string. A string. One single string variable consuming 5.7 gigabytes.

The order export module was a background job that generated CSV exports of orders for the accounting team. It ran daily - not weekly. But the daily run at 2 AM was fine on every other day. Why was Tuesday different?

I dug into the code. The export function was straightforward: query orders from the last 24 hours, format them as CSV, write to disk. Simple. Except for one detail.

The query had a date filter: orders created between "yesterday 2 AM" and "today 2 AM." On most days, this returned a few hundred orders. But on Tuesday, it returned something different.

The Eureka Moment

Monday nights were when the marketing team ran their weekly email campaign. The campaign went out at 8 PM on Monday. By Tuesday morning, the orders from the campaign were flowing in. But that was not the problem - a few thousand extra orders would not cause a 5.7 GB string.

The problem was in how the CSV was generated. The code did not stream the output to a file. It built the entire CSV in memory as a single string, then wrote it to disk in one operation. Bad practice, but harmless for a few hundred orders.

Except the marketing campaign also triggered a different system: the recommendation engine. When a customer placed an order from the campaign, the recommendation engine generated personalized product suggestions and stored them as JSON blobs in the order metadata. Each blob was about 50 KB of nested product data.

The CSV export included order metadata. On a normal day, metadata was minimal - maybe 200 bytes per order. On Tuesday, after the Monday campaign, each order carried 50 KB of recommendation data. Multiply that by 3,000 campaign-driven orders, and you get 150 MB of metadata alone - all concatenated into a single string, with the CSV formatting overhead pushing it well beyond what the server could handle.

But wait - 150 MB is not 5.7 GB. Where did the rest come from?

PHP strings are immutable. Every time you concatenate to a string, PHP allocates a new string of the combined length and copies both parts. The export code was building the CSV by concatenating one row at a time in a loop. With 3,000 rows of 50 KB each, the progressive concatenation created temporary copies that ballooned memory usage exponentially. The 150 MB of actual data became 5.7 GB of allocation churn.

The Fix

The fix was not code. Or rather, it was not new code.

I did not rewrite the export module. I did not optimize the string concatenation. I did not add memory limits or chunking.

I excluded the recommendation metadata from the CSV export.

The accounting team did not need product recommendations in their order export. Nobody had asked for it. The metadata field was included because the original developer had used SELECT * instead of selecting specific columns, and the CSV generator exported everything it received.

One line changed in the database query: from SELECT * to SELECT id, customer_id, total, created_at, status. The export went from potentially gigabytes to a few megabytes. The Tuesday crash disappeared.

But I also fixed the underlying issues, because leaving a time bomb in the codebase felt wrong:

Replaced string concatenation with fputcsv() writing directly to a file handle - streaming output instead of building in memory
Added a memory limit check that would switch to chunked processing if the result set exceeded 1,000 rows
Added monitoring for the export job: execution time, memory peak, row count

Why It Took Three Days

The actual debugging, once I understood the pattern, took about four hours. The other two and a half days were spent looking in the wrong places.

I looked at infrastructure when the problem was application code. I looked at cron jobs when the trigger was external (the marketing campaign). I assumed the crash was caused by something that changed recently, when in fact the code had been there for years - it just never encountered enough data to trigger the bug.

The hardest part was not the technical investigation. It was believing the pattern. "It only crashes on Tuesday at 2 AM" sounds absurd until you understand why Tuesday is different from every other day. The lesson: trust the data, even when it sounds ridiculous.

What I Learned

Temporal patterns are always meaningful. If a bug happens at a specific time, something is different at that time. It might be a cron job, a scheduled task, an external system, user behavior patterns, or a marketing campaign. Computers are deterministic - if the same thing happens at the same time, there is a cause.

SELECT * is technical debt. It works until it does not. When someone adds a 50 KB JSON blob to a table that used to have 200-byte rows, every query that uses SELECT * suddenly moves 250 times more data. Explicit column selection is not premature optimization - it is defensive programming.

String concatenation in loops is a classic PHP trap. It looks innocent. It works fine for small data. Then one day the data grows and your server runs out of memory in 200 milliseconds. Always stream large outputs - never build them in memory.

The fix is not always where the bug is. The bug was in the CSV export code. The fix was in the database query. The root cause was in the marketing team's campaign schedule. Understanding the full chain - from trigger to symptom - is more valuable than patching the immediate failure point.

Cross-team visibility prevents these bugs. If the development team had known about the Monday email campaign, someone might have anticipated the data volume spike. Silos create blind spots. The dashboard I built later (the SSE one from a previous post) helped with this - making work visible across teams reduces surprises.

The Postmortem

I wrote a postmortem and shared it with the entire engineering organization. Not because I wanted credit - but because this class of bug is universal. Every codebase has a SELECT * somewhere. Every system has a scheduled job that nobody monitors. Every team has a blind spot where "it works fine" means "it works fine with today's data volume."

The postmortem had one recommendation that stuck: every background job should have three monitors - execution time, memory peak, and output size. If any of those triple from their baseline, alert. Do not wait for the crash.

We implemented that monitoring across all background jobs. In the following six months, we caught four similar time bombs before they exploded. None of them crashed production.

The hardest bugs are not the ones hidden in complex algorithms. They are the ones hidden in simple code that nobody thought to question.

Igor Gawrys

AI Engineer & IT Consultant · Katowice, Poland

← Previous

Building a real-time dashboard with Server-Sent Events - and why WebSockets were overkill

What enterprise development taught me about documentation