5 Lessons Learned After Missing a Critical Bug in Production

No matter your seniority level or years of experience, there has probably been a time when a bug has slipped through your checks and landed in production. After the fire has been put out, the question is always the same: why did we not catch it earlier? In this blog post, we will cover the most common reasons why this may have happened and share 5 lessons that can be learned after missing a critical bug in production.

TL;DR

30-second summary

Why do critical bugs reach production despite a working QA process, and what do teams learn when they do?

Edge cases are where production surprises hide. Scenarios involving boundary values, system limitations, timing issues, and unexpected user behaviour are often already known about, sensed by developers and QA engineers who have seen similar systems before. Monitoring user feedback is essential because edge cases frequently surface from real user interactions that are hard to anticipate in a controlled test environment.
Assumptions are one of the most persistent sources of missed bugs. Familiarity with a system breeds comfort, and comfort breeds the habit of assuming features will behave as expected because they always have before. Every assumption, however small, deserves a direct question: is this familiar because it was documented, or because it feels right? If the answer is unclear, verify with the team before the release.
Testing takes longer than it looks on paper and the buffer matters. Environment instability, failing builds, last-minute requirement changes, and unclear acceptance criteria all add up. When QA teams work under time pressure, focus narrows to core functionality and smaller issues get skipped. A small time buffer leaves room for exploratory testing and reduces the human errors that pressure produces.
Small changes have non-obvious reach in connected systems. A single change in one part of an application can affect other parts in ways that aren't immediately visible. Before retesting only the changed component, communicating with developers to understand whether the modified methods or functions are reused elsewhere is the step that prevents the most common variant of this failure.
Test data is where the hidden production failures live. Test environments use data that is too clean, too consistent, and too predictable compared to production. Production users bring years of account history — incomplete fields from older versions, migrated data, and edge combinations that freshly created test accounts never replicate. Thinking outside the box about test data means thinking about what the system looked like three years ago, not just what it looks like today.

Bottom line: Most production issues are not caused by completely unknown problems, they are caused by things teams assumed were safe, obvious, or already covered. Production rarely fails because nobody tested it. It fails because something wasn't tested the way reality actually behaves.

Lesson #1: Never underestimate the importance of edge cases

An edge case is a scenario that rarely occurs in real-life user interactions and also can be related to boundary values, system limitations, timing issues, or unexpected user behaviour in general.

A few real-world examples of edge cases:

A user inserts special symbols in a field that is meant for credit card number — will the system show an error or crash?
Performing bank transactions with a slow internet connection — what would the system’s response be and will there be any funds lost or stuck in processing?
A user clicking the “Submit” button twice in quick succession — will it create two orders, will the second click be ignored, or will the system handle it effectively?
Two browser tabs are open, and the user is editing the same document in both — will changes from both tabs be saved, or will the user see an error message indicating the document is already being edited?
A user refreshes the page in the middle of an order being created — will it show a warning message, will the user be redirected without saving the order, or will the system crash?

Of course, for each system these edge cases will differ, but most are already known or known about, even if they have not been formally tested. In most projects, developers and QAs already sense the areas that could cause some issues or, for example, see them from user reports (if it is a public-facing system or an app). That’s because in a lot of cases these edge cases come directly from real user interactions in production, where users behave in ways that are hard to predict or fully cover in test cases. Therefore, monitoring user feedback is very important, as it often uncovers issues that were not considered before.

Lesson #2: Instead of making assumptions, verify requirements

Laptop on a desk with lines of code displayed on the screen

Assumptions are one of the most persistent sources of missed bugs. After working in a project for some time, we grow comfortable with the system and while it is a good thing, sometimes we might lose our focus and start making assumptions. For example, we might assume that a specific feature will always behave in a specific way because it has “always worked like that”, or partially skip tests in areas that have never caused issues before. Similarly, when a new feature looks like a feature we’ve seen in other projects, we might assume its functionality in certain aspects. While this experience is valuable for exploratory testing, we cannot assume that something will work the same way just because it is the most common pattern.

In one project I worked on, a mobile application, there were various submission forms with several input fields, all with a 160-character limit. When a new, similar submission form was added, it was assumed that the same limit applied. Later on, we discovered that the limit was significantly higher than expected, when users started entering long sentences, which then broke the UI layout and caused the “Continue” button to become hidden. As a result, users were unable to proceed to the next step and were stuck on the form. This was caught early in the production, so no major harm was done, but it definitely changed the way we approach assumptions going forward.

Since then, every time I make even the smallest assumption, I ask myself, have I read about this requirement before and that’s why it feels familiar or it is just my intuition. If I cannot answer clearly, I double check with the team, so that there is one less thing to worry about after the release.

Assumptions are cheaper to catch before a release than after one.

Talk to a QA expert

Lesson #3: Allocate slightly more time for testing than you think you need

On paper, the testing process often looks straightforward and somewhat easy to estimate. Simply execute test cases, run regression, and check the fixes. In reality, it is rarely that predictable. There is almost always something that takes longer than expected. Test environments can become unstable or can break, test data needs updating, builds can fail unexpectedly, and last-minute requirement clarifications or small changes can appear and take more time than planned. When the QA team is under time pressure, the focus narrows on core functionality, so smaller things get skipped and it’s easier to miss something. Therefore, it is always important to communicate within the team and try to adjust the allocated testing timeframe if the project has been developing new features and more testing time is necessary.

Another thing I have noticed over time is that even when everything looks “ready for testing”, the actual start can often be delayed by small things that don’t seem like a big deal individually, but together they add up. For example, setting up the latest build, unclear acceptance criteria, and last-minute requirement changes can take up more time than we would expect. That’s why having a bit of extra buffer really helps. It takes away some of the pressure, leaves room for some exploratory testing, and reduces human error.

Lesson #4: Don’t underestimate the impact of small changes on the system.

Varioius metrics, including quality score, displayed on screen

Most applications are highly connected. A single change in one place can easily affect other parts of the system in ways that are not immediately obvious. Because of this, the real impact is often not visible at first glance.

In one of the projects I worked on, there was a change in a form, where the “Continue” button was updated so that it would only be enabled (ready to be clicked) when all the mandatory fields were filled out correctly. This change did not seem very significant at the time, so we mostly just retested this form. Everything worked as expected, and nothing seemed unusual. It was only in the regression that we discovered that the same button logic was reused in two other forms where not all fields are mandatory. If we had not found this issue, users would be stuck with a disabled “Continue” button and not be able to continue with form submissions.

Since then, when functional changes are being introduced, we communicate with developers to check whether the methods or functions being changed are reused anywhere else in the system, so we don’t accidentally miss related areas during testing. This small step really helped us understand the actual scope of a change and reduced the chances of similar issues happening again.

Lesson #5: Test data can make or break your testing

In practice, we often end up working with test data that is either too clean, too controlled, or too predictable compared to production. Even when we try to simulate real scenarios, it is still very difficult to fully replicate things like long-term user behavior or edge combinations created over time. I’ve seen situations where everything worked perfectly during testing because the data was prepared to match the current version of the system and was consistent across all systems. But once the feature reached production, users with older, incomplete, partial or migrated data started experiencing issues that never appeared in testing.

That is why test data is not just the accepted or expected inputs. It is data from three years ago when some fields were not mandatory or the exact opposite. It is the user accounts that have been migrated several times and entries created that would not match the requirements of the latest version. So, it is useful to think ‘outside of the box’ when it comes to test data. After all, a feature might work perfectly with freshly created accounts, but production users are rarely fresh accounts. They bring years of history with them, and sometimes that history is exactly where the bugs are hiding.

Final thoughts

Missing a critical bug in production is never a nice experience, but it tends to be one of those moments where you reevaluate how you work. Looking back at these lessons, they all point in the same direction. Namely, most production issues are not caused by completely unknown problems, but by things we assumed were safe, obvious, or already covered.

What brings all of this together is the idea that testing is not just about executing cases, but about always questioning what we think we already know. The moment we stop questioning, we start relying too much on experience, intuition, or “this should be fine” mindset, and that’s usually where production surprises come from.

And if there is one takeaway from all of this, it is this one: production rarely fails because nobody tested it thoroughly. It fails because something wasn’t tested the way reality actually behaves.

FAQ

Most common questions

Why do critical bugs reach production even when a QA process is in place?

Most production failures are not caused by unknown problems, they are caused by things teams assumed were safe, obvious, or already covered. Common root causes include edge cases that were sensed but never formally tested, assumptions about feature behaviour based on familiarity rather than documented requirements, testing time compressed by environment instability or last-minute changes, small code changes whose reach across connected systems was underestimated, and test data too clean and consistent to replicate the history real production users bring with them. The moment a team stops questioning what they think they already know, they start relying on intuition, and that is where production surprises originate.

How should QA teams approach edge case testing?

Edge cases should be treated as known unknowns rather than exotic scenarios. In most projects, developers and QA engineers already sense the areas likely to cause issues based on prior experience and user reports. Monitoring user feedback from production is one of the most effective sources of edge case discovery because real users behave in ways that are genuinely hard to anticipate in a controlled environment. Formal edge case categories worth covering systematically include boundary values, timing issues like double-clicks and concurrent sessions, unexpected sequences like mid-process page refreshes, and system limitation boundaries like character limits and concurrent user thresholds.

What is the most effective way to manage assumptions in QA testing?

Treat every assumption as a question that hasn't been answered yet. When a feature feels familiar, either because it resembles something tested in a previous project, or because the team has always seen it behave a certain way, the right question to ask is whether that familiarity comes from documented requirements or from intuition. If the answer is unclear, verify with the team before the release. The cost of a brief clarification is always lower than the cost of a production bug caused by an assumption that turned out to be wrong.

How do small code changes cause unexpected production failures?

Most applications are highly connected. A change in one component can affect other parts of the system in ways that are not immediately visible from the change itself. The most common pattern is a shared method or function being modified for one use case while its other uses go untested. The fix is communicating with developers before testing begins to understand whether the modified code is reused elsewhere, so the actual scope of regression testing reflects the real impact of the change rather than just its visible surface area.

Why does test data quality matter so much for catching production bugs?

Test environments typically use data that is too clean, too controlled, and too predictable compared to what production users actually have. Production users bring years of account history, like fields that weren't mandatory in older versions, accounts migrated multiple times, and data combinations that freshly created test accounts never replicate. A feature can work perfectly with new test data and fail consistently for long-standing production users whose data predates the current system requirements. Effective test data strategy means thinking about what the system and its data looked like two or three years ago, not just what it looks like today.

Recognize any of these lessons in your own process? Let's talk about what's still slipping through.

We help engineering teams build testing practices that catch what assumptions, time pressure, and clean test data tend to miss.