Most outages are the result of a change in code or configuration, either performed manually by an employee or the accidental result of automation. Either way, something changed in the system that led to failure. The easiest way to prevent this would be not to make any changes. However, as businesses (and the systems that support them) know, change is necessary to scale and grow.
Because change will occur and thus failure will, too, the best thing you can do is institute a rigorous change management system. What does this mean? First, ensure your team members know exactly what to do if a failure happens due to a change, and where to turn (under high pressure) if they don’t. This involves forward planning and testing so that if change breaks a system, there is a plan in place for the team to follow. Back to point one, every moment spent on determining the root cause of an issue costs money. Second, track every change. Third, test every change before it is deployed. Fourth, while implementing change, continuously monitor key services, transactions, and outputs to understand what negative impact the change may be having so that you can immediately address it.
Observability is about more than just logs and tracing. It’s also about baselining and understanding trends over time. Ensure you are not only looking at information during and after an incident, but also drawing on observability to analyze trends, for instance, by establishing baselines before an outage hits so that you know what to compare it to. This might include understanding (i) how the time of day or week impacts the performance of your application or service; (ii) how long a DNS lookup takes for your DNS vendor so that if they schedule a maintenance window and update the system, you have a benchmark to compare it against; or (iii) if you’re updating your network device firmware, is it now dropping connections or adding latency to each packet sent?
This is perhaps the most important lesson of all. You cannot be proactive enough in preparing for when the next outage will hit. Many of the teams we saw experience outages over the last 18 months were not prepared. This meant that when an issue happened, it took too long to identify what the issue was and then pinpoint the cause, making the Mean Time to Repair (MTTR) far slower than it could have been.
A final note of caution. Many of us are now reliant on the cloud not just for hosting our infrastructure, but also for services that our developers would have previously coded and maintained. When a key service of a major cloud provider goes down, the ripple effect across other products and companies can lead to a massive chain of failures. We saw this in November 2021 with Google Cloud and again, in AWS’ trifecta of outages, in December 2021. We may not even realize that we could be the downstream victim of another company’s failure, but our mutual interdependence on a handful of key vendors for important services like hosting or DNS services makes it essential that we plan not only for our own failures, but also those of the third parties that underpin our services and applications. Remember, with careful planning, rigorous monitoring and observability practices, and a thorough change management plan, what can feel beyond your control can actually lead to fast resolution with minimal business impact.