Six Ways to Prevent Network Outages in 2023

This is perhaps the most important lesson of all. You cannot be proactive enough in preparing for when the next outage will hit. Many of the teams we saw experience outages over the last 18 months weren't prepared.
system won’t break (since we didn’t make a change to it). That means not enough attention is given to BGP, TCP configuration, DNS, SSL certificates, the networks our data travels along, or indeed any of the points of failure in the infrastructure we rarely alter. Since cloud has abstracted much of the underlying network from ops, network and dev teams, this problem has been compounded. It can make it much harder to perceive a problem at all and when an issue does occur due to one of these fundamental components, we are caught by surprise. To avoid this, you need to continuously monitor across the Internet stack using IPM and put a plan into place so that when an issue does occur, you’re prepared. This brings us to point four.

Implement a proper change management process

Most outages are the result of a change in code or configuration, either performed manually by an employee or the accidental result of automation. Either way, something changed in the system that led to failure. The easiest way to prevent this would be not to make any changes. However, as businesses (and the systems that support them) know, change is necessary to scale and grow. 

Because change will occur and thus failure will, too, the best thing you can do is institute a rigorous change management system. What does this mean? First, ensure your team members know exactly what to do if a failure happens due to a change, and where to turn (under high pressure) if they don’t. This involves forward planning and testing so that if change breaks a system, there is a plan in place for the team to follow. Back to point one, every moment spent on determining the root cause of an issue costs money. Second, track every change. Third, test every change before it is deployed. Fourth, while implementing change, continuously monitor key services, transactions, and outputs to understand what negative impact the change may be having so that you can immediately address it.

Develop an observability plan beyond logs and tracing

Observability is about more than just logs and tracing. It’s also about baselining and understanding trends over time. Ensure you are not only looking at information during and after an incident, but also drawing on observability to analyze trends, for instance, by establishing baselines before an outage hits so that you know what to compare it to. This might include understanding (i) how the time of day or week impacts the performance of your application or service; (ii) how long a DNS lookup takes for your DNS vendor so that if they schedule a maintenance window and update the system, you have a benchmark to compare it against; or (iii) if you’re updating your network device firmware, is it now dropping connections or adding latency to each packet sent?

Practice (and then practice again)

This is perhaps the most important lesson of all. You cannot be proactive enough in preparing for when the next outage will hit. Many of the teams we saw experience outages over the last 18 months were not prepared. This meant that when an issue happened, it took too long to identify what the issue was and then pinpoint the cause, making the Mean Time to Repair (MTTR) far slower than it could have been.

A final note of caution. Many of us are now reliant on the cloud not just for hosting our infrastructure, but also for services that our developers would have previously coded and maintained. When a key service of a major cloud provider goes down, the ripple effect across other products and companies can lead to a massive chain of failures. We saw this in November 2021 with Google Cloud and again, in AWS’ trifecta of outages, in December 2021. We may not even realize that we could be the downstream victim of another company’s failure, but our mutual interdependence on a handful of key vendors for important services like hosting or DNS services makes it essential that we plan not only for our own failures, but also those of the third parties that underpin our services and applications. Remember, with careful planning, rigorous monitoring and observability practices, and a thorough change management plan, what can feel beyond your control can actually lead to fast resolution with minimal business impact.


Latest Updates

Subscribe to our YouTube Channel