How network teams can be proactive to avoid outages

Feb. 9, 2023
No single solution can guarantee business continuity, but having preparedness strategies is critical

According to Microsoft, a change made to the Microsoft Wide Area Network (WAN) made Microsoft services including Teams, Outlook, Microsoft 365, and Xbox Live inaccessible to users around the globe.

It’s incredible that even today, the simplest configuration change or even a typo can sometimes cause a ripple effect that brings down a network and/or disrupts a supposedly fault-tolerant business service. No one is immune – not even tech giants like Microsoft. 

Modern, enterprise-class networks are incredibly complex while, at times, shockingly sensitive to changes affecting core network services like Domain Name Services (DNS), traffic filtering, and Border Gateway Protocol (BGP) routing. For this reason, many organizations have historically resisted core network changes with fervor. However, today’s businesses require agile networks capable of ingesting change at wire speed. Furthermore, the pace at which new Common Vulnerabilities and Exposures (CVEs) and operating system (OS) bugs are identified, deploying OS updates and patches has become a never-ending project for most organizations operating at scale.

Finding the Root Cause

In many cases, an outage like this one may not occur immediately after the configuration change was made and so it can be difficult to correlate the change during root cause analysis. While many news reports have keyed upon the fact that a configuration change caused such a widespread outage, the real headline is that it took them four hours to restore service.

A few years ago, a coworker and I scheduled a change control window for a Saturday, bought a few cases of Mountain Dew, and then set about changing some parameters for the routing protocol controlling our Campus Area Network (CAN). This was expected to be a simple change and we’d packed the caffeine to stay up late and do some follow-on DNS server swap outs. Within half an hour or so after we’d made the changes, our Network Management System (NMS) began notifying us that sites were intermittently going offline. Turns out, we’d stumbled upon a previously unknown bug that effectively set our Time to Live (TTL) to three hops.

Luckily, we were able to restore previous configurations from backups to all of our Layer 3 devices and rapidly restore service.

While the length of the Microsoft WAN outage sounds inordinate, without more technical details about the cause of the outage and, more specifically, the extenuating circumstances that extended the time it took to restore service, rather than pass judgment I will just honestly say, I’ve been there. In one case, I specifically remember having to put a network engineer on an airplane in order to restore service to a part of the network that had become isolated due to configuration issues.

Once, while working as a network engineer for a large telecommunications company, I was brought in to help when a network outage was taking an exceptional amount of time to resolve. In that case, one of the key engineers responsible for that part of the network had recently been laid off, and unfortunately, that person had a lot of the necessary knowledge required to restore service stored only in their head. Could a similar situation have contributed to the duration of this outage? Might recent news of layoffs and extended WAN outages be related? 

Downtime Not an Option

The cost of that extended downtime, both in terms of real dollars and reputational damage, is nearly incalculable - and by the company's own admission, Microsoft is "still reviewing telemetry to determine the next troubleshooting steps." To put this into perspective, most network teams shoot for five or six nines of availability and in many cases, network operations teams are bonused at least partially based on network reliability. Five nines of availability, or 99.999% uptime, implies five minutes of downtime, or less, per year. This single outage lasted nearly 50 times that long.

After working as a network engineer for 30 years, I do have some thoughts on how network teams at organizations can be proactive to avoid a similar disaster:

  • Accelerate the speed of solving difficult technical problems by ensuring that there is solid documentation and up-to-date network maps.
  • Implement continuous, automated configuration auditing and remediation to ensure that network and security devices are up-to-date and compliant with operational policies and industry standards. This will deliver a high level of frequency and consistency, which in turn lowers the risk of an outage. 
  • Instrument automated network and security device configuration backups that allow you to instantly restore when needed. At a minimum, your automation platform should create backups daily, before and after changes, and store a long history of backups within an autoscaling, fault-tolerant data store.
  • Architect a strategy for frequent, automated OS updates and patches. You should be able to reliably conduct upgrades at scale and while employing at least mildly complex workflows.

No single tool or approach can guarantee business continuity, but there are ways to be prepared if the worst does happen. In this case, an ounce of prevention may very well be worth 50 pounds of cure. 

About the author: Josh Stephens is Chief Technology Officer at BackBox, the most trusted network automation platform focused on automating network security and operations at scale. His decades of experience include serving in the U.S. Air Force as a network engineer and cybersecurity specialist, building highly complex networks for global banks, airports, and major enterprises for International Network Services (INS), and being among the founding team members of SolarWinds as well as an early team member at Itential.