SUBSCRIBE NOW
IN THIS ISSUE
PIPELINE RESOURCES

Network Reliability and GenAI

By: Mark Cummings, Ph.D., Mark Daley

The CrowdStrike BSOD (Blue Screen of Death ― Windows crash and failed reboot) incident is a warning of things to come. It is a symptom of how we rely heavily on networks that are increasing in scale, complexity, and volatility. It is only going to get worse with the wide scale implementations of GenAI. To effectively respond, innovative ways to monitor, prevent, and fix problems will need to be developed and deployed.

The CrowdStrike Incident in Perspective

The old Telco wired networks were designed to a “5 nines” reliability standard, meaning they would run reliably 99.999 percent of the time. To state it in an oversimplified way, in 1 million minutes (roughly 694 days) of operation, a network would not have more than one minute of downtime if running to a 5 nines standard. In the CrowdStrike incident, downtime ranged from hours to days, and in the case of Delta Airlines over a week. If the incident were the only CrowdStrike outage, and CrowdStrike met the 5 nines reliability standard, it would have had to be in continuous operation for at least ~1 million days (~42,000 years).

CrowdStrike didn’t create a problem in its own product. Rather, it created a problem in others. It can be argued that the Crowd Strike product worked fine throughout the whole incident.

The effects of the problem were widespread and concerning. In addition to the disruption of airlines, emergency services such as 911 in the U.S. and similar services around the world went down. Hospitals and emergency rooms had trouble providing services. Delivery services were impacted. Multi-mode freight transportation systems were affected. Effects rippled across networks, even in those that were not running affected computers. We were lucky it happened leading into a weekend in summer vacation season. If it had started at the beginning of a week in a busy time of year, things could have been much worse.

The big problem is that CrowdStrike is not the only infrastructure software product in our networks. A friend of mine did an analysis of a typical online banking transaction and found that there were 176 software products involved in completing it. How many software products are involved when pumping gas into your car? In having water come out of your tap when you turn on the faucet? Or in that loaf of bread being on a store shelf when you reach for it?

Our quality of life is built on an ever growing number of very large scale, very complex, and very volatile systems. A problem with any one of the many thousands of software elements under the control of a profusion of different companies can do a lot more than what happened in the CrowdStrike incident. This is why the CrowdStrike incident is a concerning harbinger of things to come.

There are clear cybersecurity issues involved in keeping these systems running. The focus here, however, will be on understanding what the problem was in the CrowdStrike incident, how GenAI may exacerbate these types of problems as time goes on, and what can be done to avoid them.

The CrowdStrike Problem

CrowdStrike provides a security product that runs on Windows computers. The product is a system based on patterns (called signatures) of previous attacks. That is, attacks that have occurred, been detected, and for which signatures have been developed. There are so many kinds of these often called zero-day attacks that CrowdStrike has to send out signature updates every two hours. This fact alone indicates the volatility of today’s networks. Given that new attack types can proliferate across very many targets in seconds, and do significant damage in minutes, my guess is that customers of CrowdStrike would like updates even more frequently.

CrowdStrike checks the content of the updates to make sure they are okay. Because of the way Microsoft Windows is set up, these updates have to go into the kernel (foundation of the OS). What happened in this incident was that the update file had an incorrect file name extension. Checking the contents didn’t, and never would have, detected the error. When the message was received, it was interpreted (because of file extension name) as a code update, not signature data. Loading it into the kernel as code caused the OS to crash in a manner that was not easily fixed. As a result, the computers kept trying to reboot and kept failing. It took a lot of manual effort (often with privileged physical access to the machines) over a significant amount of time to get all the computers involved up and running again.

We may never know for sure why there was an error in the file extension name. It could have been a human error (fat finger problem), a system error in the chain from file creation to file transmission, a system error in file preparation, or something else. What is clear is that it was an unexpected error type. Most testing tools look for types of errors that have been detected previously (known unknowns). Unexpected error types are not caught. The CrowdStrike error was unexpected, therefore there were no control measures in place for the possibility of it occurring.

What is the probability of an unexpected type of error occurring? We know from the CrowdStrike incident that it is not zero. Moreover, the greater the scale, complexity, and volatility of our networks, the higher the probability of encountering unexpected error types.



FEATURED SPONSOR:

Latest Updates





Subscribe to our YouTube Channel