SUBSCRIBE NOW
IN THIS ISSUE
PIPELINE RESOURCES

Network Reliability and GenAI


The introduction of GenAI, by its nature and in its current state of evolution, increases the probability of unexpected errors even more.

GenAI and Future Problems

We are now introducing GenAI into this environment of scale, complexity, and volatility. Introducing anything fundamentally new to such an environment increases the probability of unexpected error types being encountered. The introduction of GenAI, by its nature, and in its current state of evolution, increases the probability of unexpected errors even more. This is because GenAI is being used in software development, parameter setting, and other areas that affect the operation of our networks.

In software development, GenAI has three characteristics that, in combination, make it much more likely to produce unexpected error types. GenAI will:

  • Make errors that are very different from the type that human coders make.
  • Hallucinate in the software development process.          
  • Make hallucinations in software development appear fine.

Experience with previous forms of automation indicate GenAI will produce many previously unknown problems due to errors, or types of errors, that human software developers have not previously made and are not likely to make. The hallucination problem is well known but more problematically subtle. As a result, and along with other quality concerns, it is generally agreed that GenAI-developed software must be reviewed by people. But as time pressures, technical pressures coming from increased complexity, and attempts to control costs mount, the temptation to trust GenAI is going to be hard to resist. And, on cursory inspection, it always looks good. That’s the problem.

The same factors are at work in parameter setting and other kinds of network operations. For example, before GenAI, we saw an incident in which a fat finger problem in manual operations brought down Google’s S1 network in the Northeastern U.S. This shows that parameter and operational errors can cause a crash. We can foresee other opportunities for errors with the introduction of GenAI into the mix.

Many software updates and parameter setting updates occur in our networks each day. Even a very small percentage (back to 5 nines again) of unknown problems can cause widespread crashes. Some of those crashes will be merely inconvenient. A few can be as bad as CrowdStrike or worse.

Guarding Against Future Problems

Our systems and networks were already struggling with scale, complexity, and volatility. With the addition of GenAI it is only going to get worse, even as we are becoming more and more dependent on our systems and networks. The way out is to develop new ways of making sure that what we think are “small changes” don’t result in everything suddenly crashing down.

We have solutions for the known types of errors. They’re not all perfect, but they’re a good start. Another approach, called “sandboxing,” has been around for a long time, but is now beginning to be more widely recognized. In this approach, a small network/computing environment similar to the operational environment is created in a separate domain, one where undesired effects cannot pass through into the operational environment. Into this sandbox the new software, or parameter setting, is inserted and the results are observed. Special attention is paid to any unexpected outcome(s).

Sandboxing as it is practiced today can catch some of the unexpected previously unseen problems. Simple sandboxing might have caught the CrowdStrike problem. So, the first recommendation is that vendors and large user organizations should develop sandboxes. Vendors should put things they are planning to ship into their sandboxes before shipping. Large organizations should put incoming changes into their sandboxes before installing. These measures will help a lot.

The problem we face today is that most sandboxes don’t accurately reflect the current state of the network. They tend to be over simplified, small, not rapidly changing abstractions of the operational network. It is hard and expensive today to create sandboxes that reflect the full scale, complexity, and volatility of our networks. This is where innovation must come in. We need to develop ways of making our sandboxes accurately reflect the up-to-the-minute (second?) representation of our networks in all their complexity and scale. And we need to find a way of doing this economically. The alternative is more crashes as bad or worse than the CrowdStrike incident.

Conclusion

If we want to preclude events like the CrowdStrike BSOD incident from happening in the future we have to find innovative ways to monitor, prevent, and fix problems. Sandboxing may be the best way to do this at present. But innovation is needed to make sure the sandbox network effectively simulates the operational network. More specifically, that it is representative of the operational network’s: Volatility, by being very frequently updated; Complexity, by having representation of the full range of systems; and Scale, by developing innovative ways using statistical methods to effectively duplicate the size of large networks. Vendors and large end-user organizations forming partnerships with innovators is the best way to get to the environment we want and need.



FEATURED SPONSOR:

Latest Updates





Subscribe to our YouTube Channel