By: Asim Rasheed
Cloud-native infrastructure, open radio networks, edge computing, 6G, and the Internet of Things (IoT)—for large network operators, the list of technology trends to contend with is already long, and still growing. But for telecommunications providers struggling to keep pace with so much change, another massive trend is building on the horizon with the potential to dwarf all others: artificial intelligence and machine learning (AI/ML).
It's not that AI/ML is new; you could call it an overnight success 40 years in the making. But with the rise of today’s Large Language Models (LLMs), deep neural networks, and generative AI tools, the AI revolution has officially begun. Analysts with S&P Global forecast generative AI offerings will reach $3.7 billion in 2023, growing to $36 billion by 2028—a staggering 58% compound annual growth rate (CAGR). According to Dell’Oro Group, fully one-fifth of all Ethernet data center switch ports will connect to AI servers by 2027.
This new generation of AI applications and related workloads places tremendous stress on the world’s largest networks. Generative AI in particular brings new traffic patterns and networking requirements unlike anything operators have dealt with before. As a result, these trends are spurring an associated revolution in the way networks are designed and tested that’s still in its early stages. Today, hyperscalers bear the brunt of these growing pains, and they’re reimagining their networks in response. But telecommunications providers should pay close attention. In the not-so-distant future, they’ll need to address many of the same issues in their own networks—to bring new AI services to customers, tap into the power of AI-driven automation and optimization in their own networks, and enable other dynamic applications of tomorrow.
If you pay any attention to the tech sector, you’ve seen the headlines about the explosive growth of OpenAI’s ChatGPT. Within weeks of its November 2022 launch, the generative AI chat client shattered Internet growth records, ramping up to 100 million monthly active users. By June, the site had notched 1.6 billion visits. And that’s just one example. Already, dozens of other AI projects have been unveiled by every major tech company, with LLMs representing just one type of AI application.
Operators of the world’s largest hyperscale data centers (Amazon, Microsoft, Meta, and others) were already scrambling to add compute and network capacity to keep pace with growing cloud utilization. New generative AI workloads, however, represent an entirely different kind of challenge—with business and technical demands that can’t be met just by adding more servers and bumping up interface speeds.
To start with, the most effective compute clusters for AI workloads use Graphics Processing Units (GPUs), which are much better suited to running many tasks in parallel than conventional server Central Processing Units (CPUs). There are currently far fewer GPUs on the planet than CPUs, however, and GPUs were already more expensive and harder to acquire before new AI tools started capturing headlines. Nowadays, GPUs are extremely hard to acquire. This scarcity means that organizations paying the high costs associated with acquiring GPUs need to extract the best performance out of them.
Even when data center operators have the processing power to scale up AI support, however, these workloads represent a very different kind of computing job with different networking requirements. AI/ML clusters function more like a single high-performance machine than a collection of servers, with GPUs needing to both crunch data individually and share huge amounts of information with many other processors in the cluster—all within very tight windows. This places extreme demands on networks, with four pain points standing out in particular: