Networks within data centers
Within AI factories, where the most intensive AI training and inference occurs, networks must operate at extreme speed, with exceptional reliability and near-lossless performance. Training large models requires enormous volumes of data to move in parallel across multiple lanes, tightly synchronized between GPUs. A single dropped packet or stalled flow can force the system to roll back to its last checkpoint, resetting hours or even days of work. Congestion can be just as damaging, leaving costly GPU resources idle while they wait for data or for other GPUs in the cluster to finish their part of the job.
In short, the data center network is becoming just as critical to AI performance as the GPUs themselves.
This demanding environment calls for an evolution in network architecture and platforms to deliver an essentially deterministic and lossless environment. Sophisticated telemetry and congestion-management techniques have become essential to maintain smooth data movement and maximize GPU utilization.
Ethernet, the world’s dominant networking technology, is stepping up to meet these requirements and is becoming the preferred technology for scaling-out AI infrastructure worldwide. Link speeds are accelerating from 800 Gbps toward 1.6 Tbps and beyond. New mechanisms for load balancing and congestion control have emerged to handle the massive “elephant flows” characteristic of AI factories. And Ultra Ethernet, defined by the Ultra Ethernet Consortium (UEC), is reimagining Remote Direct Memory Access (RDMA) into an open and interoperable communications stack purpose-built for AI and High-Performance Computing (HPC) at scale.
The industry’s shift toward Ethernet is no accident. It offers a broad, mature ecosystem, rapid innovation in speeds and protocols, global operational familiarity, seamless scalability, and true multivendor flexibility. These strengths make Ethernet the most practical, future-proof foundation for AI networks.
Connecting the AI ecosystem beyond data centers
AI training workloads increasingly span distributed infrastructures, with GPU clusters spread across multiple data centers to address space, energy, and operational constraints. In this context, optimized network architectures enabling low-latency, high-bandwidth ‘scale-across’ interconnects are critical, as the network becomes the unifying fabric that transforms isolated facilities into a cohesive AI system.
Once models are trained, they must be efficiently delivered to inference locations, while large datasets must flow back to AI factories for training or fine-tuning. This demands robust, reliable cloud connectivity and seamless access networks that ensure AI workloads can be fed and consumed without bottlenecks.
Architecting for increased scale is paramount. AI applications can trigger a cascade of data requests and responses, leading to rapid traffic bursts that can overwhelm traditional networks. Without robust last-mile delivery, even the most advanced AI capabilities remain inaccessible.
To cope with these demanding requirements, several high-capacity connectivity options are available at both the IP and optical layers. The choice of technology depends on various factors, including distance, bandwidth requirements, latency, security considerations, and cost. Advanced network technologies, such as coherent optical engines, enable long-distance links with high speed, reliability, and energy efficiency, providing the backbone for this distributed AI ecosystem.
Securing and automating the AI network
In the AI era, network responsiveness is a key differentiator. Intelligent, reliable network automation allows networks to adapt dynamically to evolving demands of distributed AI workloads, optimizing performance and resource allocation in real time.
Equally important, AI is increasingly being embedded into network automation. Modern AIOps platforms are transitioning from concept to operational practice, providing capabilities such as conversational and context-aware interaction with the network, intelligent alarm correlation, accelerated root-cause analysis, and automated recommendations for remediation while maintaining operator oversight. By integrating AI directly into automation, network management becomes more proactive, efficient, and resilient.
Accompanying the expansion of data centers and the proliferation of AI workloads is the growing prevalence, volume, and sophistication of global cybersecurity threats. Protecting sensitive data is paramount. Technologies such as quantum-safe encryption and advanced DDoS detection and mitigation are essential, and many of these security functions must be embedded directly into the network. By integrating security at the infrastructure level, organizations can maintain high-speed, efficient operations without overburdening costly compute resources.
The network as the foundation for AI's future
Much like how the video revolution reshaped network architecture two decades ago, AI is now driving a profound evolution in how we conceive the cloud and the network.
Meeting AI’s demands requires a forward-looking approach to network design and automation, both within individual data centers and across distributed infrastructures. By proactively transforming and evolving the network, organizations can establish the foundational infrastructure for a seamless and efficient cloud continuum—one that is resilient, adaptable, and ready to respond to whatever innovations the age of AI brings next.
This transformation is the essential foundation for unlocking AI’s full potential and enabling the next wave of digital innovation.