Another big reason for the push to use artificial intelligence for service monitoring is that the mountain of data that networks are currently processing is going to get significantly larger. In the coming years, the continued rollout of 5G, a steady increase in the production of self-driving cars, more video-based applications, and the proliferation of IoT devices will produce higher data streams that will tax CSP networks.
Many CSPs struggle with alert fatigue, a flurry of false-positive alerts that drained resources. When used properly, AI can speed time to detection and remediation for network service issues by correlating events and reducing false-positive alerts. For AI-based service monitoring to create meaningful alerts, it must identify relationships between metrics and the events that impact how they behave.
For example, AI might connect site downtime to maintenance. Typically, this is difficult for CSPs because the information about each event is siloed, and responsibility belongs to different departments within the organization. Quickly finding and correlating cross-siloed anomalies before they impact the customer experience is a difficult task for most organizations.
Correlating anomalies to the events that influence them is a must for root cause analysis. Without a strong AI correlation engine, a flood of false alerts could spur a lot of unnecessary truck rolls and losses in revenue. One CSP using AI went from 54,000 alerts per day to around 30 alerts per day by moving to an AI-based solution, and they were able to shorten the time from alert to resolution by as much as 60 hours.Correlating anomalies across silos can also reduce customer churn. As we emerge into a post-pandemic world, and business travelers and vacationers leave their home countries, roaming becomes an issue for networks. Subscribers across the globe rely on their phones and roaming services when they travel. Any degradation in phone service is difficult for subscribers because without the phone, it's hard for them to contact their CSP. AI can detect roaming issues before they lead to customer churn. For example, an anomalous drop in data volume from different inbound roamers (from multiple countries) might be due to a drop in the DNS success rate. Quick detection and resolution of the issue would minimize the impact on subscribers.
Zero touch remains the Holy Grail for CSPs. The emergence of 5G and edge computing technologies will enable CSPs to offer more services, further enhancing the customer experience. But the rise of this digital transformation comes at a cost. Networks and network services are becoming more complex. CSPs cannot afford a misstep. Customers don't care about complexity; they care about services and lack of downtime. Machine learning will help CSPs move from simply reactively monitoring multifaceted networks to remediation.
Here's how this process will work. The first step is anomaly detection, which is already in place, as CSPs use business monitoring to measure service levels. Correlation and root cause analysis enable CSPs to use ML to correlate events in real time across billions of metrics. The advances in ML have enabled CSPs to correlate different events across multiple technologies and multiple vendors, speeding time to remediation for teams. Autonomous remediation is the final step to zero touch. Currently, the automated closed-loop process can be observed in low-level tasks such as automating “bounce the server” or “open a ticket” types of scripts. but humans still must be involved to engineer a fix for the issue.
Today, the technology road map is in place to combine all three phases in the process. Looking ahead, an ML-based system will perform the anomaly detection and root cause analysis, our current state and, based on previous events, suggest and then execute an action. This will be done without human interaction. Completing the closed-loop process, ML-based systems will fine-tune their remediation efforts based on previous zero-touch activities.
Deployed as an intelligent brain on top of existing architectures, autonomous monitoring tools provide early visibility of network issues that could lead to service disruptions. When set up correctly, they can detect and remediate problems before they impact the customer experience.