By: Aimei Wei
Cybersecurity advances come in waves, and the waves of the past few years have revolved around data collection and incorporating AI technology as a means of sifting through that data. But as cybersecurity tools proliferate, it’s clear that data collection and AI are not enough. Security analysts must monitor a dozen or more different consoles to inspect traffic across their entire attack surface, and they miss complex attacks that consist of seemingly innocuous events reported by several different tools.
This is why we have entered a third wave: correlation. Correlation involves inspecting and responding to data inputs across tools to achieve higher fidelity alerting. While data collection alone resulted in seas of data that were impossible to effectively inspect and AI produced better results within security tools, correlation pulls together data from across disparate tools to more effectively spot complex attacks.
In this article, we’ll look at the progress of cybersecurity technology through the first two waves, and see how the correlation wave is now emerging to deliver faster detections and responses to complex attacks.
Data has been at the heart of cybersecurity since its inception. It began with analysts looking at firewall and server logs and expanded as the number of security tools grew. The main purpose of security information and event management (SIEM) systems was to collect and aggregate logs from different tools and applications for compliance, incident investigation and log management. ArcSight, one of the SIEM tools released around the turn of the century, was a typical example of an SIEM and log management system. The raw packets were collected and stored as-is for forensics despite the fact that they require lots of storage space and it is very hard to sift through these huge numbers of packets to find any indication of breaches.
Quickly, however, we realized that neither raw logs nor raw packets individually are enough to be effective in detecting breaches, and raw packets are too heavy and have limited usage besides forensics. Information extracted from traffic such as Netflow/IPFix, traditionally used for network visibility and performance monitoring, started to be used for security. SIEMs also started to ingest and store Netflow/IPFix, too. However, due to both technical scalability concerns and cost concerns, SIEMs have never become the mainstream tool for traffic analysis.
As time went by, more data were collected: files, user information, threat intelligence, etc. The goal of collecting more data was valid—get pervasive visibility—but the net challenge, responding to critical attacks, is like finding needles in a haystack, especially via manual searches or rules manually defined by humans. It’s both labor intensive and inefficient.
There are two technical challenges facing data-driven security: how to store large volumes of data at scale (allowing for efficient searches and analysis) and how to deal with the variety of data—especially unstructured data—as data can be in any format. Traditional relational databases based on SQL ran into both of these problems. Earlier vendors scrambled to solve these problems with many homegrown solutions. Unfortunately, most of them were not as efficient as what we are using today—NoSQL databases for Big Data lakes.
There is one more challenge facing data-driven security: the software architecture to cost-effectively build a scalable system for enterprise customers. The typical three-tier architecture with front-end business logic and database tiers became a big hurdle. Today’s cloud-native architectures, building upon micro-services architecture with containers, provide much more scalable and cost-effective solutions.
Once you have lots of data, what do you do with it? As mentioned previously, with a large of volume of data, sifting through it looking for meaningful patterns is tedious and time-consuming. If your IT infrastructure gets hacked, it might take days or weeks to find out. Meanwhile, the damage is already done, or sensitive data has already been stolen.
In this case, too much data becomes a problem. Fortunately, we have seen the rise of machine learning (ML) thanks to advances in machine learning algorithms and computing power. Machines are very good at doing repetitive and tedious work very quickly, efficiently and tirelessly, and do so 24x7. When machines are equipped with intelligence-like learning capabilities, they help humans scale. Lots of researchers and vendors in security started leveraging AI to solve the data overload problem, to help them find those needles, or to see trends that are hidden inside large datasets. For example, endpoint detection and response (EDR) companies are using AI to address endpoint security problems; user and entity behavior analysis (UEBA) companies are using AI to address insider threats; and network traffic analysis (NTA) companies are using AI to find abnormal network traffic patterns.