Identify threats faster with a security data lake

Share

The glory days of SIEM are over. Security teams are not only measured by their ability to collect as much data as possible, but the emphasis is moving to how effectively they can analyze massive amounts of complex security data, produce actionable recommendations, in real-time.

To top that off, security data is primarily composed of events and logs that are growing in complexity and dimensionality. This means that each row (log or event) often includes dozens and even hundreds of different attributes. Yes, data is complex, and being able to sift through it effectively will drive ROI and build a solid competitive advantage.

In this blog, I want to focus on the efficiency aspect of security analytics solutions. It all comes down to timing: how quickly you can start analyzing; how quickly you can change queries and adapt; and how quickly your queries complete so you can run all the queries you need to detect anomalies, new threats and stop breaches.

Breakout time: 1 min to detect; 10 mins to understand; 60 mins to respond

To frame the concept of timing, one cybersecurity firm introduced the concept of breakout time, the critical window between when an intruder compromises the first machine and when they can move laterally to other systems on the network. It’s not enough to identify and react, you need to do it very quickly.

Attacks generally include five stages: initial access, persistence, discovery, lateral movement, and objective. For effective mitigation, security teams need to detect, analyze and respond before lateral movement. That’s essentially breakout time. According to a 2018 research, on average the breakout time was 1 hour and 58 minutes. 2 hours is far too long.

Instead, security teams should aim for the following, 1-10-60:

  • 1 minute to detect
  • 10 minutes to understand
  • 60 minutes to respond

Meanwhile, traditional SIEM, XDR and optimized search platforms struggle to support the 1 hour and 11 minute goal of breakout time and meet the budget constraints.

That’s why security data lake is the only analytics destination to achieve breakout time and also hit your budget goals. Let’s take a closer look below.

A data lake is the right place to start a security analytics project

As a general rule, threat hunting is finding a “needle in a haystack”. It’s not something you can prepare for or model. Data changes all the time, anomalies change all the time, and data has to remain in its most granular state. In addition to granularity, maintaining easy and quick access to historical data is critical, making the data lake a strong alternative for security analytics.

By leveraging commodity storage (i.e. cheap), object storage format (AWS S3, for example) and existing data pipelines, the data lake offers a cost-effective architecture for a wide range of data sources.

The advantages of the data lake are significant:

  • Access all relevant data at its raw / granular form
  • Data is retained for multiple uses over time, enabling full timeline investigations
  • The data lake serves as the single source of truth, for (all) security analytics workloads

Security data lakes alone fail to deliver efficiencies

You’re not convinced just yet. Data lakes’ agile architecture is often discounted by a lack of efficiency.

Today’s existing solutions are based on brute force technology, which scans massive amounts of data to return answers. So where you save on storage, you may end up paying on enormous compute clusters to support decent performance and concurrency SLAs, which are critical to deliver the 1-10-60 goal.

In reality, about 80% of compute will be squandered on scan and filter operations, resulting in:

  • Inefficient resource utilization
  • Inconsistent performance
  • Unpredictable costs

Bottom line, data lake analytics solutions haven’t matured yet to support the fast pace and performance required from threat hunting, incident response and security investigations.

The missing ingredient: enable the security data lake with indexing

Indexing is all about finding that “needle in a haystack.”

But if you learned anything about indexing back in school / undergrad / university / anywhere else, it’s time for a new perspective. When it comes to truly massive data sets, the traditional concepts of indexing won’t deliver the much sought-after efficiencies.

To solve big data challenges, it took a fresh look at indexing as well as a slight modification.

Starburst’s Smart Indexing and Caching Technology

Unlike partitioning-based optimizations, that are designed to reduce the amount of data scanned and subsequently boost performance (or reduce cost) by partitioning the data by frequently used columns, Starburst’s big data indexing technology is not limited to several dimensions, but rather enables it to quickly find relevant data across any dimension (column).

Data lakes are not homogeneous and includes data from many different sources and formats, the platform leverages a rich suite of indexes, such as Bitmap, Trees, Bloom, Lucene (text searches which are so critical for log and event analytics), etc.

These capabilities don’t require any special skill sets and automatically finds the best index. Let’s take it a step further — to deliver optimal performance on varying cardinality, Smart Indexing and Caching breaks down each column into small pieces, nano-blocks, and finds the optimal index for each nano-block.

Indexes, as well as cached data, are stored in SSDs to enable highly effective access and extremely fast performance.

Starburst’s Smart Indexing and Caching is designed to improve breakout time:

  • No need to move data or model it – ask questions whenever you want
  • Keep data granular – ask any question you need
  • Run directly on your data lake – run on more data, including historical data
  • Add the power of indexing to your analytics – get blazing fast answers

In addition to that, in order to join the security related data with business data for getting a holistic view to answer your questions, you can use Starburst’s federated queries to combine Data lake data with many different data sources (such as MongoDB, PostgreSQL, Salesforce).

All you need to do is start timing it!