Building a security data platform

Strategy
  • Kadd Systems

    Kadd Systems

Share

The explosion of data tools for managing security data has radically transformed the ever-evolving cybersecurity landscape. Modern security platforms now leverage sophisticated data warehouses and data lakes. Several organizations have embraced security data lakes/lakehouses, integrating them with data platforms such as Snowflake and Databricks.

Starburst, built on open source Trino (formerly PrestoSQL), introduces a federated approach that proves particularly effective in hybrid architectures and use cases where data mobility is constrained. Below is a reference architecture for a comprehensive Security Data Platform powered by Starburst for companies who want to build their own or augment their existing security stack for SOC and other use cases.

Let’s delve into the components.

Data Ingestion and Modeling

Ingestion

Most organizations can leverage existing data pipelines built by their data teams to get security telemetry into the platform. We have worked with clients who built their ingestion pipelines with tools like Vector.dev/VRL for push-based sources and Python-based tooling orchestrated via Airflow for pull-based systems like external APIs. Over the past few years, many log sources have been in semi-structured format, often JSON, facilitating more straightforward parsing. Tools like Vector.dev also support Grok, making parsing unstructured text easier.Additional options include Fluentbit, Fluentd, Rsyslog, Nifi, Cribl, and Airbyte for organizations already leveraging these tools. Organizations commonly utilize their existing SIEMs for primary detection use cases. They can feed data directly using established ingestion patterns or dump raw logs into a data lake, providing the required sources to the Operational SIEM.

Standardization and Normalization

Distinguishing between standardization and normalization is essential. Standardization involves maintaining a standard nomenclature across fields in a source, while normalization maps all data sources into a single model. While a single data model per source is standard, a normalized data model proves valuable for specific use cases like IOC searches, domain-specific detections, and ML applications.

Streaming into Kafka

Data from sources can be streamed into platforms like Kafka/Redpanda, offering advantages as a buffer for shorter-term retention. Kafka’s role as a streaming platform enables multiple consumers to transform data for diverse purposes, such as populating a graph database or loading into Elasticsearch, etc. Also, this can be the first detection stage for organizations preferring to do layered detections.

Data Processing

Storage Engines

As the architecture indicates, the primary storage engine can be any major cloud provider or on-prem object storage system like MinIO. Starburst’s compatibility with on-premises systems positions it uniquely for hybrid architectures, a rising trend in modern organizations. Apache Iceberg/Delta is the table format, aligning seamlessly with Starburst’s federated query capabilities.

Specialized Platforms:

Specialized platforms like Neo4j (Graph DB), Elastic Search, Snowflake, and Clickhouse are necessary in many use cases. Starburst distinguishes itself by facilitating connectors to most of these platforms out of the box. Starburst Enterprise’s compatibility with Jupyter notebooks offers users an interface for data explorations, primarily in ML use cases and threat hunting.

Use Cases Across Teams

Teams Leveraging the Security Data Platform (SDP):

Several teams within an organization can leverage the SDP, each finding tailored advantages. The Security Operations Center (SOC), Network Operations Center (NOC), Insider Threat Detection, Threat Research, Governance Risk and Compliance (GRC), Vulnerability Management, Machine Learning (ML) teams, and Executives with a focus on metrics and dashboards all benefit from this unified platform.

Some of the use cases that can be built using this type of platform would include:

  • Create on-demand reports on large-scale data.
  • Quick search on 90–120 days data.
  • Data in a graph store so we can study the relationship between entities.
  • Do IOC search on streaming as well as historical data.
  • Run detections on streaming data as well as batch data.
  • Correlate data and run complex threat hunts and automate them.
  • Integrate with ticketing systems like ServiceNow and TheHive and collaboration tools like Slack.
  • Combine network/operational data for Fraud, AML, or UEBA detection use cases.
  • Ability to do Machine Learning and Deep Learning for Threat Detection and response use cases.

Starburst introduces a decentralized methodology for handling security data, presenting a robust alternative in scenarios where economic considerations and hybrid cloud architectures are vital factors. This approach works particularly well for Federal agencies with multiple sub-agencies that may want to centralize/ decentralize their security data.

Originally published, here.