Deploying A Modern Security Data Lake
Deploying A Modern Security Data Lake
m
pl
im
en
ts
of
Deploying a
Modern Security
Data Lake
Solve Legacy SIEM Problems,
Integrate Data Science, and
Enable Collaboration
David Baum
REPORT
Deploying a Modern
Security Data Lake
Solve Legacy SIEM Problems,
Integrate Data Science,
and Enable Collaboration
David Baum
978-1-098-13495-2
[LSI]
Table of Contents
Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
iii
4. Achieving Your Security Program Objectives. . . . . . . . . . . . . . . . . . . . 29
Introducing the Threat Detection Maturity Framework 29
Embracing Detection-as-Code Principles 31
Improving Threat Detection Fidelity 33
Preparing for Breach Response 35
Measuring Alert Quality with KPIs 35
Applying Data Science to Threat Hunting 36
Summary 37
iv | Table of Contents
Acknowledgments
v
CHAPTER 1
The Rise of the Security Data Lake
1
storage allocations, which hinder the effectiveness of forensic inves‐
tigations. Meanwhile, because of increasingly complex regulatory
requirements, security professionals must help their organizations
comply with strict data privacy regulations governing the creation,
storage, and use of consumer data. This adds to the already oner‐
ous task of monitoring corporate information systems to avoid
unauthorized incursions—before data is lost, trust is breached, or
customers become aware of performance issues.
Security teams need a faster, easier way to get the data they need
so they can stop bad actors before attacks escalate into breaches.
Maintaining strong edge controls, such as endpoint detection and
response (EDR) software and secure access service edge (SASE)
systems, is an important part of network security, but savvy attackers
know how to penetrate these virtual perimeters. To minimize risk,
security operations teams must modernize their cybersecurity sys‐
tems so they can detect, analyze, and even predict potential threats
quickly and effectively to thwart breaches when intrusions do occur.
A security data lake is a specialized data lake designed for collect‐
ing and manipulating security data. This report describes how the
security data lake model can complement or replace the traditional
SIEM model. It also describes how to create a modern security data
lake with an organization’s existing cloud data platform to deliver
comprehensive visibility and powerful automation across multiple
security use cases.
Summary
Attack surfaces are expanding as enterprises increasingly rely on
complex, multi-cloud environments. Unfortunately, legacy SIEM
solutions fail to enable effective threat detection and response in
these diverse IT settings. These outdated solutions are plagued with
data storage and retention limitations, along with poor scalability
and slow query performance. As a result, many security teams can’t
easily determine what is happening across their organization’s infra‐
structure. These limitations also limit historical reporting and data
science initiatives because each set of security logs ingested into
the SIEM is available for a limited period, typically 90 days or less.
Effectively securing your environment is difficult when you have
access to only some of your security data, some of the time.
Summary | 9
CHAPTER 2
Implementing a
Security Data Lake
11
• How is data from these sources being used today?
• What are the biggest challenges and gaps to success with these
use cases?
• What are the key security logs and data sources you depend on?
• How do you leverage this data?
• What data sources are you not collecting currently?
• What data sets are archived and not readily accessible?
• How would centralizing these data sources be helpful?
• In what ways can you collect data from these sources?
• Does your current SIEM solution provide the investigative
capabilities you need? If not, in what ways is it lacking?
• How many detection rules are active across the environment?
How many of these are prebuilt versus custom-developed?
• Can the team develop detections in house? Does it have ade‐
quate skills for data modeling and data engineering?
Figure 2-1. Data lakes are often designed to move data through vari‐
ous zones as the data is prepared for use. The same approach can
support threat detection, hunting, and incident response.
As you develop and test SQL queries that cover your end user’s
expected questions, make sure to consider the less frequent but
potentially challenging queries and edge cases. You should also
monitor query performance at production scale to make sure it
is adequate for each use case. As you work with the users you
identified in Phase 1, provide them with sample queries, and make
sure those queries return the results your users anticipated.
Validate your preparedness with penetration tests that simulate
attacks relevant to the threat models you have identified during
this iteration of the security data lake buildout. For example, you
might want to test for basic malware downloads, API keys that are
compromised, and lateral movement of an attacker through partic‐
ular domains. The goal is to identify gaps and issues in visibility,
detection logic, and IR search queries. One of the advantages of a
security data lake is its inherent support for engineering processes
such as testing, measurement, and “detection as code,” systematically
reducing risk for the organization.
Finally, consider how you can further empower your users and
other stakeholders with actionable dashboards and metrics. In a
production security data lake, self-service BI dashboards meet most
query requirements for business users that contribute to the security
Summary
Modern cloud data platforms enable security data lake initiatives
that require less effort and are faster to implement than traditional
security solutions. However, successful implementations require
careful planning and benefit from the following best practices. To
Summary | 17
implement a security data lake that makes effective use of the enor‐
mous volume and complexity of your security data, remember these
key steps:
• Take stock of your most pressing needs and match them against
your current security stack to identify gaps in visibility or
capabilities.
• Unify security data sources and enterprise data using built-in
ingestion utilities within the cloud data platform or via an eco‐
system partner’s prebuilt connectors.
• Create a data model for accelerating analytics on security data,
and use schema-on-read to query collected data in its raw state.
• Partner with the data team within your business to utilize
existing data solutions and collaborate on a data-driven secu‐
rity strategy, including business intelligence (BI) reporting and
behavior analytics.
• Investigate incidents, hunt threats, and proactively reduce risk
by querying the security data lake both directly and through
purpose-built interfaces in connected applications.
19
Today’s purpose-built solutions, whether they are labeled as open
SIEMs, XDRs, or security analytics platforms, handle everything
from data ingestion to incident response, complementing or replac‐
ing the more limited scalability and analytic capabilities of tradi‐
tional SIEMs. The goal of these connected ecosystems is to simplify
the creation, operation, and maintenance of your security data lake.
Connected security apps typically include a high-level GUI interface
for security analysts who aren’t yet comfortable with SQL. Tailored
interfaces can be helpful for advanced users as well when they sup‐
port graph-type navigation or point-and-click pivots. Security teams
can accelerate their implementations by leveraging out-of-the-box
integrations, content, and visualizations. They can also consolidate
all their enterprise and security data into a single location and take
advantage of advanced analytics for detection and response (see
Figure 3-1).
Context Matters
Most security solutions offer users some level of analytics and
reporting. The problem facing the users when they log into each
point solution is that they can only see endpoint activity or vulner‐
ability findings in a silo. This makes it hard, if not impossible,
to achieve high-fidelity insights or establish automated workflows.
Security professionals must learn the unique nuances of each envi‐
ronment before taking action within the data environment as a
whole.
A security data lake that combines data from all logs, users, assets,
and configurations into a cohesive repository makes it much easier
to understand the contextual relations among data elements, greatly
expanding the possibilities for automation. For example, a detection
engine can connect the dots to uncover threats that would have been
impossible to spot by looking at individual sources on their own. A
real-world example would be a compliance analyst joining employee
termination records from the HR application, permission policies
from the identity directory, and authentication events from sensitive
systems to identify users who accessed sensitive data or resources,
and to determine whether they abused that access. This versatile
model opens new opportunities for threat hunting as well as other
security tasks that benefit from a holistic view.
This architecture empowers not just analysts but also a new gener‐
ation of applications that intelligently connect information across
data sets, merging security data with other enterprise sources.
Autonomous threat hunting, permission rightsizing, and security
control validation are some examples of connected application use
cases that can take advantage of the context established in your
modern security data lake.
Summary
Connected applications are SaaS solutions that separate code and
data. The SaaS vendor maintains the application infrastructure and
code while the customer manages the application’s data within their
own security data lake. Because the architecture rests on a scalable,
cloud-built data platform, building and maintaining API integra‐
tions between the data lake and the connected apps is unnecessary.
Each customer has its own environment within the cloud data plat‐
form where it can combine the data sets from multiple SaaS vendors
with its own business data to create a unified, single source of truth
for the entire organization.
Modern cloud data platforms power a broad vendor ecosystem that
attracts best-of-breed security software vendors. Role-based access
control, in conjunction with other security and governance mecha‐
nisms, ensures each user and each organization can access only the
data they are explicitly permitted to see.
Summary | 27
CHAPTER 4
Achieving Your Security
Program Objectives
Bad actors are increasingly well funded, highly capable, and deter‐
mined to leverage new technologies and paradigms to launch their
attacks. In this constantly evolving threat landscape, building detec‐
tions for every possible adversary or technique is nearly impossible.
To maximize your defense posture, your security operation center
must employ proven, repeatable processes for creating and main‐
taining threat detections in conjunction with continuous monitor‐
ing and testing to adapt them to real-world conditions. A security
data lake enables you to achieve your security program objectives as
part of a continuous process of improvement known as the Threat
Detection Maturity Framework. This process yields more robust
threat detections and greater alert fidelity by adhering to detection-
as-code procedures.
29
To stay up to date on prevalent and emerging attack patterns,
many security organizations adopt the MITRE ATT&CK matrix, an
industry-standard framework for measuring threat detection matur‐
ity. This publicly available knowledge base tracks the tactics and
techniques used by threat actors across the entire attack lifecycle.
Created by a nonprofit organization for the United States govern‐
ment, the ATT&CK matrix helps security teams understand the
motivations of adversaries and determine how their actions relate to
specific classes of defenses.
Although the MITRE ATT&CK matrix is a good starting point,
conscientious threat detection teams consider many additional fac‐
tors outside of this matrix. A complete threat detection maturity
framework should encompass five categories:
Processes
Development methodologies and workflows
Data, tools, and technology
Logs, data sources, integration logic, and documentation
Capabilities
Searches, analytic dashboards, and risk-scoring models
Coverage
Mapping of detections to threats, and prioritization of responses
People
A diverse and well-rounded SOC team
For each of these five categories, the framework defines the follow‐
ing three maturity levels:
Ad hoc
Initial rollout of security data, logic, and tools
Organized
Gradual adoption of best practices
Optimized
Well-defined procedures based on proven principles
For example, within the processes category, a team with an ad hoc
level of maturity likely has no formalized development methodol‐
ogy, no defined inputs, only rudimentary detections, and no defined
metrics.
Summary
To break down the data silos and enable analytics on a scale that
can accommodate today’s nonstop network activity, invest in a cloud
data platform that can handle a broad set of use cases, including a
security data lake, and work with a very high volume of data. Secu‐
rity teams can use this platform as a foundation to progress on the
threat detection maturity framework and follow detection-as-code
principles, including the following:
Summary | 37
• Agile development of detections throughout the continuous
loop of testing, debugging, deployment, and production
• Continuous integration/continuous delivery (CI/CD) of data
pipelines and models for fast and reliable detection and
response
• Automated testing and quality assurance (QA) for rules, espe‐
cially important as upstream data sources change over time
• Versioning and change management for detection code
• Promotion, reuse, and automation of data models, detections,
and other artifacts
By moving your data sets to a security data lake, you can reduce
traditional SIEM license fees and operational overhead. You can use
one system to analyze data from a huge variety of sources. You
can store many types of data—including logs, user credentials, asset
details, findings, and metrics—in one central place and use the same
sets of data for multiple security initiatives. Collected data can be
stored in the security data lake for however long you want, eliminat‐
ing complex storage tiers and rehydration overhead. Anytime you
want to search that data, you can do so easily via your connected
security applications of choice.
A modern cybersecurity strategy begins with a security data lake and
its rich ecosystem of security solutions and data providers equipped
to handle the vastly expanding threat landscape.