0% found this document useful (0 votes)
152 views7 pages

Different Stages in Data Pipeline

The document discusses the different stages in a Splunk data pipeline: - Data input stage where raw data is ingested and annotated with metadata - Data storage stage where data is parsed, indexed, and written to disk for searching - Data searching stage where users access, view, and analyze the indexed data through searches, reports, dashboards, and alerts It then describes the main Splunk components - forwarders that collect data, indexers that parse and index data, and search heads that provide the user interface for searching. Universal forwarders send raw data while heavy forwarders do some parsing before forwarding.

Uploaded by

Santosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views7 pages

Different Stages in Data Pipeline

The document discusses the different stages in a Splunk data pipeline: - Data input stage where raw data is ingested and annotated with metadata - Data storage stage where data is parsed, indexed, and written to disk for searching - Data searching stage where users access, view, and analyze the indexed data through searches, reports, dashboards, and alerts It then describes the main Splunk components - forwarders that collect data, indexers that parse and index data, and search heads that provide the user interface for searching. Universal forwarders send raw data while heavy forwarders do some parsing before forwarding.

Uploaded by

Santosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Different Stages In Data Pipeline

There are primarily 3 different stages in Splunk:

 Data Input stage


 Data Storage stage
 Data Searching stage

Data Input Stage


In this stage, Splunk software consumes the raw data stream from its source,
breaks it into 64K blocks, and annotates each block with metadata keys. The
metadata keys include hostname, source, and source type of the data. The keys
can also include values that are used internally, such as character encoding of
the data stream and values that control the processing of data during the
indexing stage, such as the index into which the events should be stored.

Data Storage Stage

Data storage consists of two phases: Parsing and Indexing.

1. In Parsing phase, Splunk software examines, analyzes, and transforms the


data to extract only the relevant information. This is also known as event
processing. It is during this phase that Splunk software breaks the data
stream into individual events. The parsing phase has many sub-phases:
i. Breaking the stream of data into individual lines
ii. Identifying, parsing, and setting timestamps
iii. Annotating individual events with metadata copied from the
source-wide keys
iv. Transforming event data and metadata according to regex
transform rules
2. In Indexing phase, Splunk software writes parsed events to the index on
disk. It writes both compressed raw data and the corresponding index file.
The benefit of Indexing is that the data can be easily accessed during
searching.
Data Searching Stage

This stage controls how the user accesses, views, and uses the indexed data. As
part of the search function, Splunk software stores user-created knowledge
objects, such as reports, event types, dashboards, alerts and field extractions.
The search function also manages the search process.

Splunk Components
If you look at the below image, you will understand the different data pipeline
stages under which various Splunk components fall under.
There are 3 main components in Splunk:

 Splunk Forwarder, used for data forwarding


 Splunk Indexer, used for Parsing and Indexing the data
 Search Head, is a GUI used for searching, analyzing and reporting
Splunk Forwarder
Splunk Forwarder is the component which you have to use for collecting the
logs. Suppose, you want to collect logs from a remote machine, then you can
accomplish that by using Splunk’s remote forwarders which are independent of
the main Splunk instance.
In fact, you can install several such forwarders in multiple machines, which will
forward the log data to a Splunk Indexer for processing and storage. What if you
want to do real-time analysis of the data? Splunk forwarders can be used for that
purpose too. You can configure the forwarders to send data to Splunk indexers
in real-time. You can install them in multiple systems and collect the data
simultaneously from different machines in real time.
To understand how real time forwarding of data happens, you can read my  blog  on
how Domino’s is using Splunk to gain operational efficiency.
Compared to other traditional monitoring tools, Splunk Forwarder consumes
very less cpu ~1-2%. You can scale them up to tens of thousands of remote
systems easily, and collect terabytes of data with minimal impact on
performance.
Now, let us understand the different types of Splunk forwarders.
Universal Forwarder – You can opt for an universal forwarder if you want to
forward the raw data collected at the source. It is a simple component which
performs minimal processing on the incoming data streams before forwarding
them to an indexer.
Data transfer is a  major problem with almost every tool in the market. Since
there is minimal processing on the data before it is forwarded, lot of
unnecessary data is also forwarded to the indexer resulting in performance
overheads.

Why go through the trouble of transferring all the data to the Indexers and then
filter out only the relevant data? Wouldn’t it be better to only send the relevant
data to the Indexer and save on bandwidth, time and money? This can be solved
by using Heavy forwarders which I have explained below.

Heavy Forwarder – You can use a Heavy forwarder and eliminate half your
problems, because one level of data processing happens at the source itself
before forwarding data to the indexer. Heavy Forwarder typically does parsing
and indexing at the source and also intelligently routes the data to the Indexer
saving on bandwidth and storage space. So when a heavy forwarder parses the
data, the indexer only needs to handle the indexing segment.

Splunk Architecture
If you have understood the concepts explained above, you can easily relate to
the Splunk architecture. Look at the image below to get a consolidated view of
the various components involved in the process and their functionalities.
 You can receive data from various network ports by running scripts for
automating data forwarding
 You can monitor the files coming in and detect the changes in real time
 The forwarder has the capability to intelligently route the data, clone the data
and do load balancing on that data before it reaches the indexer. Cloning is
done to create multiple copies of an event right at the data source where as load
balancing is done so that even if one instance fails, the data can be forwarded to
another instance which is hosting the indexer
 As I mentioned earlier, the deployment server is used for managing the entire
deployment, configurations and policies
 When this data is received, it is stored in an Indexer. The indexer is then broken
down into different logical data stores and at each data store you can set
permissions which will control what each user views, accesses and uses
 Once the data is in, you can search the indexed data and also distribute
searches to other search peers and the results will merged and sent back to the
Search head
 Apart from that, you can also do scheduled searches and create alerts, which
will be triggered when certain conditions match saved searches
 You can use saved searches to create reports and make analysis by
using Visualization dashboards
 Finally you can use Knowledge objects to enrich the existing unstructured data
 Search heads and Knowledge objects can be accessed from a Splunk CLI or
a Splunk Web Interface. This communication happens over a REST
API connection

SIEM Foundations: Architecture Primer


The McAfee SIEM solution is comprised of several appliance-based platforms
working in conjunction to deliver unmatched value and performance to enterprise
security professionals within an enterprise.  A multitude of deployment
configurations allow for the most scalable and feature-rich SIEM architecture
available, delivering real-time forensics, comprehensive application and database
traffic/content monitoring, advanced rule- and risk-based correlation for real-time as
well as historical incident detection and the most complete set of compliance
features of any SIEM on the market.  All appliances are available in a range of
physical and virtual models.

The following list details the entire suite of available SIEM components.

ESM - Enterprise Security Manager (sometimes referred to as ETM)

The McAfee ESM is the ‘brains’ of the McAfee SIEM solution.  It hosts the web
interface through which all SIEM interaction is performed as well as the master
database of parsed events used for forensics and compliance reporting.  It is
powered by the industry-leading McAfeeEDB proprietary embedded database which
boasts speeds more than 400% faster than any leading commercial or open source
database.

All McAfee SIEM deployments must start with [at least one] ESM (or a combination
ESM/REC/ELM appliance).

REC - Event Receiver (sometimes referred to as ERC)

The McAfee REC is used for the collection of all third-party event and flow data.

Event collection is supported via several methodologies:

1. Push – devices forward events or flows using SYSLOG, NetFlow, etc.


2. Pull – event/log data is collected from the data source using SQL, WMI, etc.
3. Agent – data sources are configured to send event/log/flow data using a small-
footprint agent such as McAfee SIEM Event Collector, SNARE, Adiscon, Lasso, etc.

The Event Receiver can also be configured to collect scan results from existing vulnerability
assessment platforms such as McAfee MVM, Nessus, Qualys, eEye, Rapid7, etc.  In addition, the REC
supports the configuration of rule-based event correlation as an application running on the Receiver. 
Receiver-based correlation has several limitations.   Risk based correlation, deviation, and correlation
flows are not supported on a Receiver; an ACE (see below) is required for these functions. Also, as a
rule-of-thumb, Receiver-based correlation imposes approximately 20% performance penalty on your
Receiver. For most enterprise environments, McAfee recommends using an ACE to centralize the
correlation, and provide sufficient resources for this function.

McAfee Event Receivers come in physical appliances with EPS ratings ranging from
6k to 26k events per second as well as VM-based models with event collection rates
ranging from 500 to 15k EPS.
Multiple REC appliances (or VM platforms) can be deployed centrally to provide a
consolidated collection environment or can be geographically distributed throughout
the enterprise.  Typical deployment scenarios will locate an Event Receiver in each
of several data centers, all of which will feed their collected events back to a
centralized ESM (or to multiple ESM appliances for redundancy and disaster recovery
purposes).

ELM - Enterprise Log Manager

The McAfee ELM stores the raw, litigation-quality event/log data collected from data
sources configured on Event Receivers.  In SIEM environments where compliance is
a success factor, the ELM is used to maintain event chain of custody and ensure full
non-repudiation.

In addition to providing compliant-quality raw event archival, the ELM also supports
the full-text index (FTI) for all event details.  The McAfee SIEM supports the ability
to perform ad-hoc searches against the unstructured data maintained in the archive.

ESM/REC/ELM

The ESMRECELM - also called an All-in-One (AIO) or a ‘combo box’ - provides the
combined functions of the McAfee Enterprise Security Manager (ESM), Event
Receiver (REC) and Enterprise Log Manager (ELM) in a single appliance.

As most SIEM POC deployments are intended to showcase functionality rather than
performance, the ESMRECELM is commonly used to demonstrate the features and
ease of use delivered by the McAfee SIEM.  It can be deployed with minimal
disruption (single appliance, minimal rack space and power, single network
connection and IP address).

In larger POC or production SIEM environments, a combo box may be inadequate to


handle the sizable EPS performance requirements of an enterprise.  The largest
ESMRECELM peaks at 6k EPS and provides no local storage for ELM archive but
instead requires supplemental storage by means of a SAN connection, NFS or CIFS
share.

ACE - Advanced Correlation Engine

The ACE provides the SIEM with unmatched advanced correlation capabilities that
include both rule- and risk-based options.  In addition to performing real-time
analysis, the ACE can be configured to process historical event/log data against the
current set of rule and risk profiles, as well as deviation correlation and flow-
correlation.  The ACE provides native risk scoring for GTI (for SIEM) and MRA-
enabled customer environments.  It also allows custom risk scoring to be configured
to highlight threats performed against high-value assets, sensitive data and/or by
privileged users.

Typical production SIEM deployments will include two ACE appliances – one
performing real-time rule and risk correlation and another configured for historical
rule and risk correlation of events.

ADM - Application Data Monitor (sometimes referred to as APM)

The ADM provides layer 7 application decode of enterprise traffic via four
promiscuous network interfaces.  It is used to track transmission of sensitive data
and application usage as well as detect malicious, covert traffic, theft or misuse of
credentials and application-layer threats.

Not to be confused with a true DLP, the integration with the SIEM provides
advanced forensics value by preserving full transactional detail for sessions violating
the user-defined policy managed from within the McAfee ESM common user
interface.  Complex rule correlation can leverage policy violation or suspicious
application usage events to identify potential security incidents in real-time.

DEM - Database Event Monitor (sometimes referred to as DBM)

The DEM provides a network-based solution for real-time discovery and transactional
monitoring of database activity via two or four promiscuous network interfaces.  It
works in lieu of OR in parallel with the McAfee (Sentrigo) agent-based database
activity solution to provide comprehensive, transaction-level database monitoring of
user or application DB usage.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy