Kwasu-Csc204 Big Data Computing and Security-1
Kwasu-Csc204 Big Data Computing and Security-1
6. Traditional data is relatively simple to Big Data is complex and requires specialized tools and
process expertise
7. Traditional data provides insights into Big Data can provide real-time insights and predictive
historical trends analytics
What is Big Data Analytics
Big Data Analytics is the process of examining large and complex datasets (big data)
to uncover market trends, customer preferences, and other useful business
insights.
Big Data Analytics:
• Works with huge volumes of data (terabytes to petabytes).
• Can process data in real-time or near real-time.
• Handles structured, semi-structured, and unstructured data (e.g., text, images,
videos, logs, etc.).
• Requires advanced tools and techniques such as machine learning, data mining
etc.
Types of Big Data Analytics
Big Data Analytics is typically categorized into four types:
1. Descriptive Analytics – “What happened?”
• Summarizes past data to understand trends and patterns.
• Example: Monthly sales reports, website traffic dashboards.
2. Diagnostic Analytics – “Why did it happen?”
• Investigates the causes behind certain outcomes.
• Example: Sales dropped last quarter due to low holiday inventory.
3. Predictive Analytics – “What is likely to happen?”
• Uses historical data to forecast future events.
• Example: Predicting customer loss, sales forecasts, or equipment failure.
4. Prescriptive Analytics – “What should we do?”
• Suggests actions based on predictive models.
• Example: Recommending the best marketing strategy or optimal pricing model.
Application of Big Data Analytics
1. Healthcare
Applications:
• Predictive analytics for early disease detection
Example: Hospitals use patient history and real-time vitals to predict heart
attacks.
• Hospital resource management
Example: Emergency departments forecast patient inflow to allocate staff and
beds.
2. Retail & E-commerce
Applications:
• Personalized recommendations
Example: Amazon suggests items based on user browsing and purchase history.
• Dynamic pricing strategies
Example: E-commerce sites adjust prices based on demand and competition.
3. Banking & Finance
Applications:
• Fraud detection
Example: Banks flag suspicious transactions using real-time analytics.
• Credit risk assessment
Example: Lenders analyze financial history and social data to assess loan risk.
4. Transportation & Logistics
Applications:
• Route optimization
Example: Delivery services like FedEx use GPS and traffic data to reduce delays.
• Demand forecasting
Example: Airlines adjust schedules based on seasonal booking trends.
5. Media & Entertainment
Applications:
• Content recommendation
Example: Netflix suggests movies based on user watch history and ratings.
• Audience sentiment analysis
Example: Studios analyze social media to gauge reactions to movie trailers.
6. Government & Public Sector
Applications:
• Crime pattern analysis
Example: Police departments use data to predict high-crime areas.
• Public health surveillance
Example: Governments track disease outbreaks using social media and hospital
reports.
7. Real Estate & Construction
Applications:
• Market trend analysis
Example: Real estate firms predict property value changes using economic data.
• Project risk assessment
Example: Construction companies assess risks based on past project delays and
conditions.
Categories of Big Data Tools
1. Data Storage Tools
These store massive amounts of data in a scalable and fault-tolerant manner.
• Hadoop Distributed File System (HDFS)
• Stores data across many machines.
• NoSQL Databases
• Examples: MongoDB (document-based), Cassandra (column-based)
• Suitable for semi-structured/unstructured data.
2. Data Collection
• Gather relevant data from various sources: databases, APIs, IoT devices, social media,
logs, etc.
• The data can be structured (e.g., spreadsheets), semi-structured (e.g., logs), or
unstructured (e.g., videos, images).
5. Data Processing
• Use batch or real-time processing to explore and filter large volumes of data.
• Create data pipelines to automate workflows.
2. Data Integration: Data often come from multiple sources and in different
formats(structured, unstructured, and semi-structured), making it challenging to combine
into a single, usable dataset.
6. Interpretation: Even with strong analysis, turning insights into practical business
strategies can be difficult, especially if decision-makers don’t fully understand the data.
UNIT 2:
BIG DATA ECOSYSTEM
Big Data Ecosystem refers to the collection of technologies, frameworks, and processes used to
collect, store, process, analyze, and visualize large volumes of diverse and complex data
commonly known as "big data." It includes both hardware and software components working
together to handle data that is too large or fast for traditional systems.
Purpose of the Big Data Ecosystem
• The primary purposes of the big data ecosystem are:
1. Efficient Data Management: To collect, store, and manage massive datasets from various
sources (structured, semi-structured, and unstructured).
2. High-Speed Processing: To enable fast processing of data using tools like Apache Spark and
Hadoop, allowing for real-time or near-real-time insights.
3. Advanced Analytics: To perform complex analysis, including machine learning, data mining,
and predictive analytics for better decision-making.
4. Visualization: To present insights in understandable visual formats using tools like Tableau,
Power BI, or Kibana.
Core Components of the Ecosystem
1. Data Sources: This refers to any origin or provider of data that feeds into a big data system for
storage, processing, and analysis. These sources can be internal or external, and they generate
data in various formats (structured, semi-structured, or unstructured).
2. Data Storage: This is a fundamental component of the big data ecosystem, responsible for
holding vast amounts of data in a secure, reliable, and accessible manner. Given the volume,
velocity, and variety of big data, traditional storage systems are not sufficient, hence, specialized
storage technologies are used. Examples are: Distributed File Systems, NoSQL databases.
a. Distributed File Systems: Distributed File Systems are storage architectures that spread data
across multiple nodes. These systems are essential in big data environments where traditional,
centralized storage cannot handle the volume, velocity, and variety of data.
Key Characteristics
i. Scalability: You can add more servers to the cluster as data grows.
ii. Fault Tolerance: Data is replicated across nodes; if one fails, others take over.
iii. High Availability: Ensures data is always accessible, even during node failures or maintenance.
iv. Distributed Access: This enables parallel data access.
Examples of distributed file system includes: Hadoop Distributed File System, Google file System.
Hadoop Distributed File System(HDFS): Splits data into blocks and distributes it across multiple nodes.
How the system works
➢Multiple Servers (Nodes): The organization sets up a cluster of servers — each one contributes storage
and compute resources.
➢Data is Split into Blocks: When a large file is uploaded (say, a video or log file), it is broken into smaller
blocks (e.g., 128MB or 256MB chunks).
➢Blocks Are Distributed Across Servers:
• Each block is stored on different servers.
• Redundancy is applied — the same block may be stored on 2 or 3 different nodes (replication) for
fault tolerance.
➢A Master Node (NameNode) keeps track of:
• Which block is stored on which server.
• File directory and metadata.
➢Worker Nodes (DataNodes) actually store the data and handle read/write operations.
b. NoSQL Databases: NoSQL databases are non-relational and they are designed to
handle large volume of semi-structured and unstructured data.
Types of NoSQL Databases:
i. Document-Oriented Databases
• Store data as documents (usually JavaScript Object Notation(JSON) or Extensible
Markup Language (XML)).
• Ideal for semi-structured data.
Examples:
• MongoDB
• CouchDB
• Amazon DocumentDB
Use Cases:
• User profiles and catalogues
• IoT data
ii. Column-oriented database
• Store data in columns instead of rows, suitable for analytics.
Examples:
• Apache Cassandra
• Apache HBase
Use Cases:
• Time-series data
• Event logging
• Recommendation systems
iii. Graph Databases
• Use nodes and edges to represent and store data relationships.
• Excellent for handling complex, interconnected data.
Key Concepts
Node: Represents an entity (person, place, or object)
Edge: Represents a relationship between two nodes
Examples:
• Neo4j
• ArangoDB
• Amazon Neptune
Use Cases:
• Social networks
• Fraud detection
• Network graphs
iv. Key-Value Stores
• Data is stored as key-value pairs.
• Extremely fast and scalable for simple lookups.
Examples:
• Redis
• Amazon DynamoDB
• Riak
Use Cases:
• Caching
• Session management
• Real-time analytics
3. Big Data Processing Frameworks
These are software systems that enable efficient storage, processing, and analysis of
large-scale data, often in a distributed computing environment. Here are the most
widely used frameworks:
Apache Hadoop MapReduce: This is a batch processing engine that splits tasks into two
phases:
• Map: Filters and sorts data.
• Reduce: Aggregates the results.
• Use Case: Data Archiving
Apache Spark: This is an in-memory data processing framework that supports batch and
streaming processing, and machine learning.
• Faster than Hadoop due to in-memory computation
• Use Case: Machine learning, real-time analytics
Apache Flink: This is a stream-processing framework for distributed, high-performing,
always-available, and accurate data streaming applications. It is especially well-suited for
real-time big data analytics.
• Use Case: Event Detection, Fraud Detection.
4. Data Management and Access
a. Data warehouse: this is a centralized system designed to store large volumes of historical and
current data for reporting, analysis, and decision-making. It stores structured data from
transactional systems, relational databases, and other sources.
Disadvantages of ELT:
• Raw data stored in warehouse possibly consumes more space
• Requires powerful warehouse
• Data consistency challenge
2. Extract-Transform-Load(ETL)
Process order:
i. Extract data from source systems
ii. Transform the data before loading (cleaning, filtering, aggregating)
iii. Load the transformed data into the data warehouse
Best for:
Traditional data warehouses with limited processing power, structured data.(e.g Apache Nifi, Talend, Informatica
etc.)
Advantages of ETL:
• Data arrives clean and ready for analysis
• Control over transformation logic
• ETL processes can catch and handle errors early during transformation, reducing corrupted or
invalid data entering the warehouse.
Disadvantages of ETL:
• Additional processing step before loading
• High latency
b. Data Query tool: This allow users to interact with a data warehouse or database
to extract, filter, aggregate, and visualize data using SQL or graphical interfaces.
Examples are: SQL Workbench/J, Dbeaver, DataGrip etc.
5. Visualization and Monitoring Tools
Visualization tools: These tools convert raw data into interactive charts, graphs, and dashboards
for easy interpretation by business analysts, executives, and data scientists.
Key Features:
• Drag-and-drop dashboards
• Real-time data refresh
• Connects to data warehouses, SQL engines, or APIs
• Supports charts, maps, gauges, and custom visuals
Some popular tools used for visualization are: Tableau, Power BI, Grafana etc.
Monitoring Tools: These are essential for ensuring that big data pipelines, clusters, and
applications run smoothly. They track system health, performance, data flow, and failures.
Key Functions:
• Track CPU, memory, disk usage
• Alerting and anomaly detection
• Data pipeline status (e.g., failed ETL jobs)
Some examples of monitoring tools are: Prometheus, Datadog, Apache Ambari etc.
UNIT 3:
BIG DATA SECURITY ANALYTICS
ARCHITECTURE
Big Data Security Analytics Architecture is a critical framework designed to protect vast
amounts of data while ensuring analytical efficiency. These includes:
1. Data Ingestion and Processing
Data ingestion and processing are fundamental to Big Data Security Analytics, ensuring
that vast amounts of information are securely collected, stored, and analysed without
exposing sensitive data to cyber threats. Here’s a deep dive into the key aspects:
a. Secure Data Collection Mechanisms
Before data can be analyzed for security insights, it must be gathered from various
sources, including network traffic, logs, application interactions, and external APIs.
Ensuring security at this initial step prevents exposure of raw, sensitive information.
• Encrypted Data Streams: Using Transport Layer Security (TLS) for encrypted data
transmission prevents interception by malicious entities. For example, a smart factory
deploys IoT sensors to monitor equipment health. To prevent cyber-attacks, the factory
encrypts sensor data using TLS before transmitting it to the cloud. This prevents
unauthorized interception or data manipulation.
• Source Authentication: Ensuring that data sources are verified and trusted before
allowing them to feed into the system reduces risks of poisoning data with malicious
inputs.
b. Encryption During Transit and Storage
Encryption is a key aspect of data security during ingestion and processing. It prevents unauthorized
entities from reading sensitive data even if intercepted
Symmetric Encryption
• Uses a single key for both encryption and decryption.
• Faster but requires secure key distribution.
• Example: Advanced Encryption Standard (AES) is widely used for data protection.
Asymmetric Encryption
• Uses a public key for encryption and a private key for decryption.
• Suitable for secure communication and key exchange.
• Example: RSA (Rivest-Shamir-Adleman) is commonly used in secure web transactions.
• End-to-End Encryption (E2EE): Ensuring that data remains encrypted until it reaches its destination
mitigates the risks of man-in-the-middle (MITM) attacks.
c. Privacy-Preserving Data Processing
Privacy is crucial in big data analytics, especially when processing personal or confidential
information. Secure processing mechanisms ensure compliance with regulations while
enabling insights extraction.
• Homomorphic Encryption: A method that allows computations on encrypted data
without the need to decrypt it, ensuring privacy at all stages of analysis.
• Secure Multi-Party Computation (MPC): Allows multiple entities to jointly compute
security analytics without exposing individual datasets to each other.
• Signature-Based IDS: Detects threats based on predefined attack signatures. This method is effective for known
threats but struggles with new or unknown attack patterns. For example, A multinational company implements a
signature-based IDS to scan its systems for known malware signatures. When an employee unknowingly
downloads an infected file containing ransomware, the IDS detects the malware and halts its execution before
encryption can begin
• Anomaly-Based IDS: Uses machine learning to identify deviations from normal network behavior, enabling
detection of zero-day threats. For example, A financial institution uses an anomaly-based IDS to monitor
transaction patterns. When a customer's account suddenly initiates multiple large transfers to unknown
recipients, the IDS flags it as potential fraud, preventing unauthorized transactions.
• Hybrid IDS: Combines both signature-based and anomaly-based detection for broader security coverage.
b. Behavioral Analytics for Anomaly Detection
Traditional security measures often rely on static rules that may not adapt to evolving threats. Behavioral analytics
introduces a dynamic approach by learning patterns over time and flagging unusual activities. It can be classified
into:
• User Behavior Analytics (UBA): Monitors user actions to detect unauthorized access, insider threats, or account
compromise. For example, In a healthcare organization, an employee suddenly accesses thousands of patient
records outside their usual working hours. User Behavior Analytics (UBA) identifies this as abnormal behavior and
alerts administrators before sensitive data is leaked.
• Network Behavior Analysis (NBA): Identifies anomalies in network traffic, such as unusual data transfer volumes
or unexpected connections. For example, A university network experiences a sudden surge in outbound data
transfers. Network Behavior Analysis (NBA) reveals that a compromised server is transmitting sensitive research
data to an unknown IP address, allowing immediate intervention.