0% found this document useful (0 votes)
5 views41 pages

Unit 2

Big Data

Uploaded by

vsatyana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views41 pages

Unit 2

Big Data

Uploaded by

vsatyana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Unit 2: NoSQL Data

Management
•Introduction to NoSQL, aggregate data models, aggregates,
key-value and document data models, relationships,
•graph databases, schemeless databases, materialized views,
•distribution models, sharding, master-slave replication, peer to
peer replication, sharding
•and replication, consistency, relaxing consistency, version
stamps, map-reduce, partitioning
•and combining, composing map-reduce calculations.
• Definition
• NoSQL: "Not Only SQL" database
• Mechanism for storage and retrieval of data
• Next-generation database
2.1 • Key Characteristics
Introducti • Distributed architecture (e.g., MongoDB)
• Open source
on to • Horizontal scalability
NoSQL • Schema-free
• Easy replication with automatic failover
• Advantages
• Handles huge amounts of data
• Performance improvement by adding more machines
• Implementable on commodity hardware
• Around 150 NoSQL databases available
2.1 Introduction to NoSQL

• Why NoSQL?
• Data Variety
• Manages structured, unstructured, and semi-structured data
• RDBMS limitations with diverse data types
• Modern Needs
• Supports simultaneous activities (code velocity and implementation)
• Simplifies database management and application development
• Benefits of NoSQL
• Data Management
• Handles and streams large volumes of data
• Analyzes structured and semi-structured data
• Flexibility and Scalability
• Object-oriented programming, easy to use
• Horizontal scaling with commodity hardware
• Avoids expensive vertical scaling (CPU, RAM)
2.1 Introduction to NoSQL

• NoSQL Database Categories


• Document Database
• Key paired with a complex data structure (document)
• Supports key-value pairs, key-array pairs, nested documents
• Key-Value Stores
• Simplest NoSQL databases
• Stores items as attribute name (key) with its value
• Graph Stores
• Stores information about networks (e.g., social connections)
• Examples: Neo4J, HyperGraphDB
• Wide Column Stores
• Optimized for large datasets
• Stores columns of data together
• Examples: Cassandra, HBase
2.1 Introduction to
NoSQL
• NoSQL vs SQL: ACID Property
• Atomicity
• Transaction as a logical unit of work
• All data modifications complete or none at all
• Consistency
• Data must be in a consistent state post-transaction
• Isolation
• Data modifications in a transaction are independent
• Prevents erroneous outcomes
• Durability
• Effects of a completed transaction are permanent
2.2 Aggregate Data Models
•Definition
• Aggregate: Collection of objects treated as a unit
• In NoSQL, an aggregate is a collection of data interacting as a unit
• Forms boundaries for ACID operations
•Key Characteristics
• Easier data management over clusters
• Data retrieval includes all related aggregates
• Doesn’t fully support ACID transactions
• Facilitates OLAP operations
• High efficiency when transactions occur within the same aggregate
• Types of Aggregate Data Models
• (1) Key-Value Model
• Description
• Contains a key or ID to access aggregate data
• Data is secure and encrypted
• Use Cases
• Storing user session data
• Schema-less user profiles
• User preferences and shopping cart data
2.2 Aggregate Data Models
(2) Document Model
• Access to parts of aggregates
• Stores and retrieves documents (XML, JSON, BSON)
• Restrictions on data structure and types
Use Cases: E-Commerce platforms, Content management systems, Blogging and analytics platforms
(3) Column Family Model
• Big-table style data models (column stores)
• Two-level aggregate structure
Use Cases: Systems maintaining counters, Services with expiring usage, Systems with heavy write
requests
(4) Graph-Based Model
Description
• Stores data in nodes connected by edges
• Suitable for complex aggregates and multidimensional data
Use Cases: Social networking sites, Fraud detection systems, Networks and IT operations
2.2 Aggregate Data Models
Steps to Build Aggregate Data Models
• Example: E-Commerce Data Model
Aggregates
• Customer: Contains billing addresses
• Order: Contains ordered items, shipping addresses, payments
• Payment: Contains billing address
Design Considerations
• Address data can be copied into aggregates as needed
• No predefined format for aggregate boundaries
• Design depends on data manipulation requirements
2.2 Aggregate Data Models
{
"customer": {
"id": 1, "name": "Martin", "billingAddress": [{"city": "Chicago"}],
"orders": [
{ "id": 99, "customerId": 1, "orderItems": [ { "productId": 27,
"price": 32.45, "productName": "NoSQL Distilled" }
], "shippingAddress": [{"city": "Chicago"}],
"orderPayment": [ {
"ccinfo": "1000-1000-1000-1000", "txnId": "abelif879rft", "billingAddress":
{"city": "Chicago"} } ]
}
]
}
}
2.2.1 Key-Value Data Models
Key-Value Data Model
• Stores data as key-value pairs
• Each key is a unique identifier
• Each value is associated with its key
Characteristics
• Simplicity: Straightforward and easy to use
• Scalability: High scalability by distributing data across servers or clusters
• Flexibility: Values can vary in structure, type, and size
• Popular Databases: Redis, Riak, Amazon DynamoDB
How Key-Value Databases Work:
• Data Structure
• Value associated with a key (similar to map object, array, dictionary)
• Persistent storage controlled by DBMS
• Indexing
• Efficient and compact index structure for fast retrieval
• Example: Redis tracks lists, maps, heaps, primitive types
2.2.1 Key-Value Data Models
When to Use a Key-Value Database
•Use Cases
• User session attributes in real-time applications (finance, gaming)
• Caching mechanism for frequently accessed data
• Applications with key-based queries
•Features
• Simple functions for storing, getting, and removing data
• No querying language
• Built-in redundancy for reliability
Advantages of Key-Value Databases
•Easy to use
• Data can accept any kind
•Fast response time
• Due to simplicity and optimized environment
•Scalable vertically and horizontally
•Reliable with built-in redundancy
2.2.1 Key-Value Data Models
Disadvantages of Key-Value Databases
•No querying language
• Can't transport queries between databases
•Not refined
• Database querying requires a key
2.2.2 Document Data Model:
• Stores data in JSON, BSON, or XML documents
• Allows nesting documents within documents
• Elements can be indexed for faster queries
Characteristics
• Documents are close to data objects used in applications
• Less translation required for application use
• JSON often used for storing and querying data
Working of Document Data Model
Semi-Structured Data Model
• Records and associated data stored in a single document
• Data is semi-structured, not completely unstructured
Key Features
• Document Type Model: Easy mapping in programming languages
• Flexible Schema: Not all documents need the same fields
• Distributed and Resilient: Supports horizontal scaling
• Manageable Query Language: Supports CRUD operations
2.2.2 Document Data Model:

Examples of Document Data Models


• Amazon DocumentDB
• MongoDB
• Cosmos DB
• ArangoDB
• Couchbase Server
• CouchDB
Advantages of Document Data Models
• Schema-less: No restrictions on data format and structure
• Faster Creation and Maintenance: Simple to create and maintain documents
• Open Formats: Uses XML, JSON, and other forms
• Built-in Versioning: Reduces conflicts as documents grow in size and complexity
2.2.2 Document Data Model:
Disadvantages of Document Data Models
• Weak Atomicity:
• Lacks support for multi-document ACID transactions.
• Changes involving multiple collections require separate queries
• Consistency Check Limitations: Searching unconnected collections and
documents can impact performance
• Security Concerns: Potential for sensitive data leakage in web applications
Applications of Document Data Models
• Content Management: Used for video streaming platforms, blogs, and
similar services
• Book Database: Useful for creating book databases with nested data
• Catalog: Fast reading ability for catalogs with thousands of attributes
• Analytics Platform: Widely used in analytics platforms
2.2.2 Graph Databases:
•NoSQL database designed to handle complex relationships and interconnections
•Data stored as nodes (entities) and edges (relationships)
•Applications: Social networks, Recommendation engines, Fraud detection systems,
Supply chain management, Network and infrastructure management, Bioinformatics

Advantages of Graph Databases


Handle Complex Relationships
• Relationships are as important as entities
• Cannot be easily represented in relational databases
Flexibility:
• Adapt to changing structures
• Suitable for rapidly changing data and complex requirements
Performance
• Fast querying using relationships
• No need for join operations
• Horizontal scalability
2.2.2 Graph Databases:
Disadvantages of Graph Databases
• Limited Use Cases: Not ideal for simple queries or data easily represented in relational databases
• Specialized Knowledge: Requires understanding of graph theory and algorithms
• Immature Technology: Relatively new and evolving
• Integration Challenges: May not integrate well with other tools and systems
Components of Graph Databases
Nodes:
• Represent objects or instances
• Equivalent to a row in a traditional database
Relationships
• Represent edges in the graph
• Have direction, type, and form data patterns
Properties
• Information associated with nodes
Examples of Graph Databases: Neo4j (most popular), Oracle NoSQL DB, Graphbase
2.2.2 Graph Databases:
Types of Graph Databases
Property Graphs:
• Used for querying and analyzing data
• Vertices (subjects) and edges (relationships) with properties
RDF Graphs
• Focus on data integration
• Represented by vertices, edges (subject, predicate, object)
• Use Uniform Resource Identifier (URI)
When to Use Graph Databases
Heavily Interconnected Data
• Large Data Volumes with Relationships
• Represent Cohesive Picture of Data
• How Graph Databases Work
Graph Models
• Allow traversal queries using connected data
• Apply graph algorithms to find patterns, paths, relationships
Advantages Over Traditional Databases
• No need for countless joins
• Efficient data modeling and flexible relationships
2.2.2 Graph Databases:
Examples of Graph Database Applications
Recommendation Engines
Provide accurate recommendations and updates
Social Media
Find "friends of friends" and suggest products
Fraud Detection
Create graphs from transactions and identify fraud
Databases
Schemaless Databases:
• Flexible and dynamic data modeling without predefined schemas
• Allows different structures and fields for each document or record
Key Characteristics:
• Flexible Data Models: Accommodates evolving data structures without predefined schemas
• No Fixed Relationships: Relationships represented through embedded documents or references
• Agile Development: Rapid iteration and modification of data models
• Heterogeneous Data: Handles structured, semi-structured, and unstructured data
• Scalability: Designed for horizontal scalability across multiple servers or clusters
Popular Schemaless Databases:
• MongoDB
• Couchbase
• Apache Cassandra
Databases
How Schemaless Databases Work:
•Store information in JSON-style documents with varying fields
•Example:
json
{ name: "Joe", age: 30, interests: "football" } { name: "Kate", age: 25 }
Benefits:
1.Greater Flexibility Over Data Types:
•Stores and queries any data type, ideal for big data analytics
2.No Pre-defined Database Schemas:
•Accepts any data type, future-proofing the database
3.No Data Truncation:
•Retains all details, useful for changing analytics needs
4.Suitable for Real-time Analytics Functions:
•Processes unstructured data, ideal for IoT and AI operations
5.Enhanced Scalability and Flexibility:
•Adapts data models to job requirements, scales by adding nodes

Applications: Content management systems, Real-time analytics, IoT applications


Materialized Views
• Precomputed, persistent results of a query stored as database objects
• Derived from one or more source tables or views
Purpose: Improve query performance by storing complex or frequently queried data
Key Characteristics:
Data Storage:
• Stores the actual result set of a query in a table-like structure
• Updated periodically to reflect changes in source tables
Query Performance:
• Eliminates the need for repeated execution of complex queries
• Reduces processing and computation time
Data Aggregation and Joins:
Optimizes and simplifies complex queries by storing aggregated or joined results
Maintenance and Refresh:
• Needs periodic refresh to reflect underlying data changes
• Can be refreshed on a schedule or triggered by events
Query Rewrite: Some systems support automatic query rewrite to use materialized views
Common Uses: Data warehousing, Reporting, Decision support systems
Materialized Views
Advantages:
Enhanced Performance: Speeds up query execution by using precomputed results
Efficient Data Aggregation: Simplifies access to aggregated and joined data
Reduced Computation Time: Minimizes processing time for complex queries
Trade-offs:
Storage Space: Requires additional storage for precomputed data
Maintenance Overhead:
• Needs regular updates to reflect changes in source data
• Can introduce overhead in systems with frequent updates
System-Specific Functionality: Availability and features vary across different database systems
Considerations: Best suited for relatively static data or infrequent updates, Weigh the benefits of improved
query performance against storage and maintenance costs
Examples of Usage: Precomputing sales totals for reporting dashboards, Storing complex joins for decision
support queries, Aggregating large datasets for real-time analytics
Distribution Models Overview
• Strategies for distributing data across multiple nodes or servers in a distributed computing environment
• Ensure availability, fault tolerance, and efficient query processing
Key Distribution Models:
Horizontal Partitioning (Sharding):
• Divides data into smaller, more manageable pieces (shards)
• Each shard is stored on a different node
• Balances the load and improves performance
• Commonly used in large-scale systems with high data volumes
• Example: MongoDB uses sharding to distribute data across multiple servers
Replication:
• Copies data across multiple nodes
• Ensures high availability and fault tolerance
• Provides redundancy in case of node failure
• Can be synchronous (real-time) or asynchronous (delayed)
• Example: Apache Hadoop replicates data blocks across different nodes for reliability
Distribution Models Overview
Factors Influencing Distribution Model Choice:
Nature of Data: Structure and type of data being stored
Access Patterns: How frequently and in what manner data is accessed
Scalability Requirements: Need for horizontal scaling and handling increased data volumes
Fault Tolerance Goals: Desired level of data redundancy and availability
Performance Considerations: Requirements for query speed and data retrieval efficiency
Implementing Distribution Models:
• Database systems like MongoDB and Apache Hadoop offer mechanisms for data partitioning and
replication
• Essential to analyze application and workload characteristics for optimal distribution strategy
Sharding
Definition: Sharding involves dividing a large database into smaller, more manageable parts called shards or
partitions.
Data Distribution: Each shard contains a subset of the data and is stored on a separate node or server in the
distributed system.
Shard Key: The sharding process typically involves selecting a shard key or partitioning key, which determines
how data is distributed across shards.
Purpose: The goal of sharding is to evenly distribute data to avoid bottlenecks and enable horizontal
scalability.
Common Use: Sharding is commonly used in NoSQL databases to handle large-scale datasets and achieve
better performance and scalability.
Benefits of Sharding: Improves performance by distributing load across multiple servers, Enhances
scalability by allowing the database to grow horizontally, Prevents any single node from becoming a
bottleneck
Challenges of Sharding: Complexity in managing and maintaining shards, Potential for uneven data
distribution if the shard key is not chosen wisely, Increased complexity in query processing and transactions
Replication
Definition: Replication involves maintaining multiple copies (replicas) of data across
different nodes in the distributed database cluster.
Data Copies: Each replica is an exact copy of the data stored on a separate server.
Purpose: Replication enhances data availability, fault tolerance, and read performance by
allowing data to be served from multiple replicas.
Replication Models:
Master-Slave Replication:
• One node (master) accepts write operations.
• Changes are asynchronously propagated to one or more replica nodes (slaves).
• Read queries can be distributed among replicas, reducing the load on the master.
Multi-Master Replication:
• Multiple nodes can accept write operations.
• Changes are replicated to other nodes.
• Allows for better write scaling and high availability.
Replication
Benefits of Replication: Enhances data availability and fault tolerance, Improves read performance by distributing read
queries across replicas, Provides redundancy to protect against data loss
Challenges of Replication:
•Increased storage requirements for maintaining multiple replicas
•Potential for data inconsistency if replicas are not synchronized properly
•Complexity in managing and maintaining multiple replicas
Implementing Distribution Models:
• Sharding and Replication Together: These models can be used together for a highly scalable and available database
system.
• Examples: Database systems like MongoDB and Apache Hadoop offer mechanisms for data partitioning (sharding)
and replication.
Choosing the Right Model:
• Nature of Data: Structure and type of data being stored.
• Access Patterns: How frequently and in what manner data is accessed.
• Scalability Requirements: Need for horizontal scaling and handling increased data volumes.
• Fault Tolerance Goals: Desired level of data redundancy and availability.
• Performance Considerations: Requirements for query speed and data retrieval efficiency.
Master-Slave Replication
Master-slave replication is a method of data replication in distributed database systems where one node, called the
master or primary node, serves as the authoritative source for data, and one or more nodes, known as slave or secondary
nodes, replicate and maintain copies of the master's data.
Key Features and Considerations:
Replication Process: The master node logs all modifications made to the data, often in the form of binary logs or
transaction logs. Slave nodes periodically connect to the master, retrieve the log files, and apply the logged changes to
replicate the data.
Data Consistency: Ensures data consistency by replicating changes made on the master to the slave nodes. Slave nodes
maintain a consistent copy of the data by applying the same set of modifications in the same order.
Fault Tolerance and High Availability: Provides fault tolerance and high availability. If the master node fails, one of the
slave nodes can be promoted as the new master, ensuring minimal disruption and data redundancy.
Read Scalability: Slave nodes handle read queries, offloading read traffic from the master node. Allows for horizontal
scalability in read-intensive scenarios.
Replication Lag: There may be a delay between changes made on the master node and their replication to the slave
nodes. The delay, known as replication lag, depends on network latency, system load, and synchronization frequency.
Applications requiring real-time or near-real-time data consistency should consider replication lag when accessing slave
nodes.
Data Integrity: Provides data integrity within the replicated data set. Does not protect against logical errors or data
corruption on the master, as those changes would also be replicated to the slave nodes.
Master-Slave
Data Replication:
Replication
• Master node handles write operations (inserts, updates, deletes) and propagates changes to slave
nodes.
• Slave nodes synchronize with the master to receive and apply changes, ensuring an up-to-date copy
of the data.
• Slave nodes are typically read-only and do not accept write operations directly.
Load Distribution:
• All read requests are managed by the slave nodes, reducing the load on the master node.
• In case of master failure, a slave node can be promoted as the new master, ensuring continuous
operation.
Advantages:
• Enhanced Read Performance: By distributing read queries across multiple nodes.
• Improved Availability: Through data redundancy and fault tolerance.
• Scalability: Supports horizontal scalability for read-intensive applications.
Disadvantages:
• Replication Lag: Potential delay in data synchronization between master and slave nodes.
• Single Point of Write: Master node becomes a single point for write operations, which may limit write
scalability.
Master-Slave Replication

Common Use Cases:


• Read-Heavy Applications: Where read queries are more frequent than write operations.
• Data Redundancy: For fault tolerance and high availability.
• Improved Performance: In scenarios requiring distributed read load across multiple nodes.

Popular Technologies:
• MySQL
• PostgreSQL
• MongoDB
Peer-to-Peer Replication
In Peer-to-Peer Replication, there is no designated master node. Instead, all nodes in the
database cluster are considered equal "peers." Each node can handle both read and write
operations independently, and data is distributed across all nodes using techniques like
sharding or consistent hashing to ensure horizontal scalability.

Characteristics of Peer-to-Peer Replication:


No designated master node; all nodes are peers and can handle read and write
operations.
• Data is distributed across all nodes in the cluster using techniques like sharding or consistent hashing.
• Provides better write scalability since write operations can be distributed across multiple nodes.
• Offers improved fault tolerance since there is no single point of failure like a master node.
• May require conflict resolution mechanisms for concurrent writes on different nodes.
Consistency in NoSQL
Databases
Consistency in NoSQL databases refers to one of the three core principles in the CAP theorem (Consistency, Availability,
and Partition Tolerance). The CAP theorem states that a distributed data store cannot simultaneously provide all three
guarantees.
Consistency Models:
Strong Consistency:
• All nodes see the same data at the same time.
• Ensures the most up-to-date information.
• May result in slower response times and reduced availability during network partitions.
Eventual Consistency:
• Guarantees that all nodes will eventually converge to the same state given enough time.
• Allows for high availability and low-latency responses.
• May present temporarily inconsistent data during network partitions.
Causal Consistency:
Guarantees that causally related events are observed in a consistent order across all nodes.
Read-your-writes Consistency:
Ensures that a client always reads the most recent version of the data it wrote.
Databases
Relaxing Consistency (Eventual Consistency): Eventual consistency allows for temporary inconsistencies
between replicas in a distributed system.
Data Updates:
Updates are propagated asynchronously to replicas.
Replicas continue to serve read requests based on their current state.
Eventual Convergence:
Over time, all replicas will converge to a consistent state.
Given enough time and no further updates, all replicas will have the same data.
Read Inconsistency:
Different replicas may show different views of the data during the period of inconsistency.
Read operations might return stale data until updates propagate to all replicas.
Advantages of Eventual Consistency:
High Availability: The system can continue to serve read and write requests even during network partitions
or node failures.
Fault Tolerance: Ensures continued operation despite failures.
Handling Inconsistencies: Applications often implement conflict resolution mechanisms, such as version
stamps or timestamps, to resolve conflicts and maintain data integrity.
e

MapReduce is a programming model developed by Google for processing large data sets in a parallel,
distributed manner across a cluster of computers. It simplifies data processing by breaking it into two
primary tasks: Map and Reduce.

Components of MapReduce:
Map:
• Takes an input pair and produces a set of intermediate key/value pairs.
• Input data is divided into smaller chunks, each processed independently.
• Outputs are sorted and grouped by key.
Reduce:
• Takes intermediate key/value pairs from the map phase and merges them to form the final output.
• Processes each group of intermediate values associated with a key to produce the final results.
e

MapReduce Workflow:
Data Splitting:
Input data is split into chunks and assigned to mappers.
Mapping:
Mappers process chunks and generate intermediate key/value pairs.
Shuffling and Sorting:
Intermediate key/value pairs are shuffled and sorted based on keys, grouping all values associated with
the same key.
Reducing:
Reducers process each group of intermediate key/value pairs to generate the final output.
e
Benefits of MapReduce:
• Scalability: Efficiently handles large data sets by leveraging the computational power of
multiple machines.
• Fault Tolerance: Automatically reassigns failed tasks to other machines, ensuring
reliable processing.
• Parallel Processing: Distributes data processing tasks across a cluster, speeding up the
overall computation.
Adoption:
Widely used in big data frameworks like Apache Hadoop, serving as a foundational model
for distributed data processing.
Combining
Partitioning and combining are key techniques in distributed computing, especially within
the MapReduce framework, to optimize performance and manage large data sets
effectively.
Partitioning:
Partitioning involves dividing data into smaller, manageable chunks for independent and
parallel processing across multiple machines. In MapReduce, partitioning occurs at various
stages:
Input Splitting:
The input data is split into partitions, each processed by a separate mapper, distributing the workload
across multiple nodes in the cluster.
Shuffling and Sorting:
Intermediate key/value pairs generated by mappers are partitioned based on their keys. Each partition is
directed to a specific reducer, ensuring all values associated with a key are processed together.
Effective partitioning ensures load balancing and efficient resource utilization. Poor
partitioning can result in imbalanced workloads, where some nodes are overwhelmed
while others remain underutilized.
Combining
Combining:
Combining reduces the amount of intermediate data transferred between the map and
reduce stages. A combiner function, which acts as a mini-reducer, operates on the output
of the map function, performing local aggregation of intermediate key/value pairs before
they are sent to reducers.
Benefits of Combining:
Reduced Data Transfer:
Aggregates data locally, decreasing the volume of data shuffled across the network, thereby improving
performance.
Improved Efficiency:
Lowers the number of intermediate key/value pairs, significantly reducing the computational load on
reducers.
The combiner function is often the same as the reduce function but operates on a smaller
scale. Proper use of partitioning and combining techniques is crucial for optimizing
distributed data processing systems and ensuring efficient resource utilization.
Calculations
Composing MapReduce calculations involves chaining multiple MapReduce jobs together
to execute complex data processing tasks. This method allows for sequential
transformations and aggregations on data, with each job's output serving as the input for
the next. Here's how the process works:
Steps in Composing MapReduce Calculations:
Job Chaining:
In this workflow, the output of one MapReduce job becomes the input for the next, enabling sequential
data processing. Each job performs a specific transformation or aggregation.
Intermediate Storage:
Intermediate results are often stored in a distributed file system (like HDFS in Hadoop) between jobs,
ensuring data preservation and accessibility for subsequent jobs.
Workflow Management:
Managing multiple MapReduce jobs requires orchestration to ensure each job starts only after the
previous one completes successfully. Tools like Apache Oozie or custom scripts automate this process.
Calculations
Example Workflow:
Consider processing log files to generate analytics reports with a composed MapReduce calculation:
Job 1 - Parsing:
Mapper: Reads raw log files and extracts relevant fields (e.g., timestamps, user IDs, actions).
Reducer: Aggregates parsed entries by key (e.g., user ID).
Job 2 - Aggregation:
Mapper: Takes output from Job 1 and maps additional transformations.
Reducer: Aggregates data further, such as calculating total actions per user or summarizing actions over
time periods.
Job 3 - Analysis:
Mapper: Processes the aggregated data for complex analysis.
Reducer: Identifies patterns or trends and generates final reports.
Benefits:
Scalability: Handles large data sets efficiently across distributed environments.
Fault Tolerance: Maintains data integrity and availability even in case of node failures.
Complex Processing: Enables sophisticated data processing pipelines, from simple transformations to
advanced analytical computations

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy