Unit 2
Unit 2
Management
•Introduction to NoSQL, aggregate data models, aggregates,
key-value and document data models, relationships,
•graph databases, schemeless databases, materialized views,
•distribution models, sharding, master-slave replication, peer to
peer replication, sharding
•and replication, consistency, relaxing consistency, version
stamps, map-reduce, partitioning
•and combining, composing map-reduce calculations.
• Definition
• NoSQL: "Not Only SQL" database
• Mechanism for storage and retrieval of data
• Next-generation database
2.1 • Key Characteristics
Introducti • Distributed architecture (e.g., MongoDB)
• Open source
on to • Horizontal scalability
NoSQL • Schema-free
• Easy replication with automatic failover
• Advantages
• Handles huge amounts of data
• Performance improvement by adding more machines
• Implementable on commodity hardware
• Around 150 NoSQL databases available
2.1 Introduction to NoSQL
• Why NoSQL?
• Data Variety
• Manages structured, unstructured, and semi-structured data
• RDBMS limitations with diverse data types
• Modern Needs
• Supports simultaneous activities (code velocity and implementation)
• Simplifies database management and application development
• Benefits of NoSQL
• Data Management
• Handles and streams large volumes of data
• Analyzes structured and semi-structured data
• Flexibility and Scalability
• Object-oriented programming, easy to use
• Horizontal scaling with commodity hardware
• Avoids expensive vertical scaling (CPU, RAM)
2.1 Introduction to NoSQL
Popular Technologies:
• MySQL
• PostgreSQL
• MongoDB
Peer-to-Peer Replication
In Peer-to-Peer Replication, there is no designated master node. Instead, all nodes in the
database cluster are considered equal "peers." Each node can handle both read and write
operations independently, and data is distributed across all nodes using techniques like
sharding or consistent hashing to ensure horizontal scalability.
MapReduce is a programming model developed by Google for processing large data sets in a parallel,
distributed manner across a cluster of computers. It simplifies data processing by breaking it into two
primary tasks: Map and Reduce.
Components of MapReduce:
Map:
• Takes an input pair and produces a set of intermediate key/value pairs.
• Input data is divided into smaller chunks, each processed independently.
• Outputs are sorted and grouped by key.
Reduce:
• Takes intermediate key/value pairs from the map phase and merges them to form the final output.
• Processes each group of intermediate values associated with a key to produce the final results.
e
MapReduce Workflow:
Data Splitting:
Input data is split into chunks and assigned to mappers.
Mapping:
Mappers process chunks and generate intermediate key/value pairs.
Shuffling and Sorting:
Intermediate key/value pairs are shuffled and sorted based on keys, grouping all values associated with
the same key.
Reducing:
Reducers process each group of intermediate key/value pairs to generate the final output.
e
Benefits of MapReduce:
• Scalability: Efficiently handles large data sets by leveraging the computational power of
multiple machines.
• Fault Tolerance: Automatically reassigns failed tasks to other machines, ensuring
reliable processing.
• Parallel Processing: Distributes data processing tasks across a cluster, speeding up the
overall computation.
Adoption:
Widely used in big data frameworks like Apache Hadoop, serving as a foundational model
for distributed data processing.
Combining
Partitioning and combining are key techniques in distributed computing, especially within
the MapReduce framework, to optimize performance and manage large data sets
effectively.
Partitioning:
Partitioning involves dividing data into smaller, manageable chunks for independent and
parallel processing across multiple machines. In MapReduce, partitioning occurs at various
stages:
Input Splitting:
The input data is split into partitions, each processed by a separate mapper, distributing the workload
across multiple nodes in the cluster.
Shuffling and Sorting:
Intermediate key/value pairs generated by mappers are partitioned based on their keys. Each partition is
directed to a specific reducer, ensuring all values associated with a key are processed together.
Effective partitioning ensures load balancing and efficient resource utilization. Poor
partitioning can result in imbalanced workloads, where some nodes are overwhelmed
while others remain underutilized.
Combining
Combining:
Combining reduces the amount of intermediate data transferred between the map and
reduce stages. A combiner function, which acts as a mini-reducer, operates on the output
of the map function, performing local aggregation of intermediate key/value pairs before
they are sent to reducers.
Benefits of Combining:
Reduced Data Transfer:
Aggregates data locally, decreasing the volume of data shuffled across the network, thereby improving
performance.
Improved Efficiency:
Lowers the number of intermediate key/value pairs, significantly reducing the computational load on
reducers.
The combiner function is often the same as the reduce function but operates on a smaller
scale. Proper use of partitioning and combining techniques is crucial for optimizing
distributed data processing systems and ensuring efficient resource utilization.
Calculations
Composing MapReduce calculations involves chaining multiple MapReduce jobs together
to execute complex data processing tasks. This method allows for sequential
transformations and aggregations on data, with each job's output serving as the input for
the next. Here's how the process works:
Steps in Composing MapReduce Calculations:
Job Chaining:
In this workflow, the output of one MapReduce job becomes the input for the next, enabling sequential
data processing. Each job performs a specific transformation or aggregation.
Intermediate Storage:
Intermediate results are often stored in a distributed file system (like HDFS in Hadoop) between jobs,
ensuring data preservation and accessibility for subsequent jobs.
Workflow Management:
Managing multiple MapReduce jobs requires orchestration to ensure each job starts only after the
previous one completes successfully. Tools like Apache Oozie or custom scripts automate this process.
Calculations
Example Workflow:
Consider processing log files to generate analytics reports with a composed MapReduce calculation:
Job 1 - Parsing:
Mapper: Reads raw log files and extracts relevant fields (e.g., timestamps, user IDs, actions).
Reducer: Aggregates parsed entries by key (e.g., user ID).
Job 2 - Aggregation:
Mapper: Takes output from Job 1 and maps additional transformations.
Reducer: Aggregates data further, such as calculating total actions per user or summarizing actions over
time periods.
Job 3 - Analysis:
Mapper: Processes the aggregated data for complex analysis.
Reducer: Identifies patterns or trends and generates final reports.
Benefits:
Scalability: Handles large data sets efficiently across distributed environments.
Fault Tolerance: Maintains data integrity and availability even in case of node failures.
Complex Processing: Enables sophisticated data processing pipelines, from simple transformations to
advanced analytical computations