Big Data Unit-Ii Notes
Big Data Unit-Ii Notes
NoSQL databases are designed to handle large volumes of data, scalability, and unstructured or
semi-structured data. Unlike traditional relational databases, NoSQL databases do not rely on
fixed schemas, making them more flexible and suitable for modern applications such as web
services, IoT, and big data.
Introduction to NoSQL
Definition: NoSQL stands for "Not Only SQL." It encompasses a wide variety of
database technologies designed to overcome limitations of relational databases.
Characteristics:
o Schema-less
o Horizontal scalability
o High availability and fault tolerance
o Support for unstructured, semi-structured, and structured data
Applications: Big Data analytics, real-time web applications, content management, etc.
Types:
o Key-Value Stores
o Document Stores
o Column-Family Stores
o Graph Databases
Aggregates
Key-Value Stores:
o Simplest NoSQL model.
o Data stored as key-value pairs.
o Examples: Redis, DynamoDB.
o Use Cases: Session management, caching.
Document Stores:
o Data stored as JSON, BSON, or XML documents.
o Examples: MongoDB, CouchDB.
o Use Cases: Content management, real-time analytics.
Relationships
Graph Databases
Graph databases are a type of database designed to store and query data structured as a graph,
where entities (nodes) are connected by relationships (edges). They are particularly suited for
applications that require modeling and querying complex, interconnected data efficiently.
1. Neo4j: The most popular graph database, with Cypher as its query language.
2. Amazon Neptune: A cloud-based graph database service.
3. ArangoDB: A multi-model database supporting graph, document, and key-value data.
4. JanusGraph: Open-source, scalable graph database optimized for large graphs.
5. TigerGraph: Focused on real-time analytics and scalability.
Query Languages
Cypher: Used by Neo4j, known for its SQL-like syntax for querying graphs.
Gremlin: A traversal language used with Apache TinkerPop-compliant databases like
JanusGraph.
SPARQL: A query language for querying RDF (Resource Description Framework) data,
often used in semantic web applications.
Schema-Less Databases
1. Key-Value Stores
o Data is stored as key-value pairs.
o Example: Redis, Amazon DynamoDB, Riak.
o Use Case: Session management, caching, and user preferences.
2. Document Stores
o Data is stored as documents (e.g., JSON, BSON, or XML).
o Example: MongoDB, Couchbase, RavenDB.
o Use Case: Content management systems, e-commerce, and real-time analytics.
3. Column-Family Stores
oData is stored in columns grouped into families.
oExample: Apache Cassandra, HBase.
oUse Case: Time-series data, log analysis, and IoT applications.
4. Graph Databases
o Focus on storing data as nodes and edges (relationships).
o Example: Neo4j, Amazon Neptune, TigerGraph.
o Use Case: Social networks, fraud detection, and recommendation engines.
1. Rapid Development: Developers can iterate quickly without worrying about schema
changes.
2. Adaptability: Supports semi-structured and unstructured data, such as JSON or
multimedia.
3. Scalability: Suited for distributed architectures with high availability and fault tolerance.
4. Cost-Effective: Handles large data volumes without expensive, high-end hardware.
Definition:
Distribution models refer to architectures and strategies for distributing data and
computational tasks across multiple nodes or servers.
Key Goals:
o Scalability: Handle growing data and workload.
o Fault Tolerance: Maintain reliability despite failures.
o Efficiency: Maximize resource utilization and minimize latency.
Common Models:
o Centralized vs. Decentralized
o Master-Slave Architectures
o Peer-to-Peer Systems
Sharding
Definition: Sharding is a database architecture pattern that splits large datasets into
smaller, more manageable pieces called shards.
Purpose:
o Improves scalability and performance.
o Distributes load evenly across servers.
Key Techniques:
o Horizontal Partitioning: Split rows across shards.
o Vertical Partitioning: Split columns across shards.
Examples in Practice:
o Database sharding in NoSQL systems (e.g., MongoDB, Cassandra).
o URL shortening services.
MapReduce: A Distributed Computing Paradigm
Overview:
o Introduced by Google to process large-scale data.
o Works by dividing tasks into smaller sub-tasks that are processed in parallel.
Core Phases:
1. Map: Processes input data to generate intermediate key-value pairs.
2. Shuffle and Sort: Groups intermediate data by keys.
3. Reduce: Aggregates and combines data for the final output.
Advantages:
o Fault tolerance through re-execution of failed tasks.
o Scalability to handle petabytes of data.
Examples of Use Cases:
o Word count
o Log analysis
o Machine learning tasks
Sharding Example:
o Design a sharding strategy for a social media application with billions of users.
MapReduce Example:
o Implement a MapReduce job to count the frequency of words in a large text
dataset.
Partitioning and Combining Example:
o Optimize a MapReduce job by adding a Combiner to pre-aggregate intermediate
results.
Composing MapReduce Example:
o Build a multi-stage pipeline to compute the PageRank of web pages.