0% found this document useful (0 votes)
20 views7 pages

Big Data Unit-Ii Notes

Uploaded by

t88699857
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views7 pages

Big Data Unit-Ii Notes

Uploaded by

t88699857
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

NoSQL Data Management

NoSQL databases are designed to handle large volumes of data, scalability, and unstructured or
semi-structured data. Unlike traditional relational databases, NoSQL databases do not rely on
fixed schemas, making them more flexible and suitable for modern applications such as web
services, IoT, and big data.

Introduction to NoSQL

 Definition: NoSQL stands for "Not Only SQL." It encompasses a wide variety of
database technologies designed to overcome limitations of relational databases.
 Characteristics:
o Schema-less
o Horizontal scalability
o High availability and fault tolerance
o Support for unstructured, semi-structured, and structured data
 Applications: Big Data analytics, real-time web applications, content management, etc.
 Types:
o Key-Value Stores
o Document Stores
o Column-Family Stores
o Graph Databases

Aggregate Data Models

 Focuses on grouping related data into a single unit, called an aggregate.


 Example: A document in MongoDB that contains a user's profile, preferences, and
purchase history.
 Benefits:
o Simplifies data access patterns.
o Reduces the need for complex joins.
 Types:
o Key-Value Models: Simple key-value pairs.
o Document Models: JSON or BSON-like objects.
o Column-Family Models: Data stored in rows and grouped by columns.
 Use Cases: Aggregates make it easy to replicate and distribute data.

Aggregates

 Definition: Aggregates are collections of related data treated as a single unit.


 Importance:
o Defines boundaries for transactions.
o Improves scalability and data locality.
 Examples:
o Shopping cart (key-value)
o Blog post with comments (document model)
Key-Value and Document Data Models

 Key-Value Stores:
o Simplest NoSQL model.
o Data stored as key-value pairs.
o Examples: Redis, DynamoDB.
o Use Cases: Session management, caching.
 Document Stores:
o Data stored as JSON, BSON, or XML documents.
o Examples: MongoDB, CouchDB.
o Use Cases: Content management, real-time analytics.

Relationships

 Relational databases handle relationships using foreign keys and joins.


 NoSQL manages relationships differently:
o Embedding: Nest related data within a single document.
o Referencing: Link data using identifiers.
o Graph Databases: Use edges and nodes to model relationships explicitly.
 Examples:
o Embedded documents in MongoDB.
o Relationships in Neo4j.

Graph Databases

Graph databases are a type of database designed to store and query data structured as a graph,
where entities (nodes) are connected by relationships (edges). They are particularly suited for
applications that require modeling and querying complex, interconnected data efficiently.

Key Concepts in Graph Databases

1. Nodes: Represent entities, such as people, places, or objects.


2. Edges: Represent relationships or connections between nodes. For example,
"FRIENDS_WITH" or "LIKES."
3. Properties: Metadata attached to nodes or edges, such as a person's name, age, or the
date a relationship was established.
4. Labels: Tags assigned to nodes to classify them (e.g., "Person" or "Movie").

Advantages of Graph Databases

1. Efficient Querying of Relationships: Ideal for traversing and querying relationships in


highly interconnected datasets.
2. Flexible Schema: No fixed schema, allowing for changes to the data model without
disrupting the database.
3. Performance: Queries involving relationships can be faster than in relational databases,
as relationships are stored directly in the database.
4. Visualization: Data and relationships are easy to visualize for better understanding.

Common Use Cases

 Social Networks: Modeling friendships, followers, or group memberships.


 Recommendation Engines: Suggesting products, movies, or content based on user
preferences and behaviors.
 Fraud Detection: Identifying patterns and anomalies in transactions.
 Network Analysis: Telecommunications, transport, or logistics optimization.
 Knowledge Graphs: Representing and querying knowledge bases.

Popular Graph Databases

1. Neo4j: The most popular graph database, with Cypher as its query language.
2. Amazon Neptune: A cloud-based graph database service.
3. ArangoDB: A multi-model database supporting graph, document, and key-value data.
4. JanusGraph: Open-source, scalable graph database optimized for large graphs.
5. TigerGraph: Focused on real-time analytics and scalability.

Query Languages

 Cypher: Used by Neo4j, known for its SQL-like syntax for querying graphs.
 Gremlin: A traversal language used with Apache TinkerPop-compliant databases like
JanusGraph.
 SPARQL: A query language for querying RDF (Resource Description Framework) data,
often used in semantic web applications.

Schema-Less Databases

Types of Schema-less Databases

1. Key-Value Stores
o Data is stored as key-value pairs.
o Example: Redis, Amazon DynamoDB, Riak.
o Use Case: Session management, caching, and user preferences.
2. Document Stores
o Data is stored as documents (e.g., JSON, BSON, or XML).
o Example: MongoDB, Couchbase, RavenDB.
o Use Case: Content management systems, e-commerce, and real-time analytics.
3. Column-Family Stores
oData is stored in columns grouped into families.
oExample: Apache Cassandra, HBase.
oUse Case: Time-series data, log analysis, and IoT applications.
4. Graph Databases
o Focus on storing data as nodes and edges (relationships).
o Example: Neo4j, Amazon Neptune, TigerGraph.
o Use Case: Social networks, fraud detection, and recommendation engines.

Advantages of Schema-less Databases

1. Rapid Development: Developers can iterate quickly without worrying about schema
changes.
2. Adaptability: Supports semi-structured and unstructured data, such as JSON or
multimedia.
3. Scalability: Suited for distributed architectures with high availability and fault tolerance.
4. Cost-Effective: Handles large data volumes without expensive, high-end hardware.

Disadvantages of Schema-less Databases

1. Complexity in Queries: May lack the rich querying capabilities of SQL.


2. Data Integrity: Schema enforcement must often be handled at the application level.
3. Consistency: May sacrifice strict consistency (in favor of eventual consistency) for better
performance and availability.

Popular Schema-less Databases

1. MongoDB: A document-oriented database widely used for modern web applications.


2. Apache Cassandra: A column-family database optimized for high availability.
3. Redis: An in-memory key-value store known for its speed.
4. Amazon DynamoDB: A cloud-native key-value and document database.
5. Couchbase: A distributed, document-oriented NoSQL database.

When to Use Schema-less Databases

 Rapidly changing data models.


 Applications with large-scale or distributed requirements.
 Use cases involving unstructured or semi-structured data.
 High-performance, low-latency requirements (e.g., caching, real-time analytics).
Materialized Views

 Definition: Precomputed query results stored for faster access.


 Benefits:
o Improves performance for frequently run queries.
o Reduces computation overhead.
 Examples: Used in Cassandra for denormalized queries.
 Challenges: Keeping materialized views up-to-date.

Introduction to Distribution Models

 Definition:
Distribution models refer to architectures and strategies for distributing data and
computational tasks across multiple nodes or servers.
 Key Goals:
o Scalability: Handle growing data and workload.
o Fault Tolerance: Maintain reliability despite failures.
o Efficiency: Maximize resource utilization and minimize latency.
 Common Models:
o Centralized vs. Decentralized
o Master-Slave Architectures
o Peer-to-Peer Systems

Sharding
 Definition: Sharding is a database architecture pattern that splits large datasets into
smaller, more manageable pieces called shards.
 Purpose:
o Improves scalability and performance.
o Distributes load evenly across servers.
 Key Techniques:
o Horizontal Partitioning: Split rows across shards.
o Vertical Partitioning: Split columns across shards.
 Examples in Practice:
o Database sharding in NoSQL systems (e.g., MongoDB, Cassandra).
o URL shortening services.
MapReduce: A Distributed Computing Paradigm

 Overview:
o Introduced by Google to process large-scale data.
o Works by dividing tasks into smaller sub-tasks that are processed in parallel.
 Core Phases:
1. Map: Processes input data to generate intermediate key-value pairs.
2. Shuffle and Sort: Groups intermediate data by keys.
3. Reduce: Aggregates and combines data for the final output.
 Advantages:
o Fault tolerance through re-execution of failed tasks.
o Scalability to handle petabytes of data.
 Examples of Use Cases:
o Word count
o Log analysis
o Machine learning tasks

Partitioning and Combining


 Partitioning in MapReduce:
o Divides input data into chunks, ensuring balanced workload.
o Controlled by a partition function (e.g., hash-based partitioning).
 Combining:
o A local Reduce step performed on intermediate data before the shuffle phase.
o Optimizes the process by reducing the volume of data transferred.
o Example: Pre-aggregating word counts before the reduce phase.

Composing MapReduce Calculations


 Concept of Composition:
o Breaking down complex problems into multiple MapReduce jobs.
o Each job’s output serves as the input for the next.
 Techniques:
o Chaining Jobs: A pipeline of dependent MapReduce operations.
o Directed Acyclic Graph (DAG): Frameworks like Apache Spark generalize
MapReduce with DAGs for more complex workflows.
 Practical Applications:
o Data transformations (e.g., ETL pipelines).
o Multi-stage machine learning workflows.
Practical Examples and Exercises

 Sharding Example:
o Design a sharding strategy for a social media application with billions of users.
 MapReduce Example:
o Implement a MapReduce job to count the frequency of words in a large text
dataset.
 Partitioning and Combining Example:
o Optimize a MapReduce job by adding a Combiner to pre-aggregate intermediate
results.
 Composing MapReduce Example:
o Build a multi-stage pipeline to compute the PageRank of web pages.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy