Unit-III CC&BD Cs62 Ab
Unit-III CC&BD Cs62 Ab
NOTE: I declare that the PPT content is picked up from the prescribed course text
books or reference material prescribed in the syllabus book and Online Portals.
Unit III
Introduction to Big Data:
• What is big data and why is it Important?
• Industry Examples of Big Data: Big Data and the New School of Marketing.
• Marketing. – Advertising and Big data.
• Types of Digital data, Big Data - Characteristics, Evolution of Big Data,
Challenges;
Storing Data in Databases and Data Warehouses:
• RDBMS and Big Data,
• Issues with Relational and Non-Relational Data Model,
• Integrating Big data with Traditional Data Warehouses,
• Big Data Analysis and Data Warehouse,
• Changing Deployment Models in Big Data Era.
NoSQL Data Management:
• Introduction to NoSQL Data Management,
• Types of NoSQL Data Models,
• Distribution Models,
• CAP Theorem,
• Sharding
Introduction to Big Data
• The "Internet of Things" and its widely ultra-connected nature are leading to a
burgeoning(increase rapidly) rise in big data. There is no dearth(scarcity) of data for
today's enterprise.
• Big data is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using on-hand data management tools or
traditional data processing applications.
What is Big Data
• Big data is data that exceeds the processing and storing capacity of
conventional database systems.
• The data is too big, moves too fast, or does not fit the structures of traditional
database architectures/Systems
• Big data is a collection of data sets, that are complex in nature, exponential/fast
growing data and variety of data, both structured and unstructured.
Data!
What is Big Data Analytics
• Big data analytics is the use of advanced analytic techniques against very
large, diverse data sets that include structured, semi-structured and
unstructured data, from different sources, and in different sizes from
terabytes to zettabytes
Big Data Analytics: Facts
Walmart
• handles 1 million customer transactions/hour
• 2.5 petabyte of data.
Facebook
• handles 40 billion photos from its user base!
• inserts 500 terabytes of new data every day
• stores, accesses, and analyzes 30 Petabytes of user generated data
More than 5 billion people are calling, texting, tweeting and browsing on mobile
phones worldwide
“apart from the changes in the actual hardware and software technology, there has also
been a massive change in the actual evolution of data systems. I compare it to the stages
of learning: dependent, independent, and interdependent.”
Data systems were fairly new and users didn’t know quite know what they
wanted. IT assumed that “Build it and they shall come.”
Users understood what an analytical platform was and worked together with IT to
define the business needs and approach for deriving insights for their firm.
During the customer relationship management (CRM) era of the 1990s, many
companies made substantial investments in customer-facing technologies that
subsequently failed to deliver expected value.
• The reason for most of those failures was fairly straightforward: Management either
forgot (or just didn ’t know) that big projects require a synchronized transformation of
people, process, and technology. All three must be marching in step or the project is
doomed.
Why Big Data?
1. Understanding and Targeting Customers
• Here, big data is used to better understand customers and their behaviors and
preferences.
• Using big data, Telecom companies can now better predict customer churn;
• Wal-Mart can predict what products will sell, and
• car insurance companies understand how well their customers actually drive.
• Even government election campaigns can be optimized using big data analytics.
• Big data is not just for companies and governments but also for all of us
individually.
• We can now benefit from the data generated from wearable devices such as
smart watches or smart bracelets: collects data on our calorie consumption,
activity levels, and our sleep patterns.
• Most online dating sites apply big data tools and algorithms to find us the most
appropriate matches.
• The computing power of big data analytics enables us to decode entire DNA
strings in minutes and will allow us to understand and predict disease patterns.
• Big data techniques are already being used to monitor babies in a specialist
premature and sick baby unit.
• By recording and analyzing every heart beat and breathing pattern of every
baby, the unit was able to develop algorithms that can now predict infections 24
hours before any physical symptoms appear.
5. Improving Sports Performance
• Most elite sports have now embraced big data analytics. We have the IBM
SlamTracker tool for tennis tournaments;
• we use video analytics that track the performance of every player in a
football or baseball game, and sensor technology in sports equipment such as
basket balls or golf clubs allows us to get feedback (via smart phones and
cloud servers) on our game and how to improve it.
• The CERN data center has 65,000 processors to analyze its 30 petabytes of data.
thousands of computers distributed across 150 data centers worldwide to
analyze the data.
• For example, big data tools are used to operate Google’s self-driving car.
• The Toyota is fitted with cameras, GPS as well as powerful computers and
sensors to safely drive on the road without the intervention of human
beings.
• The National Security Agency (NSA) in the U.S. uses big data analytics to
prevent terrorist plots .
• Others use big data techniques to detect and prevent cyber attacks.
9. Improving and Optimizing Cities and Countries
• Big data is used to improve many aspects of our cities and countries.
• For example, it allows cities to optimize traffic flows based on real time traffic
information as well as social media and weather data.
• a bus would wait for a delayed train and where traffic signals predict traffic
volumes and operate to minimize jams.
• High-Frequency Trading (HFT) is an area where big data finds a lot of use
today. Here, big data algorithms are used to make trading decisions.
• Today, the majority of equity trading now takes place via data algorithms that
increasingly take into account signals from
• social media networks and news websites to make, buy and sell decisions
in split seconds.
A Wider Variety of Data
The variety of data sources continues to increase. Traditionally, internally focused
operational systems, such as ERP (enterprise resource planning) and CRM
applications, were the major source of data used in analytic processing.
• Unstructured data is basically information that either does not have a predefined
data model and/or does not fit well into a relational database.Unstructured
information is typically text heavy, but may contain data such as dates, numbers,
and facts as well.
• The amount of data (all data, everywhere) is doubling every two years.
• Our world is becoming more transparent. We, in turn, are beginning to
accept this as we become more comfortable with parting with data that
we used to consider sacred and private.
• Most new data is unstructured. Specifically, unstructured data
represents almost 95 percent of new data, while structured data represents
only 5 percent.
• Unstructured data tends to grow exponentially, unlike structured data,
which tends to grow in a more linear fashion.
Big Data Analytics: Is Big Data analytics worth the effort? Yes
3. Frictionless actions. Increased reliability and accuracy that will allow the deeper and
broader insights to be automated into systematic actions.
Industry Examples of Big Data
Digital Marketing
• Google ’s digital marketing evangelist and author Avinash Kaushik spent the first 10
years of his professional career in the world of business intelligence, during which he
actually built large multiterabyte data warehouses and the intelligence platforms.
• Avinash Kaushik designed a framework in his book Web Analytics 2.0: The Art of
Online Accountability and Science of Customer Centricity , in which he states that
if you want to make good decisions on the Web, you have to learn how to use
different kinds of tools to bring multiple types of data together and make
decisions at the speed of light!
• Many of today ’s marketers are discussing and assessing their approaches to engage
consumers in different ways such as social media marketing
• you have to have the primary outpost from where you can collect your own “big
data” and have a really solid relationship with the consumers you have and their data
so you can make smarter decisions.
Database Marketers, Pioneers of Big Data
• It began back in the 1960s, when people started building mainframe systems that
contained information on customers and information about the products and services
those customers were buying
• By the 1980s, marketers developed the ability to run reports on the information in
their databases. The reports gave them better and deeper insights into the buying
habits and preferences of customers
• In the 1990s, email entered the picture, and marketers quickly saw opportunities for
reaching customers via the Internet and the World Wide Web.
• Today, many companies have the capability to store and analyze data generated
from every search you run on their websites, every article you read, and every
product you look at.
Big Data and the New School of Marketing
“Today ’s consumers have changed. They ’ve put down the newspaper, they fast
forward through TV commercials, and they junk unsolicited email. Why? They have
new options that better fit their digital lifestyle. They can choose which marketing
messages they receive, when, where, and from whom.
• New School marketers deliver what today ’s consumers want: relevant interactive
communication across the digital power channels: email, mobile, social, display and
the web.”
(2) They can automate and optimize their programs and processes throughout the
customer lifecycle. Once marketers have that, they need a practical framework
for planning marketing activities.
• Let ’s take a look at the various loops that guide marketing strategies and tactics
in the Cross-Channel Lifecycle Marketing approach: conversion, repurchase,
stickiness, win-back, and re-permission (see Figure 2.1 ).
Web Analytics
• Web analytics is the measurement, collection, analysis and reporting of web data for
purposes of understanding and optimizing web usage.
• The following are the some of the web analytic metrics: Hit, Page view, Visit/
Session, First Visit / First Session, Repeat Visitor, New Visitor, Bounce Rate, Exit
Rate, Page Time Viewed / Page Visibility Time / Page View Duration, Session
Duration / Visit Duration. Average Page View Duration, and Click path etc.
• Web is that the primary way in which data gets collected, processed and stored,
and accessed is actually at a third party
• Big Data on the Web will completely transform a company’s ability to understand
the effectiveness of its marketing and hold its people accountable for the
millions of dollars that they spend. It will also transform a company’s ability to
understand how its competitors are behaving.
Web event data is incredibly valuable
• It tells you how your customers actually behave (in lots of detail), and how that
varies
• Between different customers
• For the same customers over time. (Seasonality, progress in customer journey)
• How behaviour drives value
• It tells you how customers engage with you via your website / webapp
• How that varies by different versions of your product
• How improvements to your product drive increased customer satisfaction and
lifetime value
• It tells you how customers and prospective customers engage with your
different marketing campaigns and how that drives subsequent behaviour
Web analytics tools are good at delivering the standard reports that are common across
different business types
• Where does your traffic come from e.g.
• Sessions by marketing campaign / referrer
• Sessions by landing page
• As a result of the growing popularity and use of social media around the world and
across nearly every demographic, the amount of user-generated content—or “big
data”—created is immense, and continues growing exponentially.
• Millions of status updates, blog posts, photographs, and videos are shared every
second. Very intelligent software is required to parse all that social data to define
things like the sentiment of a post.
• In terms of geography, Singer explained that they are combining social check-in
data from Facebook, Foursquare, and similar social sites and applications over maps
to show brands at the country, state/region, state, and down to the street level
where conversations are happening about their brand, products, or competitors.
• Customer intent is the big data challenge we ’re focused on solving. By applying
intelligent algorithms and complex logic with very deep, real-time text analysis,
we’re able to group customers in to buckets such as awareness, opinion,
consideration, preference and purchase.
• That ability let ’s marketers create unique messages and offers for people along
each phase of the purchase process and lets sales more quickly identify qualified
sales prospects.
• Marketers now have the opportunity to mine social conversations for purchase
intent and brand lift through Big Data.
• Condition: The condition of data deals with the state of data, that is,
• "Can one use this data as is foranalysis?" or
• "Does it require cleansing for further enhancement and enrichment?"
• Replication copies data across multiple servers, so each bit of data can be
found in multiple places.
A system may use either or both techniques.: Replication comes in two forms:
• Master-slave replication makes one node the authoritative copy that handles
writes while slaves synchronize with the master and may handle reads.
RDBMS:
•Structured Schemas: Uses predefined tables and relationships.
•Schema on Write: Schema is defined before data is written.
•Transactional Systems: Ideal for applications like financial systems and inventory
• management.
•ACID Compliance: Ensures reliable transaction processing.
Big Data:
• Flexible Data Handling: Accommodates various data formats and structures.
• Schema on Read: Schema is applied when data is read.
• Batch and Real-Time Processing: Supports large-scale data processing.
• Scalability: Designed to scale out horizontally.
• Suitable Applications: Analytics, sentiment analysis, fraud detection, IoT data
• processing.
Storing Data in Databases and Data
Warehouses – RDBMS and Big Data
Storing Data in Databases and Data Warehouses –
Issues with Relational and Non-Relational model
Issues with Relational model
Traditional relational database models separate blog posts and comments into
different tables.
Each post has a unique ID in the Posts table, and comments related to a post
reference this ID in the Comments table.
When a visitor accesses a blog post, the software fetches the post content and
comments separately from their respective tables.
This separation can lead to inefficiencies, as retrieving comments requires
knowledge of the associated post.
NoSQL databases offer an alternative approach, allowing for more flexible data
structures that can better accommodate relationships between posts and
comments.
By storing posts and their comments together or in a more interconnected
manner, NoSQL databases can simplify querying and improve performance for
applications with complex relationships like blogs.
Storing Data in Databases and Data Warehouses –
Issues with Relational and Non-Relational model
Issues with Non-Relational model
Non-relational databases, like NoSQL, diverge from the
traditional RDBMS table/key model, offering alternative solutions for
Big Data management.
They're favored by tech giants like Google, Amazon, Yahoo!, and Facebook for their
scalability and ability to handle unpredictable traffic spikes.
Non-relational databases provide scalability without traditional
table structures,
utilizing specialized frameworks for storage and querying.
Common characteristics include scalability across clusters, seamless expansion for
increasing data flows, and a query model based on key-value pairs or documents.
Efficient design principles, such as dynamic memory
utilization, ensure high performance in managing large data volumes.
Eventual consistency, a feature of non-relational databases, ensures availability and
network partition tolerance.
Despite simplicity, the non-relational model poses challenges in data organization and
retrieval, especially for tasks like tagging posts with categories, necessitating careful
software-level considerations.
Polyglot Persistence
Key methods:
• Data Availability: Big Data systems require immediate access to data. NoSQL
databases like Hadron help ensure availability, though challenges arise in handling
context-sensitive data and avoiding duplicate data impact.
• Pattern Study: Analyzing patterns in data allows for efficient retrieval and analysis.
Trending topics and pattern-based study models aid in knowledge gathering,
identifying relevant patterns in massive data streams.
Integrating Big Data with Traditional Data
Warehouses
• Data Incorporation and Integration: Integrating data poses challenges due to
continuous processing. Dedicated machines can alleviate resource conflicts,
simplifying configuration and setup processes.
• Data Volumes and Exploration: Managing large datasets is crucial. Retention
requirements vary, necessitating exploration and mining for procurement and
optimization. Neglecting these areas can lead to performance drains.
• Compliance and Legal Requirements: Adhering to compliance standards is
essential for data security. Data infrastructure can comply with standards while
implementing additional security measures to minimize risks and performance
impacts.
• Storage Performance: Optimizing storage performance is vital. Considerations
include disk performance, SSD utilization, and data exchange across layers.
Addressing these challenges ensures efficient storage in Big Data environments.
Big Data Analysis and Data Warehouse
Big Data Solutions Overview: Big Data solutions facilitate storing large,
heterogeneous data in low-cost devices in raw or unstructured formats, aiding
trend analysis and future predictions across various sectors.
Data Warehousing Definition: Data warehousing involves methods and software
for collecting, integrating, and synchronizing data from multiple sources into a
centralized database, supporting analytical visualization and key performance
tracking.
Case Study: Argon Technology: Argon Technology implements a data warehouse
for a client analyzing data from 100,000 employees worldwide, streamlining
performance assessment processes.
Complexity of Data Warehouse Environment: Recent years have seen increased
complexity with the introduction of various data warehouse technologies and tools
for analytics and real-time tasks.
Comparison: Big Data Solution vs. Data Warehousing: Big Data solutions
handle vast data quantities, while data warehousing organizes integrated data for
informed decision-making.
Differentiation and Use Cases: Big Data analysis focuses on raw data, while data
warehousing filters data for strategic and management purposes.
Future Prospects: Enterprises continue relying on data warehousing for reporting
and visualization, alongside Big Data analytics for insights, ensuring
comprehensive database support.
Changing Deployment Models in Big Data Era
• Deployment Shift: Transition from traditional data centers to distributed database
nodes within the same data center has optimized data warehouses, focusing on
scalability and cost-effectiveness.
• Challenges: Big Data architecture and cloud computing face challenges related to
data magnitude and location, processing requirements, and technical supportability
of cloud-based service models.
NoSQL Data Management - Introduction
NoSQL databases are non-relational and designed for distributed data stores with
large volumes of data, utilized by companies like Google and Facebook.
These databases do not require fixed schemas, avoid join operations, and scale data
horizontally to accommodate growing data volumes.
Tables in NoSQL databases are stored as ASCII files, with tuples represented by
fields separated with tabs, manipulated through shell scripts or UNIX pipelines.
NoSQL databases are still evolving, with varying opinions among software
developers regarding their usefulness, flaws, and long-term viability.
The chapter covers various aspects of NoSQL, starting with an introduction to its
aggregate data models, including key-value, column-oriented, document, and graph
models.
It further explains the concept of relationships in NoSQL and schema-less databases,
along with materialized views and distribution models.
The concept of sharding, or horizontal partitioning of data, is also discussed towards
the end of the chapter.
NoSQL Data Management - Introduction
Need for NoSQL: NoSQL databases meet the demand for scalability and
continuous availability, offering an alternative to traditional relational databases.
They address technical, functional, and financial challenges, particularly in
environments requiring large-scale data processing and management.
History of NoSQL: The concept of NoSQL emerged in the late 1990s, evolving
from relational databases to address distributed, non-relational, and schema-less
designs. It gained momentum in the early 2000s, driven by the need for open-
source distributed databases and led to the development of popular platforms like
MongoDB, Apache Cassandra, and Redis.
NoSQL Data Management – Types
Key-value databases offer basic operations
like retrieval, storage, and
deletion.
Values, typically Binary Large Objects
(BLOBs), store data without internal
interpretation.
Efficient scaling is achieved through
primary key access.
Popular options include Riak, Redis,
Memcached, Berkeley DB, HamsterDB,
Amazon DynamoDB, Project Voldemort, and
Couchbase.
Database selection depends on specific
requirements, such as persistence and data
durability.
Riak ensures data persistence, while
Memcached lacks persistence.
Choose the database type based on
individual use cases and requirements.
NoSQL Data Management – Types
Column-oriented databases store
data by columns rather than rows.
Each column's values are stored
contiguously, allowing for efficient data
retrieval.
Examples of column-oriented
databases include Cassandra,
BigTable, SimpleDB, and HBase.
These databases excel in performance for
operations like counting and aggregation
queries.
They are particularly efficient for
operations like COUNT and MAX.
Column-oriented databases are ideal for
scenarios where data aggregation and
analytics are frequent tasks.
NoSQL Data Management – Types
Document databases store data in self-
describing hierarchical structures like XML,
JSON, and BSON.
They provide indexing and searching similar to
relational databases but with a different
structure.
While offering performance and scalability
benefits, they lack the ACID properties of
relational models.
Choosing a document-oriented database trades
database-level data integrity for increased
performance.
Document databases and relational databases
serve different purposes and are not direct
replacements for each other.
Examples of popular document databases
include MongoDB, Couchbase, and OrientDB.
Organizations often use a combination of
relational and document-oriented databases to
meet different needs.
NoSQL Data Management – Types
Graph databases utilize semantic
queries and graph structures to
represent and store data.
Nodes represent entities or
instances, while edges denote
relationships between them.
Unlike relational databases, graph
databases allow for dynamic
schema changes without extensive
modifications.
Relationships play a crucial role
in graph databases, enabling the
derivation of meaningful insights.
Modelling relationships in graph
databases requires careful
consideration and design
expertise.
Popular graph databases include
Neo4J, Infinite Graph, and
FlockDB.
NoSQL Data Management – Distribution
Models
• Propagate-oriented databases facilitate easy data distribution, focusing on
aggregating data movement rather than related data.
• Data distribution is typically accomplished through two methods: sharding and
replication.
• Sharding involves distributing various data types across multiple servers, with
each server managing a subset of the data.
• Replication enhances fault tolerance by duplicating data across multiple servers,
ensuring each piece of data exists in multiple locations.
• Replication can occur through master-slave replication, where one node manages
writes while others handle reads, or through peer-to-peer replication, allowing
writes to any node without authorization.
• While master-slave replication reduces update conflicts, peer-to-peer replication
avoids single points of failure, and some databases utilize a combination of both
techniques.
NoSQL Data Management – CAP Theorem
• The CAP theorem outlines three critical aspects in distributed databases: Consistency,
Availability, and Partition Tolerance.
• According to the CAP theorem, it's impossible for a distributed system to simultaneously achieve
all three aspects.
• Consistency ensures that all clients see the same data after an operation, maintaining data
integrity.
• Availability indicates that the system is continuously accessible without downtime.
• Partition Tolerance ensures that the system functions reliably despite communication failures
between servers.
• NoSQL databases typically operate under one of three combinations: CA (Consistency and
Availability), CP (Consistency and Partition Tolerance), or AP (Availability and Partition
Tolerance).
• Transactions in relational databases adhere to ACID properties (Atomicity, Consistency,
Isolation, Durability), ensuring data integrity and reliability.
• In contrast, NoSQL databases often prioritize BASE principles (Basic Availability, Soft-state,
Eventual Consistency), offering flexibility but requiring developers to implement transactional
logic manually.
• The absence of built-in transaction support in many NoSQL databases necessitates custom
implementation strategies by developers.
NoSQL Data Management – CAP Theorem
NoSQL Data Management – Sharding
• Definition: Database sharding partitions large databases across servers to boost
performance and scalability by distributing data into smaller segments, called shards.
• Origin and Popularity: Coined by Google engineers, sharding gained traction through
publications like Big Table architecture. Major internet companies like Amazon and Facebook
have adopted sharding due to the surge in transactional volume and database sizes.
• Purpose and Approach: Sharding aims to enhance the throughput and overall performance of
high-transaction business applications by scaling databases. It provides a scalable solution to
handle increasing data volumes and transaction loads.
• Factors Driving Adoption: With businesses striving to maintain optimal performance amidst
growing demands, sharding becomes essential. Despite improvements in disk I/O and database
management systems, the need for enhanced performance and scalability fuels the adoption of
database sharding.
NoSQL Data Management –
Sharding
What is a NoSQL database?
• NoSQL, which stands for “not only SQL,” is an approach to database design that
provides flexible schemas for the storage and retrieval of data beyond the traditional
table structures found in relational databases.
• NoSQL Database is a non-relational Data Management System, that does not require a
fixed schema.
• It avoids joins, and is easy to scale. The major purpose of using a NoSQL database is for
distributed data stores with humongous data storage needs.
• NoSQL databases provide flexible schemas and scale easily with large amounts of
data and high user loads.
• NoSQL data models allow related data to be nested within a single data structure.
Why NoSQL?
Data-driven
Sharding of data
• Key-value databases are a simpler type of database where each item contains keys
and values. Redis and DynanoDB are popular key-value databases.
• Wide-column stores store data in tables, rows, and dynamic columns. Wide-column
stores provide a lot of flexibility over relational databases because each row is not
required to have the same columns. Cassandra and HBase are two of the most
popular wide-column stores.
• Graph databases store data in nodes and edges. Nodes typically store information
about people, places, and things while edges store information about the relationships
between the nodes. Neo4j and JanusGraph are examples of graph databases.
Impedance Mismatch
• Impedance mismatch is the term used to refer to the problems that occurs due to
differences between the database model and the programming language model.
• Data type mismatch means the programming language attribute data type may differ
from the attribute data type in the data model.
• Hence it is quite necessary to have a binding for each host programming language
that specifies for each attribute type the compatible programming language types.
• It is necessary to have different data types, for example, we have different data
types available in different programming languages such as data types in C are
different from Java and both differ from SQL data types.
• The results of most queries are sets or multisets of tuples and each tuple is formed
of a sequence of attribute values.
• In the program, it is necessary to access the individual data values within individual
tuples for printing or processing.
• Hence there is a need for binding to map the query result data structure which
is a table to an appropriate data structure in the programming language.
Impedance Mismatch
Impedance Mismatch
• The difference between the relational model and the in-memory data structures.
• The relational data model organizes data into a structure of tables and rows, or
more properly, relations and tuples.
• In the relational model, a tuple is a set of name-value pairs and a relation is a set
of tuples. (The relational definition of a tuple is slightly different from that in
mathematics and many programming languages with a tuple data type, where a
tuple is a sequence of values.)
Each NoSQL solution has a different model that it uses, which we put into four categories
widely used in the NoSQL ecosystem:
• Graph databases are motivated by a different frustration with relational databases and
thus have an opposite model—small records with complex interconnections
• In Figure 3.1 we have a web of information whose nodes are very small
(nothing more than a name) but there is a rich structure of interconnections
between them
• This is where the important differences between graph and relational databases come in.
• Although relational databases can implement relationships using foreign keys, the
joins required to navigate around can get quite expensive—which means
performance is often poor for highly connected data models
• Graph databases make traversal along the relationships very cheap. A large part
of this is because graph databases shift most of the work of navigating relationships
from query time to insert time. This naturally pays off for situations where
querying performance is more important than insert speed.
• Most of the time you find data by navigating through the network of edges, with
queries such as “tell me all the things that both Anna and Barbara like.”
• You do need a starting place, however, so usually some nodes can be indexed
by an attribute such as ID.
• So you might start with an ID lookup (i.e., look up the people named “Anna”
and “Barbara”) and then start using the edges.
Complex queries typically run faster in graph databases than they do in relational
databases.
The flexibility of a graph database enables the ability to add new nodes and
relationships between nodes, making it reliable for real-time data.
• Relational databases make adding new tables and columns possible while
the database is running.
Distribution Models
The primary driver of interest in NoSQL has been its ability to run databases on a large
cluster.
• As data volumes increase, it becomes more difficult and expensive to scale
up—buy a bigger server to run the database on.
• A more appealing option is to scale out—run the database on a cluster of
servers.
• Aggregate orientation fits well with scaling out because the aggregate is a
natural unit to use for distribution.
• Depending on your distribution model, you can get a data store that will give you the
ability to handle larger quantities of data, the ability to process a greater read or write
traffic, or more availability in the face of network slowdowns or breakages
• Broadly, there are two paths to data distribution: replication and sharding.
• Replication takes the same data and copies it over multiple nodes.
• Replication and sharding are orthogonal techniques: You can use either or both of them.
• Run the database on a single machine that handles all the reads and
writes to the data store. We prefer this option because it eliminates all the
complexities that the other options introduce;
• Although a lot of NoSQL databases are designed around the idea of running
on a cluster, it can make sense to use NoSQL with a single-server
distribution model if the data model of the NoSQL store is more suited to
the application
• MongoDB uses sharding to support deployments with very large data sets and
high throughput operations.
Features of Sharding:
• When a query is made, only one or a few machines may get involved in
processing the query.
• Sharding enables effective scaling and management of large datasets. There are
many ways to split a dataset into shards.
Key Based Sharding
Sharding
• Often, a busy data store is busy because different people are accessing different parts of
the dataset.
• In these circumstances we can support horizontal scalability by putting different parts of
the data onto different servers—a technique that’s called sharding
In the ideal case, we have different users all talking to different server nodes.
• Each user only has to talk to one server, so gets rapid responses from that
server.
• In order to get close to it we have to ensure that data that’s accessed together is
clumped(a clump is a grouping.)together on the same node and that these clumps
are arranged on the nodes to provide the best data access
• The first part of this question is how to clump the data up so that one user mostly
gets her/his data from a single server.
• If you know that most accesses of certain aggregates are based on a physical
location, you can place the data close to where it’s being accessed.
• This means that you should try to arrange aggregates so they are evenly
distributed across the nodes which all get equal amounts of the load.
• In some cases, it’s useful to put aggregates together if you think they may be read
in sequence.
• The Bigtable paper [Chang etc.] described
• This way data for multiple pages could be accessed together to improve
processing efficiency.
Historically most people have done sharding as part of application logic.
• You might put all customers with surnames starting from A to D on one shard
and E to G on another.
• This complicates the programming model, as application code needs to ensure
that queries are distributed across the various shards.
• Sharding is particularly valuable for performance because it can improve both read and
write performance.
• Despite the fact that sharding is made much easier with aggregates, it’s still
not a step to be taken lightly.
Master-Slave Replication:
Master-slave replication makes one node the authoritative copy that handles writes
while slaves synchronize with the master and may handle reads.
Peer-to-Peer Replication:
Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize
their copies of the data.
Key Points
• There are two styles of distributing data:
• Replication copies data across multiple servers, so each bit of data can be
found in multiple places.
A system may use either or both techniques.: Replication comes in two forms:
• Master-slave replication makes one node the authoritative copy that handles
writes while slaves synchronize with the master and may handle reads.