0% found this document useful (0 votes)

37 views85 pages

Unit-III CC&BD Cs62 Ab

Uploaded by

dhanrajpandya26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views85 pages

Unit-III CC&BD Cs62 Ab

Uploaded by

dhanrajpandya26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

Cloud Computing and Big Data

Subject Code: CS62 (Credits: 2:1:0)

Textbook:
1. Marinescu, Dan C. Cloud computing: theory and practice. Morgan Kaufmann, 2022. –
3rd Edition – Elsevier.
2. Distributed and Cloud Computing, From Parallel Processing to the Internet of Things,
Kai Hwang, Jack Dongarra, Geoffrey Fox. MK Publishers. 2012
3. Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R And Data
Visualization) Authored by DT Editorial Services , 2017
4. White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012. Third Edition.
5. Ryza, Sandy, Uri Laserson, Sean Owen, and Josh Wills. Advanced analytics with spark:
patterns for learning from data at scale. " O'Reilly Media, Inc.", 2017. 2nd Edition.

NOTE: I declare that the PPT content is picked up from the prescribed course text
books or reference material prescribed in the syllabus book and Online Portals.
Unit III
Introduction to Big Data:
• What is big data and why is it Important?
• Industry Examples of Big Data: Big Data and the New School of Marketing.
• Marketing. – Advertising and Big data.
• Types of Digital data, Big Data - Characteristics, Evolution of Big Data,
Challenges;
Storing Data in Databases and Data Warehouses:
• RDBMS and Big Data,
• Issues with Relational and Non-Relational Data Model,
• Integrating Big data with Traditional Data Warehouses,
• Big Data Analysis and Data Warehouse,
• Changing Deployment Models in Big Data Era.
NoSQL Data Management:
• Introduction to NoSQL Data Management,
• Types of NoSQL Data Models,
• Distribution Models,
• CAP Theorem,
• Sharding
Introduction to Big Data
• The "Internet of Things" and its widely ultra-connected nature are leading to a
burgeoning(increase rapidly) rise in big data. There is no dearth(scarcity) of data for
today's enterprise.

• That brings us to the following questions:

1. Why is it that we cannot skip big data?
2. How has it come to assume such magnanimous importance in running
business?
3. How does it compare with the traditional Business Intelligence (BI)
environment?
4. Is it here to replace the traditional, relational database management
system and data warehouse environment or is it likely to complement their
existence?"
• Big data is data that exceeds the processing capacity of conventional database systems.
The data is too big, moves too fast, or does not fit the structures of traditional database
architectures

• Big data is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using on-hand data management tools or
traditional data processing applications.
What is Big Data

• Massive amount of data , which cannot be stored, processed and Analyzed

using traditional Database systems and tools

• Big data is data that exceeds the processing and storing capacity of
conventional database systems.

• The data is too big, moves too fast, or does not fit the structures of traditional
database architectures/Systems

• Big data is a collection of data sets, that are complex in nature, exponential/fast
growing data and variety of data, both structured and unstructured.
Data!
What is Big Data Analytics

• Big data analytics , Process of extracting a meaningful insights from Big

data , such as hidden patterns, Unknown facts and Correlations.

• Big data analytics examines large amounts of data to uncover hidden

patterns, correlations and other insights.

• Big Data analytics is the process of collecting, organizing and analyzing

large sets of data (called Big Data) to discover patterns and other useful
information.

• Big data analytics is the use of advanced analytic techniques against very
large, diverse data sets that include structured, semi-structured and
unstructured data, from different sources, and in different sizes from
terabytes to zettabytes
Big Data Analytics: Facts

Walmart
• handles 1 million customer transactions/hour
• 2.5 petabyte of data.
Facebook
• handles 40 billion photos from its user base!
• inserts 500 terabytes of new data every day
• stores, accesses, and analyzes 30 Petabytes of user generated data

More than 5 billion people are calling, texting, tweeting and browsing on mobile
phones worldwide

90% of the world’s data was created in last 2 years

90% of the world’s data is unstructured.

This flood of data is coming from many sources.
3 dimensions / Characteristics of Big data
Why It is Important?
• Computing perfect storm.
• Big Data analytics are the natural result of four major global trends: Moore’s
Law (which basically says that technology always gets cheaper), mobile computing
(that smart phone or mobile tablet in your hand), social networking (Facebook,
Foursquare, Pinterest, etc.), and cloud computing (you don’t even have to own
hardware or software anymore; you can rent or lease someone else’s).

• Data perfect storm.

• Volumes of transactional data have been around for decades for most big firms,
but the flood gates have now opened with more volume, and the velocity and
variety—the three Vs—of data that has arrived in unprecedented ways. This perfect
storm of the three Vs makes it extremely complex and cumbersome with the current
data management and analytics technology and practices.

• Convergence perfect storm.

• Another perfect storm is happening, too. Traditional data management and
analytics software and hardware technologies, open-source technology, and
commodity hardware are merging to create new alternatives for IT and business
executives to address Big Data analytics.
Ghosh(executive at MasterCard Advisor) explains,

“apart from the changes in the actual hardware and software technology, there has also
been a massive change in the actual evolution of data systems. I compare it to the stages
of learning: dependent, independent, and interdependent.”

• Dependent (Early Days).

Data systems were fairly new and users didn’t know quite know what they
wanted. IT assumed that “Build it and they shall come.”

• Independent (Recent Years).

Users understood what an analytical platform was and worked together with IT to
define the business needs and approach for deriving insights for their firm.

• Interdependent (Big Data Era).

Interactional stage between various companies, creating more social collaboration

beyond your firm ’s walls.
Why Now?

Enterprise Resource Planning

During the customer relationship management (CRM) era of the 1990s, many
companies made substantial investments in customer-facing technologies that
subsequently failed to deliver expected value.

• The reason for most of those failures was fairly straightforward: Management either
forgot (or just didn ’t know) that big projects require a synchronized transformation of
people, process, and technology. All three must be marching in step or the project is
doomed.
Why Big Data?
1. Understanding and Targeting Customers
• Here, big data is used to better understand customers and their behaviors and
preferences.
• Using big data, Telecom companies can now better predict customer churn;
• Wal-Mart can predict what products will sell, and
• car insurance companies understand how well their customers actually drive.
• Even government election campaigns can be optimized using big data analytics.

2. Understanding and Optimizing Business Processes

• Big data is also increasingly used to optimize business processes.
• Retailers are able to optimize their stock based on predictions generated from
social media data, web search trends and weather forecasts.
• One particular business process that is seeing a lot of big data analytics is supply
chain or delivery route optimization.
3. Personal Quantification and Performance Optimization

• Big data is not just for companies and governments but also for all of us
individually.
• We can now benefit from the data generated from wearable devices such as
smart watches or smart bracelets: collects data on our calorie consumption,
activity levels, and our sleep patterns.
• Most online dating sites apply big data tools and algorithms to find us the most
appropriate matches.

4. Improving Healthcare and Public Health

• The computing power of big data analytics enables us to decode entire DNA
strings in minutes and will allow us to understand and predict disease patterns.
• Big data techniques are already being used to monitor babies in a specialist
premature and sick baby unit.
• By recording and analyzing every heart beat and breathing pattern of every
baby, the unit was able to develop algorithms that can now predict infections 24
hours before any physical symptoms appear.
5. Improving Sports Performance

• Most elite sports have now embraced big data analytics. We have the IBM
SlamTracker tool for tennis tournaments;
• we use video analytics that track the performance of every player in a
football or baseball game, and sensor technology in sports equipment such as
basket balls or golf clubs allows us to get feedback (via smart phones and
cloud servers) on our game and how to improve it.

6. Improving Science and Research

• Science and research is currently being transformed by the new possibilities big
data brings. Take, for example, CERN(European Council for Nuclear
Research)

• The CERN data center has 65,000 processors to analyze its 30 petabytes of data.
thousands of computers distributed across 150 data centers worldwide to
analyze the data.

• Such computing powers can be leveraged to transform so many other areas of

science and research.
7. Optimizing Machine and Device Performance
• Big data analytics help machines and devices become smarter and more
autonomous.

• For example, big data tools are used to operate Google’s self-driving car.

• The Toyota is fitted with cameras, GPS as well as powerful computers and
sensors to safely drive on the road without the intervention of human
beings.

8. Improving Security and Law Enforcement.

• Big data is applied heavily in improving security and enabling law
enforcement.

• The National Security Agency (NSA) in the U.S. uses big data analytics to
prevent terrorist plots .

• Others use big data techniques to detect and prevent cyber attacks.
9. Improving and Optimizing Cities and Countries
• Big data is used to improve many aspects of our cities and countries.

• For example, it allows cities to optimize traffic flows based on real time traffic
information as well as social media and weather data.

• a bus would wait for a delayed train and where traffic signals predict traffic
volumes and operate to minimize jams.

10. Financial Trading

• The final category of big data application comes from financial trading.

• High-Frequency Trading (HFT) is an area where big data finds a lot of use
today. Here, big data algorithms are used to make trading decisions.

• Today, the majority of equity trading now takes place via data algorithms that
increasingly take into account signals from

• social media networks and news websites to make, buy and sell decisions
in split seconds.
A Wider Variety of Data
The variety of data sources continues to increase. Traditionally, internally focused
operational systems, such as ERP (enterprise resource planning) and CRM
applications, were the major source of data used in analytic processing.

 Internet data (i.e., clickstream, social media, social networking links)

 Primary research (i.e., surveys, experiments, observations)
 Secondary research (i.e., competitive and marketplace data, industry reports,
consumer data, business data)
 Location data (i.e., mobile device data, geospatial data)
 Image data (i.e., video, satellite image, surveillance)
 Supply chain data (i.e., EDI, vendor catalogs and pricing, quality information)
 Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
 The wide variety of data leads to complexities in ingesting the data into data
storage. The variety of data also complicates the transformation (or the changing
of data into a form that can be used in analytics processing) and analytic
computation of the processing of the data.
Expanding Universe of Unstructured Data

• Unstructured data is basically information that either does not have a predefined
data model and/or does not fit well into a relational database.Unstructured
information is typically text heavy, but may contain data such as dates, numbers,
and facts as well.

• The amount of data (all data, everywhere) is doubling every two years.
• Our world is becoming more transparent. We, in turn, are beginning to
accept this as we become more comfortable with parting with data that
we used to consider sacred and private.
• Most new data is unstructured. Specifically, unstructured data
represents almost 95 percent of new data, while structured data represents
only 5 percent.
• Unstructured data tends to grow exponentially, unlike structured data,
which tends to grow in a more linear fashion.
Big Data Analytics: Is Big Data analytics worth the effort? Yes

Big Data analytics is the process of examining data—typically of a variety of sources,

types, volumes and / or complexities—to uncover hidden patterns, unknown
correlations, and other useful information. The intent is to find business insights that
were not previously possible or were missed, so that better decisions can be made.
Big Data analytics uses a wide variety of advanced analytics to provide
1. Deeper insights. Rather than looking at segments, classifications, regions, groups, or
other summary levels you ’ll have insights into all the individuals, all the products, all
the parts, all the events, all the transactions, etc.

2. Broader insights. The world is complex. Operating a business in a global,

connected economy is very complex given constantly evolving and changing
conditions. As humans, we simplify conditions so we can process events and
understand what is happening. Big Data analytics takes into account all the data,
including new data sources, to understand the complex, evolving, and interrelated
conditions to produce more accurate insights.

3. Frictionless actions. Increased reliability and accuracy that will allow the deeper and
broader insights to be automated into systematic actions.
Industry Examples of Big Data
Digital Marketing
• Google ’s digital marketing evangelist and author Avinash Kaushik spent the first 10
years of his professional career in the world of business intelligence, during which he
actually built large multiterabyte data warehouses and the intelligence platforms.

• Avinash Kaushik designed a framework in his book Web Analytics 2.0: The Art of
Online Accountability and Science of Customer Centricity , in which he states that
if you want to make good decisions on the Web, you have to learn how to use
different kinds of tools to bring multiple types of data together and make
decisions at the speed of light!

Don ’t Abdicate Relationships

• Many of today ’s marketers are discussing and assessing their approaches to engage
consumers in different ways such as social media marketing

• you have to have the primary outpost from where you can collect your own “big
data” and have a really solid relationship with the consumers you have and their data
so you can make smarter decisions.
Database Marketers, Pioneers of Big Data
• It began back in the 1960s, when people started building mainframe systems that
contained information on customers and information about the products and services
those customers were buying

• By the 1980s, marketers developed the ability to run reports on the information in
their databases. The reports gave them better and deeper insights into the buying
habits and preferences of customers

• In the 1990s, email entered the picture, and marketers quickly saw opportunities for
reaching customers via the Internet and the World Wide Web.

• The term marketing automation refers to software platforms designed to automate

repetitive tasks in critical areas such as campaign management. Doyle explains how
that helped the marketers in today ’s world of Big Data:

• Today, many companies have the capability to store and analyze data generated
from every search you run on their websites, every article you read, and every
product you look at.
Big Data and the New School of Marketing
“Today ’s consumers have changed. They ’ve put down the newspaper, they fast
forward through TV commercials, and they junk unsolicited email. Why? They have
new options that better fit their digital lifestyle. They can choose which marketing
messages they receive, when, where, and from whom.

• New School marketers deliver what today ’s consumers want: relevant interactive
communication across the digital power channels: email, mobile, social, display and
the web.”

Consumers Have Changed. So Must Marketers

• Today ’s cross-channel consumer is more dynamic, informed, and unpredictable than

ever.
The Right Approach: Cross-Channel Lifecycle Marketing:

• Cross-Channel Lifecycle Marketing really starts with the capture of customer

permission, contact information, and preferences for multiple channels. It also
requires marketers to have the right integrated marketing and customer information
systems,

(1) They can have complete understanding of customers through stated

preferences and observed behavior at any given time; and

(2) They can automate and optimize their programs and processes throughout the
customer lifecycle. Once marketers have that, they need a practical framework
for planning marketing activities.

• Let ’s take a look at the various loops that guide marketing strategies and tactics
in the Cross-Channel Lifecycle Marketing approach: conversion, repurchase,
stickiness, win-back, and re-permission (see Figure 2.1 ).
Web Analytics

• Web analytics is the measurement, collection, analysis and reporting of web data for
purposes of understanding and optimizing web usage.

• The following are the some of the web analytic metrics: Hit, Page view, Visit/
Session, First Visit / First Session, Repeat Visitor, New Visitor, Bounce Rate, Exit
Rate, Page Time Viewed / Page Visibility Time / Page View Duration, Session
Duration / Visit Duration. Average Page View Duration, and Click path etc.

• Web is that the primary way in which data gets collected, processed and stored,
and accessed is actually at a third party

• Big Data on the Web will completely transform a company’s ability to understand
the effectiveness of its marketing and hold its people accountable for the
millions of dollars that they spend. It will also transform a company’s ability to
understand how its competitors are behaving.
Web event data is incredibly valuable
• It tells you how your customers actually behave (in lots of detail), and how that
varies
• Between different customers
• For the same customers over time. (Seasonality, progress in customer journey)
• How behaviour drives value
• It tells you how customers engage with you via your website / webapp
• How that varies by different versions of your product
• How improvements to your product drive increased customer satisfaction and
lifetime value
• It tells you how customers and prospective customers engage with your
different marketing campaigns and how that drives subsequent behaviour
Web analytics tools are good at delivering the standard reports that are common across
different business types
• Where does your traffic come from e.g.
• Sessions by marketing campaign / referrer
• Sessions by landing page

• Understanding events common across business types (page views, transactions,

‘goals’) e.g.
• Page views per session
• Page views per web page
• Conversion rate by traffic source
• Transaction value by traffic source

• Capturing contextual data common people browsing the web

• Timestamps
• Referer data
• Web page data (e.g. page title, URL)
• Browser data (e.g. type, plugins, language)
• Operating system (e.g. type, timezone)
• Hardware (e.g. mobile / tablet / desktop, screen resolution, colour depth)
Empowering Marketing with Social Intelligence

• As a result of the growing popularity and use of social media around the world and
across nearly every demographic, the amount of user-generated content—or “big
data”—created is immense, and continues growing exponentially.

• Millions of status updates, blog posts, photographs, and videos are shared every
second. Very intelligent software is required to parse all that social data to define
things like the sentiment of a post.

• It ’s important for brands to be able to understand the demographic information

of the individual driving social discussions around their brand such as gender, age,
and geography so they can better understand their customers and better target
campaigns and programs based on that knowledge

• In terms of geography, Singer explained that they are combining social check-in
data from Facebook, Foursquare, and similar social sites and applications over maps
to show brands at the country, state/region, state, and down to the street level
where conversations are happening about their brand, products, or competitors.
• Customer intent is the big data challenge we ’re focused on solving. By applying
intelligent algorithms and complex logic with very deep, real-time text analysis,
we’re able to group customers in to buckets such as awareness, opinion,
consideration, preference and purchase.

• That ability let ’s marketers create unique messages and offers for people along
each phase of the purchase process and lets sales more quickly identify qualified
sales prospects.

• Marketers now have the opportunity to mine social conversations for purchase
intent and brand lift through Big Data.

• Since this data is captured in real-time, Big Data is coercing marketing

organizations into moving more quickly to optimize media mix and message as a
result of these insights.
Characteristics of Data:
• Composition: The composition of data deals with the structure of data, that is,
• the sources of data,
• the granularity,
• the types, and
• the nature of data as to whether it is static or real-time streaming.

• Condition: The condition of data deals with the state of data, that is,
• "Can one use this data as is foranalysis?" or
• "Does it require cleansing for further enhancement and enrichment?"

• Context: The context of data deals with

• Where has this data been generated?

• "Why was this datagenerated?
• How sensitive is this data?
• What are the events associated with this data? and so on.
Evolution of Big Data:
• 1970s and before was the era of mainframes. The data was essentially primitive and
structured.
• Relational databases evolved in 1980s and 1990s. The era was of data intensive
applications.
• The World Wide Web (WWW) and the Internet of Things (IOT) have led to an onslaught
of structured, unstructured, and multimedia data
Key Points
• There are two styles of distributing data:

• Sharding distributes different data across multiple servers, so each server

acts as the single source for a subset of data.

• Replication copies data across multiple servers, so each bit of data can be
found in multiple places.

A system may use either or both techniques.: Replication comes in two forms:

• Master-slave replication makes one node the authoritative copy that handles
writes while slaves synchronize with the master and may handle reads.

• Peer-to-peer replication allows writes to any node; the nodes coordinate to

synchronize their copies of the data.

• Master-slave replication reduces the chance of update conflicts but peer-

to-peer replication avoids loading all writes onto a single point of failure.
Storing Data in Databases and Data
Warehouses – RDBMS and Big Data
Relational Model:
• Preset Schemas: Data is structured in predefined tables with rows and
columns.
• Table Relationships: Tables are linked via foreign keys, maintaining referential
integrity.

Hierarchical Data Storage:

• ACID Standards:
• Atomicity: Ensures all parts of a transaction are completed successfully; if one
part fails, the entire transaction fails.
• Consistency: Guarantees that data adheres to all defined rules, such as data types,
constraints, and relationships.
• Isolation: Transactions occur independently without interference, ensuring
the integrity of data during concurrent transactions.
• Durability: Ensures that once a transaction is committed, it remains so
even in the event of a system failure.
Storing Data in Databases and Data
Warehouses – RDBMS and Big Data
Three V's of Big Data:
Volume:
•Massive Data Storage: Designed to store and manage extremely large datasets,
ranging from terabytes to petabytes, and even zettabytes. This vast amount of data
requires advanced storage solutions like distributed file systems and cloud storage
to handle the load efficiently.
Variety:
•Diverse Data Types: Capable of handling a mix of structured, semi-structured, and
unstructured data (e.g., text, images, videos, sensor data). This diversity allows
organizations to combine different data sources for more comprehensive and
insightful analysis.
Velocity:
• High-Speed Data Ingestion: Capable of processing incoming data rapidly, enabling
real-time or near-real-time data analytics and processing. This capability is essential
for applications that need immediate insights, such as fraud detection, real-time
recommendations, and live monitoring systems.
Storing Data in Databases and Data
Warehouses – RDBMS and Big Data
Key Differences between RDBMS and Big Data

RDBMS:
•Structured Schemas: Uses predefined tables and relationships.
•Schema on Write: Schema is defined before data is written.
•Transactional Systems: Ideal for applications like financial systems and inventory
• management.
•ACID Compliance: Ensures reliable transaction processing.

Big Data:
• Flexible Data Handling: Accommodates various data formats and structures.
• Schema on Read: Schema is applied when data is read.
• Batch and Real-Time Processing: Supports large-scale data processing.
• Scalability: Designed to scale out horizontally.
• Suitable Applications: Analytics, sentiment analysis, fraud detection, IoT data
• processing.
Storing Data in Databases and Data
Warehouses – RDBMS and Big Data
Storing Data in Databases and Data Warehouses –
Issues with Relational and Non-Relational model
Issues with Relational model

 Traditional relational database models separate blog posts and comments into
different tables.
 Each post has a unique ID in the Posts table, and comments related to a post
 reference this ID in the Comments table.
 When a visitor accesses a blog post, the software fetches the post content and
 comments separately from their respective tables.
 This separation can lead to inefficiencies, as retrieving comments requires
knowledge of the associated post.
 NoSQL databases offer an alternative approach, allowing for more flexible data
structures that can better accommodate relationships between posts and
comments.
 By storing posts and their comments together or in a more interconnected
manner, NoSQL databases can simplify querying and improve performance for
applications with complex relationships like blogs.
Storing Data in Databases and Data Warehouses –
Issues with Relational and Non-Relational model
Issues with Non-Relational model
 Non-relational databases, like NoSQL, diverge from the
traditional RDBMS table/key model, offering alternative solutions for
Big Data management.
 They're favored by tech giants like Google, Amazon, Yahoo!, and Facebook for their
 scalability and ability to handle unpredictable traffic spikes.
 Non-relational databases provide scalability without traditional
table structures,
 utilizing specialized frameworks for storage and querying.
 Common characteristics include scalability across clusters, seamless expansion for
 increasing data flows, and a query model based on key-value pairs or documents.
 Efficient design principles, such as dynamic memory
utilization, ensure high performance in managing large data volumes.
 Eventual consistency, a feature of non-relational databases, ensures availability and
 network partition tolerance.
 Despite simplicity, the non-relational model poses challenges in data organization and
retrieval, especially for tasks like tagging posts with categories, necessitating careful
software-level considerations.
Polyglot Persistence

 Polyglot persistence in Big Data involves leveraging multiple database technologies

 within a single application to address complex challenges effectively.
 This strategy entails breaking down problems into fragments and applying diverse
database models, consolidating the outcomes into a cohesive storage and analysis
solution.
 Leading companies like Disney, Netflix, Twitter, and Mendeley employ polyglot
databases to cater to their diverse data needs.
 While some organizations still utilize relational databases, there's a noticeable shift
towards integrating various data sources to meet the evolving demands of modern
applications.
 Polyglot persistence offers robustness against database failures by distributing data
across different nodes, ensuring uninterrupted functionality even in the event of a
failure.
 Behind the scenes, polyglot persistence manifests through diverse elements such as
data warehouses, multiple RDBMSs, flat files, and content management servers
commonly found in corporate environments.
Integrating Big Data with Traditional Data
Warehouses
• Transitioning from heterogeneous data environments to implementing Big Data
solutions involves addressing challenges like data volume, scalability, and operational
costs.
• Organizations are realizing the necessity of integrating traditional data warehouses
with Big Data sources, opting for a hybrid approach to leverage the strengths of both.
• Challenges in the physical architecture include scalability, loading, storage
management, and operational costs, requiring careful planning and management for
effective implementation.

Key methods:

• Data Availability: Big Data systems require immediate access to data. NoSQL
databases like Hadron help ensure availability, though challenges arise in handling
context-sensitive data and avoiding duplicate data impact.
• Pattern Study: Analyzing patterns in data allows for efficient retrieval and analysis.
Trending topics and pattern-based study models aid in knowledge gathering,
identifying relevant patterns in massive data streams.
Integrating Big Data with Traditional Data
Warehouses
• Data Incorporation and Integration: Integrating data poses challenges due to
continuous processing. Dedicated machines can alleviate resource conflicts,
simplifying configuration and setup processes.
• Data Volumes and Exploration: Managing large datasets is crucial. Retention
requirements vary, necessitating exploration and mining for procurement and
optimization. Neglecting these areas can lead to performance drains.
• Compliance and Legal Requirements: Adhering to compliance standards is
essential for data security. Data infrastructure can comply with standards while
implementing additional security measures to minimize risks and performance
impacts.
• Storage Performance: Optimizing storage performance is vital. Considerations
include disk performance, SSD utilization, and data exchange across layers.
Addressing these challenges ensures efficient storage in Big Data environments.
Big Data Analysis and Data Warehouse
 Big Data Solutions Overview: Big Data solutions facilitate storing large,
heterogeneous data in low-cost devices in raw or unstructured formats, aiding
trend analysis and future predictions across various sectors.
 Data Warehousing Definition: Data warehousing involves methods and software
for collecting, integrating, and synchronizing data from multiple sources into a
centralized database, supporting analytical visualization and key performance
tracking.
 Case Study: Argon Technology: Argon Technology implements a data warehouse
for a client analyzing data from 100,000 employees worldwide, streamlining
performance assessment processes.
 Complexity of Data Warehouse Environment: Recent years have seen increased
complexity with the introduction of various data warehouse technologies and tools
for analytics and real-time tasks.
 Comparison: Big Data Solution vs. Data Warehousing: Big Data solutions
handle vast data quantities, while data warehousing organizes integrated data for
informed decision-making.
 Differentiation and Use Cases: Big Data analysis focuses on raw data, while data
warehousing filters data for strategic and management purposes.
 Future Prospects: Enterprises continue relying on data warehousing for reporting
and visualization, alongside Big Data analytics for insights, ensuring
comprehensive database support.
Changing Deployment Models in Big Data Era
• Deployment Shift: Transition from traditional data centers to distributed database
nodes within the same data center has optimized data warehouses, focusing on
scalability and cost-effectiveness.

• Hybrid Environments: The evolving hybrid Big Data platform addresses

enterprise needs for scalability, speed, agility, and affordability, integrating flexible
components for future-proofing.

• Appliance Model: Also known as Commodity Hardware, the appliance model

offers autonomous, cost-effective data storage and management with pre- loaded
tools for organizing data from various sources.

• Cloud Deployment Advantages: Cloud-based Big Data applications offer on-

demand self-service, broad network access, multi-user support, elasticity, scalability,
and measured service.

• Challenges: Big Data architecture and cloud computing face challenges related to
data magnitude and location, processing requirements, and technical supportability
of cloud-based service models.
NoSQL Data Management - Introduction
 NoSQL databases are non-relational and designed for distributed data stores with
large volumes of data, utilized by companies like Google and Facebook.

 These databases do not require fixed schemas, avoid join operations, and scale data
horizontally to accommodate growing data volumes.

 Tables in NoSQL databases are stored as ASCII files, with tuples represented by
fields separated with tabs, manipulated through shell scripts or UNIX pipelines.

 NoSQL databases are still evolving, with varying opinions among software
developers regarding their usefulness, flaws, and long-term viability.

 The chapter covers various aspects of NoSQL, starting with an introduction to its
aggregate data models, including key-value, column-oriented, document, and graph
models.
 It further explains the concept of relationships in NoSQL and schema-less databases,
along with materialized views and distribution models.
 The concept of sharding, or horizontal partitioning of data, is also discussed towards
the end of the chapter.
NoSQL Data Management - Introduction
 Need for NoSQL: NoSQL databases meet the demand for scalability and
continuous availability, offering an alternative to traditional relational databases.
They address technical, functional, and financial challenges, particularly in
environments requiring large-scale data processing and management.

 Characteristics of NoSQL: NoSQL databases diverge from SQL principles,

emphasizing flexibility and scalability. They allow schema-less operation,
enabling the addition of fields without structural changes. NoSQL databases offer
various consistency and distribution options, accommodating non-uniform data
and custom fields efficiently.

 History of NoSQL: The concept of NoSQL emerged in the late 1990s, evolving
from relational databases to address distributed, non-relational, and schema-less
designs. It gained momentum in the early 2000s, driven by the need for open-
source distributed databases and led to the development of popular platforms like
MongoDB, Apache Cassandra, and Redis.
NoSQL Data Management – Types
Key-value databases offer basic operations
like retrieval, storage, and
deletion.
 Values, typically Binary Large Objects
(BLOBs), store data without internal
interpretation.
 Efficient scaling is achieved through
primary key access.
 Popular options include Riak, Redis,
Memcached, Berkeley DB, HamsterDB,
Amazon DynamoDB, Project Voldemort, and
Couchbase.
 Database selection depends on specific
requirements, such as persistence and data
durability.
 Riak ensures data persistence, while
Memcached lacks persistence.
 Choose the database type based on
individual use cases and requirements.
NoSQL Data Management – Types
 Column-oriented databases store
data by columns rather than rows.
 Each column's values are stored
contiguously, allowing for efficient data
retrieval.
 Examples of column-oriented
databases include Cassandra,
BigTable, SimpleDB, and HBase.
 These databases excel in performance for
operations like counting and aggregation
queries.
 They are particularly efficient for
operations like COUNT and MAX.
 Column-oriented databases are ideal for
scenarios where data aggregation and
analytics are frequent tasks.
NoSQL Data Management – Types
 Document databases store data in self-
describing hierarchical structures like XML,
JSON, and BSON.
 They provide indexing and searching similar to
relational databases but with a different
structure.
 While offering performance and scalability
benefits, they lack the ACID properties of
relational models.
 Choosing a document-oriented database trades
database-level data integrity for increased
performance.
 Document databases and relational databases
serve different purposes and are not direct
replacements for each other.
 Examples of popular document databases
include MongoDB, Couchbase, and OrientDB.
 Organizations often use a combination of
relational and document-oriented databases to
meet different needs.
NoSQL Data Management – Types
 Graph databases utilize semantic
queries and graph structures to
represent and store data.
 Nodes represent entities or
instances, while edges denote
relationships between them.
 Unlike relational databases, graph
databases allow for dynamic
schema changes without extensive
modifications.
 Relationships play a crucial role
in graph databases, enabling the
derivation of meaningful insights.
 Modelling relationships in graph
databases requires careful
consideration and design
expertise.
 Popular graph databases include
Neo4J, Infinite Graph, and
FlockDB.
NoSQL Data Management – Distribution
Models
• Propagate-oriented databases facilitate easy data distribution, focusing on
aggregating data movement rather than related data.
• Data distribution is typically accomplished through two methods: sharding and
replication.
• Sharding involves distributing various data types across multiple servers, with
each server managing a subset of the data.
• Replication enhances fault tolerance by duplicating data across multiple servers,
ensuring each piece of data exists in multiple locations.
• Replication can occur through master-slave replication, where one node manages
writes while others handle reads, or through peer-to-peer replication, allowing
writes to any node without authorization.
• While master-slave replication reduces update conflicts, peer-to-peer replication
avoids single points of failure, and some databases utilize a combination of both
techniques.
NoSQL Data Management – CAP Theorem
• The CAP theorem outlines three critical aspects in distributed databases: Consistency,
Availability, and Partition Tolerance.
• According to the CAP theorem, it's impossible for a distributed system to simultaneously achieve
all three aspects.
• Consistency ensures that all clients see the same data after an operation, maintaining data
integrity.
• Availability indicates that the system is continuously accessible without downtime.
• Partition Tolerance ensures that the system functions reliably despite communication failures
between servers.
• NoSQL databases typically operate under one of three combinations: CA (Consistency and
Availability), CP (Consistency and Partition Tolerance), or AP (Availability and Partition
Tolerance).
• Transactions in relational databases adhere to ACID properties (Atomicity, Consistency,
Isolation, Durability), ensuring data integrity and reliability.
• In contrast, NoSQL databases often prioritize BASE principles (Basic Availability, Soft-state,
Eventual Consistency), offering flexibility but requiring developers to implement transactional
logic manually.
• The absence of built-in transaction support in many NoSQL databases necessitates custom
implementation strategies by developers.
NoSQL Data Management – CAP Theorem
NoSQL Data Management – Sharding
• Definition: Database sharding partitions large databases across servers to boost
performance and scalability by distributing data into smaller segments, called shards.

• Origin and Popularity: Coined by Google engineers, sharding gained traction through
publications like Big Table architecture. Major internet companies like Amazon and Facebook
have adopted sharding due to the surge in transactional volume and database sizes.

• Purpose and Approach: Sharding aims to enhance the throughput and overall performance of
high-transaction business applications by scaling databases. It provides a scalable solution to
handle increasing data volumes and transaction loads.

• Challenges Addressed: Traditional relational database management systems (RDBMS) face

limitations in handling growing data sizes and transaction volumes, impacting performance.
Sharding distributes data across servers to improve scalability and alleviate the burden on
individual servers.

• Factors Driving Adoption: With businesses striving to maintain optimal performance amidst
growing demands, sharding becomes essential. Despite improvements in disk I/O and database
management systems, the need for enhanced performance and scalability fuels the adoption of
database sharding.
NoSQL Data Management –
Sharding
What is a NoSQL database?

• NoSQL, which stands for “not only SQL,” is an approach to database design that
provides flexible schemas for the storage and retrieval of data beyond the traditional
table structures found in relational databases.

• NoSQL Database is a non-relational Data Management System, that does not require a
fixed schema.

• It avoids joins, and is easy to scale. The major purpose of using a NoSQL database is for
distributed data stores with humongous data storage needs.

• NoSQL database technology is a database type that stores information in JSON

documents instead of columns and rows used by relational databases.
Introduction to NoSQL
NoSQL database stands for "Not Only SQL" or "Not SQL."

• A NoSQL (originally referring to "non-SQL" or "non-relational") database provides

a mechanism for storage and retrieval of data.

• NoSQL Database is a non-relational Data Management System, that does not

require a fixed schema.

• NoSQL databases are built to be flexible, scalable, and capable of rapidly

responding to the data management demands of modern businesses.

• NoSQL databases provide flexible schemas and scale easily with large amounts of
data and high user loads.

• NoSQL data models allow related data to be nested within a single data structure.
Why NoSQL?
Data-driven

• 90% data created in last 2 years

(e.g., machine generated)

• 80% enterprise data is unstructured

(e.g., social media)

Sharding of data

• Distributes a single logical database

system across a cluster of machines

• Uses range-based partitioning to

distribute documents based on a
specific shard key
What are NoSQL databases?
• Document databases store data in documents similar to JSON (JavaScript Object
Notation) objects. Each document contains pairs of fields and values.

• Key-value databases are a simpler type of database where each item contains keys
and values. Redis and DynanoDB are popular key-value databases.

• Wide-column stores store data in tables, rows, and dynamic columns. Wide-column
stores provide a lot of flexibility over relational databases because each row is not
required to have the same columns. Cassandra and HBase are two of the most
popular wide-column stores.

• Graph databases store data in nodes and edges. Nodes typically store information
about people, places, and things while edges store information about the relationships
between the nodes. Neo4j and JanusGraph are examples of graph databases.
Impedance Mismatch
• Impedance mismatch is the term used to refer to the problems that occurs due to
differences between the database model and the programming language model.

• Data type mismatch means the programming language attribute data type may differ
from the attribute data type in the data model.

• Hence it is quite necessary to have a binding for each host programming language
that specifies for each attribute type the compatible programming language types.

• It is necessary to have different data types, for example, we have different data
types available in different programming languages such as data types in C are
different from Java and both differ from SQL data types.

• The results of most queries are sets or multisets of tuples and each tuple is formed
of a sequence of attribute values.

• In the program, it is necessary to access the individual data values within individual
tuples for printing or processing.
• Hence there is a need for binding to map the query result data structure which
is a table to an appropriate data structure in the programming language.
Impedance Mismatch
Impedance Mismatch
• The difference between the relational model and the in-memory data structures.

• The relational data model organizes data into a structure of tables and rows, or
more properly, relations and tuples.

• In the relational model, a tuple is a set of name-value pairs and a relation is a set
of tuples. (The relational definition of a tuple is slightly different from that in
mathematics and many programming languages with a tuple data type, where a
tuple is a sequence of values.)

The Object–Relational Impedance Mismatch is a set of conceptual and technical

difficulties

• that are often encountered when a relational database management system

(RDBMS) is being served by an application program (or multiple application
programs)

• written in an object-oriented programming language or style, particularly because

objects or class definitions must be mapped to database tables defined by a
relational schema.
One of the most obvious shifts with NoSQL is a move away from the relational model.

Each NoSQL solution has a different model that it uses, which we put into four categories
widely used in the NoSQL ecosystem:

• key-value, document, column-family, and graph

Data Storage: Relational Vs Document
Relational Vs Document Model
Javascript aside : What is JSON ?
Graph Databases
• Graph databases are an odd fish in the NoSQL pond. Most NoSQL databases were
inspired by the need to run on clusters, which led to aggregate-oriented data
models of large records with simple connections.

• Graph databases are motivated by a different frustration with relational databases and
thus have an opposite model—small records with complex interconnections

• we refer to a graph data structure of nodes connected by edges.

• In Figure 3.1 we have a web of information whose nodes are very small
(nothing more than a name) but there is a rich structure of interconnections
between them

• Graph databases specialize in capturing this sort of information—but on a much larger

scale than a readable diagram could capture.

• This is ideal for capturing any data consisting of complex relationships

such as social networks, product preferences, or eligibility rules.

• The fundamental data model of a graph database is very simple:

nodes connected by edges
Why is a Graph Database?
We live in a connected world! There are no isolated pieces of information, but rich,
connected domains all around us.
• Only a database that natively hold relationships is able to store, process, and query
connections efficiently.
• While other databases compute relationships at query time through expensive JOIN
operations, a graph database stores connections together and in cooperation with the
data in the model.
Each node has independent relationships with other nodes. These relationships have names
like PURCHASED, PAID_WITH, or BELONGS_TO (see Figure 3.5);
• Let’s say you want to find all
the Customers who
PURCHASED a product with
the name Refactoring Database

• All we need to do is query for

the product node Refactoring
Databases and look for all the
Customers with the incoming
PURCHASED relationship.

• This type of relationship

traversal is very easy with graph
databases.

• It is especially convenient when

you need to use the data to
recommend products to users
or to find patterns in actions
taken by users.
• Once you have built up a graph of nodes and edges, a graph database allows you to
query that network with query operations designed with this kind of graph in mind.

• This is where the important differences between graph and relational databases come in.

• Although relational databases can implement relationships using foreign keys, the
joins required to navigate around can get quite expensive—which means
performance is often poor for highly connected data models

• Graph databases make traversal along the relationships very cheap. A large part
of this is because graph databases shift most of the work of navigating relationships
from query time to insert time. This naturally pays off for situations where
querying performance is more important than insert speed.

• Most of the time you find data by navigating through the network of edges, with
queries such as “tell me all the things that both Anna and Barbara like.”

• You do need a starting place, however, so usually some nodes can be indexed
by an attribute such as ID.

• So you might start with an ID lookup (i.e., look up the people named “Anna”
and “Barbara”) and then start using the edges.
Complex queries typically run faster in graph databases than they do in relational
databases.

• Relational databases require complex joins on data tables to perform

complex queries, so the process is not as fast.
-----------------------------------------------------------------------------------------------------------

The flexibility of a graph database enables the ability to add new nodes and
relationships between nodes, making it reliable for real-time data.

• Relational databases make adding new tables and columns possible while
the database is running.
Distribution Models
The primary driver of interest in NoSQL has been its ability to run databases on a large
cluster.
• As data volumes increase, it becomes more difficult and expensive to scale
up—buy a bigger server to run the database on.
• A more appealing option is to scale out—run the database on a cluster of
servers.
• Aggregate orientation fits well with scaling out because the aggregate is a
natural unit to use for distribution.
• Depending on your distribution model, you can get a data store that will give you the
ability to handle larger quantities of data, the ability to process a greater read or write
traffic, or more availability in the face of network slowdowns or breakages

• Broadly, there are two paths to data distribution: replication and sharding.

• Replication takes the same data and copies it over multiple nodes.

• Sharding puts different data on different nodes.

• Replication and sharding are orthogonal techniques: You can use either or both of them.

• Replication comes into two forms: master-slave and peer-to-peer

Single Server
• The first and the simplest distribution option is the one we would most often
recommend—no distribution at all.

• Run the database on a single machine that handles all the reads and
writes to the data store. We prefer this option because it eliminates all the
complexities that the other options introduce;

• Although a lot of NoSQL databases are designed around the idea of running
on a cluster, it can make sense to use NoSQL with a single-server
distribution model if the data model of the NoSQL store is more suited to
the application

• Graph databases are the obvious category here

• these work best in a single-server configuration.

• If your data usage is mostly about processing aggregates, then a single-
server document or key-value store may well be worthwhile because
it’s easier on application developers.
What is Sharding?
Sharding is the process of breaking up large tables into smaller chunks called shards that
are spread across multiple servers.

• A shard is essentially a horizontal data partition that contains a subset of the

total data set, and hence is responsible for serving a portion of the overall
workload.

Sharding is a method for distributing data across multiple machines.

• MongoDB uses sharding to support deployments with very large data sets and
high throughput operations.

Features of Sharding:

• Sharding makes the Database smaller

• Sharding makes the Database faster
• Sharding makes the Database much more easily manageable
• Sharding can be a complex operation sometimes
• Sharding reduces the transaction cost of the Database
Why Sharding : Problem and Solution
In the age of Big Data, popular social media platforms and IoT sensor data, datasets are
huge. If such a dataset is stored in a single database, queries become slow. Overall
system performance suffers. This is when database sharding becomes useful.

• A single logical dataset is split into multiple databases that are

then distributed across multiple machines.

• When a query is made, only one or a few machines may get involved in
processing the query.

• Sharding enables effective scaling and management of large datasets. There are
many ways to split a dataset into shards.
Key Based Sharding
Sharding
• Often, a busy data store is busy because different people are accessing different parts of
the dataset.
• In these circumstances we can support horizontal scalability by putting different parts of
the data onto different servers—a technique that’s called sharding
In the ideal case, we have different users all talking to different server nodes.

• Each user only has to talk to one server, so gets rapid responses from that
server.

• The load is balanced out nicely between servers—for example, if we

have ten servers, each one only has to handle 10% of the load.

• In order to get close to it we have to ensure that data that’s accessed together is
clumped(a clump is a grouping.)together on the same node and that these clumps
are arranged on the nodes to provide the best data access

• The first part of this question is how to clump the data up so that one user mostly
gets her/his data from a single server.

• This is where aggregate orientation comes in really handy.

• The whole point of aggregates is that we design them to combine data

that’s commonly accessed together—so aggregates leap out as an
obvious unit of distribution.
• When it comes to arranging the data on the nodes, there are several factors that can
help improve performance.

• If you know that most accesses of certain aggregates are based on a physical
location, you can place the data close to where it’s being accessed.

• Another factor is trying to keep the load even.

• This means that you should try to arrange aggregates so they are evenly
distributed across the nodes which all get equal amounts of the load.

• In some cases, it’s useful to put aggregates together if you think they may be read
in sequence.
• The Bigtable paper [Chang etc.] described

• keeping its rows in lexicographic order and sorting web addresses

based on reversed domain names (e.g., com.martinfowler).

• This way data for multiple pages could be accessed together to improve
processing efficiency.
Historically most people have done sharding as part of application logic.

• You might put all customers with surnames starting from A to D on one shard
and E to G on another.
• This complicates the programming model, as application code needs to ensure
that queries are distributed across the various shards.

• Many NoSQL databases offer auto-sharding,

• where the database takes on the responsibility of allocating data to shards

and ensuring that data access goes to the right shard.

• Sharding is particularly valuable for performance because it can improve both read and
write performance.

• Sharding provides a way to horizontally scale writes.

• Despite the fact that sharding is made much easier with aggregates, it’s still
not a step to be taken lightly.
Master-Slave Replication:
Master-slave replication makes one node the authoritative copy that handles writes
while slaves synchronize with the master and may handle reads.
Peer-to-Peer Replication:
Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize
their copies of the data.
Key Points
• There are two styles of distributing data: