0% found this document useful (0 votes)
76 views

Big Data Management Basic Principles

Here are some key considerations when deciding to shard: - Performance - Sharding can improve performance by distributing load across multiple nodes. But it also introduces overhead. - Scalability - Sharding allows nearly unlimited horizontal scaling by adding more nodes as your data and traffic grow. - Data size - Once your data exceeds the storage limits of a single node, sharding becomes necessary to store more data. - Data access patterns - Sharding works best if your queries can be distributed across shards, like by user ID or date range. - Consistency needs - Sharding complicates transactions and consistency. Consider your consistency requirements. - Application complexity - Sharding makes applications more complex to

Uploaded by

k yong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Big Data Management Basic Principles

Here are some key considerations when deciding to shard: - Performance - Sharding can improve performance by distributing load across multiple nodes. But it also introduces overhead. - Scalability - Sharding allows nearly unlimited horizontal scaling by adding more nodes as your data and traffic grow. - Data size - Once your data exceeds the storage limits of a single node, sharding becomes necessary to store more data. - Data access patterns - Sharding works best if your queries can be distributed across shards, like by user ID or date range. - Consistency needs - Sharding complicates transactions and consistency. Consider your consistency requirements. - Application complexity - Sharding makes applications more complex to

Uploaded by

k yong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Basic Principles

Big Data Management


Anis Ur Rahman
Faculty of Computer Science & Information Technology
University of Malaya

October 15, 2019

Borrowed from http://www.ksi.mff.cuni.cz/~svoboda/courses/171-NDBI040/


1 / 55
Lecture Outline

Different aspects of data distribution


1 Scaling
Vertical vs. horizontal
2 Distribution models
Sharding
Replication: master-slave vs. peer-to-peer architectures
3 CAP properties
Consistency, availability and partition tolerance
ACID vs. BASE guarantees
4 Consistency
Read and write quora

2 / 55
Scalability

Outline

1 Scalability

2 Distribution Models

3 CAP Theorem

4 Consistency

3 / 55
Scalability

Scalability

What is scalability?
the capability of a system to handle growing amounts of data
and/or queries without losing performance, or
its potential to be enlarged in order to accommodate such a growth
Two general approaches
Vertical scaling
Horizontal scaling

4 / 55
Scalability

Vertical Scalability

Vertical scaling (scaling up/down)


Adding resources to a single node in a system
e.g. increasing the number of CPUs, extending system memory,
using larger disk arrays, ...
i.e. larger and more powerful machines are involved
Traditional choice
In favor of strong consistency
Easy to implement and deploy
No issues caused by data distribution
...
Works well in many cases but ...

5 / 55
Scalability

Vertical Scalability: Drawbacks

Performance limits
Even the most powerful machine has a limit
Everything works well ... until start approaching the limit
Higher costs
The cost of expansion increases exponentially
In particular, it is higher than the sum of costs of equivalent
commodity hardware
Proactive provisioning
New projects/applications might evolve rapidly
Upfront budget is needed when deploying new machines
So flexibility is seriously suppressed

6 / 55
Scalability

Vertical Scalability: Drawbacks

Vendor lock-in
There are only a few manufacturers of large machines
Customer is made dependent on a single vendor
Their products, services, but also implementation details,
proprietary formats, interfaces, support, ...
e.e. it is difficult or impossible to switch to another vendor
Deployment downtime
Inevitable downtime is often required when scaling up

7 / 55
Scalability

Horizontal Scalability

Horizontal scaling (scaling out/in)


adding more nodes to a system
i.e. system is distributed across multiple nodes in a cluster
Choice of many NoSQL systems
Advantages
Commodity hardware, cost effective
Flexible deployment and maintenance
Often surpasses the vertical scaling
Often no single point of failure
...

8 / 55
Scalability

Horizontal Scalability: Consequences

Significantly increases complexity


Complexity of management, programming model, ...
Introduces new issues and problems
Data distribution
Synchronization of nodes
Data consistency
Recovery from failures
...
And there are also plenty of false assumptions ...

9 / 55
Scalability

Horizontal Scalability: Fallacies

False assumptions
Network is reliable
Latency is zero
Bandwidth is infinite
Network is secure
Topology does not change
There is one administrator
Network is homogeneous
Transport cost is zero
Source: https://www.red-gate.com/simple-talk/blogs/
the-eight-fallacies-of-distributed-computing/

10 / 55
Scalability

Horizontal Scalability: Conclusion

A standalone node still might be a better option in certain cases


e.g. for graph databases
Simply because it is difficult to split and distribute graphs
In other words
It can make sense to run even a NoSQL database system on a
single node
No distribution at all is the most preferred/simple scenario
But in general, horizontal scaling really opens new possibilities

11 / 55
Scalability

Horizontal Scalability: Architecture

What is a cluster?
A collection of mutually interconnected commodity nodes
Based on the shared-nothing architecture
Nodes do not share their CPUs, memory, hard drives, ...
Each node runs its own operating system instance
Nodes send messages to interact with each other
Nodes of a cluster can be heterogeneous
Data, queries, calculations, requests, workload, ... all distributed
among the nodes within a cluster

12 / 55
Distribution Models

Outline

1 Scalability

2 Distribution Models

3 CAP Theorem

4 Consistency

13 / 55
Distribution Models

Distribution Models

Generic techniques of data distribution


1 Sharding
Idea. different data on different nodes
Motivation. increasing volume of data, increasing performance
2 Replication
Idea. the same data on different nodes
Motivation. increasing performance, increasing fault tolerance
Both the techniques are mutually orthogonal
i.e. we can use either of them, or combine them both
Distribution model
specific way how sharding and replication is implemented
NoSQL systems often offer automatic sharding and replication

14 / 55
Distribution Models

Sharding

Sharding (horizontal partitioning)


Placement of different data on different nodes
What different data means? Usually aggregates
e.g. key-value pairs, documents, ...
Related pieces of data that are accessed together should also
be kept together
Specifically, operations involving data on multiple shards should be
avoided (if possible)
The questions are...
how to design aggregate structures?
how to actually distribute these aggregates?

15 / 55
Distribution Models

Sharding

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.

16 / 55
Distribution Models

Sharding

Objectives
Achieve uniform data distribution
Achieve balanced workload (read and write requests)
Respect physical locations
e.g. different data centers for users around the world
...
Unfortunately, these objectives...
may mutually contradict each other
may change in time
So, how to actually determine shards for aggregates?

17 / 55
Distribution Models

Sharding

Source:
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
18 / 55
Distribution Models

Sharding

Sharding strategies
Based on mapping structures
Data is placed on shards in a random fashion
e.g. round-robin, ...
Knowledge of the mapping of individual aggregates to particular
shards must then be maintained
Thus usually maintained using a centralized index structures with
all the disadvantages
Based on general rules
Each shard is responsible for storing certain data
Hash partitioning, range partitioning, ...

19 / 55
Distribution Models

Sharding

Key-based sharding

Source:
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
20 / 55
Distribution Models

Sharding

Range-based sharding

Source:
https://www.digitalocean.com/community/tutorials/understanding-database-sharding

21 / 55
Distribution Models

Sharding

Directory-based sharding

Source:
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
22 / 55
Distribution Models

Sharding

Should I Shard?
Amount of application data grows to exceed the storage capacity
of a single database node.
Volume of writes or reads to the database surpasses what a single
node can handle,
resulting in slowed response times or timeouts.

23 / 55
Distribution Models

Sharding

Why is sharding difficult?


Not only we need to be able to determine particular shards during
write requests
i.e. when a new aggregate is about to be inserted
So that we can actually make a decision where it should be
physically stored
but also during read requests
i.e. when existing aggregate/s are about to be retrieved
So that we can actually find and return them efficiently (or detect
they are missing)
And all that only based on the search criteria provided (e.g. key, id,
...) unless all the nodes should be accessed

24 / 55
Distribution Models

Sharding

Why is sharding even more difficult?


Structure of the cluster may be changing
Nodes can be added or removed
Nodes may have incomplete/obsolete cluster knowledge
Nodes involved, their responsibilities, sharding rules, ...
Individual nodes may be failing
Network may be partitioned
Messages may not be delivered even though sent

25 / 55
Distribution Models

Replication

Replication
Placement of multiple copies of the same data (replicas) on
different nodes
Replication factor = number of such copies
Two approaches
1 Master-slave architecture
2 Peer-to-peer architecture

26 / 55
Distribution Models

Replication

Master-Slave Architecture

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
27 / 55
Distribution Models

Replication

Master-Slave Architecture
One node is primary (master), all the other secondary (slave)
Master node bears all the management responsibility
All the nodes contain identical data
Read requests can be handled by both the master or slaves
Suitable for read-intensive applications
More read requests to deal with → more slaves to deploy
When the master fails, read operations can still be handled

28 / 55
Distribution Models

Replication

Master-Slave Architecture
Write requests can only be handled by the master
Newly written replicas are propagated to all the slaves
Consistency issue
Luckily enough, at most one write request is handled at a time
But the propagation still takes some time during which obsolete
reads might happen
Hence certain synchronization is required to avoid conflicts
In case of master failure, a new one needs to be appointed
Manually (user-defined) or automatically (cluster-elected)
Since the nodes are identical, appointment can be fast
Master might therefore represent a bottleneck (because of the
performance or failures)

29 / 55
Distribution Models

Replication

Peer-to-Peer Architecture

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
30 / 55
Distribution Models

Replication

Peer-to-Peer Architecture
All the nodes have equal roles and responsibilities
All the nodes contain identical data once again
Both read and write requests can be handled by any node
No bottleneck, no single point of failure
Both the operations scale well
More requests to deal with → more nodes to deploy
Consistency issues
Unfortunately, multiple write requests can be initiated
independently and being executed at the same time
Hence synchronization is required to avoid conflicts

31 / 55
Distribution Models

Sharding and Replication

Observations with respect to the replication


Does the replication factor really need to correspond to the
number of nodes?
No, replication factor of 3 will often be the right choice
Consequences
Nodes will no longer contain identical data
Replica placement strategy will be needed
Do all the replicas really need to be successfully written when
write requests are handled?
No, but consistency issues have to be tackled carefully
Sharding and replication can be combined... but how?

32 / 55
Distribution Models

Sharding and Replication

Sharding and Master-Slave Replication

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.

33 / 55
Distribution Models

Sharding and Replication

Sharding and Peer-to-Peer Replication

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.

34 / 55
Distribution Models

Sharding and Replication

Combinations of sharding and replication


1 Sharding + master-slave replication
Multiple masters, each for different data
Roles of the nodes can overlap
Each node can be master for some data and/or slave for other
2 Sharding + peer-to-peer replication
Basically placement of anything anywhere (although certain rules
can still be applied)

35 / 55
Distribution Models

Sharding and Replication

Questions to figure out for any distribution model


Can all the nodes serve both read and write requests?
Which replica placement strategy is used?
How the mapping of replicas is maintained?
What level of consistency and availability is provided?
What extent of infrastructure knowledge do the nodes have?
...

36 / 55
CAP Theorem

Outline

1 Scalability

2 Distribution Models

3 CAP Theorem

4 Consistency

37 / 55
CAP Theorem

CAP Theorem

Assumptions
Distributed system with sharding and replication
Read and write operations on a single aggregate only
CAP properties
Properties of a distributed system:
Consistency, Availability, and Partition tolerance
CAP theorem
It is not possible to have a distributed system that would
guarantee all CAP properties at the same time.
Only 2 of these 3 properties can be enforced.
But, what these properties actually mean?

38 / 55
CAP Theorem

CAP Properties

Consistency
Read and write operations must be executed atomically
There must exist a total order on all operations
Each operation looks as if it was completed at a single instant
i.e. as if all operations were executed sequentially one by one on a
single standalone node
Practical consequence. after a write operation, all readers see
the same data
Since any node can be used for handling of read requests, atomicity
of write operations means that changes must be propagated to all
the replicas
As we will see later on, other ways for such a strong consistency
exist as well

39 / 55
CAP Theorem

CAP Properties

Availability
If a node is working, it must respond to user requests
Every read or write request successfully received by a non-failing
node in the system must result in a response,
i.e. their execution must not be rejected
Partition tolerance
System continues to operate even when two or more sets of
nodes get isolated
The network is allowed to lose arbitrarily many messages sent
from one node to another
i.e. a connection failure must not shut the whole system down

40 / 55
CAP Theorem

CAP Theorem Consequences

If at most two properties can be guaranteed ...


1 CA = consistency + availability
Traditional ACID properties are easy to achieve
Examples.: RDBMS, Google BigTable
Any single-node system, but even clusters (at least in theory)
However, should the network partition happen, all the nodes must be
forced to stop accepting user requests
2 CP = consistency + partition tolerance
Other examples. distributed locking
3 AP = availability + partition tolerance
New concept of BASE properties
Examples. Apache Cassandra, Apache CouchDB
Other examples. web caching, DNS

41 / 55
CAP Theorem

CAP Theorem Consequences

42 / 55
CAP Theorem

CAP Theorem Consequences

Partition tolerance is necessary in clusters


Why? Because it is difficult to detect network failures
Does it mean that only purely CP and AP systems are possible?
No...
The real meaning of the CAP theorem:
The real-world does not need to be just black and white
Partition tolerance is a must, but we can trade off consistency
versus availability
Just a little bit relaxed consistency can bring a lot of availability
Such trade-offs are not only possible, but often works very well in
practice

43 / 55
CAP Theorem

CAP Theorem Example

You are asked to design a distributed cluster of 4 data nodes.


Replication factor is 2 i.e. any data written in cluster must be written on
2 nodes; so when one goes down – second can serve the data. Now try
to apply CAP theorem on this requirement.
In distributed system, two things may happen anytime i.e. node failure
(hard disk crash) or network failure (connection between two nodes go
down).
1 CP [Consistency/Partition Tolerance] Systems
2 AP [Availability/Partition Tolerance] Systems
3 CA [Consistency/Availability] Systems

44 / 55
CAP Theorem

ACID Properties

Traditional ACID properties


1 Atomicity
Partial execution of transactions is not allowed (all or nothing)
2 Consistency
Transactions bring the database from one consistent (valid) state
to another
3 Isolation
Transactions executed in parallel do not see uncommitted effects
of each other
4 Durability
Effects of committed transactions must remain durable

45 / 55
CAP Theorem

BASE Properties

New concept of BASE properties


1 Basically Available
The system works basically all the time
Partial failures can occur, but there are no total system failures
2 Soft State
The system is in flux (unstable), non-deterministic state
Changes occur all the time
3 Eventual Consistency
Sooner or later the system will be in some consistent state
BASE is just a vague term, no formal definition
Proposed to illustrate design philosophies at the opposite ends of
the consistency-availability spectrum

46 / 55
CAP Theorem

ACID and BASE

ACID
Choose consistency over availability
Pessimistic approach
Implemented by traditional relational databases
BASE
Choose availability over consistency
Optimistic approach
Common in NoSQL databases
Allows levels of scalability that cannot be acquired with ACID
Current trend in NoSQL. strong consistency → eventual consistency

47 / 55
CAP Theorem

CAP Theorem Conclusion

48 / 55
Consistency

Outline

1 Scalability

2 Distribution Models

3 CAP Theorem

4 Consistency

49 / 55
Consistency

Consistency

Consistency in general...
Consistency is the lack of contradiction in the database
However, it has many facets ... e.g.
only assume atomic operations always manipulating just a single
aggregate,
but set operations could also be considered etc.
Strong consistency is achievable even in clusters, but eventual
consistency might often be sufficient
1 One minute obsolete article on a news portal does not matter
2 Even when an already unavailable hotel room is booked once
again, the situation can still be figured out in the real world
3 ...

50 / 55
Consistency

Consistency

Write consistency (update consistency)


Problem. write-write conflict
Two or more write requests on the same aggregate are initiated
concurrently
Context. peer-to-peer architecture only
Issue. lost update
Solution.
1 Pessimistic strategies
Preventing conflicts from occurring
Write locks, ...
2 Optimistic strategies
Conflicts may occur, but are detected and resolved later on
Version stamps, vector clocks, ...

51 / 55
Consistency

Consistency

Read consistency (replication consistency)


Problem. read-write conflict
Write and read requests on the same aggregate are initiated
concurrently
Context. both master-slave and peer-to-peer architectures
Issue. inconsistent read
When not treated, inconsistency window will exist
Propagation of changes to all the replicas takes some time
Until this process is finished, inconsistent reads may happen
Even the initiator of the write request may read wrong data!
Session consistency/read-your-writes/sticky session

52 / 55
Consistency

Strong Consistency

How many nodes need to be involved to get strong consistency?


Write quorum. W > N /2
Idea. only one write request can get the majority
W = number of nodes successfully participating in the write
N = number of nodes involved in replication (replication factor)
Read quorum. R > N − W
Idea. concurrent write requests cannot happen
R = number of nodes participating in the read
Should the retrieved replicas be mutually different, the newest
version is resolved and then returned
When a quorum is not attained rightarrow the request cannot be
handled

53 / 55
Consistency

Strong Consistency

Examples for replication factor N = 3


Write quorum W = 3 and read quorum R = 1
All the replicas are always updated
Can read any one of them
Write quorum W = 2 and read quorum R = 2
Typical configuration, reasonable trade-off
Consequence
Quora can be configured to balance read and write workload
The higher the write quorum is required lower the read quorum
can then be required

54 / 55
Consistency

Lecture Conclusion

There is a wide range of options influencing...


Scalability – how well the entire system scales?
Availability – when nodes may refuse to handle user requests?
Consistency – what level of consistency is required?
Latency – how long does it take to handle user requests?
Durability – is the committed data written reliably?
Resilience – can the data be recovered in case of failures?
it’s good to know these properties and choose the right trade-off

55 / 55

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy