Big Data Management Basic Principles
Big Data Management Basic Principles
2 / 55
Scalability
Outline
1 Scalability
2 Distribution Models
3 CAP Theorem
4 Consistency
3 / 55
Scalability
Scalability
What is scalability?
the capability of a system to handle growing amounts of data
and/or queries without losing performance, or
its potential to be enlarged in order to accommodate such a growth
Two general approaches
Vertical scaling
Horizontal scaling
4 / 55
Scalability
Vertical Scalability
5 / 55
Scalability
Performance limits
Even the most powerful machine has a limit
Everything works well ... until start approaching the limit
Higher costs
The cost of expansion increases exponentially
In particular, it is higher than the sum of costs of equivalent
commodity hardware
Proactive provisioning
New projects/applications might evolve rapidly
Upfront budget is needed when deploying new machines
So flexibility is seriously suppressed
6 / 55
Scalability
Vendor lock-in
There are only a few manufacturers of large machines
Customer is made dependent on a single vendor
Their products, services, but also implementation details,
proprietary formats, interfaces, support, ...
e.e. it is difficult or impossible to switch to another vendor
Deployment downtime
Inevitable downtime is often required when scaling up
7 / 55
Scalability
Horizontal Scalability
8 / 55
Scalability
9 / 55
Scalability
False assumptions
Network is reliable
Latency is zero
Bandwidth is infinite
Network is secure
Topology does not change
There is one administrator
Network is homogeneous
Transport cost is zero
Source: https://www.red-gate.com/simple-talk/blogs/
the-eight-fallacies-of-distributed-computing/
10 / 55
Scalability
11 / 55
Scalability
What is a cluster?
A collection of mutually interconnected commodity nodes
Based on the shared-nothing architecture
Nodes do not share their CPUs, memory, hard drives, ...
Each node runs its own operating system instance
Nodes send messages to interact with each other
Nodes of a cluster can be heterogeneous
Data, queries, calculations, requests, workload, ... all distributed
among the nodes within a cluster
12 / 55
Distribution Models
Outline
1 Scalability
2 Distribution Models
3 CAP Theorem
4 Consistency
13 / 55
Distribution Models
Distribution Models
14 / 55
Distribution Models
Sharding
15 / 55
Distribution Models
Sharding
Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
16 / 55
Distribution Models
Sharding
Objectives
Achieve uniform data distribution
Achieve balanced workload (read and write requests)
Respect physical locations
e.g. different data centers for users around the world
...
Unfortunately, these objectives...
may mutually contradict each other
may change in time
So, how to actually determine shards for aggregates?
17 / 55
Distribution Models
Sharding
Source:
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
18 / 55
Distribution Models
Sharding
Sharding strategies
Based on mapping structures
Data is placed on shards in a random fashion
e.g. round-robin, ...
Knowledge of the mapping of individual aggregates to particular
shards must then be maintained
Thus usually maintained using a centralized index structures with
all the disadvantages
Based on general rules
Each shard is responsible for storing certain data
Hash partitioning, range partitioning, ...
19 / 55
Distribution Models
Sharding
Key-based sharding
Source:
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
20 / 55
Distribution Models
Sharding
Range-based sharding
Source:
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
21 / 55
Distribution Models
Sharding
Directory-based sharding
Source:
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
22 / 55
Distribution Models
Sharding
Should I Shard?
Amount of application data grows to exceed the storage capacity
of a single database node.
Volume of writes or reads to the database surpasses what a single
node can handle,
resulting in slowed response times or timeouts.
23 / 55
Distribution Models
Sharding
24 / 55
Distribution Models
Sharding
25 / 55
Distribution Models
Replication
Replication
Placement of multiple copies of the same data (replicas) on
different nodes
Replication factor = number of such copies
Two approaches
1 Master-slave architecture
2 Peer-to-peer architecture
26 / 55
Distribution Models
Replication
Master-Slave Architecture
Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
27 / 55
Distribution Models
Replication
Master-Slave Architecture
One node is primary (master), all the other secondary (slave)
Master node bears all the management responsibility
All the nodes contain identical data
Read requests can be handled by both the master or slaves
Suitable for read-intensive applications
More read requests to deal with → more slaves to deploy
When the master fails, read operations can still be handled
28 / 55
Distribution Models
Replication
Master-Slave Architecture
Write requests can only be handled by the master
Newly written replicas are propagated to all the slaves
Consistency issue
Luckily enough, at most one write request is handled at a time
But the propagation still takes some time during which obsolete
reads might happen
Hence certain synchronization is required to avoid conflicts
In case of master failure, a new one needs to be appointed
Manually (user-defined) or automatically (cluster-elected)
Since the nodes are identical, appointment can be fast
Master might therefore represent a bottleneck (because of the
performance or failures)
29 / 55
Distribution Models
Replication
Peer-to-Peer Architecture
Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
30 / 55
Distribution Models
Replication
Peer-to-Peer Architecture
All the nodes have equal roles and responsibilities
All the nodes contain identical data once again
Both read and write requests can be handled by any node
No bottleneck, no single point of failure
Both the operations scale well
More requests to deal with → more nodes to deploy
Consistency issues
Unfortunately, multiple write requests can be initiated
independently and being executed at the same time
Hence synchronization is required to avoid conflicts
31 / 55
Distribution Models
32 / 55
Distribution Models
Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
33 / 55
Distribution Models
Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
34 / 55
Distribution Models
35 / 55
Distribution Models
36 / 55
CAP Theorem
Outline
1 Scalability
2 Distribution Models
3 CAP Theorem
4 Consistency
37 / 55
CAP Theorem
CAP Theorem
Assumptions
Distributed system with sharding and replication
Read and write operations on a single aggregate only
CAP properties
Properties of a distributed system:
Consistency, Availability, and Partition tolerance
CAP theorem
It is not possible to have a distributed system that would
guarantee all CAP properties at the same time.
Only 2 of these 3 properties can be enforced.
But, what these properties actually mean?
38 / 55
CAP Theorem
CAP Properties
Consistency
Read and write operations must be executed atomically
There must exist a total order on all operations
Each operation looks as if it was completed at a single instant
i.e. as if all operations were executed sequentially one by one on a
single standalone node
Practical consequence. after a write operation, all readers see
the same data
Since any node can be used for handling of read requests, atomicity
of write operations means that changes must be propagated to all
the replicas
As we will see later on, other ways for such a strong consistency
exist as well
39 / 55
CAP Theorem
CAP Properties
Availability
If a node is working, it must respond to user requests
Every read or write request successfully received by a non-failing
node in the system must result in a response,
i.e. their execution must not be rejected
Partition tolerance
System continues to operate even when two or more sets of
nodes get isolated
The network is allowed to lose arbitrarily many messages sent
from one node to another
i.e. a connection failure must not shut the whole system down
40 / 55
CAP Theorem
41 / 55
CAP Theorem
42 / 55
CAP Theorem
43 / 55
CAP Theorem
44 / 55
CAP Theorem
ACID Properties
45 / 55
CAP Theorem
BASE Properties
46 / 55
CAP Theorem
ACID
Choose consistency over availability
Pessimistic approach
Implemented by traditional relational databases
BASE
Choose availability over consistency
Optimistic approach
Common in NoSQL databases
Allows levels of scalability that cannot be acquired with ACID
Current trend in NoSQL. strong consistency → eventual consistency
47 / 55
CAP Theorem
48 / 55
Consistency
Outline
1 Scalability
2 Distribution Models
3 CAP Theorem
4 Consistency
49 / 55
Consistency
Consistency
Consistency in general...
Consistency is the lack of contradiction in the database
However, it has many facets ... e.g.
only assume atomic operations always manipulating just a single
aggregate,
but set operations could also be considered etc.
Strong consistency is achievable even in clusters, but eventual
consistency might often be sufficient
1 One minute obsolete article on a news portal does not matter
2 Even when an already unavailable hotel room is booked once
again, the situation can still be figured out in the real world
3 ...
50 / 55
Consistency
Consistency
51 / 55
Consistency
Consistency
52 / 55
Consistency
Strong Consistency
53 / 55
Consistency
Strong Consistency
54 / 55
Consistency
Lecture Conclusion
55 / 55