SGDB
SGDB
Avoid cortesian
go for foims
chaise
products given
a
:
equivalent
but
different
cost
the cost
Estimating :
because
Example is
page
*
:
mestec
Loops
Sail a s
a e Problems
Creservations
areequat
:
Iratings a re
equivalent
·
no use of indexes
distributed)
>
-
simplest option
Improvements imagine
it
goes from 0-10
(ishalf)
Pushdown
Ratingevaluation
Pushdow bid evaluation
500 + 250 X 1000
selection
you pusha
When
into the inmer
loop of
↑
a
rested doesn't
loop join it
for
the
high ranking sailons of(
this example
.
the cost
Estimating
Ts = half
* Offsailor S
reserves 10 (250 I0's)
100
=
* to write the
aut of 2000 matinalized
6000105-100 reserves
.
table
ACID
Atomicity : All actions happen or none happens.
Consistency: Starts and ends with the same consistency.
Depends on the type of database, but normally is defined Active - Initial State
by a set of declarative integrity constraints. (Any transaction that Partially Committed - after the final statement
violates them is aborted). has been executed.
Isolation: Each execution is isolated from the others. Failed - When the execution can no longer
Durability: If there is a commit, the data needs to be persisted. proceed.
Usually Atomicity and Durability are ensured by logging all Aborted - After the action was rollbacked and
actions so that when there is a failure, they are either: the data has been restored to the previous
Undo - Actions that failed or were aborted and need to status.
be rollbacked. Committed - After successful completion.
Redo - Actions that were committed but not Terminated - Either committed or Aborted.
propagated to the disk, and need to be kept.
Concurrency Control
The simplest way to avoid concurrency problems is to ensure a serial execution, one transaction runs at a time, but
it’s an heavy and slow process.
There are 2 good reasons to allow concurrency:
Improve throughout and resource utilization (TPS)
Improve latency(AVG response time).
In order to allow concurrency the databases have a set of concurrency-control schemes, such as:
Transaction Schedules
Schedule - Sequence of action on data from one or more transactions, they represent the chronological order in
which instructions are executed in the system.
The instruction WRITE nest be done before any READ instruction in any valid schedule.
Serial Schedule - Each transaction runs from start to finish without any interference from actions of another
transactions,
Serial Equivalence - 2 schedules are equivalent if:
Involve the same transactions
Each individual trsanaction’s actions are ordered the same
Both schedules leave the database in the same final state.
Serializability Conflict - When 2 transactions have
A schedule is serializable if is equivalent to some serial schedule. operations for the same data item and at
least one of these instructions is a WRITE
Conflict serializable schedules
They involve the same actions of the same transactions.
Every pair of conflicting actions is ordered the same way.
Assumes it’s serializable.
It’s possible to transform into a serial schedule by swapping
consecutive non-conflicting operations of different transactions.
A conflict dependency/precendence graph can be built to
determine it.
Locked Based Protocols - Data items are accessed in a mutually exclusive manner. Each action in the system follows
set of rules, called locking protocol that defines when the data can be accessed or not.
Shared Lock - The item can still be read but not written (Shared Mode Lock S)
Exclusive Lock - The item can be both read and written.(Exclusive mode Lock X)
The lock and unlock request are handled by the Lock Manager.
Maintains an hashtable, keyed on names of objects being locked, an entry for each currently held lock.
(2LP) 2 Phase Locking - Each transaction issues a lock and the unlock request is dealt in 2 phases. It obtains a S lock
before reading and an X lock before writing.
The release of the lock happens like:
Growing phase or acquisition phase - A transaction can obtain locks but not
release any lock.
Release phase or shrinking phase - A transaction can release locks, but not
obtain any new locks O.
Most common scheme for enforcing conflict serializability. Aborts when conflict is detected.
Disadvantages: Sets locks for fear of conflict, generating cost. Does not prevent cascading aborts.
It’s conflict serializable because one transaction gets blocked by other, offering the equivalent of a serial schedule.
To avoid cascading aborts, we use strict 2LP -
same as 2LP but all locks are release together
when the transaction completes.
Either the transaction is completed as:
committed and all updates endure or
aborted and everything is undone.
Deadlock - When a set of transactions is waiting on each other, and are blocked. There are 3 ways they can be treated:
Prevention
Avoidance
Detection and Resolutiom
Deadlock prevention - Requests are ordered to avoid cyclic waits or a rollback is done on the transaction, after the
some time waiting, avoiding a deadlock.
Checkpoint
Save the status of the DB periodically, so that on recovery the entire log doesn’t need to be reprocessed.
During a checkpoint, no other transactions are accepted.
Waits until the end of the transaction to checkpoint.
Data analytics - Process of inferring patterns, correlations and models for prediction from data.
Data Warehouse(DW) - Single location(repository or archive) where multiple sources of data are gathered, normally
providing a single interface for them through an unified schema.
ETL(Extract, Transform, Load) - Processes of collecting, cleaning/deduplicating and loading the data into a DW.
Source-driven architecture - Sources send the data to the DW.
Destination-driven architecture - The DW requests data from the sources.
The relations in a data warehouse schema are usually classified as:
Fact tables - Records of information about individual examples, example: a table that contains individual
sales of a certain brand.
Dimension tables - Contains attributes of a certain dimension, example: a table about the stores where the
product can be bought.
Data lake - Repository where data can be store in more than one format, including structure and unstructured.
Row-oriented - Tuples are stored sequentially in a file.
Column-oriented - Each attribute of a relation is stored in a separate file.
OLAP (Online Analytics Processing) - Efficient systems that allow for aggregation and retrieval of large sets of
information in almost real time.
Allows an analyst to look into different cross-tabs on the same data by interacting with the attributes.
Cross-tab (cross-tabulation) or pivot-table - is a table derived from a relation (2 dimensional)
Data cube - n-dimensional
Pivoting -To change between dimensions on a cross-tab.
Slicing/Dicing - Checking a sub-set of the initial dimension by a certain value.
Rollup - To go from finer-granularity to coarser by means of aggregation.
Drill down - To go from coarser granularity to finer.
Hierarchy - Different levels of detail can be organized into an hierarchy, example datetime (day, month,
year).
PageRank
O algoritmo original de PageRank descrito por Lawrence Page and Sergey Brin em 1995 é dado por:
Onde:
PR(A) é o PageRank da página A,
PR(Ti) é o PageRank das páginas Ti que estão ligadas (apontam) para a página A,
C(Ti) é o número de apontadores (“outbound links”) na página Ti
d é o fator de amortecimento que varia entre 0 e 1.
Exemplo:
Seja d=0.5, iterações A B C
0 1 1 1
1 1 0.75 1.25
PR(A) = 0.5 + 0.5 (PR(C) /1) 2 1.125 0.75 1.125
3 1.0625 0.78125 1.15625
PR(B) = 0.5 + 0.5 (PR(A) /2) 4 1.078125 0.765625 1.15625
PR(C) = 0.5 + 0.5 (PR(A) /2 + PR(B) /1) 5 1.078125 0.769531 1.152344
PageRank 2 3 1
Dado que na prática existem milhões de
incógnitas, o sistema de equações do
PageRank é ser resolvido de forma iterativa,
iniciando as variáveis com o valor 1:
Considerando que na tabela de factos os dados podem ser:
· Aditivos: são atributos que podem ser agregados (somados) por todas as dimensões, ex: valor da venda (usar Sum()
sempre)
· Semi-aditivos: são atributos que podem ser agregados (somados) por algumas as dimensões, ex: quantidade (usar
Sum() em condições particulares)
· Não-aditivos: são atributos que não podem ser agregados (somados), ex: preço unitário (usar Average() por
exemplo)
· Sem factos: só existem identificadores (usar a função Count() dosidentificadores)
O
1FD (1ª forma desnormalizada)
Cada coluna em um tabela só pode ter valores atómicos(indivisíveis).
Não pode haver repetição de grupos de colunas.
Exemplo, numa tabela livros,(ID, Titulo, Autor, Categoria) um livro não pode ter mais que
um autor ou categoria, teria de ser dividida em duas tabelas.
2FD (2ª forma desnormalizada).
1FD + cada coluna não faz parte de uma chave candidata, deve ser totalmente dependente
da chave primária, não existem dependências parciais.
Exemplo, na mesma tabela livros, o ID é chave primária, no entanto Autor e Categoria
depende de ID mas não do Título, logo para normalizar, temos de dividir em duas tabelas, Livros
(ID,Título) e Autores_Categorias( ID de Livro, Autor, Categoria).