0% found this document useful (0 votes)
10 views14 pages

SGDB

Uploaded by

happytesterwork
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views14 pages

SGDB

Uploaded by

happytesterwork
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Heuristics (not all times effective ,

but mas times) Algebra Equivalence


Selections :
Apply them as soom as possible to the relevant colums in do selections first than do the
, ajoim join
Projections :

Keep only the needed columns to evaluate downstream


operators .

Avoid cortesian
go for foims
chaise
products given
a

Con Join Ondering


,

:
equivalent
but
different
cost

the cost
Estimating :

because

Example is
page
*
:
mestec
Loops
Sail a s

a e Problems
Creservations
areequat
:

· selection cam be done down

Iratings a re
equivalent
·
no use of indexes
distributed)
>
-
simplest option

Improvements imagine
it
goes from 0-10
(ishalf)
Pushdown
Ratingevaluation
Pushdow bid evaluation
500 + 250 X 1000

500 + 250 X1000

selection
you pusha
When
into the inmer
loop of


a

rested doesn't
loop join it

save I/Os , so it's equivalent


it above all
on
doing
1000 will still be searched
,
since

for
the
high ranking sailons of(

this example
.

What if we materiaze the immer


loop "
What
if we
change
the
join ondering
?

the cost
Estimating
Ts = half
* Offsailor S
reserves 10 (250 I0's)
100
=

* to write the
aut of 2000 matinalized
6000105-100 reserves
.
table

resurves once (1000 I0's)


create
TA sailons once (500IO's)

1000 + 500 + 250 + (10x250) reach
T1
4250 IOS
Transaction - Sequence of multiple actions to be executed as an atomic unit.
Begin Transaction
SQL Statements (Sequence of READs and WRITEs as per Transaction Manager, point of view)
READ - Transfer data from the database into a variable in a buffer in the main memory.
WRITE - Transfers the value of the value in the variable into the database.
End Transaction
Storage Structure Transactions Lifecycle
Volatile Storage - Does not usually survives a system crash.
Examples: Main and Cache memory
Non-Volatile Storage - Usually survives a system crash.
Examples: Magnetic Disks, Flash Storage, devices such as
optical media and magnetic tapes.
Stable Storage - Information that is “never” lost.
Example: Disks replicated into several independent places.

ACID
Atomicity : All actions happen or none happens.
Consistency: Starts and ends with the same consistency.
Depends on the type of database, but normally is defined Active - Initial State
by a set of declarative integrity constraints. (Any transaction that Partially Committed - after the final statement
violates them is aborted). has been executed.
Isolation: Each execution is isolated from the others. Failed - When the execution can no longer
Durability: If there is a commit, the data needs to be persisted. proceed.
Usually Atomicity and Durability are ensured by logging all Aborted - After the action was rollbacked and
actions so that when there is a failure, they are either: the data has been restored to the previous
Undo - Actions that failed or were aborted and need to status.
be rollbacked. Committed - After successful completion.
Redo - Actions that were committed but not Terminated - Either committed or Aborted.
propagated to the disk, and need to be kept.

Concurrency Control
The simplest way to avoid concurrency problems is to ensure a serial execution, one transaction runs at a time, but
it’s an heavy and slow process.
There are 2 good reasons to allow concurrency:
Improve throughout and resource utilization (TPS)
Improve latency(AVG response time).

In order to allow concurrency the databases have a set of concurrency-control schemes, such as:
Transaction Schedules
Schedule - Sequence of action on data from one or more transactions, they represent the chronological order in
which instructions are executed in the system.
The instruction WRITE nest be done before any READ instruction in any valid schedule.

Serial Schedule - Each transaction runs from start to finish without any interference from actions of another
transactions,
Serial Equivalence - 2 schedules are equivalent if:
Involve the same transactions
Each individual trsanaction’s actions are ordered the same
Both schedules leave the database in the same final state.
Serializability Conflict - When 2 transactions have
A schedule is serializable if is equivalent to some serial schedule. operations for the same data item and at
least one of these instructions is a WRITE
Conflict serializable schedules
They involve the same actions of the same transactions.
Every pair of conflicting actions is ordered the same way.
Assumes it’s serializable.
It’s possible to transform into a serial schedule by swapping
consecutive non-conflicting operations of different transactions.
A conflict dependency/precendence graph can be built to
determine it.

Locked Based Protocols - Data items are accessed in a mutually exclusive manner. Each action in the system follows
set of rules, called locking protocol that defines when the data can be accessed or not.
Shared Lock - The item can still be read but not written (Shared Mode Lock S)
Exclusive Lock - The item can be both read and written.(Exclusive mode Lock X)
The lock and unlock request are handled by the Lock Manager.
Maintains an hashtable, keyed on names of objects being locked, an entry for each currently held lock.

(2LP) 2 Phase Locking - Each transaction issues a lock and the unlock request is dealt in 2 phases. It obtains a S lock
before reading and an X lock before writing.
The release of the lock happens like:

Growing phase or acquisition phase - A transaction can obtain locks but not
release any lock.
Release phase or shrinking phase - A transaction can release locks, but not
obtain any new locks O.

Most common scheme for enforcing conflict serializability. Aborts when conflict is detected.
Disadvantages: Sets locks for fear of conflict, generating cost. Does not prevent cascading aborts.
It’s conflict serializable because one transaction gets blocked by other, offering the equivalent of a serial schedule.
To avoid cascading aborts, we use strict 2LP -
same as 2LP but all locks are release together
when the transaction completes.
Either the transaction is completed as:
committed and all updates endure or
aborted and everything is undone.

Deadlock - When a set of transactions is waiting on each other, and are blocked. There are 3 ways they can be treated:
Prevention
Avoidance
Detection and Resolutiom
Deadlock prevention - Requests are ordered to avoid cyclic waits or a rollback is done on the transaction, after the
some time waiting, avoiding a deadlock.

Deadlock Avoidance - Assigns priorities to locks by timestamp when it


begins. (Transaction Age)
Not used very often and is still a way to prevent.
When a transaction restarts it will still maintain the initial timestamp, so that
it doesn’t loose priority on the list.

Deadlock Detection and Resolution - By creating and maintaining “wait-for”


graphs, the databases control the transactions that are dead lock by periodically checking for cycles and performing
actions over that transactions.
Used by almost every SGDB.
An algorithm needs to be run to
shoot(selection of a victim) one of
the transactions of the cycle and
resolve it.
After that a total or partial rollback
can be done.
Can cause starvation if it’s cost
based, since the same transactions
are going to be constantly selected,
Multiple granularity or multigranularity
Due to the complexity of how to decide the granularity of what to lock(tuples vs pages vs tables) multiple
locking granularity was implemented, so that we different situations can be locked on different ways. Defining a
hierarchy of granularity
Finer granularity (lower in the tree) - Better/Higher
concurrency but more overhead(lots of locks).
Coarse granularity(higher in the tree) - Less overhead but
lost potential for concurrency.

Intention Lock Modes - Are added to all ancestors of the


node in the tree, so that the transaction doesn’t need to
search the full tree. It includes 3 more lock types:
IS: Intent to get S(shared) lock
IX: Intent to get X(exclusive) lock
SIX: Both
Multiple Granularity Locking - Each transaction starts from the root of the
hierarchy
To get S or IS lock on a node, must hold IS or IX on the parent,
To get X or IX or SIX on a node, must hold IX or SIX on the parent.
Must release locks from bottom to top.
Types of Log records:
START Tx - Transation Tx has begun
COMMIT Tx - Transaction Tx has committed
ABORT Tx - Transaction Tx has aborted
Undo transaction logging - T, X, v - Transaction Tx has updated element X, and it’s OLD value was v
Example: Write T1, D, 20
Redo transaction logging - T, X, v - Transaction Tx has updated element X, and it’s NEW value was v
Example: Write T1, D, 25
ARIES transaction logging - T, X, v - Transaction Tx has updated element X, and it’s OLD and NEW value was v
Example: Write T1, D, 20, 25
If there is a checkpoint, example: Read T1, D Checkpoint
Checkpoint 1 on Redo Phase

Undo Logging (STEAL-FORCE) STEAL Policy:


At recovery time, undo transactions that were not committed. The latest transaction (even if
Leave commited transactions alone. uncommitted) is written into the disk.

Redo Logging (NO-STEAL, NO-FORCE) FORCE Policy:


At recovery time, redo the transactions that have been commited. Every update is forced into the DB disk
Leave uncommitted transactions alone. before commit.

ARIES(Algorithm for Recovery and Isolation Exploiting Semantics) Logging


Combines UNDO and REDO

Checkpoint
Save the status of the DB periodically, so that on recovery the entire log doesn’t need to be reprocessed.
During a checkpoint, no other transactions are accepted.
Waits until the end of the transaction to checkpoint.
Data analytics - Process of inferring patterns, correlations and models for prediction from data.
Data Warehouse(DW) - Single location(repository or archive) where multiple sources of data are gathered, normally
providing a single interface for them through an unified schema.
ETL(Extract, Transform, Load) - Processes of collecting, cleaning/deduplicating and loading the data into a DW.
Source-driven architecture - Sources send the data to the DW.
Destination-driven architecture - The DW requests data from the sources.
The relations in a data warehouse schema are usually classified as:
Fact tables - Records of information about individual examples, example: a table that contains individual
sales of a certain brand.
Dimension tables - Contains attributes of a certain dimension, example: a table about the stores where the
product can be bought.
Data lake - Repository where data can be store in more than one format, including structure and unstructured.
Row-oriented - Tuples are stored sequentially in a file.
Column-oriented - Each attribute of a relation is stored in a separate file.

Star(esquema em estrela) - One fact table connect with one


or more dimensional tables.
Snowflake(esquema em floco-de-neve) - One fact table, but
the dimension tables can connect with other dimension
tables.
Multi-Star(esquema em constelações (ou multi-estrela))
More than one fact table.

OLAP (Online Analytics Processing) - Efficient systems that allow for aggregation and retrieval of large sets of
information in almost real time.
Allows an analyst to look into different cross-tabs on the same data by interacting with the attributes.
Cross-tab (cross-tabulation) or pivot-table - is a table derived from a relation (2 dimensional)
Data cube - n-dimensional
Pivoting -To change between dimensions on a cross-tab.
Slicing/Dicing - Checking a sub-set of the initial dimension by a certain value.
Rollup - To go from finer-granularity to coarser by means of aggregation.
Drill down - To go from coarser granularity to finer.
Hierarchy - Different levels of detail can be organized into an hierarchy, example datetime (day, month,
year).

Data visualization - Graphical representation of data.


Data mining - Process of analyzing large databases to find useful patterns.
IDF-TF
Term Frequency (TF):
TF(d, t) = (Número de vezes que o termo "t" aparece no documento "d") / (Número total de termos no documento
"d"
Inverse Document Frequency (IDF):
IDF(t) = log(N / n(t))
N é o número total de documentos
n(t) é o número de documentos que contêm o termo "t"
Agora, podemos calcular a relevância usando a fórmula TF x IDF

PageRank
O algoritmo original de PageRank descrito por Lawrence Page and Sergey Brin em 1995 é dado por:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Onde:
PR(A) é o PageRank da página A,
PR(Ti) é o PageRank das páginas Ti que estão ligadas (apontam) para a página A,
C(Ti) é o número de apontadores (“outbound links”) na página Ti
d é o fator de amortecimento que varia entre 0 e 1.

Exemplo:
Seja d=0.5, iterações A B C
0 1 1 1
1 1 0.75 1.25
PR(A) = 0.5 + 0.5 (PR(C) /1) 2 1.125 0.75 1.125
3 1.0625 0.78125 1.15625
PR(B) = 0.5 + 0.5 (PR(A) /2) 4 1.078125 0.765625 1.15625
PR(C) = 0.5 + 0.5 (PR(A) /2 + PR(B) /1) 5 1.078125 0.769531 1.152344
PageRank 2 3 1
Dado que na prática existem milhões de
incógnitas, o sistema de equações do
PageRank é ser resolvido de forma iterativa,
iniciando as variáveis com o valor 1:
Considerando que na tabela de factos os dados podem ser:
· Aditivos: são atributos que podem ser agregados (somados) por todas as dimensões, ex: valor da venda (usar Sum()
sempre)
· Semi-aditivos: são atributos que podem ser agregados (somados) por algumas as dimensões, ex: quantidade (usar
Sum() em condições particulares)
· Não-aditivos: são atributos que não podem ser agregados (somados), ex: preço unitário (usar Average() por
exemplo)
· Sem factos: só existem identificadores (usar a função Count() dosidentificadores)

O
1FD (1ª forma desnormalizada)
Cada coluna em um tabela só pode ter valores atómicos(indivisíveis).
Não pode haver repetição de grupos de colunas.
Exemplo, numa tabela livros,(ID, Titulo, Autor, Categoria) um livro não pode ter mais que
um autor ou categoria, teria de ser dividida em duas tabelas.
2FD (2ª forma desnormalizada).
1FD + cada coluna não faz parte de uma chave candidata, deve ser totalmente dependente
da chave primária, não existem dependências parciais.
Exemplo, na mesma tabela livros, o ID é chave primária, no entanto Autor e Categoria
depende de ID mas não do Título, logo para normalizar, temos de dividir em duas tabelas, Livros
(ID,Título) e Autores_Categorias( ID de Livro, Autor, Categoria).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy