ETL Notes
ETL Notes
OLTP : - current business Env. The data are being modified is a part of online transaction processing.
The data which is used for transactional purpose is called OLTP data.
The database hold tables in the form of rows and columns is called relational database.
OLAP(Online analytical processing) : - This is Datawarehouse which is used to generate report for analysis purpose. we
have 80% read operation and 20% write operations.
E(Data Ingestion) : - we can extract the data from different source(multiple databases or any other source)
T(Data Transformaion or processing) : - Transform the data at staging area(Removing any issue in the data)
L(Loading data into the Datawarehouse) : - Loading data into the Datawarehouse for reporting and analysis purpose.
ELT(Extract Load and Transform) : - Extracting data and then loading into the Datawarehouse and after some time we
can transform the data.
ETL is good for big org. because it support incremental load and transform at staging area and then loading into the
Datawarehouse.
Data Visulaization : - Data represented in tables such as rows and coumns or as a document,We can use powerbi to
design a rerort using the Datawarehouse.
Roles in Database
-------------------
Database Administrator : - Database admin is responsible for the design ,implement,maint. and operational aspects of
on-premises and cloud based database solutions built on Azure data services.
Managing Database
Data Engineer : -
Compute and Storage : - Resources(CPU and memory) that we provide to the azure sqldata
====================
------------------------------
SQL Connection
Azure Database : - Azure Database has only sql login or active directory connection.
Indexing
-----------
View
----
View is a schema object which provide the access of specific records.View is a virtual table based on the result set of a
query.
View is a window on a specified rows in an underlying table.
View doesn't have any data,When-ever user fire a query on the view,It always fetch data from the base table and return
to the user.
User1 (id,sal,deptno)
User2 (id,deptno)
We can create a view on multiple tables to provide single location of multiple table to the user.
==========================================
==========================================
-- we cannot select specific version of the sql,It always select latest verion of the sql
==============================================================
==============================================================
Day02
------
A Database is a collection of data.A database can be as simple as desktop spreadsheet or as complex as a global system
holding petabytes of highly strcutured information
-------------------------------
-----------------------
General Purpose : - We can use upto 80VCPU for the database and we have 4TB limited storage to be use in this.
Hyperscale : - Hypserscale support upto or even more than 100TB of the storage size which approx. 7.20/- per GB
and per Month Cost.
Business Critical :- Uptp 80VCPU can be used as general purpose but its provide faster storage along with high
availability which provide secondary region to store the data
------------------------
IaaS(InfraStructure as a Service) : -
------------------------------
-- We can access the O/S,Even we can install any of the O/S that we required.
-- We can Install any application or server software as per our requirement, SSAS,SSIS,SSRS,PowerBI
-- We have Full access of the Sql Server Features as its completely available for us
-- we have access of the virtual storage where our datafiles are stored.
Azure SQLDB
---------------
-- We can use Runbook to configure Automation task but we cannot use SQl Server Agent job
-- We can use Query store to Monitor the Execution Plan and statistics
Managed Instance
-----------------
Managed Instance is also an example of PaaS offering.But the difference between Azure SQLDB and Managed Instance
are below.
SaaS(Software as a Service) : - In Terms of SQL Database,AzureSQLDB and Managed instance is an example of Saas.
-- Outlook Email
--------------------------------------------------------------
--------------------
-- DDL : - Create/Alert/Drop/Truncate
-- Creating a Table
-- Droping A Column
-- Droping A Table
values(101,'Smith',2342,10),(102,'Martin',2342,20),
(103,'King',2422,30)
--begin Tran
update Test
set sal=9000
where id=101
-- Deleting Data
-- Delete can be executed on single row but Trucated can be executed on entire table
-- Truncate is always faster than the Delete,It free or release the space immeditely
-- Delete statement doesn't free the resource or page where as truncate relase the space immeditely
Types of Indexes
------------------
Clustered Index : - It store data in asc. order and keep data at leaf level page.It follow the B-Structure.(DP-300),It store
all columns in single page
columnStore Index :- It store each column in seprate page and in compressed form that is why it is faster to read,It is
much faster to fetch data from the data pages.
Non-clustered Index :- It store data in B-Tree Structure and point always to the Clustered Indexes.
Virtual Machine
----------------
In Virtual Machine Creation we have Azure Availability Zone and Avaialbility sets
Zone1(Datacenter1)
Zone2(Datacenter2)
Zone3(Datacenter3)
All nodes are having seprate power,cooling and network etc.We can use these nodes while we perform patching
,upgrade or high availability
--------------------------------------------------------
Elastic Pool
--------------
DB1 : - This Database used in Day time and occupy 2 CPU (9-5)
---------------
We can allocate resource to the Elastic Pool and elastic pool allocate the cpu and resource the database which is need it
most.
====================================================
====================================================
Module03
--------
Data comes in all shapes and sizes and can be used for a large number of purpose.
Azure Table : - Azure Table can use to store and implement the NoSQL key value model.In this model(Azure Table) data
are stored as a fields.
-----------------------------
this is non-relational Data management system.In the Azure Table Storage,Items are referred as a rows and fields are
called as a columns.
-----------------------------------------------
-- Data Retrival is fast ,if we sepcify the partition and rows keys as query criteria.
--It is schema free and relationship free ,That is why it is much faster.
-- Its simple to scale,it takes the same time to insert data in an empty table or table with billions of entries
-- If we specifiy the paritionkey and rows keys that make it much faster.
Disadvantage
------------
-- Its difficult to filter and sort on non-key data,query that search is based on non-key fields could result in full table
scans.
----------------------
Block Blobs : - Each in block in block blobs can vary in size ,upto 100MB.A block blob can contain upto 50,000 blocks
,giving a maximum size of over 4.7TB.
Page blobs: - A page blob is org. as a collection of fixed size 512 bytes pages.A page blob is optimized to support random
read and write operations.
Azure Mainly use the page blob to implement the virtual disk storage for virutal machine.
Append Blobs : - IT support the append operations,we can only add blocks to the end of an append blob ,updating or
deleting existing blocks
Types of Disk
---------------
Standard HDD Disk : - more than 20 ms l This type of storage is always good for the backup files.
Standard SSD Disk :- 10-20ms l,this type of storage are good for the Test and Dev.
Premium SSD Disk :- 5-10ms L,this type of storage are good for the Sql server configuration.(Storing Datafiles and
system datafiles )
Ultra Disk :- 1-2ms l for fully configured sql server,We should store TempDB and TransactionLog int the the Ultra Disk
-------------------------
File Share
------------
A file share enables us to store a file on one computer and grant access to that file to users and applications running on
other computers.This process can work well for computers in the same local area network,but doesn't work scale well
as the number of usrs increase,
--------------------------
Azure File share storage creates file share in the cloud and access these file share from anywhere with an internet
conneciton.
the application can be running on-premises or in the cloud.we can control access to share in azure file storage using
authentication and authorziation methods.
Azure Queue Storage
--------------------
Azure Queue storage canbe u sed to store the large number of messages from any source.we can access these messages
from anywhere in the world.
the size of the Queue messages is 64kb in size.Capacity limit can be depend on the storage capacity.
------------------
Data Lake storage is a massively scalable and secure data lake for higher performance analytics workloads.Azure data
lake storage was formely know and is sometimes called storage account.
Azure Datalake storage provide the single storage platform for Data analytics which we can be use in synapase analytics
using polybase.
We can store very huge amount of data in the data lake store as compare to the storage account
----------------------------
It is an on-demand analytics paltform for big data.Users can develop and run massively parallel data transformation and
processing programs in SQL.
----------------
---------------------
Azure CosmosDb is multimodel nosql database.we can store Semi-Structured files in it.CosmosDB manages data as a
partitioned set of documents.
CustomerId" : "101"
"Name"
"Customername" :"Allen"
Document can hold upto 2mb of data including small binary objects.If we need to store larger blobs as part of a
document,we can use Azure Blob Storage and add a reference to the blob in the document.
DP-420(Cosmod DB)
Module04
--------
DataWarehouse : - Datawarehouse gather data from many different sources within an org. This data is then used as the
source for analysis.
Normally we ingest the data from the different source and than we can stage the data at staging area and last we can
load the data into the datawarehouse.
-- If it is a On-prem env.
SSIS
-- If it is Azure cloud
Azure DataFactory
----------------------------------------
Data Lake
------------
Data Lake is a repository for large quantities of raw data.Because the data is raw and unprocessed ,its very fast to load
and update.
-- Datalake storage org. our files into directories and subdirectories for improved firl org. .Blob stroag only a directory
structure.
-- Data Lake storage support the protable O/S files and directories
-- Azure Datalake storage is compatible with the Hadoop Distributed File system.
--------------------------
Azure Synapse analytics is an analytics engine.It is design to process the very large amount of data very fast.
We can ingest data from external source like flat files,Azure data lake or other database management systems and then
we can transform,aggregate the data as well.
E : - Extracting the Data from different source(Source can be Azure SQLDB,Source can be On-prem sqlserver,Source can
be csv file,it can be Oracle,Mysql,website)
T : - Transforming the data,Remvoing any duplicate value,removing any incosistent value,removing any invalid value
,removing any null in the data
E : - Extracting the Data from different source(Source can be Azure SQLDB,Source can be On-prem sqlserver,Source can
be csv file,it can be Oracle,Mysql,website)
T : - After loading the data into the datawarehouse,We can transform the data.
Transforming the data,Remvoing any duplicate value,removing any incosistent value,removing any invalid value
,removing any null in the data
PowerBI Desktop
PowerBI Services
PowerBI Apps
---------------------------