SQL Material
SQL Material
Language
The way of storing relational data
Data Engineering
How the data will be used in the future so that the format
you use will make sense. Here are some of the questions
you might want to consider
• How do I store multimodal data, e.g., a sample that
might contain both images and texts?
• Where do I store my data so that it’s cheap and still fast
to access?
Data Engineering
Row-Major and Column Major
• Row-major formats are better when you have to do a lot of writes, whereas column-major ones are
better when you have to do a lot of column-based reads.
Row-Major and Column Major
• Consider that you want to store the
number 1000000. If you store it in a text
file, it’ll require 7 characters, and if each
character is 1 byte, it’ll require 7 bytes. If
you store it in a binary file as int32, it’ll
take only 32 bits or 4 bytes.
Data Models
• Data models describe how data is represented. Consider cars in the real world. In a
database, a car can be described using its make, its model, its year, its color, and its
price
• Alternatively, you can also describe a car using its owner, its license plate, and its history
of registered addresses. This is another data model for cars.
• Two types of Data Models: Relational models and NoSQL models.
Data Models
• Data models describe how data is represented. Consider cars in the real world. In a
database, a car can be described using its make, its model, its year, its color, and its
price
• Alternatively, you can also describe a car using its owner, its license plate, and its history
of registered addresses. This is another data model for cars.
• Two types of Data Models: Relational models and NoSQL models.
Data Models
• Data models describe how data is represented. Consider cars in the real world. In a
database, a car can be described using its make, its model, its year, its color, and its
price
• Alternatively, you can also describe a car using its owner, its license plate, and its history
of registered addresses. This is another data model for cars.
• Two types of Data Models: Relational models and NoSQL models.
Relational Data Model
NoSQL Data Model
• All documents in a document database are assumed to be encoded in the same format.
• Each document has a unique key that represents that document, which can be used to retrieve it.
• A document is often a single continuous string, encoded as JSON, XML, or a binary format like BSON
(Binary JSON)
Graph Data Model
• The graph model is built around the concept of a “graph.”
• A graph consists of nodes and edges, where the edges represent the relationships between the
nodes.
• A database that uses graph structures to store its data is called a graph database.
Structured and Unstructured
DW, DL, DLH
Data Warehouse
• Data warehouses are central repositories of integrated data from one or more disparate
sources. They store current and historical data in one single place[2] that are used for
creating analytical reports for workers throughout the enterprise.[3] This is beneficial for
companies as it enables them to interrogate and draw insights from their data and
make decisions.[4]
• The data stored in the warehouse is uploaded from the operational systems (such as
marketing or sales). The data may pass through an operational data store and may
require data cleansing[2] for additional operations to ensure data quality before it is
used in the data warehouse for reporting.
• Extract, transform, load (ETL) and extract, load, transform (ELT) are the two main
approaches used to build a data warehouse system
Data Warehouse
Data Warehouse
• Problems with Traditional Warehouse:
✓Scalability
✓Meant for OLTP Processing(Row Oriented & Normalized)
✓No Distributed Processing[A distributed database is a set of
databases stored on multiple computers that appears to
applications as a single database. Distributed Processing.
Distributed processing occurs when an application system
distributes its tasks among different computers in a network]
✓No Concurrency[many users can access data at the same
time]
✓Single Point of Failure
Data Warehouse
Data Warehouse
• Cloud Data Warehouse:
✓On-Demand Scalability
✓OLAP(Columnar Databases)
✓Concurrency and Distributed work Loads
✓No Single Point of Failure
✓Zero Overhead and Maitenance
DW - Tools
• Traditional DW:
✓SQL Server
✓PostgreSQL
✓MySQL
• Cloud DW
✓BigQUery(GCP)
✓RedShift(AWS)
✓Snowflake
• More DWs
• Hadoop HIVE
• AWS Athena
Declarative & Imperative
• In the declarative paradigm, you specify the outputs you want, and the computer
figures out the steps needed to get you the queried outputs.
• In the imperative paradigm, you specify the steps needed for an action and the
computer executes these steps to return the outputs
OLTP vs OLAP
• OLTP: Online Transaction Processing
• OLTP systems are designed to support everyday transaction-
oriented applications in industries such as banking, retail, logistics,
etc.
• Prioritizes fast query processing and maintaining data integrity in
multi-access environments.
• Data is often current, not historical.
• Examples: A bank's system where customers withdraw or deposit
money; a retailer's system where customers make purchases.
OLTP vs OLAP
• OLAP: Online Analytical Processing
• OLAP systems are designed to support complex queries and offer
business insights. They facilitate multi-dimensional analytical
queries, providing a platform for business intelligence and data
mining.
• Simple relationships with fewer joins.
• Aggregated data.
• Commonly uses schemas like star and snowflake.
• Examples: An e-commerce company analysing sales trends over
the past year; a system providing business performance metrics.
What is Relational Database
• A relational database is a collection of information that
organizes data in predefined relationships where data
is stored in one or more tables (or "relations") of
columns and rows
Different SQL Tools
• MySQL: An open-source relational database management
system, owned by Oracle Corporation. One of the most popular
databases for web-based applications
• Microsoft SQL: A relational database management system
developed by Microsoft. Used for a variety of applications ranging
from small applications to large scale enterprise applications. SQL
Server uses T-SQL as its primary querying language
• PostgreSQL: It's an open-source relational database
management system (RDBMS). Known for its extensibility and SQL
compliance. It's not just an SQL processing tool but also offers
"NoSQL" capabilities.
Different SQL Tools
• PL/SQL(Procedural Language for SQL): Predominantly used in
Oracle Databases for writing stored procedures, functions, and
triggers.
• SQLite: A C-language library that offers a lightweight, disk-based
database, which doesn’t require a separate server process. It's
serverless, self-contained, and zero-configuration.
ACID
• ACID is an acronym representing a set of properties that guarantee that database transactions are
processed reliably and ensure the integrity of the database in a transactional system. These
properties are Atomicity, Consistency, Isolation, and Durability. They are fundamental to the
transaction management system of relational databases (SQL databases), and here's what each
one means
• Atomicity (All or Nothing): This property ensures that all the steps in a transaction are completed
successfully as a single unit.
• Example: If you're transferring $100 from Account A to Account B, both the deduction from Account
A and the addition to Account B must be completed together. If either operation fails, neither
should occur
• Consistency (Follow the Rules): This ensures that a transaction can only bring the database from
one valid state to another, maintaining the database's integrity by ensuring that any data written
to the database must be valid according to all defined rules, including constraints, cascades,
triggers, and any combination thereof.
• Example: The overall balance of the system should remain the same after the transaction. The
sum of all accounts before and after the transaction should be equal.
• Rules such as not allowing accounts to go negative can be enforced using constraints in the
database.
ACID
• Isolation (One at a Time): This property ensures that multiple transactions can occur
concurrently without leading to the inconsistency of database state. Transactions are
protected from each other while they are in a transient state.
• Example: Even if many users are transferring money at the same time, each transaction
should remain isolated. No transaction should see the intermediate results of another.
• Durability (Stick Around): Durability guarantees that once a transaction has been
committed, it will remain so, even in the event of power loss, crashes, or errors
• Example: Once the transaction is committed, the changes (the $100 transfer) are
permanent. Even if the system crashes immediately after, the changes won't be lost.
• In SQL once COMMIT, command is issued, the changes are written to disk, ensuring
durability
ACID
• The key properties of transactions are encapsulated in the ACID model, which
stands for Atomicity, Consistency, Isolation, and Durability:
• Atomicity ensures that all operations within the transaction are completed
successfully; if not, the transaction is aborted and no changes are made to the
database.
• Consistency ensures that a transaction can only bring the database from one
valid state to another, maintaining database invariants.
• Isolation ensures that concurrent execution of transactions leaves the
database in the same state as if the transactions were executed sequentially.
• Durability ensures that once a transaction has been committed, it will remain
so, even in the event of a system failure.
4 Stages of DBMS
• Database Management Systems are having 4 Important
Characteristics.
• Data Definition(DDL) – Define the data being tracked
• Data Manipulation(DML) – Add, Update & Remove the Data
• Data Retrieval(DQL) - Extract and Report the data available in
database
• Transaction Control Language(TCL) - A transaction is a
sequence of one or more SQL operations that are treated as a
single unit of work
• Administration(DCL) – Defining users on the system, security,
monitoring, system administration
Database Tables
• A database table is a lot like a spreadsheet. • Data is kept in
Columns and Rows.
• Each Column is assigned:
• A Unique Name, identifying a human readable name of the
column. (ie FIRST_NAME, LAST_NAME)
• A Data Type (ie - String, Date, Time, Number, etc)
• Optionally, constraints (ie - Is a value required?, Length of String,
etc) • Each Row is a distinct database Record.
Data Relationships
• One to One - Record in Table A matches exactly one record in Table B
• One to Many - Record in Table A matches many in Table B, but Table B matches
only one record
in Table A. (Think - An Order with multiple items)
• Many to Many - Record in Table A matches many in Table B, and Table B
matches many records in Table A.
Data Relationships
• One to One - Record in Table A matches exactly one record in Table B
• One to Many - Record in Table A matches many in Table B, but Table B matches
only one record
in Table A. (Think - An Order with multiple items)
• Many to Many - Record in Table A matches many in Table B, and Table B
matches many records in Table A.
Data Relationships
Data Relationships
DDL
• DDL - Data Definition Language (ie CREATE TABLE...) is used to
define the relational model
• Under the covers, the RDBMS will store data about your tables in
catalog tables
• The software is used to enforce data being stored conforms to the
rules you’ve defined for the data.
DML
• DML - Data Manipulation Language
Allows you to add (INSERT), change (UPDATE), or remove (DELETE)
data.
The RDBMS enforces data manipulation adheres to the rules of the
Data Definition.
The RDBMS allows set up ‘rules’ for multi-user systems.
These rules manage what happens in competing conditions. (what
happens when two users want to update the same data, at the
same time)
Retrieval
• Data Retrieval is the act of pulling data out of the database
• The RDBMS determines the optimal way to retrieve data out of the
database. • Multi-table joins can become very complex.
• Consider tables with billions and billions of rows.
• Reports can go from seconds, to hours when the retrieval strategy
is wrong.
• The RDBMS also considers what happens when updates occur
while your report is running.
Character Set
• Computers are driven off of binary information - ie 1’s and zeros. •
A ‘bit’ is binary one or zero.
• A byte is a collection of eight bits (10000111) = 70
• ASCII - American Standard Code for Information Interchange
• One of the first ‘character’ sets
• Limited to 128 characters (mostly letters, numbers, common
punctuation)
• UTF-8 is highly popular used for email / web. 1 - 4 bytes long.
• Up to 1,112,064 characters
Data Normalization
Database Normalization is the most important factor in Database
design or Data modeling. Database Normalization is the process to
eliminate data redundancies and store the data logically to make
data management easier
• First Normal Form (1NF)
• Second Normal Form (2NF)
• Third Normal Form (3NF)
• Fourth Normal Form
• Fifth Normal Form
• Boyce Codd Normal Form(BCNF)
First Normal Form
In the first normal form, each column must contain only one value.
No table should store repeating groups of related data. The easiest
way to follow the first normal form is to inspect the database table
horizontally.
Second Normal Form
In the second normal form, first, the database must be in the first
normal form and there should not be any partial dependency. If
there are duplicate values in the row, they should be stored in their
own separate tables and linked to the table using foreign keys.
Third Normal Form
In the third normal form, the database is already in the third normal
form, if it is in the second normal form. Every non-key column must
be mutually independent. Identify any columns in the table that are
interdependent and break those columns into their own separate
tables.
Third Normal Form
Functional Dependency: When there is a relationship exists
between the primary key and non-key attribute within a table it is
called functional dependency.
• X -> Y
• Here, X is known as determinant, and Y is known as the
dependent.
For a table to satisfy the Boyce-Codd Normal Form, it should satisfy the following two
conditions:
1. It should be in the Third Normal Form.
2. And, for any dependency A → B, A should be a super key.
• BCNF applies to relational databases that have a primary key. A primary key is a column
or combination of columns that uniquely identifies each row in a table
• BCNF ensures that every determinant of a relation is a candidate key. This means that all
non-trivial functional dependencies are eliminated.
• BCNF states that every non-key column in a table should depend on the primary key, not
any subset.
Boyce Codd Normal Form
And, for any dependency, A → B, A should be a super key.
Fourth Normal Form
For a table to satisfy the Fourth Normal Form, it should satisfy the
following two conditions:
1. It should be in the Boyce-Codd Normal Form.
2.And, the table should not have any Multi-valued Dependency
A table is said to have multi-valued dependency, if the following
conditions are true,
1. For a dependency A → B, if for a single value of A, multiple value of
B exists, then the table may have multi-valued dependency.
2.Also, a table should have at-least 3 columns for it to have a multi-
valued dependency.
Denormalization
It is a database optimization technique where we can add
redundant data to one or more tables and optimize the efficiency
of the database. It is applied after doing normalization. It also
avoids costly joins in a relational database. It is used on the already
normalized database to increase performance. In denormalization,
we are including data from one table to another table to reduce the
number of joins in the query which helps in speeding up the
performance.
Data Integrity & Referential
Data integrity refers to the overall accuracy and consistency of the
data in your database. You want high-quality data. People lose
confidence in your data when they spot problems like a salary
value that contains alpha characters or a percent increase value
over 100 percent.