Chapter 3- Query Processing and Optimization
Chapter 3- Query Processing and Optimization
Introduction
Sometimes multiple tables house the various pieces of data you want to access.
In these instances, queries can help you retrieve and compile information from the assorted
tables.
For example, a data analyst might perform a query to find the average ages of a company's
customers.
This information can help the company learn more about its customers and make informed
business decisions
4
QUERY
Filtering data according to specific criteria Combining data from various tables
A database query can be either a select query or an action query. A select query is a query for
retrieving data, while an action query requests additional actions to be performed on the data,
A language which is used to store and retrieve data from database is known as query language.
Procedural Query Languages (How + What) Non-Procedural Query Languages (What only)
The purpose of a query language is to retrieve data from database or perform various
Relational algebra is more operational, is very useful for representing execution plan.
Procedural query language means that it tells what data to be retrieved and how to be retrieved.
Relational algebra and Calculus mainly provides mathematical theory foundation for
Selection (σ)
Description The SELECT operation is used for selecting a subset of the tuples according to a given selection condition.
Sigma(σ)Symbol denotes it. It is used as an expression to choose tuples which meet the selection
condition.
Syntax
σ Condition (Table)
Example
σ DeptId=1 (Student)
Output UID (PK) FullName StudID Batch DeptId (FK)
Project (∏)
Description Project operator is denoted by ∏ symbol and it is used to select desired columns (or attributes) from a
table (or relation). Project operator in relational algebra is similar to the Select statement in SQL.
Union (∪)
Description Union operator is denoted by ∪ symbol and it is used to select all the rows (tuples) from two tables (relations). It also eliminates
duplicate tuples; For a union operation to be valid, the following conditions must hold: Number of columns and order of columns of all
queries must be same; the data types of the columns on involving table in each query must be compatible. Duplicate tuples should be
automatically removed.
Output DeptId
1
2
3
4
16
RELATIONAL ALGEBRA
Description Cartesian Product is denoted by X symbol. Lets say we have two relations R1 and R2 then the Cartesian product
of these two relations (R1 X R2) would combine each tuple of first relation R1 with the each tuple of second
relation R2.
In the non-procedural query language, the user is concerned with the details of how to obtain the end
results.
The relational calculus tells WHAT to do but never explains HOW to do.
Universal Quantifiers: The universal quantifier denoted by ∀ is read as for all which means that in
Existential Quantifiers: The existential quantifier denoted by ∃ is read as there exists which means
that in a given set of tuples there is at least one occurrences whose value satisfy a given condition.
18
RELATIONAL CALCULUS
Quantifiers are words, expressions or phrases that indicate the number of elements that a statement refer to.
Existential Quantifier [∃]: indicates that at least one element exists that satisfy a certain condition.
Universal Quantifier [∀]: indicates that all of the elements of a given set satisfy the condition.
19
RELATIONAL CALCULUS
• Return all tuples t from the Employee relation where the salary is over 50,000.
• There exists a tuple d in the Department relation such that d.name = 'HR‘
• For every tuple e in the Employee relation, e.salary > 0 must be true
20
FILE ORGANIZATION
As we've already seen, a database consists of various elements such as tables, views, indexes,
To a user, data appears in the form of tables or views but these are just logical representations.
What is a File?
A file is a named collection of related data that is stored on secondary storage devices such as
While users interact with structured data using SQL queries, behind the scenes, this data is
converted into binary format and stored across physical memory blocks.
These blocks have fixed capacities and are used to map and store actual data.
21
FILE ORGANIZATION
Each memory devices will have many data blocks, each of which will be capable of storing
The data and these blocks will be mapped to store the data in the memory.
Any user who wants to view these data or modify these data, simply fires SQL query and gets the
But how these data are fetched from the physical memory?
22
FILE ORGANIZATION
Do you think simply storing the data in memory devices give us the better results when we fire
queries? No
How is it stored in the memory, accessing method, query type etc. makes great affect on getting the results.
Hence organizing the data in the database and hence in the memory is one of important topic to think about.
In a database we have lots of data. Each data is grouped into related groups called tables.
Any user will see these records in the form of tables in the screen.
As we saw above, in order to access the contents of the files – records in the physical memory, it is not
that easy.
They are not stored as tables there and our SQL queries will not work.
To access these files, we need to store them in certain order so that it will be easy to fetch the records.
It is same as indexes in the books, or catalogues in the library, which helps us to find required topics or
books respectively.
24
FILE ORGANIZATION
The File Organization can be defined as the logical relationships between different
If we talk simply, file organization means storing files in a particular sequence for
logical control
25
FILE ORGANIZATION
Any insert, update or delete transaction on records should be easy, quick and should
Indexing is a way to optimize the performance of a database by minimizing the number of disk accesses
required when a query is processed. It is a technique used in databases to speed up data retrieval.
It is a data structure technique which is used to quickly locate and access the data in a database.
The first column is the Search key that contains a copy of the primary key or candidate key of the
table. These values are stored in sorted order so that the corresponding data can be accessed
quickly.
The second column is the Data Reference or Pointer which contains a set of pointers holding the
address of the disk block where that particular key value can be found.
27
INDEXING
28
Introduction
SQL simplifies interactions with databases, turning complex data operations into intuitive, structured commands.
Example: A simple SELECT query can retrieve years of sales data in milliseconds.
Behind the scenes, databases process millions of requests daily—each needing to be fast, accurate, and resource-
efficient.
Not all queries are created equal: Some crawl; others fly.
Modern DBMSs (like RDBMS) don’t just execute queries they optimize them.
Like a GPS finding the fastest route, the system evaluates multiple paths to deliver results at lightning speed.
30
INTRODUCTION … (2)
One of the types of DBMS is RDBMS where data is stored in the form of rows and columns (in other
words, stored in tables) which have intuitive associations with each other.
The users have the freedom to select, insert, update and delete these rows and columns without
violating the constraints provided at the time of defining these relational tables.
Let’s say you want the list of all the employees who have a salary of more than 100,000.
The SQL query” SELECT EMPNAME FROM EMPLOYEE WHERE SALARY > 10000”; is a high-level command used to
retrieve data. However, DBMS cannot directly understand such high-level statements.
To bridge this gap, SQL is used, as it allows users to query data in a way that's more intuitive and closer to human
Despite its user-friendly syntax, the DBMS does not natively understand SQL.
Instead, SQL queries are passed through a processing unit that translates them into a lower-level language using
Relational Algebra.
This transformation is necessary because relational algebra, although more complex than SQL, provides the
As a result, users are only required to write SQL queries, which the system then processes, optimizing and
Query Processing refers to the series of steps involved in translating high-level queries into low-level
It encompasses a range of activities, including parsing, optimization, and actual execution of the
This process is essential for efficiently retrieving data from the database.
Query processing relies on fundamental concepts such as relational algebra and file structures.
It begins with the translation of high-level database language (like SQL) into intermediate
representations and eventually into low-level instructions that interact with the storage system.
By studying query processing, we gain insight into how queries are interpreted, optimized for
In the compile-time phase, the query compiler translates the high-level query specification into an executable
form.
This process, known as query compilation, includes several steps: lexical analysis, syntactic and semantic
The resulting code typically consists of a sequence of physical operators designed for the database engine.
These operators handle core operations such as data access, joins, selections, projections, grouping, and
aggregation.
In the runtime phase, the database engine interprets and executes the compiled program to retrieve and
This separation of compilation and execution allows for efficient processing and optimization of queries
34
QUERY PROCESSING AND OPTIMIZATION … (3)
Query processing refers to the range of activities involved in extracting data from a database.
expressions that can be used at the physical level of the file system, a variety of query-
Query Optimization
Lexical Analysis: The query is broken down into tokens such as keywords, identifiers, and symbols. During this
process, white spaces and comments are removed, simplifying the input for further analysis.
Syntactic Analysis: The query is checked for correct SQL syntax. The parser verifies whether the structure of the
query conforms to the grammar rules of SQL, ensuring that the command is well-formed.
Semantic Analysis: Beyond syntax, this step validates the meaning of the query. It checks for logical correctness,
such as whether referenced tables and columns exist, whether data types match, and if operations are valid.
Once these checks are successfully completed, the query is translated into intermediate representations such as
These representations help the database engine optimize and evaluate the query more efficiently in later stages.
37
PARSING AND TRANSLATION … (2)
• FROM → Keyword
• WHERE → Keyword
• > → Operator
• Are there any logical issues with the query (e.g., column names and
table relationships)?
• Does salary and the value 10000 have the same data type?
40
PARSING AND TRANSLATION … (5)
If the checks pass, the query is then translated into relational algebra (like the π operator for projection
So, the next step is to translate the generated set of tokens into a relational algebra query.
Query optimization is the process of enhancing the efficiency of a query by selecting the most cost-
The primary goal is to reduce the overall cost of query execution, which can include minimizing CPU
and memory usage as well as improving disk I/O performance (e.g., reducing the number of disk
reads).
In relational algebra, optimization involves restructuring the query execution plan to make it faster and
more resource-efficient.
This is achieved through various techniques like reordering operations, selection and projection
Typically, users do not need to write their queries in a way that makes them optimally efficient.
Instead, the database system is expected to handle the task of generating an optimized query execution
plan.
This plan should minimize the cost of executing the query, ensuring efficient resource utilization.
Once a query is submitted to the database server, it is first parsed by the query parser for parsing
After this, the query is passed to the query optimizer, which analyzes the possible execution strategies and
This optimization process happens automatically in the background, and users do not have direct control
over it.
43
QUERY OPTIMIZATION … (3)
It can range from a simple request like, "Find the address of the person with Social Security Number 123," to a more
complex one such as, "Find the average salary of all employed married men in California between the ages of 30 and
The result of a query is generated by processing the rows in the database to yield the requested information.
Given the complexity of most database structures, especially for more complex queries, the data needed to answer
the query may be retrieved in different ways, using various data structures and in different orders.
Each method of accessing the data typically requires different processing times.
Processing times for the same query can vary widely—from fractions of a second to several hours—depending on the
approach chosen.
The purpose of query optimization, an automated process, is to identify the most efficient way to process a given query
One aspect of query optimization occurs at the relational algebra level, where the system aims to find
an equivalent expression that is more efficient to execute than the original one.
Another aspect involves selecting the optimal strategy for processing the query, such as choosing the
appropriate algorithm for executing operations, determining which indices to use, and other such
decisions.
The difference in cost (in terms of execution time) between an efficient strategy and an inefficient one
This makes it worthwhile for the system to invest considerable time in selecting the best strategy for
The ultimate goal of query optimization is to minimize the system resources required to fulfill a query,
thereby providing faster results to the user. There are several benefits to this:
Improved User Experience: By delivering faster results, the application appears more responsive
to the user.
Increased Throughput: With optimized queries, the system can handle more queries in the same
Resource Efficiency: Query optimization reduces the strain on hardware resources, such as disk
drives, and contributes to lower power consumption, reduced memory usage, and overall system
efficiency.
46
QUERY OPTIMIZATION … (6)
In earlier example, we translate the query in to two relational algebra expression and submit to optimizer.
Expression 1 Expression 2
• Projection (π_EMPNAME) is applied first, which • Selection (σ_SALARY > 10000) is applied first to
selects only the EMPNAME column from the the EMPLOYEE table, which filters out the rows
EMPLOYEE table. where SALARY > 10000.
• After that, Selection (σ_SALARY > 10000) is • Then, Projection (π_EMPNAME) is applied to the
applied to the result, meaning it filters out rows filtered result, keeping only the EMPNAME
where SALARY > 10000. column.
47
QUERY OPTIMIZATION … (7)
Expression 2 is more efficient because it filters out unnecessary rows first (with the
Selection) and then projects the results, minimizing the amount of data that needs to be
Expression 1 does the projection first, reducing the number of columns but still keeping all
the rows from the original table, which could be wasteful in terms of resources since many
EXECUTION PLAN
An execution plan is a set of instructions that outlines the steps performed by the database engine during query
execution. This plan is often referred to as the SQL Server execution plan (or simply query plan).
The query optimizer is responsible for generating the execution plan, with the primary goal of creating an optimal and
cost-effective plan.
The optimizer evaluates various possible execution strategies and selects the one that is expected to provide the best
performance.
During query execution, the query processing engine generates multiple execution plans.
From these options, the plan with the highest performance (i.e., the least resource-intensive) is chosen.
A plan cache is a memory location where execution plans are stored for future reuse.
By caching execution plans, the database can avoid regenerating the plan for identical or similar queries, improving
The execution engine is responsible for performing the operations defined in the query execution plan
by fetching data from the database and carrying out the necessary computations. Once a query
evaluation plan is generated by the query optimizer, the execution engine interprets and executes that
In essence, the query execution engine is the component that carries out the work specified by the
query, which includes accessing the relevant data, performing operations (such as joins, aggregations,
and filtering), and processing the results. It interprets the SQL commands in the query, determines the
most efficient way to execute the steps outlined in the execution plan, and retrieves the necessary data
The engine manages the interaction with the database storage, retrieving rows, columns, or index
entries, and may use various techniques (like sequential scans, index scans, hash joins, or nested loops)
to access and manipulate the data efficiently. After processing the query, it returns the computed results
Additionally, the execution engine plays a critical role in query performance by determining how
resources like memory, CPU, and disk I/O are utilized. Through careful management of these resources,
the execution engine aims to execute queries in the most efficient way possible, ensuring fast and
accurate results.
BEST PRACTICE FOR QUERY OPTIMIZATION … Developer Side (1)
51
There are a number of best practices that developers can adopt to enhance query performance and help
By writing optimized queries, following sound database design principles, and being mindful of how data
is accessed and manipulated, developers can significantly reduce the workload of the query optimizer.
This not only ensures that the database engine executes queries with optimal performance, but also helps
By applying these best practices, developers can ensure that the database system performs at its best,
delivering faster response times and reducing resource consumption, all while minimizing the reliance on
Avoid Unnecessary Complexity: Try to keep queries simple and focused. Overly complex queries with
unnecessary joins, subqueries, or nested operations can confuse the optimizer and result in suboptimal plans.
Limit the Use of Subqueries: If possible, avoid using subqueries, especially in the WHERE clause or FROM clause.
In many cases, subqueries can be rewritten as JOINs or common table expressions (CTEs), which might improve
performance.
Write SELECT E.* FROM EMPLOYEES E JOIN SALARY S ON E.EMPID = S.EMPID WHERE
S.AMOUNT > 10000;
Instead of SELECT * FROM EMPLOYEES WHERE EMPID IN (SELECT EMPID FROM SALARY WHERE
AMOUNT > 10000);
BEST PRACTICE FOR QUERY OPTIMIZATION … Developer Side (3)
53
• Use SELECT Carefully: Avoid SELECT * and only select the columns you need.
• Use WHERE Clauses to Filter Data: Always apply filters early with WHERE clauses to reduce the
• Limit the Result Set: If you're only interested in the top N rows, use LIMIT or TOP clauses to restrict the
Avoid Aggregations on Large Result Sets: Applying aggregations (SUM(), AVG(), COUNT(), etc.) to a
large number of rows can be expensive. Try to reduce the result set with a WHERE clause before
applying aggregation.
Use Grouping Efficiently: Use GROUP BY clauses only when necessary. Only apply GROUP BY when
you need to group rows by a specific attribute and then perform an aggregation.
Use WHERE to Filter Before Grouping: If you're filtering rows before applying the aggregation,
make sure to use the WHERE clause instead of the HAVING clause, as WHERE filters rows before the
Batch Updates and Deletes: If your queries involve UPDATE or DELETE operations that affect many
rows, try to batch them into smaller chunks to avoid long transactions and heavy locking.
UPDATE EMPLOYEES SET STATUS = 'ACTIVE' WHERE EMPID BETWEEN 1 AND 1000;
The second query enforces the database to scan every row (full table scan) to check the pattern.
BEST PRACTICE FOR QUERY OPTIMIZATION … Developer Side (7)
57
FROM EMPLOYEES e