Refreshing Material
Refreshing Material
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Tableau
What is Data Visualization?
Data Visualization is the process of representing data and information in graphical form. By
transforming the written data into charts, and graphics we would be able to identify the
trends and patterns better. The goal of data visualization is not to convert data into an image.
We tend to understand these visual graphics better and easily, that is why the quarters and
percentages are represented as pie charts.
Examples of Data Visualization tools:
• Tableau
• Power Tool
• Infogram
Introduction to Tableau
Tableau is a strong growing data visualization tool. It is a BI tool that helps to interpret the
raw data by converting it into a proper visual manner; it may be in the form of a graph, report,
chart, pie chart, etc.
The software doesn’t require any high-level technical knowledge or programming skills to be
operated. The software is very easy for creating visual graphics. The results created can be
understood by professionals working at any level.
Tableau Desktop
Tableau Desktop is a data visualization application that will help you to transform any sort of
data into graphics within a few minutes. After installation, you can retrieve data from any
spreadsheets and present that information in different graphical forms.
Tableau Public
Tableau Public is a free program to facilitate anyone to connect to a spreadsheet or file and
create immersive data visualizations for the internet.
With Tableau Public, users can create awesome immersive graphics without the help of
programmers or Technical individuals, and publish them easily.
Dimension Measure
Dimension is an independent variable Measure is a dependent variable
www.fingertips.co.in+91-780.285.8907
Learn by Doing
This field can filter individual data elements This field can filter through range
Discrete field becomes header in a view Continuous field becomes axis in view
Brings detail to the view Brings aggregate to view
Discrete field can have hierarchy Continuous field cannot have hierarchy
www.fingertips.co.in+91-780.285.8907
Learn by Doing
3) After the file is opened, click on the sheet tab to start the analysis process.
Connection to Database
1. Under the Data tab, click on the database connection that you wish to connect. For
example, if you wish to connect to MySQL, then select the MySQL option.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
www.fingertips.co.in+91-780.285.8907
Learn by Doing
As the name shows, this string type has a variable length. The user can input as many
values as required without facing any restrictions.
Numeric Data type
This type of data type has both integer and floating types. The users generally prefer to use
integer type over the floating type, as it is easier to round up the decimals after certain limit.
It even has a specific function called Round (), which is used for rounding up float values.
Date and Time Data type
Tableau tool supports all sorts of formats for date and time. You can use the format "dd-mm-
yy, or mm-dd-yy", or anything else. The data can be in any format like year, month, hour,
minute, decade, etc.
Boolean Data type
Boolean Data types are either True or False. These types of Data types are formed at the time
of relational calculations. At the times, when the result is unknown, the calculation shows
null.
Geographic Data type
Values that are used in maps are geographic data types. The example can be valued like
company name, state name, city name, etc.
Cluster or Mixed Data type
When some data set contains values that have a mixture of data types in it then such values
can be known as cluster group values or mixed data values, it can be handled manually or you
can allow Tableau to operate on it.
Relationships vs Joins
Relationships Joins
• They are displayed as flexible • They are displayed with Venn
noodles between the logical tables. diagram icons between physical
tables.
• They require you to select matching • They require you to select join types
fields between two logical tables. and join clauses
• It does not require you to select the • In Joins physical tables are merged
join types. into a single logical table with a fixed
combination of the data.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
• It makes all the row and column data • It May drop unmatched measure
from the related tables to get values.
available in the data source.
• Maintains the level of the table in the • When fields are at different levels of
data source and during analysis. detail, it may duplicate aggregate
values.
• It creates independent domains at • It supports scenarios that require a
multiple levels of detail. Here tables single table of data, such as extract
do not get merged together in the filters and aggregation.
data source.
Creating a Join:
• When you need to create a join, you need to connect all the relevant sources.
• You need to drag the first table to the canvas.
• Now select Open from the menu or double click to the first table to open the join
canvas.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Types of Join
In Tableau, there are four types of joins named as:
• Inner Join
• Left Join
• Right Join
• Full outer
Union Join
This is another method of joining two or more tables by just appending the rows of the data
from one table to another. The tables in which you have used union tables will consist of the
same number of fields, and these fields will have matching names and data types.
Blending the data is a very powerful and useful tool in Tableau. The user would like to analyze
the data that is related to multiple data sources; via Blends, you can analyze that data
www.fingertips.co.in+91-780.285.8907
Learn by Doing
together in a single view. Blending doesn’t combine the data, on the other hand, it queries
each data source separately, and finally, the results are displayed together.
3. Now, go to other data sources and ensure that there is a blend relation with a primary
data source.
4. Drag a field from a secondary data source.
Let’s take an example, suppose the primary data source has data on the field as months and
has values of January, February and March. Now the view will show Jan, Feb, and March data
along with data of secondary data sources. Even, if the secondary data source has values of
all the months but it won’t be displayed.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
While calculation, the field that is from another data source will be referred to as data
source using dot notation.
Apart from the calculations part, there are certain limitations in working across blended data
sources. The user might not be able to sort the data by field from a secondary data source.
Also, the action filter might not work properly with blended data.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Data joining is used when the data set is Data blending is used when the data set is
from the same source. from different source.
It can use different types of Joins. It can use only left join.
Duplication and loss of data is possible. Duplication and loss of data is not possible.
Data Joining cannot use published sources. Data joining can use published sources.
It’s not possible to use calculated field as Blends can use calculated field as key.
key.
Marks Card
It is the key element for visual analysis in Tableau, when you drag fields to different properties
in the marks card, you can add context and also details to the marks in the view.
Highlighting
It will allow you to call attention to marks of interest by just colouring specific marks and
dimming others. Using a variety of tools you can highlight marks.
For example, you can just manually select the marks you want to highlight and can use the
legend to select related marks, highlighter to search for marks in context, or create an
advanced highlight action.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Sorting
Sorting is an important feature of data analysis. Sorting data of the fields is known as
dimensions. While viewing visualization, data can be sorted using single-click options from an
axis, header, or field label. In the authoring environment, additional sorting options include
sorting manually in the headers and legends, using the toolbar sort icons, or sorting from the
sort menu.
Grouping
When you group different items that are within a dimension then it can be very useful if you
need to color up certain items, sum groups of items together or even follow different item
groups. A group can be created to combine the related members in a field.
Creating a Group
Groups can be created in multiple ways, It can be created from a field in the Data pane, or
just by selecting the data in the view and then clicking the group icon.
Just right-click a field and select Create > Group in the data pane.
When you are in the Create Group dialog box, select the several members you want to
group, and then just click on Group.
Sets
www.fingertips.co.in+91-780.285.8907
Learn by Doing
It can be used to compare and ask questions about a subset of data. They are the custom
fields that define a subset of data based on some conditions.
Constant Sets
They are the sets that cannot be changed. If the underlying data changes, the membership of
the constant set does not change to reflect these differences, as they are also known as
manually created sets.
Computed Sets
In Tableau, there are loose collections of mini-series that are designed to give you an in-depth
look into various features of the Tableau software.
They use logic to dynamically update the membership of the set. As this is the key distinction
between the constant sets and the computed sets and changes to the data will change the
set itself as it re-computes what gets classified as IN the set and as OUT the set.
Bins
They are the containers of equal size that stores data values corresponding to the bin size and
also fitting to the bin size.
Bins group a set of data into the groups of equal intervals or size making it a systematic
distribution of the data. In Tableau, data from any discrete field can be taken so that bins
can be created.
Hierarchies
When the data source is connected, Tableau will automatically separate the data fields into
hierarchies so that you can easily break them down.
For Example, Suppose if you have a set of fields named Region, State, and Country, you can
directly create a hierarchy by using these fields so that you can quickly drill down the between
levels.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
1. Extract Filter
This type of data is used to extract data from different sources. Using an extract filter
reduces the tableau queries in the data.
2. Data Source Filter
This filter is used to restrict certain sensitive data from viewers. But, viewers have
certain rights of access to view the data. One important thing to remember is that the
data source filter and extract filter are not at all linked together.
3. Context Filter
This filter helps to create data sets by implying relevant presets for compilation. The
context filter adds actionable context to the data analysis process.
4. Dimension Filter
These filters are applied to the dimension field. It includes filtering based on a certain
category of text or numerical data.
5. Measure Filter
This filter is applied on the Measure field. A measure field has quantitative data and
thus filtering is based on the calculation part.
6. Visual Filter
While converting the data into visual form, you have to think not only as an analyst
but also as a designer and as an end reader of data.
7. Interactive Filter
www.fingertips.co.in+91-780.285.8907
Learn by Doing
You can add context and make your data interactive using certain actions. Users are
going to use your data and interact with it. You need to make sure that your data is
worth for the users to interact with. Below mentioned are different types of actions:
• Highlight
• Filter
• URL
• Parameter
• Set values
8. Date filter
Under this Filter, the user can select to filter on a relative date, filter between a range
of date, discrete date, or individual date. The moment you drag a date filter, this type
of box will be discovered:
Charts in Tableau
Let’s see different forms of charts that we can make in Tableau.
Bar Charts
Bar Charts are simple and can be used your data can be divided into various categories. With
the help of bar charts, you can identify trends, compare high and low values, and compare
historical data in just one glance.
Stacked Bar Chart
A stacked Bar Chart is a simple bar chart with further segmented bars. The bars in the chart
are then categorized further. The bars are internally divided to provide more advanced
details.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Line Charts
Line charts show the information via data points that are connected by line segments. The
result drawn from this chart is easy to understand and visualize.
Histogram
A histogram represents the frequencies of values of a variable bucketed into ranges. The
histogram is similar to a bar chart but it groups the values into continuous ranges. Each bar in
the histogram represents the height of the number of values present in that range.
A step-by-step process of creating Histogram
1. Under Toolbar, click on show me and then click on histogram chart.
2. Now, drag a segment to color
3. Hold down the Ctrl key and drag the CNT (Quantity) field from the Rows shelf to Label.
4. Right-click the CNT (Quantity) field on the Marks card and select Quick Table
Calculation > Percent of Total.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
5. In the Table Calculation dialog box, change the value of the Compute Using field to
Cell.
Scatter plot
Scatter plot also known as scatter chart or scatter graph, uses dots to represent value for
two different variables. The place of each dot depicts value on the horizontal and vertical
axis. This type of visualization can be used to study the relationship between two different
variables.
Process of creating a scatter Plot in Tableau:
1. Select the Measure
2. Drag Measure to the Rows Section
3. Select Two Dimension Fields
www.fingertips.co.in+91-780.285.8907
Learn by Doing
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Maps
If the user wants to analyze the data information in a geographical format, then he/she can
plot the data on a map in Tableau. It is appropriate to use maps when you have a spatial
question in your mind. A spatial question is something related to size, position, or area.
Bullet chart
A bullet chart is an advanced version of a Bar chart where we can compare two different
variables on a single bar only. In the bullet chart, the main primary variable is shown by the
dark color bar and the second variable is displayed in the light color bar.
Bar-in-bar chart
The bar is a bar chart is two bars on a single chart, interlocking each other. It is used when
there’s a requirement to study two variables simultaneously for comparison purposes.
Level of Detail
It allows you to compute the values at the data source level and the visualization level. LOD
expressions are used to run queries that are complex and involve many dimensions at the
data source level instead of bringing all the data to the Tableau interface.
Different Types of LOD expressions
FIXED
It computes a value using a specified dimension, which is without reference to the dimension
in the view.
INCLUDE
It computes the values using the specified dimensions in addition to whatever dimensions are
in the view.
INCLUDE level of detail expressions are useful when you wanted to calculate at a fine level of
detail in the database and then re-aggregate and show the level of details in your view. Fields
based on INCLUDE level of detail expressions will change when you add or remove dimensions
from the view.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
EXCLUDE
It declares dimensions to omit from the view level of details. They are useful for the ‘percent
of total’ or ‘difference from overall average’ scenarios and can be comparable to such features
as Totals and Reference Lines.
Expressions Syntax
Level of detail expression has this structure:
{[FIXED | INCLUDE | EXCLUDE] <dimensions declaration > : < aggregate expression> }
Aggregation and replication with LOD expressions
FIXED-Aggregated Results
If the LOD expression’s results are more granular, the values from the LOD Expression are
aggregated to create the view.
• LOD of the view will be simply segmented, but LOD expression will be fixed at a more
granular level of both segment and category.
• LOD Expression’s values for each category are aggregated into a single value per
segment which is displayed in the view as the result.
• If we wrap a level of detail expression in aggregation when we create it, Tableau will
use the aggregation specified rather than choosing one when that expression is placed
on a shelf.
FIXED-Replicated Results
• If the dimension declaration is more aggregated than the view, the values from the
LOD Expression are replicated to create the view.
• The LOD of the view is both Category and segment, But LOD Expression will be fixed
at a less granular level of just segment.
The LOD expression’s values for each segment will be replicated for each category within a
segment to be displayed in the view.
Dashboards
It is a collection of several views, which will let you compare a variety of data
simultaneously.
Building a Dashboard
When you’ve created one or more sheets you will be able to combine them in a dashboard
and can add interactivity, and much more.
You can create a dashboard in the same way you create a new worksheet. On the left from
the Sheets, drag views to your dashboard at the right.
Adding interactivity
Interactivity can be added to dashboards to enhance the data insights.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Adding an object
Objects section which is at left, drag an item to the dashboard on the right.
Copying objects
Copy and paste objects that are either within the current dashboard, or from the
dashboards in the other sheets and files. Objects can also be copied between the tableau
desktop and even in your web browser.
You will not be able to copy when:
• Sheets are in a dashboard
• Items that rely on a specific sheet, such as filters, parameters, and legends
• In Layout containers, you can't copy inside them, like a sheet or filter
• Objects on a device layout
www.fingertips.co.in+91-780.285.8907
Learn by Doing
• Dashboard titles
URLs for the web-based images will require the HTTPS prefix so that security can be improved.
For image URLs with the other prefixes, use the Web page object.
• From the object section which is at the left, just drag an image object to your
dashboard at right. Or, on an existing Image object in a dashboard, click the pop-up
menu in the upper corner, and choose Edit Image.
• Now click either to the Insert Image file so that an image file can be embedded into
the workbook of link to Link to Image to link to a web-based image.
• You can consider linking to a web-based image when:
The image is very large and the dashboard audience will be going to view it in a
browser
The image is an animated GIF file.
• When you’re inserting an image, just click Choose to select the file and if you’re
linking to an image, enter its web URL.
• Set remaining image fitting, URL linking, and alt text options.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Dashboards can include layouts for different types of devices that span a wide range of
screen sizes. Later on, you can publish these layouts to Tableau Server or Tableau Online,
when someone views your dashboard they experience a design optimized for their phone,
tablet, or even desktop.
Story Points
In Tableau story points are the very powerful feature of Tableau. Tables of numbers can be
generally seen as Excel, with a chart or two where we need to create and maintain those,
and sometimes they aren't always done well, and excel can encode one or maybe two
values at the most but a tableau chart can encode several around 6+ variables.
Tableau story points with actions can solve your problems.
• Now you need to add a title to the tableau story that will apply to your entire
dashboard.
• Here you can drag your charts, dashboard, text box, image, or webpage, onto the
dashboard canvas.
• Title and captions on your charts and images should make sense to your end-user.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
• To do that double click in the “Add a caption” box which is at the top and you need to
add a descriptive sentence that helps build the action.
• Now click enter. A new tableau story point will appear to you.
• Later on, you can repeat these steps to build your entire story
• Now to format your Tableau story points or even title, click on the story menu linked
at the top and click on format.
The presentation mode is used when the user wants to share the findings using this mode.
When the user uses this view, Tableau hides the toolbar and menu option and displays only
the view. There are different controls available in the presentation mode like:
Show Filmstrip - shows the sheets as thumbnails at the bottom of the workspace.
Show Tabs - shows the sheet tabs at the bottom of the workspace.
Previous/Next Sheet - advances forward or backward through the sheets in a
workbook.
Enter/Exit Full Screen - switches between expanding the workbook to fill the entire
screen and showing it in a window.
Exit Presentation Mode - returns the workbook to showing the entire workspace
including the menus, toolbar, and the Data pane.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
You can print your reports of Tableau to a printer or as PDF. But, before hitting the print
option you're required to do a Page Setup.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
SQL
What is Database-
A database is a systematic way of collecting data. It supports the manipulation of data and
electronic storage. Databases make the management of data easy and less complex.
We will be able to organize the data into tables, rows, columns, and indexes which make it
easier to find relevant information.
The main purpose of the database is to operate a large amount of information by storing and
managing data.
Generally, modern databases are managed by database management systems (DBMS).
Database Management systems can be of several types. Here is a list of some common
database management systems.
• Hierarchical databases
• Network databases
• Relational databases
• Object-oriented databases
• Graph databases
• Centralized database
• Distributed database
• NoSQL databases
Hierarchical databases
Here is a hierarchical database management system, data is stored in a node which is like a
parent-child relationship. Besides actual data, records also contain information about the
relationships of groups of parent/child relationships.
In Hierarchical databases data is organized into a tree-like structure, the data is stored in such
a way that each field contains only one value, and records are linked to each other via links
to a parent-child relationship. Each of the child records can only have one parent whereas a
parent can have multiple children, if we want to retrieve a field's data, we need to traverse
through each tree until the record is found.
Network Databases
It uses a network structure to create a relationship between entities. They are mainly used on
large digital computers. They are hierarchical databases, but unlike hierarchical databases
www.fingertips.co.in+91-780.285.8907
Learn by Doing
where one node can have a single parent only, here in a network node multiple entities can
have a relationship, it looks like a record of interconnected network.
Here children are known as members and parents are known as occupiers. The main
difference between each child or member is that it can have more than one parent.
Relational Databases
Here is a relational database management system (RDBMS), the relationship between the
data is relational and data here is stored in the form of tables which consist of columns and
rows. Each column of a table represents an attribute and each row in a table represents a
record. Each field in a table represents a data value.
Structured Query Language (SQL) is the language used to query RDBMS, which includes
inserting, updating, deleting, and searching records.
It works on each table that has a key field that uniquely indicates each row, later on, these
fields can be used to connect one table of data to another. Relational databases are more
widely used databases. Some examples can be Oracle, SQL Server, MySQL, SQLite, and IBM
DB2.
Object-Oriented Model
Graph Databases
These Databases are NoSQL databases as they use a graph structure for semantic queries. The
data in this database is stored in the form of nodes, edges, and properties. In a graph
database, a Node represents an entity or instance such as a customer, person, or car where a
node is equivalent to a record in a relational database system. An Edge in a graph represents
a relationship that connects nodes.
Azure Cosmos DB, SAP HANA, Oracle Spatial, and Graph are some popular graph databases.
It is also supported by some RDBMS like Oracle and SQL servers.
Centralized Database
It is a type of database that is stored, located as well as maintained at a single location only.
This type of database is modified and managed from that location itself. This location is thus
mainly any database system or a centralized computer system. The centralized location is
accessed via an internet connection (LAN, WAN, etc). This centralized database is mainly used
by institutions or organizations.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Distributed Database
It is a type of database that consists of multiple databases that are connected and are spread
across different physical locations. The data that is stored in various physical locations can
thus be managed independently of other physical locations. The communication between
databases at different physical locations is thus done by a computer network.
NoSQL Databases
These databases do not use SQL as their primary data access language. A graph database,
network database, object database, and document database are some common NoSQL
databases. This database does not have predefined schemas, which makes it the best. It also
allows developers to make changes without affecting applications.
It has five major categories as Column, Document, Graph, Key-value, and Object databases.
Features of RDBMS:
Every database administrator and a user uses SQL queries for accessing and manipulating
the data of database tables and views.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
This manipulation and retrieving of the data are performed with the help of reserved words
and characters, which are generally used to perform some of the operations such as
arithmetic operations, logical operations, comparison operations, compound operations,
etc.
In SQL, reserved words and characters are known as operators, they are generally used with
a WHERE clause in an SQL query. Here In SQL, an operator can be either a unary or binary
operator. The unary operator uses only one type of operand while performing the unary
operation, whereas the binary operator uses two types of operands for performing the
binary operations.
Types of Operator-
SQL operators can be of the following categories:
• SQL Arithmetic Operators
• SQL Comparison Operators
• SQL Logical Operators
• SQL Compound Operators
• SQL Unary Operators
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Storage Engines
Storage engines (underlying software components) are MySQL components, that can handle
the SQL operations for different table types to store and manage information in a database.
InnoDB is mostly used general-purpose storage engine, by using storage engines you can
easily interact with a file at an OS level so that data can be stored in it.
Keys in SQL-
It is just a single attribute which is a column that can uniquely identify a row.
• Primary key-
The primary key helps to identify every record that is present in the table uniquely.
We can have only one primary key in a table while there can be multiple unique
keys.
• Super Key-
It is a set of attributes that can be one or more than one that collectively identifies
an entity set.
• Candidate key-
It is a minimal super key. An entity set can have more than one candidate key.
• Alternate key- It is a table that has more than one candidate key, and after choosing
the primary key from those candidate keys, the rest of the candidate keys are known
as alternate keys of that table.
• Composite Key- They are the keys with more than one column.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
• CREATE: It is used to create the database or objects (which can be a table, index,
function, views, store procedure, and triggers).
• DROP: It is used to delete objects from the database.
• ALTER: It is used to alter the structure of the database.
• TRUNCATE: It is used to remove all records from a table, including all spaces
allocated for the records are removed.
• COMMENT: It is used to add comments to the data dictionary.
• RENAME: It is used to rename an object existing in the database.
CREATE, ALTER, and DROP commands require exclusive access to the specified object. An
ALTER TABLE statement fails if another user has an open transaction on the specified table.
Some more DDL Commands like GRANT, REVOKE, ANALYZE, AUDIT, and COMMENT
commands do not require exclusive access to the specified object.
Functions in SQL
They are the methods where data operations are performed. SQL has many in-built
functions used to perform string concatenations, mathematical calculations etc.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Aggregate Functions
Scalar Functions
SQL Joins
SQL joins are used when we need to combine records from two or more two different tables
in a database, In other words, it helps us to retrieve data from two or more database tables,
where these two tables are related to each other using primary and foreign keys.
• Multiple Join-
This Join is used to perform multiple joins in a query statement that can retrieve the
data by combining the records of more than one table. Whenever we perform joining
of more than one join in a single query statement, then we are making use of multiple
joins.
Inner join-
The INNER JOIN keyword selects all rows from both tables as long as the condition is
satisfied. This keyword will create the result-set by combining all rows from both the tables
where the condition satisfies
i.e value of the common field will be the same.
Left join-
Left join returns all the rows of the table on the left side of the join and matching rows for the
table on the right side of the join. For the rows for which there is no matching row on the
right side, the result-set will contain null. LEFT JOIN is also known as LEFT OUTER JOIN.
Right join-
RIGHT JOIN is similar to LEFT JOIN. This join returns all the rows of the table on the right side
of the join and matching rows for the table on the left side of the join. For the rows for which
www.fingertips.co.in+91-780.285.8907
Learn by Doing
there is no matching row on the left side, the result-set will contain null. RIGHT JOIN is also
known as RIGHT OUTER JOIN.
Full Join-
FULL JOIN creates the result-set by combining results of both LEFT JOIN and RIGHT JOIN. The
result-set will contain all the rows from both tables. For the rows for which there is no
matching, the result-set will contain NULL values.
Subqueries in SQL-
It is a query within another SQL query and which is embedded within the WHERE clause.
It is used mostly when we need to return data that will be used in the main query as a
condition to further restrict the data to be retrieved.
They can be easily used with the SELECT, INSERT, UPDATE, and DELETE statements along with
the operators like =, <, >, >=, <=, IN, BETWEEN, etc.
• It can have only one column in the SELECT clause, whereas multiple columns are in the
main query for the subquery to compare its selected columns.
• An ORDER BY command cannot be used in a subquery, the main query can use an
ORDER BY. Also, the GROUP BY command can be used to perform the same function
as the ORDER BY in a subquery.
• Subqueries that return more than one row can only be used with multiple value
operators such as the IN operator.
• The SELECT list cannot include any references to values that evaluate a BLOB(Binary
large objects)
Stored Procedures-
This code in SQL can be stored for later use and can be used many times. In that case,
www.fingertips.co.in+91-780.285.8907
Learn by Doing
whenever you need to execute a query you can just simply call the stored procedures to
perform the task, you can even pass parameters to an already stored procedure, as the
stored procedure will act based on the parameters' values.
They provide one or more SQL statements for selecting, updating, or removing data
from database tables. They are specified by the user who accepts input parameters
and returns output parameters. DDL and DML commands are used together in a
user-defined procedure.
When SQL Server is installed, It creates system procedures. The system stored
procedures prevent the administrator from querying or modifying the system and
database catalog tables directly. Developers often ignore system stored procedures.
Filter-
When you want to refine your query by running your aggregations against a set of limited
values in a column then you can use the FILTER keyword. As the filter clause will extend
aggregate functions such as sum, avg, count, etc. By using an additional WHERE clause. The
result of the aggregate will be built up from only the rows which satisfy the additional WHERE
clause too.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
STATISTICS
What is Statistics?
Statistics is a branch of Mathematics dealing with Data Collection, Organization, Analysis,
Interpretation, and Presentation.
As defined by the American Statistical Association (ASA)- is the science of learning from
data and measuring, controlling, and communicating, uncertainty. Statistics is the art of
learning from data. In other words, we can say that it is concerned with the collection of
data, subsequent description, and their analysis, which leads it to draw of conclusion,
Moreover, in statistics, we will be going to study a large collection of people or objects.
Inferential statistics
In Inferential statistics, we are working on the samples, because it is too difficult or we can
say expensive to collect data from the whole population that we are interested in.
What is Probability?
In the most literal sense, the probability is the likelihood of the occurrence of an event.
Probability of an event= (Number of favorable outcomes)/ (Total Number of
Possible Outcomes)
P(A) =n(E)/n(S)
Uniform distribution is fairly simple. Every value has a change of incidence that is
equal. The distribution is thus made up of random values with no trends in them.
Normal Distribution-The "Bell Curve" is a Normal Distribution and some data that
follows it closely, but not perfectly (which is usual). It is often called a "Bell Curve
“because it looks like a bell.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Binomial Distribution is a type of distribution that has two possible outcomes (the prefix
“bi” means two, or twice). For example, a coin toss has only two possible outcomes
Variable
A variable is a characteristic of a unit that is being observed and can assume more than
one of a set of values to which a numerical measure or a category from a classification
can be assigned.
Some types of variables in the field of Data Science are listed below:
• Numerical
• Categorical
Numerical: This category of variables is the variable that deals with numbers only. This can
now be divided into 2 subcategories.
• Discrete: A discrete variable is a numeric variable. Observations can be taken as a
value that is based on a count from a set of distinct whole values. A discrete variable
cannot take the value of a fraction that lies between one value and the next closest
value.
• Continuous: A Continuous Variable is a numerical Variable that deals with quantities
such as continuous quantities or fractional quantities.
o Height
o Age
o Temperature.
Categorical: Categorical variables have values that describe a ‘quality’ or ‘characteristic’ of
a data unit, like ‘what type’ ‘or which category. Categorical variables can be put into
categories.
Some terminologies in Probability
• Experiment: It is an activity whose outcomes are not known, is called an experiment.
Every experiment has a few favorable outcomes and a few unfavorable outcomes.
• Event: An event is a trial with a clearly defined outcome. An example of an event can
be getting a tail while tossing a coin which is an event.
• Random Event: An event that cannot be predicted easily is a random event. For these
events, the probability value is very less. An example of a Random event can be seeing
a shooting star as a random event.
• Trial: The number of numerous attempts in the process while experimenting is called
trials, or we can say that any particular act of a random experiment is called a trial. An
example of a trial can be the tossing of a coin.
• Outcome: The result of a trial can be termed an outcome. An example of an outcome
can be a footballer, either he will hit the goal or miss the goal.
• Mutually Exclusive Events: When the happening of one event prevents the happening
of another event then they are known as mutually exclusive events, or we can say that
two events are mutually exclusive if they cannot occur at the same time.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Sample
A sample is a subset of the population, to study the larger population we select a sample. In
sampling, we select a portion of a larger population and study that portion to gain
information about the population.
Probability Sampling
Probability sampling represents a group of sampling techniques that help researchers
to select units from a population that they are interested in studying. Types of
Probability Sampling.
• Simple Random Sampling: With simple random sampling, there is an equal
chance (probability) that each of the units could be selected for inclusion in
our sample.
• Systematic Sampling: Every member of the population here is listed with a
number, but instead of randomly generating the numbers, here individuals are
chosen at regular intervals.
• Stratified Sampling: With the stratified random sample, there is an equal
chance (probability) of selecting each unit from within a particular stratum
(group) of the population when creating the sample.
• Clustered Sampling: It is a method where we divide the entire population into
sections or clusters that represent a population. This method is good for
dealing with large and dispersed populations
Non-Probability Sampling
Non-probability sampling represents a group of sampling techniques that help
researchers to select units from a population that they are interested in studying. Its
Types are:
• Quota Sampling: With proportional quota sampling, the aim is to end up with
a sample where the strata (groups) being studied (e.g., males vs. females
students) are proportional to the population being studied.
• Convenience Sampling: A convenience sample is simply one where the units
that are selected for inclusion in the sample are the easiest to access.
• Snowball Sampling: Snowball sampling is particularly appropriate when the
population you are interested in is hidden and/or hard to reach. It can be used
to recruit participants via other participants.
• Judgement Sampling: Also known as selective, or subjective, sampling, this
technique relies on the judgment of the researcher when choosing who to ask
to participate
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Range
A range is the most common and easily understandable measure of dispersion. It is
the difference between two extreme observations of the data set. It is based on
two extreme observations. Hence, gets affected by fluctuations.
Quartile
The quartiles will divide the data set into quarters. The first quartile, (Q1) will be the
middle number between the smallest number and the median of the data. The second
quartile, (Q2) will be the median of the data set. The third quartile, (Q3) will be the middle
number between the median and the largest number.
A quartile divides a sorted data set into 4 equal parts so that each part represents ¼ of the
data set.
Variance
Variance is the average squared deviation from the mean of a set of data. Variance is
generally used to find the standard deviation.
T-Test
To evaluate whether there is a significant difference between the means of two groups that
may be related in some ways, a t-test is a sort of inferential statistic that is utilized. It is
typically employed when data sets, such as the one representing the results of tossing a coin
100 times, would follow a normal distribution and might contain unidentified variances.
Types of T-Test
There are three types of T-tests.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Chi-Square
Anova
It stands for Analysis of Variance. It is used to determine whether there is a statistically
significant difference among more than two group means.
Example Anova-
We could use the One-way Anova test to determine if out of three or more rivers, at least
two of them differ significantly from each other in terms of pH, TDS, etc.
We could determine if at least two regions differ significantly in terms of average sales of a
particular product category
Skewness
It is the degree of distortion from the symmetrical bell curve or the normal distribution. It
measures the lack of symmetry in data distribution.
It differentiates extreme values in one versus the other tail. A symmetrical
distribution will have a skewness of 0
Hypothesis
Hypothesis testing is a statistical method that is used in making statistical decisions
using experimental data. Hypothesis Testing is an assumption that we make about
the population parameter.
Critical Value
We need to consider the following two facts. One significance level is the probability of
rejecting a correct null hypothesis.
The sampling distribution for a test statistic assumes that the null hypothesis is correct.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
P-value
A statistical hypothesis test may return a value called p or the p-value. This is a
quantity that we can use to interpret or quantify the result of the test and either
reject or fail to reject the null hypothesis. This is done by comparing the p-value to a
threshold value chosen beforehand called the significance level. The significance
level is often referred to by the Greek lower-case letter alpha.
There are two different types of skills required by a data scientist and they’re technical and
non-technical skills. Various technical skills are machine learning, deep learning, data
visualization, data wrangling, etc. And, various non-technical skills required are strong
communication skills, data intuition, and strong business acumen.
Machine Learning
Machine learning is an interesting branch of artificial intelligence. Machine learning helps us
in accessing data in new ways. For example, Facebook recommends you the ads for
products that you searched on other platforms. This amazing technology helps the machines
access data from the systems and performs smart tasks.
Data visualization
Data visualization is also known as information visualization. It is the process of translating
data and information into a visual context. The visual context involves a chart, bar, graph,
etc. This step in the process shows that the information is collected and processed also. The
visualized information allows the user to conclude.
Data wrangling
It is the process of cleaning, organizing, and transforming raw data into the format that is
desired. The exact method of data wrangling varies as it depends on the project and type of
data. Data wrangling helps in making the raw data useful. Accurately wrangled data help
and guarantee that correct data is entered.
www.fingertips.co.in+91-780.285.8907
Learn by Doing
Communication skills
Data scientists extract, understand and analyze data. However, to be successful in the role
and to benefit the organization, you must be able to successfully communicate the results
with your team members who are not from the same background as you.
Computer programming
As a data scientist, you need to know the various programming language. Language like
Python, C/C++, SQL, and Java is the most common coding language required in a data
science role.
www.fingertips.co.in+91-780.285.8907
POWER BI
Introduction
What is Microsoft Power BI?
Microsoft Power BI is a suite that is a collection of business intelligence tools such as software
services, apps, and data connectors. It is a cloud-based platform used to consolidate data from
varied sources into a single data set. These data sets are used for data visualization, evaluation,
and analysis by making sharable reports, dashboards, and apps. Microsoft offers three types of
Power BI platforms i.e. Power BI Desktop (a desktop application), Power BI Service (SaaS i.e.,
Software as a Service), and Power BI Mobile (for iOS and Android devices).
Power BI can be deployed both on-premise and on-cloud. It can also import data from local
databases/data sources, cloud-based data sources, big data sources, simple Excel files, and other
hybrid sources. Thus, Power BI, a leader among a lot of other BI tools proves to be an efficient and
user-friendly tool for data analysis. The file format for the same is .pbix. It enables the users to
consolidate data from multiple sources, make interactive dashboards, evaluate data, create
informative reports, and share them with other users.
In Power BI you can create visualizations such as:
3. Datasets Filtration
Dataset is a single set of data created as a result of taking data from multiple data
sources. You can use the datasets to create visualizations of different kinds. A dataset
can be made of data taken from a single source like an Excel workbook or more than a
data source.
You can filter the datasets and have smaller subsets containing only the important data
and contextual relevance. Power BI provides the users with a wide range of in-built data
connectors such as Excel, SQL database, Oracle, Azure, Facebook, Salesforce, MailChimp,
etc. Users can easily connect to such data sources and create datasets by importing data
from one or more sources.
4. Customizable Dashboards
Dashboards are a collection of visualizations offering meaningful information or insights
into data. Typical dashboards in Power BI are composed of multiple visualizations as tiles.
They are single pages from the reports. The dashboards are shareable as well as printable.
5. Flexible Tiles
A tile is a single block containing a visualization in a Power BI dashboard. Tiles segregate
each informative visualization properly to provide a clearer view. These tiles can be
adjusted and the size can also be changed. Also, they can be placed anywhere on the
dashboard per the users’ convenience.
6. Navigation Pane
The navigation pane has options for datasets, dashboards, and reports. Users can
conveniently work in Power BI and navigate between datasets, the dashboard they are
working on, and reports they are creating.
• Visualizations
• Datasets
• Reports
• Dashboards
a. Visualizations
Next, we’ll create an area chart that will show the total sales of products over 12 months. We
created an area chart by selecting the Area chart from Visualizations and adding respective
fields to it.
Step 8: Creating a multi-row card
In addition to visually representing data via graphs and charts, you can display data as textual
information using cards or multi-row cards. So, we’ll add a Multi-row card from the
Visualizations section.
Step 9: Creating a map showing the total units sold by the state
From the wide range of visualizations available, we can also represent information on the map.
In our dashboard, we will add a map showing the total units sold per state in the USA. We
selected a Filled map from the Visualizations section.
Step 10: Adding a Slicer for Sub-categories of products
Lastly, we’ll add a Slicer for Subcategories of products in the record. Using this slicer, users can
select specific categories and filter through data. Upon selecting in the slicer, all the other
visuals will change and show only the visuals related to the selected field or value.
Step 11: Finish the final dashboard
Now that we are done adding all the different types of visuals and graphics that we needed on
our dashboard, we are nearing the final steps. Adjust and resize the visualizations on the
dashboard as you like. You can also select a theme for the dashboard, its page size,
background, etc.
Step 12: Publish the dashboard
Once your dashboard is ready, you can publish it on the Power BI workspace. First, save your
dashboard in your system.
Go to the Publish option to publish the dashboard on the web. Log in to your Power BI account
and the publishing process will be successful. Then, you will get a link to a web source where
the dashboard is uploaded and available for other users’ access.
DAX in Power BI
DAX stands for Data Analysis Expressions i.e. such expressions or formulas that are used for data
analysis and calculations. These expressions are a collection and combination of functions,
operators, and constants that are evaluated as one formula to yield results (value or values). DAX
formulas are very useful in BI tools like Power BI as they help data analysts to use the data sets
they have to the fullest potential.
With the help of the DAX language, analysts can discover new ways to calculate data values they
have and come up with fresh insights.
DAX is a functional language i.e. its complete code is always a function. An executable DAX
expression may contain conditional statements, nested functions, value references, etc.
DAX formulas have two primary data types; Numeric and Non-numeric or Others. The numeric
data type includes integers, decimals, currency, etc. Whereas, the non-numeric consists of strings
and binary objects.
DAX expressions are evaluated from the innermost function going to the outermost one at the
last. This makes the formulation of a DAX formula important.
DAX Functions
A DAX function is a predefined formula that performs calculations on values provided to it in
arguments. The arguments in a function need to be in a particular order and can be a column
reference, numbers, text, constants, another formula or function, or a logical value such as TRUE
or FALSE. Every function performs a particular operation on the values enclosed in an argument.
You can use more than one argument in a DAX formula.
Key Points about DAX Functions
Here are some unique facts about DAX functions that you must know to understand them better:
Any DAX function always refers to a complete column/field or a table. It will never refer to
individual values. If you want to use the functions on separate values within a column, you need
to apply filters in a DAX formula.
DAX functions provide the flexibility to create a formula that is applied on a row-by-row basis. The
calculations or formulas get applied as per the context of the values in each row.
In some cases, DAX functions return a full table which can be used in other DAX formulas that
need a complete set of values. However, you cannot display this table’s contents.
DAX functions have a category known as time intelligence functions. Such functions are used to
calculate time/date ranges and periods.
Types of Filters in Power BI – Edit View in Power BI Filter Pane Filters in Power BI
In Power BI benefit, reports can open in Editing perspective or Reading view. In Editing view, and
Desktop Report sees, report proprietors can add filters to a report and those Power BI filters are
spared with the report. Individuals seeing the report from a Reading perspective can associate
with the filters in Power BI and spare their progressions, yet can’t add new filters to the report.
How to Clear & Add Filters in Power BI
a. Clear the Power BI Filter
In either progressed or fundamental filtering mode, select the eraser symbol to clear the filters in
Power BI.
b. Add the Power BI Filter
Follow this step to add Filters in Power BI:
In Desktop and Power BI benefit Editing view, add a filter to a visual, page, drill through, or report
by choosing a field from the Fields sheet and hauling it into the suitable filter well, where you see
the words Drag fields here. Once a field has been included as a filter, tweak it utilizing the Basic
filtering and Advanced filtering controls (portrayed underneath).
Hauling another field into the Visual level filter zone does not add that field to the visual, but
rather it allows you to filter the visual with this new field. In the case beneath, Chain is added as
another filter to the visual. Notice that essentially including Chain as a filter does not change the
visual until the point when you utilize the Basic or Advanced filtering controls.
Python
Difference between Python and R
Python R
Can easily handle large data sets. Not suitable for large data sets.
Essential libraries are Numpy, Pandas, etc. Essential libraries are caret, tidyverse, etc.
Python Java
Dynamically-typed Statically-typed
Codes are shorter and easier to read. Not easy when compared to Python
Python Variable
Python is not "statically typed", here you don't need to declare variables before using them, or
even declaration on their type is not needed. A variable is created at the same moment you
assign a value to it. A variable can be a name given to a memory location or a basic unit of
storage in a program.
• Values stored in a variable can be changed during program execution.
• A variable is a name given to a memory location, and all the operations done on it effects
that memory location.
Data
Types
Dictionar Sequence
Numeric Boolean Set
y Type
Complex
Integer Float Strings List Tuple
Number
Python Operators
Python Operators are used to performing operations on values and variables.
Arithmetic Operations
These operations are used to perform mathematical operations like addition(+), subtraction(-),
multiplication(*), division (/ - float & // - floor), Modulus(% - remainder values) and Power(**).
Comparison Operators
Comparison Operators compare the value among the operators. It displays either True or False
depending upon the condition. Examples are Greater (>), Less than (<), Equal to (==), Not Equal
to (!=), Greater Than equal to (>=) and Less than equal to (<=).
Logical Operators
Logical operators are used on combined conditional statements. These operators perform
Logically AND, Logical OR, Logical NOT.
Bitwise Operators
Bitwise Operators work on bit-by-bit operations. These operators are used to operate on binary
numbers.
Operator Description Syntax
| Bitwise OR a|b
~ Bitwise NOT ~a
Assignment Operators
Assignment operators are used to assigning values in operators.
Operator Description Example
= Assigns values from right side operands c = a + b assigns value of a + b into c
to left side operand
+= Add It adds right operand to the left c += a is equivalent to c = c + a
AND operand and assign the result to left
operand
-= It subtracts right operand from the left c -= a is equivalent to c = c - a
Subtract operand and assign the result to left
AND operand
*= It multiplies right operand with the left c *= a is equivalent to c = c * a
Multiply operand and assign the result to left
AND operand
/= Divide It divides left operand with the right c /= a is equivalent to c = c / a
AND operand and assign the result to left
operand
%= It takes modulus using two operands c %= a is equivalent to c = c % a
Modulus and assign the result to left operand
AND
**= Performs exponential (power) c **= a is equivalent to c = c ** a
Exponent calculation on operators and assign
AND value to the left operand
//= Floor It performs floor division on operators c //= a is equivalent to c = c // a
Division and assign value to the left operand
Identify Operators
Is and is not are identity operators. They are used in the following manner.
Is TRUE if operands are identical
Is not TRUE if operands are not identical
Python Pre-defined Functions or Built-in functions
These are the functions that are already defined in the Python programing language. The users
can easily use these functions without defining them first. A few of the built-in functions are:
User-defined Functions
These functions are framed by the programmers to ease their complexity while programming and
to use them according to their need.
Conditional and Control Statements Conditional Statements: -
The if statement
It checks whether <Expressions> evaluates to True and if it happens it immediately blocks
<Statements>, otherwise it is not executed. You need to note the indentation before the
statement block <Statements>.
Control Statements
The for loop
It enables to iterate on an ordered collection of objects and then execute the same sequence of
the statements for each of the element.
Here it Evaluates Statements for each i in the list, while the process it skips evaluating any instance
of Statements whenever Condition 2 is True.
This expression is executed and the result is returned. We can take a example here to understand
more clearly
Let’s add 25 to argument a , and return the result: x
= lambda a : a + 25
print(x(5))
Introduction to Matplotlib
Matplotlib is one of the most used and powerful libraries in Python that is used for Data
Visualization. . It is a 2D plotting library. The library is structured in a way that within a few lines
of code, it can generate a visual data plot.
Matplotlib was originally written as an open-source alternative for MATLAB. The key to
understanding the working of plots is to understand Matplotlib pyplot API:
• Figure: The figure is the top-level container. It visualizes everything in a plot including one
or more Axes.
• Axes: It is the area where data is plotted. It includes X-Axis, Y-Axis, and sometimes a Z-Axis
also.
Box Plot
This box plot is used to show a summary of the whole dataset. It contains minimum, first quartile,
median, third quartile, and maximum. The median is present between the first and third quartile.
A Box Plot is the visual representation of the statistical five-number summary of a given data set.
Scatter Plots
Scatter plots use the dots to showcase the plotting. It easily shows the relationships between
the variables. The function scatters() is used for Scatter Plots. The closer the dots to the line, the
weaker the relation will be in the plot.
Pie Chart
The circular chat is used to show the percentage of the data. It is used to compare the individual
categories to the whole data. The function Pie() is used to create this visualization.
Line Plots
The plots are used to study the relationship between the x and y-axis. Here, x and y are the
coordinates of the horizontal and vertical axis. In Matplotlib, the plot() function is used to create
the visual graph.
Distribution Plots
Distribution plots of Seaborn can be compared with histograms in Matplotlibs. In histograms,
frequency plots will be used and in the distribution plot, approximate probability density across
the y-axis is plotted. The function sns.distplot() is used in the code.
Introduction to Seaborn
Seaborn is Library in Python that is especially used for making statistical graphics. The library is
built on top of Matplotlib and is integrated with Pandas. To learn Seaborn, the user needs to get
familiar with Numpy, Matplotlib, and Pandas.
Heatmaps
The Heatmaps showcase the data in 2D form. It shows the data in a colored graph which makes
the understanding of data easy. The function sns.heatmap() is used to create the Heatmap.
Pair Plots
Pair plots are used to study the relationship pattern among more than 3 variables. The function
sns.pairplot() is used in the code to create pair plots using Seaborn.
Distribution Plots
Distribution plots of Seaborn can be compared with histograms in Matplotlibs. In histograms,
frequency plots will be used and in the distribution plot, approximate probability density across
the y-axis is plotted. The function sns.distplot() is used in the code.
Introduction to Pandas
Pandas are a library of Python that helps you to work with tabular data, time-series data, matrix
data, etc. The library allows you to work on big data, analyze it, and derive conclusions from it. A
few of the things for which Pandas are used are:
• Data Handling
Panda library has made it easy to manage and explore the data. The library comes with
Series and DataFrames that makes representation and manipulation of data easy.
• Organization of Data
Having a huge chunk of data without it being properly aligned is useless as unorganized
data is hard to interpret. The organization and labeling of data are perfectly done by
Panda’s intelligent methods of alignment and indexing.
• Handling Missing Data
Data is very important for every organization but one of the big problems associated with
it is missing data. It is of utmost importance to handle the missing data properly so that
the insights derived from the data are of the utmost accuracy. Feature of handling missing
data is integrated with Panda Library.
• Unique Data
Data is not always filtered, it comes with numerous repetitions and therefore, you must
analyze the data that have unique values. Panda library in Python allows the users to see
unique values in the dataset.
• Visualize
The important step in data science is the visualization of data. Visualization makes the data
understandable easily. Panda libraries have in-built features to help the users to plot the
data and see various kinds of a graph that can be formed.
• Grouping
The option of separating the data based on certain criteria and then grouping it differently
is important. There’s a feature in Panda named GroupBy that allows the user to split the
data into different categories.
• Multiple file formats supported
Data is not found from one source, not in one particular format. Instead, it comes from
different sources in different formats. Therefore, libraries must support various file
formats. Panda supports a huge amount of file formats, right from JSON to CSV, including
Excel and many more.
Series objects in Pandas
A Series is a one-dimensional array that is capable of holding any type of data. To understand it
in simple terms, Panda Series is nothing but a column in an excel sheet. For example, columns
“name”, and “age” represent a series.
A Panda Series can be created via the following constructors:
• data
Data takes various forms like ndarray, list, constants
• index
an index is used to label the rows. The values of the index should be unique and of the
same length as the data.
• dtype
dtype is used to specify the datatype of the series. If not present, the datatype will be
inferred.
• name
This parameter is used to give a name to the series.
• copy
It is used to copy the input data.
Stacking in Panda means converting the innermost column Index into the innermost row Index.
On the other hand, unstacking is its opposite. It means converting the innermost row index into
the innermost column index.
Introduction To Numpy
NumPy can also be used as an efficient multi-dimensional container that contains generic data.
Using Numpy arbitrary data types can be easily defined which will allow Numpy to speedily get
integrated with a wide variety of databases.
Arrays in NumPy
• It is a table of elements ( which are usually numbers), all are of the same type indexed by
a tuple of positive integers
• Dimensions in NumPy are called axes. The number of axes is rank.
• NumPy’s array class is called ndarray, which is also known as the alias array.
Elements Description
reshape It gives a new shape to an array without
changing its data.
flat It is A 1-D iterator over the array.
flatten It returns a copy of the array collapsed into
one dimension.
ravel It returns a contiguous flattened array.
transpose It permutes the dimensions of an array
ndarray.T It is the same as self.transpose()
rollaxis It Rolls the specified axis backward.
swapaxes Here interchanges get done between the
two axes of an array.
broadcast It produces an object that mimics
broadcasting
broadcast_to Broadcasts an array to a new shape
expand_dims Expands the shape of an array
squeeze It Removes single-dimensional entries from
the shape of an array.
concatenate It joins a sequence of arrays along an existing
axis
stack Here the sequence of arrays gets joined
along a nex axis
hstack Stacks arrays in sequence horizontally (
Column wise)
vstack stacks arrays in sequence vertically which is
row-wise
split It splits an array into multiple sub-arrays
hsplit Array gets splits into multiple sub-arrays
horizontally (ccolumn-wise
vsplit Splits an array into multiple sub-arrays
vertically (row-wise)
resize It returns a new array with the specified
shape
append It appends the value to the end of an array
insert insertion of the values along the given axis
before the given indices.
delete Here new array with sub-arrays along an axis
is deleted.
unique It Finds the unique elements of the array.
Functions Description
add() Here arguments can be added element-wise.
positive() They are Numerical positive, element-wise.
negative() They are Numerical negative, element-wise.
multiply() Multiply arguments element-wise.
power() Here first array elements get raised to
powers from the second array, elementwise.
MACHINE LEARNING
What is Machine Learning?
Machine Learning (ML) is the part of artificial intelligence that prepares the machines to become
more accurate at predicting the outcomes without being specially trained for it. The algorithms in
machine learning use historical data to provide the output.
Need for Machine Learning
Top companies like Netflix, and Amazon use algorithms of Machine Learning models with a huge
amount of data to derive useful insights and obtain accurate results.
Data mining is considered to be one of the popular terms for machine learning as it extracts
meaningful information from a large pile of datasets and is used for decision-making tasks. Some
of the top reasons why Machine Learning is important are:
• Decision Making – The algorithms help to make better business decision for the business.
For example, it is used to forecast sales, predict stock market conditions, identify risks, etc.
• The increased amount of Data – Every single industry creates a huge amount of data which
is ultimately required by them for analysis purposes. And machine learning helps exactly
in that. It uses these data to solve problems. The algorithm of Machine Learning helps to
complete the most complicated task of an organization with utmost ease.
• Identify patterns and trends – The most important part of Machine Learning is finding
hidden patterns. With help of statistical techniques, Machine Learning goes into detail and
studies the data minutely. On the other hand, understanding data manually will make this
a long process. Machine Learning can perform such operations in just a few seconds.
• The Gmail of Google uses Algorithms to filter spam messages and label them as spam,
promotional, etc.
• Netflix uses Machine Learning to recommend to its users their next shows and series. More
than 75% of recommendations are from these algorithms.
There are various ways of applying the process of Machine Learning, but mainly there are three
main categories namely:
Supervised
Learning
Machine Unsupervised
Learning Learning
Reinforcement
Learning
Supervised Learning
It is the way where we teach machines using well-labeled data. This learning is easiest to
understand and very simple to implement. The user can feed the data learning to the algorithm
and then allow the algorithm to predict the results, also allowing the feedback option to make
machines know whether they’re right or wrong.
Over a while, the algorithms tend to find a relation between different parameters, finding cause
and effect relationships between different variables. And by the end of the training, the algorithm
learns how data works in the dataset.
Here, the level of accuracy majorly depends mainly on two factors, the availability of labeled data
and the algorithm used by machines. In addition to these, other factors are:
• Data Scientist needs to be careful while feeding the data to machines. The data must be
balanced and utmost cleaned. Duplicate and missing data would affect the accuracy of
machines.
• The input of diversified data would help the machines learn new cases. If there are not
enough diversified data the machine will fail to provide reliable answers.
• Avoid overfitting. Overfitting is a situation where the model learns the detail of data that
it negatively impacts the result of a new data set. Therefore, it is important to keep test
data different from training data.
• Supervised learning is used for fraud transactions and customer detection. The
algorithms use historic data to identify the patterns of possible fraud.
• This learning is also used for spam detection. The emails considered spam are directly
transferred to the spam folder.
• These algorithms are also used in speech recognition. The algorithm is trained with
voice data for voice-activated passwords and voice commands.
• Regression
• Classification
Regression
Regressions are used when there’s a relationship between the input variable and the output
variable and one of them is a dependent variable and the other is an independent one. Regression
analysis helps to understand how a change in the value of an independent variable affects the
dependent variable. This analysis is used in prediction, such as weather forecasting, market
trends, etc.
Types of Regression
Various types of Regression are used in Machine Learning. Below mentioned are a few of the
important types of Regression:
• Linear Regression
• Logistic Regression
• Polynomial Regression
• Support Vector Regression (SVR)
• Decision Tree Regression
• Ridge Regression
• Lasso regression
Classification
Unsupervised Learning
In unsupervised learning, the machines work with unlabelled data. This means that the assistance
of humans is not required. The machines are trained with unlabelled datasets and the machine
provides insights without human supervision.
This learning aims to find similarities, patterns, and even differences from the unfiltered data.
Unsupervised Learning Algorithms:
• K-Means clustering
• Hierarchical clustering
• Anomaly detection
• Neural networks
Reinforcement Learning
This learning works on the feedback process, where an AI agent (which is a software component)
explores the data with the hit and trial method. It learns from experience and improves its
performance.
When the algorithm finds the correct solution, it provides a reward to the algorithm. On the other
hand, in the opposite case, the algorithm keeps on working unless and until the correct solution
is found.
Application of Reinforcement Learning
Text mining uses RL to transfer the free text in documents into organized data to make it suitable
for analysis.
• Forecasting about the outcomes like stock prices, sales, etc. is done based on Regression
analysis.
• Predicting about the success of future retail sales or marketing campaigns to ensure about
the resources are being used effectively or not.
• Predicting about the user trends or the customers on streaming services or e-commerce
websites.
• Analyzing the datasets so that a relationship between variables and an output can be
established.
• Stock prices and rates of interest can be predicted easily by analyzing a variety of factors.
• Creating time series visualizations.
Linear Regression
• Linear regression attempts to find a linear relationship between a target and one or more
predictors.
• It is a statistical regression method used for predictive analysis.
• It is one of the most simple and easy algorithms which works on the regression and shows
the relationship between the continuous variables
• In machine learning, regression problems can be easily solved by linear regression.
• It shows the linear relationship between the X-axis which is the independent variable and
the Y-axis which is a dependent variable hence called linear regression.
Applications and Uses of Regression
• It is a prominent machine learning technique that can be used in various fields from stock
markets to scientific research.
• Engine performance can be easily analyzed from test data in automobiles.
• It is used to model causal relationships between parameters in biological systems.
• It can also be used in weather data analysis.
• It is often used in customer survey result analysis and market research studies. It is also
used in observational astronomy for astronomical data analysis.
Regression analysis is commonly used in research, so we get to know that a correlation exists
between variables. But correlation is not the same as causation; a relationship between two
variables does not mean one causes the other to happen. Even we can say that a line in a simple
linear regression that fits the data points may not fully guarantee a cause and effect relationship.
Using a linear regression model that will allow you to discover whether a relationship between
variables exists or not. To understand exactly what this relationship is about, and whether one
variable causes another, we need to work and research
Loss function
• The loss function is a method of evaluating algorithm models of the dataset. And if your
predictions are totally off, the loss function will output a higher number.
• It is a way to measure "how well your algorithm models your dataset."
• If your algorithm models are good, it’ll output a lower number. When you try to change
pieces of your algorithm and improve your model, loss function will tell you if you’re doing
it right and getting anywhere.
• When your predictions are off, a high number will get produced by the loss function, if
they're good it will lead to a lower amount
• It will inform you whether you need to change your algorithm when you try to refine your
model or not, It lets one understand how distinct the expected values are from the real
value.
Loss Function V/S Cost Function
• Loss Functions are used for the single • Cost function is the average loss over
training examples. the entire training dataset.
• Loss function is also known as error • Optimization strategies in the Cost
functions function aim at minimizing the cost
function.
• It measures how well your model • It considers the entire training set
performs on a single training example and then tries to measure how well
the model is performing on it.
• Loss function measures the error for a • Cost function measures the average
single training example. error for the entire training set.
Regression analysis aims to model the relationship between a certain number of features and a
continuous target variable.
These are some of the performance metrics used for evaluating a regression model:
The average of the squared difference can be easily found by using the mean squared error
between the predicted and actual values. Since it has a convex shape, it will be easier to optimize
it.
It is the average of the absolute difference between the target value and the predicted by the
model. It is not preferred in cases where outliers are prominent.
Root mean Squared Error is the root of the average of squared residuals. As residuals are the
measure of how
distant the points are from the regression line. Thus, RMSE measures the scatter of these
residuals.
R- Squared
It is also known as the Coefficient of Determination. As it explains the degree to which the input
variables explain the variation of the output/ predicted variable.
Adjusted R-squared
Here, N = The Total sample size (number of rows) and P = the number of predictors (number of
columns).
There is a limitation of R-squared which is that it will either stay the same or increases just with
the addition of some more variables even if there is no relationship with the output variable.
Ridge regression is a tuning method used to analyze any data that suffer from multicollinearity.
As it performs L2 regularization. When a multicollinearity issue occurs, least-squares are unbiased
and variances are large, this results in predicted values being far away from the actual values.
For a regression machine learning model, the usual regression equation forms the base which can
be written as:
Y= XB + e
• Y is the dependent variable.
• X represents about the independent variables.
• B is the regression coefficients
• And e represents about the errors that are residuals.
Lasso regression is a type of linear regression that uses shrinkage. As shrinkage is where the data
values are shrunk towards a central point same as the mean. The lasso procedure encourages
simple, sparse models.
"LASSO" stands for Least Absolute Shrinkage and Selection Operator.
It adds the "absolute value of magnitude" of the coefficient as a penalty term to the loss function.
Logistic regression is a statistical method used to predict the outcome of a dependent variable
based on the previous observation. It is a type of regression analysis that is a commonly used
algorithm for solving binary classification problems.
It is a classification algorithm that predicts a binary outcome based on a series of independent
variables. Logistic regression can also be used to solve regression problems.
It is also referred to as binomial logistic regression or binary logistic regression, when there are
two classes of the response variable it's called multinomial logistic regression.
• Variable: It refers to any number, characteristics, or even quantity that can be measured
or can be counted. Speed, gender, and income are some of the examples.
• Coefficient: It is a number that is usually an integer multiplied by the variable that it
accompanies.
• EXP: It is a short form of exponential.
• Outliers: They are the data points that significantly differ from the rest.
• Estimator: They are the algorithms or formula that generates estimates of parameters.
• Chi-squared test: It’s a hypothesis testing method to check whether the data is as
expected.
• Standard error: It is the approximate standard deviation of a statistical sample population.
• Regularization: It is a method used for reducing the error.
• Multicollinearity: It is an occurrence of intercorrelations between two or more
independent variables.
• The goodness of fit: It’s a description of how well a statistical model fits a set of
observations.
• Odds ratio: They are the measure of the strength of association between two events.
In statistics, it is used to describe about the properties of population growth. Sigmoid function and
logit function are some variations of the logistic function.
Logistic regression can be applied to predict about the categorical dependent variable like yes or
no, true or false, 0 to 1.
In the case of predictor variables, they can be the part of following categories:
• Continuous data: It can be measured on an infinite scale and can take any amount of value
between two numbers.
• Discrete, nominal data: It fits into the named categories.
• Discrete, ordinal data: Here the Data fits into some form of order on a scale.
• Decision Tree comes under the Supervised learning technique that can be used for both
problems like classifications and Regression, but mostly it is used and preferred for solving
Classification problems. As it is a tree-structured classifier, where internal nodes represent
about the features of a dataset, branches represent about decision rules and each leaf
node represents the outcome.
• The decision tree consists of two nodes named as Decision Node and Leaf Node. Decision
Nodes are used to make any decision and also have multiple branches, whereas leaf nodes
are just the output of those decisions and also it doesn't contain any further branches.
• All of the tests and decisions are made or performed on the basis of features of the given
dataset.
• A decision tree is the graphical representation for getting all of the possible solutions to a
problem/ decision based on given conditions.
• A decision tree has its structure completely same to a tree, it starts with the root node and
expands on further branches, and constructs a tree-like structure.
• If we need to build a tree, CART algorithm can be used which stands for Classification and
Regression tree algorithm.
• It can contain Yes/No as well as numeric data too.
• Root Node: It is where the decision tree starts from. This reflects the whole dataset, which
can be easily split into two or more homogeneous sets.
• Leaf Node: It is the last output node in the decision tree after this the tree will not be more
separated.
• Splitting: It separates the decision node/root node according to the specified conditions
into the sub-nodes.
Parent/Child node: Root node of the tree is considered as the parent node and the other nodes
are known as child nodes.
Machine learning has various algorithms, and choosing the best suitable algorithm for the dataset
you are working upon is the main point to remember while creating models in machine learning.
Decision Trees has the same capabilities to think as human while making a decision so that it can
be easy to understand.
And the logic behind the decision tree is very clear and can be understood easily as it shows a
tree-like structure.
• It is a simple process to understand as it follows the same process we humans follow while
making any decision in real life.
• It is a very useful technique for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem
• There is less requirement for data cleaning compared to other algorithms.
There are some techniques to handle the problem of overfitting such as:
• Pruning: Here in this technique, the decision tree can grow easily up to its full depth. It's a
technique to remove the parts of the decision tree to prevent it from growing to its full
depth and this can be done by tuning the hyperparameters of the decision tree model to
prune the trees and prevent them from overfitting.
• Pre-Pruning • Post-Pruning
• Ensemble- Random Forest:
Information Gain
Entropy
It is metric so that impurity in a given attribute can be measured, in the data it defines randomness
and can be calculated as:
• Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
• Here, S = Total number of samples.
• P(yes) = probability of yes
• P(no) = probability of no
Gini Index
Gini index is a measure of the impurity or purity used in the CART (Classification and Regression
Tree) algorithm when creating a decision tree.
In Comparison to a high Gini index, an attribute with a low Gini Index should be preferred.
Ensemble Learning
Ensemble models in machine learning combine the decision and insights from different and
multiple models to perform better and increase their overall decision-making capabilities.
In learning models, there are some major sources like noise, variance, and bias. The ensemble
methods here in machine learning help to minimize these errors causing factors that eventually
ensures about the accuracy and stability of the machine learning algorithms.
You can say that Ensembles are a divide and conquer approach and we use them to improve the
overall performance.
• Mode: Moving on to statistics, “mode” is the number or value that appears in a dataset.
In the ensemble technique, machine learning professionals use number of models to make
predictions about each of the data points present in the model. All these predictions are
made by different models which are taken as separate votes. And in final these predictions
made by the models are treated as the ultimate prediction.
• Mean/Average: In this technique, data analysts go with the average predictions made by
all the models while making the ultimate prediction.
• The Weighted Average : In this most of the time data scientist assign different weights to
all of the models in order to just make a prediction, where the assigned weight defines
about the relevance of each model.
It is a method that seeks a diverse group of members just by varying the model types fit on the
training data and later on using that model to combine predictions.
Random Forest
Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification
and Regression problems. It builds decision trees on different samples and takes their majority
vote for classification and average in case of regression.
One of the most important features of the Random Forest Algorithm is that it can handle the
data set containing continuous variables as in the case of regression and categorical variables as
in the case of classification. It performs better results for classification problems.
Bagging– It creates a different training subset from sample training data with replacement & the
final output is based on majority voting. For example, Random Forest.
Boosting– It combines weak learners into strong learners by creating sequential models such that
the final model has the highest accuracy. For example, ADA BOOST, XG BOOST
Difference Between Random Forest & Decision Trees
AdaBoost also called Adaptive Boosting is a technique in Machine Learning used as an Ensemble
Method. The most common algorithm used with AdaBoost is decision trees with one level which
means with Decision trees with only 1 split. These trees are also called Decision Stumps.
What this algorithm does is that it builds a model and gives equal weights to all the data points. It
then assigns higher weights to points that are wrongly classified. Now all the points which have
higher weights are given more importance in the next model. It will keep training models until
and unless a low error is received.
XG Boost
XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle
competitions for structured or tabular data.
It is an implementation of gradient boosted decision trees designed for speed and performance.
XG Boost Feature
The library is laser-focused on computational speed and model performance, as such there are
few frills. Nevertheless, it does offer a number of advanced features like gradient boosting,
Regularized gradient boosting and Stochastic gradient boosting.
K Nearest Neighbour
Introduction
Unsupervised Learning is the algorithm that does not depends on labelled data at the time of input
to make a machine learn a function and produce an appropriate output. In the KNN process, the
“K” is selected and the closest neighbour to it is determined. The crucial element in this algorithm
is a selection of the "K" element.
How does the KNN algorithm work?
To understand the working of the algorithm, we will take an example: there are two categories
below i.e. circles and squares.
We're required to find the category of a blue star. The star can be in the category circle or square.
We will now make a circle with BS as the center just as big as enclosing only three data points on
the plane.
The three closest points to the star are all the circles and therefore, the star would belong to the
category of circle. In the whole process, the selection of the K is most important as the accuracy
of the result depends upon it. Selecting the value of K
• There is not any predefined method to determine the value of K apart from elbow method.
You can start computing with a random value of K and later on increase it or decrease it
according to the accuracy.
• The value of K depends on the amount of data. In the case of different scenarios, the value
of K may vary. It is similar to the hit and trial method.
• Selecting small K will lead to an unstable output. Those outputs won’t be 100% accurate.
On the hand, if we increase the value of K, our predictions will become more stable and
they’ll be more likely to make more accurate predictions.
The goal of KNN is to find the nearest neighbours of a particular data point. To perform this task,
KNN has a few requirements:
Determine Distance metrics
To determine which data point is closest to the query it is necessary to calculate the distance
between the measures. Euclidean Distance, Manhattan Distance, Minkowski Distance and
Hamming Distance.
Advantages of KNN
• Easy Implementation
This is one of the first algorithms that data scientists will learn. The Algorithm's simplicity
and accuracy make it easy to implement.
• Adapts easily
Whenever new training samples are added, the algorithm adjusts itself for the new training
data.
• Fewer hyperparameters
KNN requires the “K” values and distance metric. These hyperparameters are low as
compared to other machine learning algorithms.
Disadvantages of KNN
Now, how will we select the best Hyperplane that separates our data point? One valid method is
to select the Hyperplane that represents the largest separation between the two classes.
If such a Hyperplane is present, it is called a maximum-margin Hyperplane. Types
of SVM
Linear SVM Non-Linear SVM
Can be separated with a single line. Cannot be easily separated with a single line.
Data Classification is done with the help of Use of Kernels to classify the data.
Hyperplane.
Easy classification with a single straight line. Mapping of data into HD space is required to
be done for classification.
Margins in SVM
Soft Margin:
In the linearly separable case, the Support Vector Machine is trying to find the line that
maximizes the margin, which is the distance between those closest dots to the line. This is called
the Soft Margin.
The motive of soft margin is simple, to keep the margin wide as possible and allow SVM to make
a certain number of mistakes.
Hard Margin:
If we strictly impose that all instances must be off the street and to the right side, this is called
hard margin classification. There are two main issues with hard margin classification. First, it
only works if the data is linearly separable, and second, it is quite sensitive to outliers.
In the case of hard margin, classification in SVM is very rigid. It tries to work well even in a training
set and thus, causes overfitting.
Kernels are used to solve non-linear problems and this is known as the Kernel trick method. It
helps to frame the hyperplane in an extremely high dimension without raising any complexity.
Important components of Kernel SVC
The two important components of Kernel SVC are Gamma and the ‘C’ parameter:
• Gamma: Gamma decides the amount of effect single training has on the final output. This
affects the decision boundaries in the model. When the values of Gamma are small, the
data points farther from it are considered similar. On the other hand, larger points cause
the data point to be much closer and ultimately result in overfitting.
• The ‘C’ parameter: This parameter controls the regularization amount of data. Greater
values of C lower the amount of regularization. And, lower values of C result in C
regularization.
Linear Kernel
This is the simplest Kernel function. The Kernel is used for text classification problems as
they can easily be separated through Linear Kernel. The formula for the same is:
F(x, y) = sum (x.y)
In the formula, x and y represent the data that you're supposed to classify.
• Polynomial Kernel
This type of Kernel is used when the training data are all standardized and normalized.
Polynomial Kernel is less preferred due to its less efficiency and accuracy. The formula for
this function is:
F(x, y) = (x.y+1) ^d
The dot in the formula shows the product of both the values and d shows the degree.
The value of gamma varies in the range of 0 to 1. The user is supposed to manually input
the value of Gamma and the most used value is 0.1.
• Sigmoid Kernel
It is mostly preferred for neural networks. The formula for Sigmoid Kernel is:
Advantages of SVM
• SVM provides a technique called Kernel. With the usage of this function, any complex
problem can be solved easily. The kernel is applied to non-linear classes and is also known
as a non-parametric function.
• SVM doesn’t get the problem of overfitting and it performs well in terms of memory.
• SVM can efficiently solve both classification and regression problems. SVM is used for
classification problems and SVR is used for regression problems.
• Compared with Naive Bayes (another technique for classification) SVM is faster and more
accurate at prediction.
• SVM can be applied to semi-supervised learning models. It also applies not only to labeled
data but even to unlabeled data.
Disadvantages of SVM
• SVM can prove to be quite costly for the user. The cost of training them especially
nonlinear models is high.
• Data in SVM need to have feature vectors in advance. This needs pre-processing and this
is not always an easy task.
• Selecting an appropriate Kernel function is tricky and complex. Selecting higher or lower
Kernel can prove to be wrong for your output.
• SVM tends to take a long training time on mainly large data sets.
• SVM requires a good amount of computation capability. This is required to tune the hyper-
parameters including the value of the ‘C’ parameter and gamma.
Naive Bayes
It is a supervised learning algorithm which helps us to solve classification problems; Naive bayes
is mostly used in text classification where high dimensional training datasets are included.
Why is it known as Naive Bayes?
Naive Bayes comprises of two words Naive and Bayes and they can described as:
• Naive: It is known as Naive because it assumes about the occurrence of a certain features is
independent of the occurrence of other features. Such as the fruits that can be easily
identified on the bases of color , shape, and taste. Hence each of the feature contributes
individually to identify that it is an apple without depending on each other.
• Bayes: It is known as bayes since it depends on the principle of the bayes Theorem.
Precision = TP/TP+FP
Recall = TP/TP+FN
True Positive:
Interpretation: You predicted covid positive and it’s true.
You predicted that a patient is covid positive and he/she actually is.
True Negative:
Interpretation: You predicted negative and it’s true.
You predicted that another patient is not and he/she actually is not.
False Positive: (Type 1 Error)
Interpretation: You predicted positive and it’s false.
You predicted about a patient who is positive but in actually he/she is not.
False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false.
You predicted about a patient that he/she is not covid positive but in real she is.
Unsupervised Learning Flow
The unsupervised algorithm is handling data without prior training, working with unlabeled data.
The main purpose of unsupervised learning is working under the conditions of result being
unknown.
Unsupervised machine learning algorithm is mostly used to:
• Exploring about the structure of the information and detect distinct patterns
• Extracting about valuable insights
• Implementing this into the operation in order to increase the efficiency of the
decisionmaking process
Some Common types of unsupervised learning approaches
Unsupervised learning models are utilized for three main tasks named as clustering, association
and dimensionality reduction.
Clustering
Clustering algorithms are used to process raw, unclassified data object into the groups
represented by structures or patterns in the information. It can be categorized into a few types
such as exclusive, overlapping, hierarchical and probabilistic.
Exclusive and Overlapping Clustering
It is a form of grouping that stipulates a data point that can exist only in one cluster and this can
also be referred as “hard” clustering. The K- means clustering algorithm is the example of
exclusive clustering.
Understanding about K- means clustering
It is a common example of an exclusive clustering method where data points are assigned into
the K groups; here k represents the number of the clusters based on the distance from each
group’s centroid.
Hierarchical clustering
Agglomerative
It is considered as “bottoms-up-approach.” Its data points are isolated as separate
groupings initially, and then they get merged together on the basis of similarity until one
of the cluster achieved.
In Agglomerative method there are some commonly used measure similarities known as:
Ward’s linkage: It states that the distance between two clusters is defined by the increase
in the sum of the squared after getting clustered or merged.
Average linkage: It is defined by the mean distance between two points on each cluster.
Complete (or maximum) linkage: This method is defined by the maximum distance
between the two points in each clusters
Single (or minimum) linkage: It is defined by the minimum distance between the two
points in each of cluster.
What is Dimensionality?
Dimensionality is several variables in the dataset. In simple words, it is the total number of
columns present in the dataset.
What is Correlation?
It depicts how strongly two variables are interconnected to each other. That is if one variable gets
changed, another variable will also get affected due to interdependency.
What is orthogonal?
The term is used to describe that the variables are not related to each other. When the correlation
between the pair is zero, it is said to be Orthogonal.
𝑣𝑎𝑙𝑢𝑒 − 𝑚𝑒𝑎𝑛
𝑍=
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
Step.2 Calculating Covariance Matrix
• Normalizing Scaling
Also known as Min-Max Scaling, Normalizing scaling is used when we want to range our
values between two numbers mostly between [0,1] or [-1,1]. The formula to calculate it is:
(𝑿 − 𝑿_𝒎𝒊𝒏)
𝑿𝒏𝒆𝒘 =
(𝑿_𝒎𝒂𝒙 − 𝑿_𝒎𝒊𝒏)
• Standardization
Also known as Z-score scaling, standardization scaling is performed by subtracting the
values from the mean ad then dividing it by the standard deviation. The formula for it is:
(𝐗 − 𝐦𝐞𝐚𝐧)
𝐗_𝐧𝐞𝐰 =
𝐒𝐭𝐝
Feature Encoding
Machine learning works only with numerical data. And, therefore, it is necessary to convert any
other form of data into numerical form. This process is known as Feature Encoding. The inserted
data is majorly processed with the following techniques of encoding:
• One-hot encoding
One hot encoding is the process of converting categorical data variables into numerical
values. This process is performed so that the data can be provided to machine learning
algorithms. The one-hot encoding provides numerical values to label values.
For example, there are two variables named red and blue. We will provide binary values
to them as 0 to red and 1 to blue. The data can be represented in binary terms and can
successfully be transferred to machine learning algorithms.
• Target-mean encoding
This process is another way of transforming categorical data variables into numerical ones.
The process includes removing and then replacing the categorical data with the average
value of the data.
However, the main challenge in the process of the target-mean encoding process is the
problem of overfitting. As we’re changing the variables based on the mean value that may
result in data leakage.
• Frequency encoding
Feature Selection
Feature selection is the process of separating the consistent and relevant features used in the
model. The main goal of feature selection is to improve the performance of the model by reducing
the size of datasets. The process decreases the number of variables by removing the irrelevant
variables. Different types of Feature Selection Methods are:
Difference Between Filter, Wrapper, and Embedded Method
Faster as compared to High computation compared Speed between Filter and wrapper
wrapper method to additional features method in consideration with the
considering the time factor. time factor.
Low possibility of overfitting High chances of over-fitting The method used to reduce the
problem of over-fitting
Do not use Machine Works with specific Feature selection by the model
Learning Algorithms. Algorithms of Machine building process.
Learning
Examples can be Examples can be Forward An example can be Lasso, Ridge
Correlation, ANOVA, ETC. and backward selection. Regression.
Outlier Treatment
An outlier is a particular data point in a data set that is extremely different from the rest of the
observation.
Outlier can be caused due to the following mentioned reasons,
• Error in recording of data
• Error in observation
• Measurement Error
• Data variability
• Numeric Outlier
Numeric Outlier is capable to detect outlier in a One-dimension space. The data is
measured in terms of Interquartile Range (IQR). First, the 1 st and 3rd quartile (Q1 and Q3)
are calculated. An Outlier will the data point that resides outside the interquartile range.
This Technique helps in easy detection of an outlier.
• Z-score
Z-score identifies Outlier by assuming Gaussian distribution of the data. The mean is found
out and an outlier is the distribution that will be far from the mean. The formula for Zscore
is:
𝒙−𝝁
𝒛=
𝝈
• DBSCAN
The method is based on DBSCAN clustering method suitable for multi-dimensional feature
space. Here, the data is divided between three different points i.e. Core points, Border
Points, and Noise points.
Core points are the points that have at least Minimum Points neighbouring data points.
Border Points are the values nearby Core Points and are part of dataset. The data points
far from data set are Nosie Points and are identified as outliers.
• Isolation Forest
Isolation Forest uses the concept of isolation number. The isolation number is the total
number of splits required to differentiate a particular data point. Here, the outlier have
lower isolation number as compared to non-outlier point.
DEEP LEARNING
Biological Neuron
It is a typical biological individual cell mostly found in the animal brain, composed of a cell
nucleus which is soma and many extended tendrils. Tendrils can be classified into one dendrite,
whreceiveseive short electrical impulses from other neurons and bring it to the cell body.
Wneuronsuron receive sufficient number of signals from the other neuron axons send
information from the cell body to other neurons.
In short, Artificial Neural networks are an ensemble of a large number of simple artificial
neurons. This network learns to conducta few tasks such as recognizing an apple by firing a
neuron in a certain way when a given particular input is an apple. Next, we will see a perceptron
which was proposeatin the very beginning of the research area of machine learning and is also a
building block of a Neural network.
Artificial Neuron (Perceptron)
An artificial neuron only in a few aspects resembles a biological neuron. A neural network is an
interconnected system of perceptron’s, so it is safe to say perceptron’s are the foundation of any
neural network. Perceptron’s can be viewed as building blocks in a single layer in a neural
network, made up of four different parts:
1. Input Values or One Input Layer
2. Weights and Bias
3. Net sum
4. Activation function
5. Outputs
An artificial single neuron is represented by a mathematical function. It takes i inputs x, and each
of them usually has its own weight w. The neuron calculates the sum and it is passed through
the activation function to the network further.
The input to the neuron is x, which has a weight w associated with it. Weights shows the
strength of the particular node. The weight is the intrinsic parameter, the parameter the
model has control over in order to get a better fit for the output. When we pass an input into
a neuron, we multiply it by its weight, giving us x * w.
The second element of the input is called bias. A bias value allows you to shift the activation
function curve up or down. The bias adds an element of unpredictability to our model, which
helps it generalize and gives our model the flexibility to adapt to different unseen inputs when
using testing data.
The combination of the bias and input produces our output y, giving us a formula of w*x + b =y.
This should look familiar as a modification of the equation of a straight line, y = mx + c. Neural
Networks are made up of tens, hundreds, or many even thousands of interconnected neurons,
which run its own regression
Layers
Neural networks organize neurons into layers. A layer in which every neuron is connected to
every other neuron in its next layer is called a dense layer.
Working of Perceptron
A perceptron consists of four parts: input values, weights and a bias, a weighted sum, and
activation function. Assume we have a single neuron and two inputs x1, x2 multiplied by the
weights w1, w2 respectively as shown below
W1
W2
Next, all the weighted inputs are added together with a bias b:
This looks like a good function, but what if we wanted the outputs to fall into a certain range say
0 to 1. Finally, the sum is passed through an activation function
The activation function is used to turn an unbounded input into an output that has a nice,
predictable form. An activation function is a function that converts the input given (the
input, in this case, would be the weighted sum) into a certain output based on a set of rules.
Activation Function
Activation is used to determine the output of neural network like yes or no. It maps the
resulting values in between 0 to 1 or -1 to 1 etc. The Activation Functions can be basically
divided into 2 types-
As you can see the function is a line or linear. Therefore, the output of the functions will not
be confined between any range.
Equation: f(x) = x
The Nonlinear Activation Functions are the most used activation functions. Nonlinearity helps
to makes the graph look something like this
It makes it easy for the model to generalize or adapt with variety of data and to differentiate
between the output.
The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore,
it is especially used for models where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.
The advantage is that the negative inputs will be mapped strongly negative and the zero
inputs will be mapped near zero in the tanh graph. ReLU (Rectified Linear Unit) Activation
Function
The ReLU is the most used activation function in the world right now.Since, it is used in almost
all the convolutional neural networks or deep learning.
As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero
and f(z) is equal to z when z is above or equal to zero.
Range: [ 0 to infinity)
Leaky ReLU
A multi-layer neural network contains more than one layer of artificial neurons or nodes. They
differ widely in design. It is important to note that while single -layer neural networks were
useful early in the evolution of AI, the vast majority of networks used today have a multi-layer
model.
Input Layer: As the name suggests, it accepts inputs in several different formats provided by the
programmer. In the diagram, there are three nodes in the Input Layer. The Bias node has a value
of 1. X1 and X2 are taken by the other two nodes as external inputs (which are numerical values
depending upon the input dataset).
Hidden Layer: The hidden layer presents in-between input and output layers. It performs all
the calculations to find hidden features and patterns. As shown in the figure, the hidden layer
also has three nodes. The Hidden layer also has three nodes where the Bias node has an output
of 1. The output of the other two nodes of Hidden layer depends on the outputs from the
Input layer (1, X1, X2) and the weights associated with the edges.
The figure indicates output calculation for one of the hidden nodes
Similarly, it is possible to measure the output from another hidden node. Note that f
corresponds to the activation function. These outputs are then fed into the Output layer
nodes.
Output Layer: The input goes through a series of transformations using the hidden layer,
which finally results in output that is conveyed using this layer. - There are two nodes in the
output layer that take inputs from the hidden layer and perform identical computations as
seen for the hidden node that is highlighted in the diagram. The values determined (Y1 and
Y2) as a result of these computations are considered as the result of the multi-layer
perceptron.
For example, the following four-layer network has two hidden layers:
The hidden neurons act as feature detectors; as such, they play a critical role in the operation
of a multilayer neural networks. As the learning process progresses across the multilayer
neural networks, the hidden neurons begin to gradually “discover” the salient features that
characterize the training data. They do so by performing operations on the input data into a
new space called the feature space. In this new space, the classes of interest in a pattern-
classification task, for example, may be more easily separated from each other than could be
the case in the original input data space.
Tanh or hyperbolic tangent Activation Function
tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh
is also sigmoidal (s-shaped).
The advantage is that the negative inputs will be mapped strongly negative and the zero inputs
will be mapped near zero in the tanh graph.
The advantage is that the negative inputs will be mapped strongly negative and the zero inputs
will be mapped near zero in the tanh graph.
As you can see, the ReLU is half rectified (from the bottom). f(z) is zero when z is less than zero
and f(z) is equal to z when z is above or equal to zero.
Range: [ 0 to infinity)
Leaky ReLU
It is an attempt to solve the dying ReLU problem
The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so.
Therefore, the range of the Leaky ReLU is (-infinity to infinity).
Computing Gradients
Now, before the equations, let's define what each variable means. We have already defined some
of them, but it's good to summarize. Some of this should be familiar to you
Although we are not directly found in the cost function, we start by considering the change of w
in the z equation, since that z equation holds a w. Next, we consider the change of z L in aL, and
then the change aL in function C. Effectively, this measures the change of a particular weight
with a cost function.
We need to move backward in the network and update the weights and biases. One equation for
weights, one for biases, and one for activations:
Each partial derivative from the weights and biases is saved in a gradient vector, that has as
many dimensions as you have weights and biases. The gradient is the triangle symbol , and n is
several weights and biases:
You compute the gradient according to a mini-batch (often 16 or 32 is best) of your data, i.e.,
you subsample your observations into batches. For each observation in your mini-batch, you
average the output for each weight and bias. Then the average of those weights and biases
becomes the output of the gradient, which creates a step in the average best direction over the
mini-batch size.
Then you would update the weights and biases after each mini-batch. Each weight and bias are
'nudged' a certain amount for each layer l:
The learning rate is usually written as an alpha α or eta η.
Gradient Descent
Gradient descent is an optimization algorithm that's used when training a machine learning
model. Gradient descent is essentially used to find the values of the parameters of a function
(coefficients) that minimize, as much as possible, a cost function. The goal of Gradient Descent
is to minimize the objective convex function f(x) using iteration. If they're pretty good, they're
going to yield a lower amount. Your loss function will inform you whether or not you're changing
when you tune your algorithm to try to refine your model. 'Loss' lets one understand how
distinct the expected value is from the real value.
Artificial Intelligence
Artificial intelligence (AI) is a broad area of computer science that focuses on creating intelligent
machines which can perform tasks that would usually require human intelligence. While AI is a
multidisciplinary science with many methods, developments in machine learning and deep
learning are causing a paradigm shift in nearly every field of the tech industry.
Deep Learning
Deep learning is a subset of machine learning, not anything distinct from it. As we are about to
start deep learning, many of us try to find out what the difference between machine learning
and deep learning is. Machine learning and deep learning both are all about learning from past
experience and making predictions based on future evidence.
In deep learning, artificial neural networks are used to learn from previous data (a mathematical
model mimicking the human brain). At a high level, above diagram illustrates the differences
between machine learning and deep learning.
Some Real-Life Applications of Deep Learning
Computer Vision
Text Analysis & Understanding
Speech Recognition
Computer Games
AI Cybersecurity
Health Care
General Flow of Deep Learning Project
Any deep learning project, including predictive modeling, can be broken down into five common
tasks:
1. Define and prepare the problem
2. Summarize and understand data
3. Process and prepare data
4. Evaluate algorithms
5. Improve results
Workflow of AI Project
We can define the AI workflow in 5 stages.
1. Gathering data
2. Data pre-processing
3. Researching the model that will be best for the type of data
4. Training and testing the model
5. Evaluation
2. The validation set is used for unbiased model evaluation during hyperparameter tuning.
For example, when you want to find the optimal number of neurons in a neural network you
experiment with different values. For each considered setting of hyperparameters, you fit the
model with the training set and assess its performance with the validation set.
3. The test set is needed for an unbiased evaluation of the final model. You shouldn’t use it
for fitting or validation.
What is Bias?
Bias is the difference between the average prediction of our model and the correct value which
we are trying to predict. Model with high bias pays very little attention to the training data and
oversimplifies the model. It always leads to high error on training and test data. High Bias can
also be termed as underfitting. Underfitting is the case where the model has “not learned
enough” from the training data, resulting in low generalization and unreliable predictions.
What is Variance?
Variance is the variability of model prediction for a given data point or a value which tells us
spread of our data. Model with high variance pays a lot of attention to training data and does
not generalize on the data which it hasn’t seen before. As a result, such models perform very
well on training data but has high error rates on test data. High Variance can also be termed as
Overfitting. Overfitting is the case where the overall cost is really small, but the generalization of
the model is unreliable. This is due to the model learning “too much” from the training data set.
Regularization
This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates
towards zero. In other words, this technique discourages learning a more complex or flexible
model, so as to avoid the risk of overfitting.
Here, lambda is the regularization parameter. It is the hyperparameter whose value is optimized
for better results. L2 regularization is also known as weight decay as it forces the weights to
decay towards zero (but not exactly zero). In L1, we have:
In this, we penalize the absolute value of the weights. Unlike L2, the weights may be reduced to
zero here. Hence, it is very useful when we are trying to compress our model. Otherwise, we
usually prefer L2 over it.
Dropout
This is the one of the most interesting types of regularization techniques. It also produces very
good results and is consequently the most frequently used regularization technique in the field
of deep learning.
Data Augmentation
The simplest way to reduce overfitting is to increase the size of the training data. In machine
learning, we were not able to increase the size of training data as the labeled data was too
costly.
But, now let’s consider we are dealing with images. In this case, there are a few ways of
increasing the size of the training data – rotating the image, flipping, scaling, shifting, etc. In the
below image, some transformation has been done on the handwritten digits’ dataset.
This technique is known as data augmentation. This usually provides a big leap in improving the
accuracy of the model.
Early stopping
Early stopping is a kind of cross-validation strategy where we keep one part of the training set as
the validation set. When we see that the performance on the validation set is getting worse, we
immediately stop the training on the model. This is known as early stopping.
Batch Normalization
Normalization is a data pre-processing tool used to bring the numerical data to a common scale
without distorting its shape.
Generally, when we input the data to a machine or deep learning algorithm, we tend to change
the values to a balanced scale. The reason we normalize is partly to ensure that our model can
generalize appropriately.
But what is the reason behind the term “Batch” in batch normalization? A typical neural network
is trained using a collected set of input data called batch. Similarly, the normalizing process in
batch normalization takes place in batches, not as a single input.
By using Batch normalization, we will make neural networks faster and more stable through
adding extra layers in a deep neural network. The new layer performs the standardizing and
normalizing operations on the input of a layer coming from a previous layer.
Advantages of Batch Normalization
• Speed Up the Training
• Handles internal covariate shift
• Smoothens the Loss Function
We have four main strategies available for searching for the best configuration of our
hyperparameters:
• Babysitting (aka Trial & Error)
• Grid Search
• Random Search
• Bayesian Optimization
Grid Search
Grid Search – a naive approach of simply trying every possible configuration.
Here's the workflow:
• Define a grid on n dimensions, where each of these maps for an hyperparameter. e.g. n =
(learning_rate, dropout_rate, batch_size)
• For each dimension, define the range of possible values: e.g. batch_size = [4, 8, 16, 32, 64,
128, 256]
• Search for all the possible configurations and wait for the results to establish the best one:
e.g. C1 = (0.1, 0.3, 4) -> acc = 92%, C2 = (0.1, 0.35, 4) -> acc = 92.3%, etc...
Random Search
The only real difference between Grid Search and Random Search is on the step 1 of the strategy
cycle – Random Search picks the point randomly from the configuration space.
Optimization
Optimizers are algorithms or methods used to change the attributes of your neural network such
as weights and learning rate in order to reduce the losses.
Gradient Descent
Gradient Descent is the most basic but most used optimization algorithm. It’s used heavily in
linear regression, classification and neural network algorithms. Backpropagation in neural
networks also uses a gradient descent algorithm.
Gradient descent is a first-order optimization algorithm which is dependent on the first-order
derivative of a loss function. It calculates that which way the weights should be altered so
that the function can reach a minima. Algorithm: θ=θ−α⋅∇J(θ) Advantages:
• Easy computation
• Easy to implement
• Easy to understand
Disadvantages:
• May trap at local minima
• Weights are changed after calculating gradient on the whole dataset. So, if the dataset is
too large than this may take years to converge to the minima
• Requires large memory to calculate gradient on the whole dataset
Adam Adam (Adaptive Moment Estimation) works with momentums of first and just because
we can jump over the minimum, we want to decrease the velocity a little bit for a careful
search.
Advantages:
• Computationally costly.
Introduction to CNN
Convolutional Neural Network (ConvNet/CNN) is a Deep Learning method that can take in an input
image and assign importance (learnable weights and biases) to distinct aspects and objects in the
image while being able to distinguish between them. Comparatively speaking, a ConvNet requires
substantially less pre-processing than other classification techniques. ConvNets have the capacity
to learn these filters and properties, whereas in primitive techniques filters are hand-engineered.
A ConvNet's architecture was influenced by how the Visual Cortex is organised and is similar to
the connectivity network of neurons in the human brain. Only in this constrained area of the visual
field, known as the Receptive Field, do individual neurons react to stimuli. The entire visual field
is covered by a series of such fields that overlap.
Convolution Layer — The Kernel
The green area in the demonstration above mimics our 5x5x1 input image, I. The Kernel/Filter, K,
which is symbolised by the colour yellow, is the component that performs the convolution process
in the initial portion of a convolutional layer. K has been chosen as a 3x3x1 matrix.
The Kernel shifts nine times as a result of Stride Length = 1 (Non-Strided), conducting a matrix
multiplication operation between K and the area P of the picture that the kernel is now hovering
over each time.
Until it has parsed the entire width, the filter travels to the right with a specific Stride Value. Once
the entire image has been traversed, it hops back up to the image's beginning (on the left) with
the same Stride Value.
Pictures having several channels, like RGB images, have a kernel with the same depth as the input
image. A squashed one-depth channel Convoluted Feature Output is produced by performing
matrix multiplication across the Kn and In stacks ([K1, I1]; [K2, I2]; and [K3, I3]). All of the results
are then added together with the bias.
The Convolution Operation's goal is to take the input image's high-level characteristics, such
edges, and extract them. There is no requirement that ConvNets have just one convolutional
layer. Typically, low-level features like edges, colour, gradient direction, etc. are captured by the
first ConvLayer. With more layers, the architecture adjusts to High-Level characteristics as well,
giving us a network that comprehends the dataset's images holistically, much like we do.
The procedure yields two different types of results: one where the dimensionality of the
convolved feature is decreased as compared to the input, and the other where it is either
increased or stays the same.
Applying Valid Padding in the first scenario or Same Padding in the second accomplishes this.
When the 5x5x1 image is enhanced into a 6x6x1 image and the 3x3x1 kernel is applied to it, we
see that the convolved matrix has the dimensions 5x5x1. Therefore, Same Padding was born.
On the other hand, if we carry out the identical operation without padding, we are given a matrix
called Valid Padding that has the same dimensions as the kernel itself (3x3x1).
Pooling Layer
The Pooling layer, like the Convolutional Layer, is in charge of shrinking the Convolved Feature's
spatial size. Through dimensionality reduction, the amount of computing power needed to
process the data will be reduced. Furthermore, it aids in properly training the model by allowing
the extraction of dominating characteristics that are rotational and positional invariant.
Max Pooling and Average Pooling are the two different types of pooling. The maximum value from
the area of the image that the Kernel has covered is returned by Max Pooling. The average of all
the values from the area of the image covered by the Kernel is what is returned by average
pooling, on the other hand.
Additionally, Max Pooling functions as a noise suppressant. It also does de-noising and
dimensionality reduction in addition to completely discarding the noisy activations. Average
Pooling, on the other hand, merely carries out dimensionality reduction as a noise-suppression
strategy. Therefore, we can conclude that Max Pooling outperforms Average Pooling significantly.
The i-th layer of a convolutional neural network is made up of the convolutional layer and the
pooling layer. The number of these layers may be expanded to capture even more minute details,
but doing so will require more computer power depending on how complex the images are.
Fully Connected Layer
A (typically) inexpensive method of learning non-linear combinations of the high-level
characteristics represented by the output of the convolutional layer is to add a Fully-Connected
layer. In that area, the Fully-Connected layer is now learning a function that may not be linear.
We will now flatten the input image into a column vector after converting it to a format that is
appropriate for our multi-level perceptron. A feed-forward neural network receives the flattened
output, and backpropagation is used for each training iteration. The model can categorise images
using the Softmax Classification method across a number of epochs by identifying dominant and
specific low-level features.
After going through the approach outlined above, we were able to successfully help the model
comprehend the features. Next, we will flatten the output for classification purposes and feed it
into a standard neural network.
There are numerous CNN architectures that may be used, and these architectures have been
essential in creating the algorithms that power and will continue to power AI as a whole in the
near future. Below is a list of a few of them:
• LeNet
• AlexNet
• VGGNet
• GoogLeNet
• ResNet
• ZFNet
Many-to-one RNNs
The most intuitive type of RNN is probably many-to-one. A many-to-one RNN can have input
sequences with as many time steps as you want, but it only produces one output after going
through the entire sequence. The following diagram depicts the general structure of a many-
toone RNN:
In the first diagram, f represents one or more recurrent hidden layers, where an individual layer
takes in its own output from the previous time step. The second diagram shows an example of
three hidden layers stacking up.
Many-to-one RNNs are widely used for classifying sequential data. Sentiment analysis is a good
example of this and is where the RNN reads the entire customer review, for instance, and assigns
a sentiment score (positive, neutral, or negative sentiment). Similarly, we can also use RNNs of
this kind in the topic classification of news articles. Identifying the genre of a song is another
application as the model can read the entire audio stream. We can also use many-to-one RNNs to
determine whether a patient is having a seizure based on an ECG trace.
One-to-many RNNs
One-to-many RNNs are the exact opposite of many-to-one RNNs. They take in only one input (not
a sequence) and generate a sequence of outputs. A typical one-to-many RNN is presented in the
following diagram:
Note that “one” here doesn’t mean that there is only one input feature. It means the input is
from a one-time step, or it is time-independent. One-to-many RNNs are commonly used as
sequence generators. For example, we can generate a piece of music given a starting note
or/and a genre. Similarly, we can write a movie script like a professional screenwriter using
oneto-many RNNs with a starting word we specify. Image captioning is another interesting
application: the RNN takes in an image and outputs the description (a sentence of words) of the
image.
They are also widely used in solving NLP problems, named entity recognition, and real-time speech
recognition.
Sometimes, we only want to generate the output sequence after we’ve processed the entire input
sequence. This is the unsynced version of many-to-many RNN.
Refer to the following diagram for the general structure of a many-to-many (unsynced) RNN:
Note that the length of the output sequence (Ty in the preceding diagram) can be different from
that of the input sequence (Tx in the preceding diagram). This provides us with some flexibility.
This type of RNN is a go-to model for machine translation. In French-English translation, for
example, the model first reads a complete sentence in French and then produces a translated
sentence in English. Multi-step ahead forecasting is another popular example: sometimes, we
are asked to predict sales for multiple days in the future when given data from the past month.
LSTM Networks
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN,
capable of learning long-term dependencies. They work tremendously well on a large variety of
problems and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem. These long-term
dependency problems are vanishing and exploding gradients problems. Remembering
information for long periods of time is practically LSTM default behaviour, not something they
struggle to learn! All recurrent neural networks have the form of a chain of repeating modules of
a neural network. In standard RNNs, this repeating module will have a very simple structure, such
as a single tanh layer. LSTMs also have this chain-like structure, but the repeating module has a
different structure. Instead of having a single neural network layer, there are four, interacting in
a very special way.
Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by
step later. For now, let’s just try to get comfortable with the notation we’ll be using.
In the above diagram, each line carries an entire vector, from the output of one node to the inputs
of others. The pink circles represent point-wise operations, like vector addition, while the yellow
boxes are learned neural network layers. Lines merging denote concatenation, while a line forking
denotes its content being copied and the copies going to different locations.
Walkthrough the architecture: -
Now we are ready to look into the LSTM architecture step by step:
● We’ve got a new value xt and value from the previous node ht-1 coming in.
● These values are combined and go through the sigmoid activation function, where it is
decided if the forget valve should be open, closed or open to some extent.
● The same values, or vectors of values, go in parallel through another layer operation
“tanh”, where it is decided what value we’re going to pass to the memory pipeline, and
also sigmoid layer operation, where it is decided if that value is going to be passed to the
memory pipeline and to what extent.
● Then, we have a memory flowing through the top pipeline. If we have forget valve open
and the memory valve closed then the memory will not change. Otherwise, if we have
forget valve closed and the memory valve open, the memory will be updated completely.
● Finally, we’ve got xt and ht-1 combined to decide what part of the memory pipeline is
going to become the output of this module.
That’s basically, what’s happening within the LSTM network.
BIG DATA
What is Big Data?
Data can be analyzed computationally to reveal patterns, trends, and associations to human or
machine behavior and interaction. Big data is just a collection of very large datasets (PetaBytes)
produced in huge volumes whose computation cannot be done using traditional computing
techniques. Big data isn’t a single technique or a tool, rather it has become a complete subject,
which involves various tools, techniques, and frameworks.
Types in Big Data
Big Data involves data of heterogeneous nature, produced in huge volumes and at a very high
rate. The data is generally classified into three categories.
Structured data: Structured data is data that has been organized into a formatted repository,
typically a database. It concerns all data which can be stored in relational tables having a fixed
schema. This type of data is the most processed in the development and simplest way to
manage information. Eg. Relational data is stored in tables with rows and columns.
Semi-Structured data: Semi-structured data is a form of structured data that does not obey the
tabular structure of data models associated with relational databases or other forms of data
tables, nonetheless contains tags or other markers to separate semantic elements and enforce
some sort of hierarchies. Eg. XML data
Unstructured data: Unstructured data is data that is not organized in a predefined manner and
does not fit in a predefined data model, thus it is not a good fit for a mainstream relational
database. Although there are alternative platforms for storing and managing and are used by
organizations in a variety of business intelligence and analytics applications. Eg. Text, Media,
logs, etc.
The five important V’s of Big Data are:
1. Value – It refers to changing data into value, which allows businesses to generate revenue.
2. Velocity – Any data growing at an increasing rate is known as its variety. Social media is an
important factor contributing to the growth of data.
3. Variety – Data can be of different types such as texts, audios, videos, etc. which are known as
variety.
4. Volume – It refers to the amount of any data that is growing at an exponential rate.
5. Veracity – It refers to the uncertainty found in the availability of data. It mainly arises due to
the high demand for data which results in inconsistency and incompleteness.
Data Storage is the next step in Big Data Solutions. In this step, the data is extracted from the
first step is stored in HDFS or NoSQL database, also known as HBase. The HDFS storage is widely
used for sequential access. On the contrary, HBase is used for random read or write access.
Data Processing is the final step of Big Data Solutions. In this step, with the help of different
processing frameworks, the data is processed. Various processing frameworks used are Pig,
MapReduce, Spark, etc.
Hadoop is an open source software framework for distributed storage and distributed
processing of large data sets. Open source means it is freely available and even we can
change its source code as per our requirements. Apache Hadoop makes it possible to run
applications on the system with thousands of commodity hardware nodes. It’s distributed file
system has the provision of rapid data transfer rates among nodes. It also allows the system
to continue operating in case of node failure.
Hadoop Architecture: -
Hadoop follows a master-slave architecture for storing data and data processing.
A cluster in the Hadoop ecosystem consists of one master node and numerous slave nodes. The
components included in the master-node of the architecture are Job-tracker and Name-node,
whereas the components of the slave-nodes are Task-tracker and data-node.
In the HDFS Layer:
Name-node: Name-node essentially performs a supervisory role that monitors and instructs the
Data-nodes to perform the computations. Name-node contains the meta-data for all the files
and by extension all the file segments in data blocks. It contains a table of information about the
file namespace, Ids of the blocks, and the data nodes possessing them for each file inside the
cluster.
Secondary Name-node: Secondary Name-node is a dedicated node to ensure availability in case
of a primary name-node failover.
Data-node: Data-node is a component of the slave nodes which is used to store the processes and
the data blocks in the HDFS Layer.
Replication in HDFS
HDFS has the reputation of being one of the most reliable file systems in the world. One of the
reasons for this is replication. Every block is replicated by a factor of 3 by default, meaning that
each block is copied to 3 different data nodes. The reason for this redundancy is to ensure fault
tolerance and high availability. If a data node goes down due to some failure, we can ensure that
we have at least 2 more copies of the same block which can be used when required. The figure
below shows an instance where the blocks of files are replicated and stored in the data nodes.
Now let’s understand complete end to end HDFS data write pipeline. As shown in the above
figure the data write operation in HDFS is distributed, client copies the data distributedly on
datanodes, the steps by step explanation of data write operation is:
i) The HDFS client sends a create request on DistributedFileSystem APIs. ii) DistributedFileSystem
makes an RPC call to the namenode to create a new file in the file system’s namespace. The
namenode performs various checks to make sure that the file doesn’t already exist and that the
client has the permissions to create the file. When these checks pass, then only the namenode
makes a record of the new file; otherwise, file creation fails and the client is thrown an
IOException. iii) The DistributedFileSystem returns a FSDataOutputStream for the client to start
writing data to. As the client writes data, DFSOutputStream splits it into packets, which it writes
to an internal queue, called the data queue. The data queue is consumed by the DataStreamer,
whichI is responsible for asking the namenode to allocate new blocks by picking a list of suitable
datanodes to store the replicas. iv) The list of datanodes form a pipeline, and here we’ll assume
the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams
the packets to the first datanode in the pipeline, which stores the packet and forwards it to the
second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards
it to the third (and last) datanode in the pipeline.
v) DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only
when it has been acknowledged by the datanodes in the pipeline. Datanode sends the
acknowledgment once required replicas are created (3 by default). Similarly, all the blocks are
stored and replicated on the different datanodes, the data blocks are copied in parallel. vi)
When the client has finished writing data, it calls close() on the stream. vii) This action flushes all
the remaining packets to the datanode pipeline and waits for acknowledgments before
contacting the namenode to signal that the file is complete. The namenode already knows which
blocks the file is made up of, so it only has to wait for blocks to be minimally replicated before
returning successfully.
Now let’s understand complete end to end HDFS data read operation. As shown in the above
figure the data read operation in HDFS is distributed, the client reads the data parallelly from
datanodes, the steps by step explanation of data read cycle is:
i) Client opens the file it wishes to read by calling open() on the FileSystem object, which for
HDFS is an instance of DistributedFileSystem. ii) DistributedFileSystem calls the namenode using
RPC to determine the locations of the blocks for the first few blocks in the file. For each block,
the namenode returns the addresses of the datanodes that have a copy of that block and
datanode are sorted according to their proximity to the client. iii) DistributedFileSystem returns
a FSDataInputStream to the client for it to read data from. FSDataInputStream, thus, wraps the
DFSInputStream which manages the datanode and namenode I/O. Client calls read() on the
stream. DFSInputStream which has stored the datanode addresses then connects to the closest
datanode for the first block in the file. iv) Data is streamed from the datanode back to the client,
as a result client can call read() repeatedly on the stream. When the block ends, DFSInputStream
will close the connection to the datanode and then finds the best datanode for the next block.
v) If the DFSInputStream encounters an error while communicating with a datanode, it will try
the next closest one for that block. It will also remember datanodes that have failed so that it
doesn’t needlessly retry them for later blocks. The DFSInputStream also verifies checksums for
the data transferred to it from the datanode. If it finds a corrupt block, it reports this to the
namenode before the DFSInputStream attempts to read a replica of the block from another
datanode. vi) When the client has finished reading the data, it calls close() on the stream.
Backup Node
The Backup node provides the same checkpointing functionality as the Checkpoint node, as well
as maintaining an in-memory, up-to-date copy of the file system namespace that is always
synchronized with the active NameNode state. Along with accepting a journal stream of file
system edits from the NameNode and persisting this to disk, the Backup node also applies those
edits into its own copy of the namespace in memory, thus creating a backup of the namespace.
The Backup node does not need to download fsimage and edits files from the active NameNode
in order to create a checkpoint, as would be required with a Checkpoint node or Secondary
NameNode, since it already has an up-to-date state of the namespace state in memory. The
Backup node checkpoint process is more efficient as it only needs to save the namespace into
the local fsimage file and reset edits. As the Backup node maintains a copy of the namespace in
memory, its RAM requirements are the same as the NameNode. The NameNode supports one
Backup node at a time. No Checkpoint nodes may be registered if a Backup node is in use. Using
multiple Backup nodes concurrently will be supported in the future.
Map Reduce: -
Introduction to MapReduce
MapReduce is the programming model that is active in the Hadoop processing layer, it provides
easy scalability and data computation. MapReduce works on top of the Hadoop HDFS layer. This
programming model is designed to process large datasets by achieving parallelism. Parallelism in
the MapReduce layer is achieved by breaking down the job into a set of tasks that are
independent by nature.
Map-Reduce Job
A Map-Reduce job submitted by a user goes through two layers of processing, namely Map and
Reduce consisting and many phases. A client application has to submit the input data and the
Map-reduce program along with its configuration. The data is then sliced into lists of inputs and
sent to the map and reduce processes.
Task in MapReduce
A task in MapReduce is an execution of a Mapper or a Reducer on a slice of data on a node.
Task Attempt is a particular instance of an attempt to execute a task on a node.
It is possible that any machine can go down at any time. Eg., while processing an application task
if any node goes down, the framework automatically reschedules the task to some other active
node. This rescheduling of the task cannot happen infinitely, there is an upper limit for that. The
default value for task attempts is 4. After a task (Mapper or reducer) has failed 4 times, the job is
considered to be a failed job. For high-priority jobs, the value of task attempts can also be
increased.
Input Split
InputSplit in Hadoop MapReduce is the logical representation of data. This describes a unit of work
that contains a single mapper task inside a MapReduce program.
In Hadoop, InputSplit represents the data that is processed by an individual Mapper. Each split is
divided into records. Hence, the mapper processes each record (that is a key-value pair). The
important thing to notice is that Inputsplit does not contain the input data; it is just a reference
to the data. By default, the split size is approximately equal to the block size in HDFS. InputSplit
is a logical chunk of data i.e. it just has the information about blocks addresses or locations.
Record Reader
Input Format computes splits for each file and then sends them to the Job-Tracker (in
Namenode), which uses their storage locations to schedule map tasks to process them on the
TaskTrackers (in Data-nodes). Map task then passes the split to the task tracker to obtain a
RecordReader for that split. The RecordReader loads data from its source and converts it into
key-value pairs suitable for reading by the mapper (map function). It communicates with the
input split until the file reading is not completed.
The total number of blocks of the input files handles the number of map tasks in a program.
Hence, No. of Mapper= {(total data size)/ (input split size)}
For example, if data size is 1 TB and InputSplit size is 100 MB then,
No. of Mapper= (1000*1000)/100= 10,000
Combiner
Combiner is also known as “Mini-Reducer” that summarizes the Mapper output record with the
same Key before passing to the Reducer.
On huge datasets when we run MapReduce job, large chunks of intermediate data are
generated by the Mapper, and this intermediate data is passed on the Reducer for further
processing, which would lead to enormous network congestion. MapReduce framework
employs a function known as Hadoop Combiner that plays a vital role in decreasing network
congestion.
Hadoop Combiner reduces the time taken for data transfer between mapper and reducer.
It decreases the amount of data that needed to be processed by the reducer. The
use of Combiner enhances the overall performance of the reducer.
Partitioner
The Partitioner in MapReduce regulates the partitioning of the key of the intermediate output
produced by mappers. By the means of a hash function, a key (or a subset of the key) is used to
derive a partition. The total number of partitions is dependent on the number of reducer tasks.
Partitioner in Hadoop MapReduce redirects the mapper output to the reducer by determining the
reducer responsible for that particular key.
The total number of running Partitioners is equal to the number of reducers. The data from a
single partitioner is processed by a single reducer, and the partitioner exists only when there are
multiple reducers.
Sorting
The keys generated by the mapper functions are automatically sorted by MapReduce
Framework, i.e. Before starting a reducer, all intermediate key-value pairs in MapReduce that
are generated by the mappers get sorted by key and not by value.
Reducer
The reducer takes the output of the Mapper (intermediate key-value pairs) as input, processes
each of them, and generates the output also as key-value pairs. The output from the reducers is
the final output, which is stored in HDFS or a different file system.
Output Format
Output Format checks the Output-Specification of the job. It determines how RecordWriter
implementation is used to write output to output files.
Record Writer
As mentioned, Reducer takes as input a set of an intermediate key-value pair produced by the
mapper and runs a reducer function on them to generate output that is again zero or more
keyvalue pairs.
RecordWriter writes these output key-value pairs generated after the Reducer phase to the
output files. Output Specification
The Output specification is described by the Output Format for a Map-Reduce job. On the basis of
output specification MapReduce job checks that the output directory does not already exist
Availability: As the master daemon was a single point of failure, a failure meant killing all the jobs
in the queue.
Resource Utilization: A fixed number of tasks meant a fixed number of Map and Reduce jobs which
would give rise to resource utilization issues.
Hadoop 1.0 could only run MapReduce: This architecture had only one option in programming
models to be run in the processing layer, which was MapReduce. It wasn’t able to run
nonMapReduce applications.
High Availability: Having just one Resource Manager would mean a single point of failure in case
the resource manager dies, therefore YARN adds a redundancy feature as an Active-Standby pair
of Resource Managers.
Resource Utilization: YARN allows dynamic allocation of resources, to avoid utilization issues.
YARN architecture: -
The client would submit the job request to the Resource Manager in Hadoop 2.0 (YARN) as
opposed to the Job Tracker in the older version. Thus, the Resource Manager is the Master
Daemon in the YARN architecture.
Node Manager is the Slave daemon in the YARN architecture, which is similar to the Task-tracker
in Hadoop MapReduce.
Components in YARN
Resource Manager
This is the master daemon in the YARN Architecture, its duty is to manage the assignment of
resources which include CPU cores, Memory, Network I/O for the applications submitted by the
clients.
The applications from the clients compete for resources and the Resource Manager resolves the
resource allocation, which is why it is called a resource negotiator.
The two main components in the Resource Manager are Scheduler and Application Manager.
Scheduler
The scheduler is the component of the Resource Manager that does the actual allocation of the
cluster resources to the applications. Its only job is to assign the resources to the competing jobs
as a resource negotiator. It does not concern itself with scheduling, tracking, or monitoring of
jobs.
Application Manager
The responsibility of the Application Manager is to manage the application masters that are
running in the cluster. Its duty is to spawn the first application master, monitoring them, and
also restarting them if there is a failure.
Node Manager
It is the slave daemon of Yarn. Every data node has its Node Manager, which manages the user
process on that machine. Node Manager helps the Resource Manger keep its data updated, and
Node Manager can also kill containers based on the instructions from the Resource Manager.
Application master
One application master runs per application. It negotiates resources from the resource manager
and works with the node manager. It Manages the application life cycle.
The Application Manager acquires containers from the Scheduler before contacting the
corresponding Node Manager to start the application’s individual tasks.
Container
It is a collection of physical resources such as RAM, CPU cores, and disks on a single node.
Work-preserving Resource Manager restart: This type of Resource Manager restart focuses on
reconstructing the running state of Resource Manager by combining the container status from
Node Managers and container requests from Application Masters on restart. The key difference
from Non-work-preserving Resource Manager restart is that already running apps will not be
stopped after Resource Manager restarts, so applications will not lose their processed data
because of Resource Manager/master outage.
Scheduler Plug-ins
Plug-in Schedulers such as Fair Scheduler and the Capacity Scheduler are widely used in YARN
applications.
FIFO Scheduler
FIFO Scheduler runs the applications in submission order by placing them in a queue. Application
submitted first, gets resources first and on completion, the scheduler serves the next application
in the queue.
Capacity Scheduler
The Capacity Scheduler allows sharing of a Hadoop cluster between organizations. Each
organization is set up with a dedicated queue that is configured to use a given fraction of the
cluster capacity. Queues may be further divided in a hierarchical fashion, and it follows FIFO
scheduling for a single queue. Capacity Scheduler may allocate the spare resources to jobs in the
queue, even if the queue’s capacity is exceeded.
Fair Scheduler attempts to allocate resources so that all running applications get the same share
of resources. It enforces dynamic allocation of resources among the competing queues, with no
need for prior capacity. It also allows for a priority mechanism within the queues. When a
highpriority job arises in the same queue, the task is processed in parallel by replacing some
portion from the already dedicated slots.
Latency:
One of the key advantages of Apache Spark is its ability to minimize latency, or the delay
between data processing and obtaining results. By keeping intermediate data in-memory and
optimizing task execution, Spark reduces the time required for iterative algorithms, interactive
queries, and real-time data processing. This low-latency processing capability is crucial for
applications requiring rapid insights and responsiveness.
Introduction to Pyspark:
PySpark is the Python API for Apache Spark, which allows developers to write Spark applications
using the Python programming language. PySpark provides a familiar and expressive interface
for data scientists and Python developers, enabling them to leverage Spark's distributed
computing capabilities without having to learn a new programming language or framework.
Deep Learning
1.What is AI and Deep Learning?
• AI, or Artificial Intelligence, is like giving computers the ability to think and solve problems
like humans. It's about making machines smart so they can perform tasks that typically
require human intelligence.
• Deep Learning, a subset of AI, is a specific way of building intelligent systems. It's inspired
by the structure and function of the human brain, using complex algorithms called neural
networks to analyze data, recognize patterns, and make decisions.
2. History of Deep Learning:
• Deep Learning has its roots in the 1940s when researchers started exploring neural
networks as a way to simulate the human brain's learning process. However, it wasn't until
the 2010s that Deep Learning really took off.
• Breakthroughs in computing power, availability of large datasets, and improvements in
algorithms fueled the rapid advancement of Deep Learning, leading to its widespread
adoption in various applications.
3. Machine Learning vs. Deep Learning:
• Machine Learning is a broader concept that encompasses various techniques enabling
computers to learn from data and improve over time without being explicitly programmed.
• Deep Learning is a specific approach to Machine Learning where neural networks with
multiple layers (deep neural networks) process data hierarchically, extracting intricate
patterns and representations from raw input.
4. How Deep Learning is Different from Other Machine Learning Methods:
• In traditional Machine Learning approaches, feature extraction and selection are often
done manually by humans, requiring domain expertise.
• Deep Learning automates this process by allowing the neural network to learn hierarchical
representations directly from raw data, eliminating the need for explicit feature
engineering and making it more adaptable to diverse tasks and datasets.
5. Real-Life Applications of Deep Learning:
• Deep Learning is ubiquitous in modern technology and powers numerous applications:
• Image and speech recognition: enabling facial recognition on smartphones, virtual
assistants like Siri and Alexa, and automated captioning of images.
• Natural language processing: facilitating language translation, sentiment analysis,
and chatbots.
• Autonomous vehicles: enabling self-driving cars to perceive and navigate the
environment.
• Healthcare: aiding in medical imaging diagnosis, drug discovery, and personalized
treatment recommendations.
• Finance: supporting fraud detection, algorithmic trading, and credit scoring.
6. The Benefits of Machine Learning:
• Machine Learning offers several advantages:
• Automation of repetitive tasks, reducing manual effort and increasing efficiency.
• Enhanced decision-making through data-driven insights and predictions.
• Scalability, allowing systems to handle large volumes of data and complex
problems.
• Adaptability to changing environments and evolving datasets, improving
performance over time.
7. Challenges of Deep Learning:
• Despite its successes, Deep Learning faces several challenges:
• Data requirements: Deep Learning models often require massive amounts of
labeled data to generalize well, posing challenges in data collection and annotation.
• Computational resources: Training deep neural networks demands substantial
computational power and memory, limiting accessibility to high-performance
hardware.
• Interpretability: Deep Learning models are often perceived as black boxes, making
it difficult to understand their decision-making process and assess their reliability.
• Overfitting: Deep Learning models can memorize noise in the training data, leading
to poor generalization performance on unseen data.
8. Latest Breakthroughs in Deep Learning:
• Recent advancements in Deep Learning include:
• Efficient model architectures: Development of compact and computationally
lightweight models suitable for deployment on edge devices with limited
resources.
• Self-supervised learning: Techniques that enable models to learn from unlabeled
data, reducing the reliance on annotated datasets.
• Explainable AI: Efforts to enhance the interpretability and transparency of Deep
Learning models, allowing users to understand model decisions and trust their
outputs.
9. General Flow of Deep Learning Projects:
• Deep Learning projects typically follow a systematic workflow:
• Data collection and preprocessing: Gathering relevant datasets and preparing them
for training, including tasks such as cleaning, normalization, and augmentation.
• Model design and training: Defining the architecture of the neural network,
selecting appropriate loss functions and optimization algorithms, and training the
model on the prepared data.
• Evaluation: Assessing the performance of the trained model on validation and test
datasets, measuring metrics such as accuracy, precision, recall, and F1 score.
• Fine-tuning and deployment: Iteratively refining the model based on evaluation
results, optimizing hyperparameters, and deploying the finalized model for real-
world applications.
10.Introduction to TensorFlow:
• TensorFlow is an open-source machine learning library developed by Google Brain for
building and training various types of machine learning models, including neural networks.
• It provides a flexible and efficient framework for numerical computations using data flow
graphs.
• TensorFlow supports both CPU and GPU computation, allowing for fast training of models
on different hardware platforms.
• With TensorFlow, developers and researchers can easily create, train, and deploy machine
learning models for a wide range of applications.
11. TensorFlow Hello World:
• The "Hello World" program in TensorFlow is a simple demonstration of how to create and
execute a computational graph.
• In TensorFlow, computations are represented as data flow graphs, where nodes represent
mathematical operations, and edges represent the flow of data between operations.
• The "Hello World" example typically involves creating a graph with a single node that
performs a basic mathematical operation, such as addition or multiplication, and then
executing the graph to obtain the result.
12. Linear Regression With TensorFlow:
• Linear regression is a fundamental machine learning algorithm used for predicting a
continuous target variable based on one or more input features.
• In TensorFlow, linear regression can be implemented using the concept of placeholders,
variables, and operations.
• Placeholder nodes are used to feed input data into the computational graph, while variable
nodes are used to represent model parameters (e.g., slope and intercept).
• By defining the loss function (e.g., mean squared error) and optimization algorithm (e.g.,
gradient descent), TensorFlow can automatically adjust the model parameters to minimize
the loss and optimize the linear regression model.
13. Logistic Regression With TensorFlow:
• Logistic regression is a popular machine learning algorithm used for binary classification
tasks, where the goal is to predict the probability that an input belongs to a particular class
(e.g., spam or non-spam).
• TensorFlow can be used to implement logistic regression by constructing a computational
graph that computes the probability of the target class using a sigmoid function.
• Similar to linear regression, logistic regression in TensorFlow involves defining
placeholders for input data, variables for model parameters, and a loss function (e.g.,
binary cross-entropy) for optimizing the model.
14. Intro to Deep Learning:
• Deep Learning is a subfield of machine learning that focuses on training neural networks
with multiple layers (deep neural networks) to learn hierarchical representations of data.
• Deep Learning has gained popularity due to its ability to automatically learn complex
patterns and features from raw data, without the need for manual feature engineering.
• Neural networks in Deep Learning consist of interconnected layers of nodes (neurons),
each performing simple mathematical operations and passing the results to the next layer.
• Deep Learning has achieved remarkable success in various domains, including computer
vision, natural language processing, speech recognition, and reinforcement learning.
15. Deep Neural Networks:
• Deep Neural Networks (DNNs) are a class of neural networks with multiple hidden layers
between the input and output layers.
• DNNs are capable of learning hierarchical representations of data, where each layer
captures increasingly abstract features from the input.
• Common architectures of DNNs include feedforward neural networks, convolutional
neural networks (CNNs) for image data, recurrent neural networks (RNNs) for sequential
data, and transformer models for natural language processing tasks.
• Training deep neural networks typically involves techniques such as backpropagation,
stochastic gradient descent, and regularization to optimize model parameters and prevent
overfitting.
16.Biological Neuron:
• A biological neuron is a basic unit of the nervous system found in the brains of animals,
including humans.
• It consists of three main parts: dendrites (input terminals), a cell body (processing unit),
and an axon (output terminal).
• Neurons communicate with each other through electrochemical signals called action
potentials, which travel along the axon and stimulate neighboring neurons via synaptic
connections.
17. Perceptron:
• The perceptron is the simplest form of a neural network, inspired by the biological neuron.
• It takes multiple binary inputs, each with an associated weight, and produces a single
binary output.
• The perceptron computes a weighted sum of its inputs, applies a step function (threshold
function) to the sum, and outputs the result.
18.Multi-Layer Perceptron (MLP):
• The multi-layer perceptron is a type of feedforward neural network consisting of multiple
layers of perceptrons.
• It contains an input layer, one or more hidden layers, and an output layer.
• Each perceptron in the hidden layers and the output layer applies a weighted sum of
inputs, followed by an activation function, to produce the output.
19. Weight and Bias:
• In a neural network, weights and biases are parameters that determine the strength of
connections between neurons and the neuron's responsiveness to inputs, respectively.
• Weights represent the importance of input signals, while biases allow the network to learn
different activation patterns.
• During training, the values of weights and biases are adjusted through optimization
algorithms such as gradient descent to minimize the difference between predicted and
actual outputs.
20. Activation Function:
• An activation function is a mathematical function applied to the output of each neuron in
a neural network.
• It introduces non-linearity into the network, allowing it to learn complex patterns and
relationships in the data.
• Common activation functions include sigmoid, tanh, ReLU (Rectified Linear Unit), and
softmax, each with its unique properties and use cases.
21. Deep Neural Networks (DNNs):
• Deep neural networks (DNNs) are neural networks with multiple layers between the input
and output layers.
• They are capable of learning hierarchical representations of data by composing multiple
non-linear transformations.
• DNNs have achieved significant success in various machine learning tasks, including image
recognition, speech recognition, natural language processing, and reinforcement learning.
These concepts form the foundation of neural networks and deep learning. Understanding them
is essential for grasping more advanced topics in artificial intelligence and machine learning. Let
me know if you need further clarification on any of these points!
• A Recurrent Neural Network (RNN) is a type of neural network designed for processing
sequential data by maintaining an internal state or memory.
• Unlike feedforward neural networks, RNNs have connections that form directed cycles,
allowing them to exhibit temporal dynamic behavior.
• RNNs are suitable for tasks where the order and context of input data are essential, such
as time series prediction, natural language processing, and speech recognition.
29. Architecture of RNN:
• The architecture of an RNN consists of recurrent connections that allow information to
persist over time steps.
• At each time step �t, the RNN receives an input ��xt and the hidden state ℎ�−1ht−1
from the previous time step. It then computes the current hidden state ℎ�ht and
optionally produces an output ��yt.
• The hidden state ℎ�ht serves as the memory of the network, capturing information from
previous time steps and influencing the computation at the current time step.
30. Different Types of RNNs:
• Various extensions and modifications of basic RNNs have been developed to address their
limitations, including:
• Long Short-Term Memory (LSTM) networks: Introduced by Hochreiter and
Schmidhuber, LSTM networks use memory cells with gating mechanisms to better
capture long-range dependencies and prevent the vanishing gradient problem.
• Gated Recurrent Unit (GRU) networks: Similar to LSTM networks, GRU networks
utilize gating mechanisms to control the flow of information, but with a simpler
architecture and fewer parameters.
• Echo State Networks (ESNs): ESNs are a type of reservoir computing where a fixed
random recurrent network (the reservoir) is combined with a trainable readout
layer to perform tasks such as time series prediction and classification.
31. Bidirectional RNN:
• A Bidirectional Recurrent Neural Network (BiRNN) is an extension of RNNs that processes
input sequences in both forward and backward directions.
• BiRNNs consist of two separate RNNs: one processes the input sequence from the
beginning to the end, while the other processes it from the end to the beginning.
• By capturing information from both past and future contexts, BiRNNs can better
understand and model the dependencies within sequential data.
32. Applications of RNN:
• RNNs find applications in a wide range of sequential data processing tasks, including:
• Language modeling: Generating text, predicting the next word in a sentence, and
machine translation.
• Speech recognition: Converting spoken language into text.
• Time series prediction: Forecasting future values based on historical data, such as
stock prices and weather patterns.
• Handwriting recognition: Recognizing handwritten characters and words.
• Music generation: Creating new music compositions based on existing melodies
and styles.
33. Using RNN with Different Datasets:
• RNNs can be applied to various datasets beyond the aforementioned applications:
• Financial data: Analyzing stock prices, predicting market trends, and performing
algorithmic trading.
• Medical data: Predicting patient outcomes, diagnosing diseases from medical
images, and analyzing electrocardiogram (ECG) signals.
• Genomic data: Predicting gene sequences, identifying regulatory elements, and
analyzing protein sequences.
• Video data: Understanding human actions, recognizing objects and scenes, and
generating video captions.