DSA Practical Workbook - LAb Manuals 18cs
DSA Practical Workbook - LAb Manuals 18cs
TECHNOLOGY
Practical Workbook
Upon successful completion of the course, the student will be able to:
Linking No: of
Associated
CLO Psychomotor lecture
S# Topic
Taxonomy hrs
Level required
1. 3 P3
To Practice and Execute NoSQL Database queries - MongoDB 3
2. To Practice the Python Platform and Explore its Libraries for Data 3 P3 3
1 Science & Machine Learning.
3. To Perform Exploratory Data Analysis and demonstrate in depth- 3 P3 3
visualizations of Data.
4. 3 P3
To perform inferential statistics and hypothesis testing. 3
5. To Demonstrate the Implementation of Linear Regression in Data 3 P3 3
Modeling.
6. To Demonstrate the Implementation of Logistic Regression in Data 3 P3 3
Modeling.
7. To Practice and demonstrate the Implementation of Artificial Neural 3 P3 3
Network & Deep Learning.
8. To Practice and demonstrate the implementation of Support Vector 3 P3 3
Machine.
9. To Practice and demonstrate the implementation of K- Means 3 P3 3
Clustering algorithm.
10. To Practice and demonstrate the implementation of Decision Trees 3 P3
3
algorithm.
11. To Practice and demonstrate the Implementation of Random Forest 3 P3 3
algorithm.
12. To Execute anomaly detection techniques in Machine Learning 3 P3 3
Models.
13. To Explore the Performance Measuring variables and optimize 3 P3 3
hyper-parameters.
14. 4 P4
To Develop the LSTM Models for Time Series Forecasting. 3
15. 4 P4
Open-ended Lab 1 3
Total Lecture hrs 45
PRACTICAL 01
To Practice and Execute NoSQL Database
Queries - MongoDB
Outline:
Database
A database is a collection of information that is organized so that it can be easily accessed,
managed and updated. Computer databases typically contain data records or files, containing
information about sales transactions or interactions with specific customers.
NOSQL
NoSQL is an approach to database management that can accommodate a wide variety of data
models, including key-value, document, columnar and graph formats. A NoSQL database
generally means that it is non-relational, distributed, flexible and scalable.
Additional common NoSQL database features include the lack of a database schema, data
clustering, replication support and eventual consistency, as opposed to the typical ACID
(atomicity, consistency, isolation and durability) transaction consistency of relational and SQL
databases. Many NoSQL database systems are also open source.
SQL vs NOSQL
NOSQL DATABASES
•Redis
•Couch Db
•Neo 4j
•Mongo Db
•Cassandra
MONGODB
MongoDB is a cross-platform, document-oriented database that provides, high performance,
high availability, and easy scalability. MongoDB works on concept of collection and document.
Document
A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema
means that documents in the same collection do not need to have the same set of fields or
structure, and common fields in a collection's documents may hold different types of data.
Collection
Collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A collection
exists within a single database. Collections do not enforce a schema. Documents within a
collection can have different fields. Typically, all documents in a collection are of similar or
related purpose.
STRENGTHS OF MONGODB
•The following are some of MongoDB benefits and strengths:
•Dynamic schema: As mentioned, this gives you flexibility to change your data schema without
modifying any of your existing data.
•Scalability: MongoDB is horizontally scalable, which helps reduce the workload and scale your
business with ease.
•Manageability: The database doesn’t require a database administrator. Since it is fairly user-
friendly in this way, it can be used by both developers and administrators.
•Flexibility: You can add new columns or fields on MongoDB without affecting existing rows or
application performance.
LAB TASK
1. Installation of Mongo Db and use its interface for creating database and collections.
2. Attach screenshots of your installed database and create 3 different databases of your
own choice.
3. Populate those databases with minimum 15 documents using MongoDB GUI and each
document should contain at least 6 key-value pairs. For example, use Emp table data that
you have used in dbms.
4. Create Database and create collection of the data of your own choice and attach
screenshot of every output.
5. Use at least 5 collection functions of your own choice and attach screenshot of every
output.
6. Use any five Aggregation Pipeline Operators and Aggregation Pipeline Stages on the data
of your own choice and attach screenshot of every output.
7. Create a database name Students, create collection name StudentsInfo, insert 10
documents with column names student first_name, last_name, roll_no, date_of_admission,
Current semester, cgpa in every semester. E.g. semester_cgpa:[3.99,2.87]
8. Write a query to display student First name and current semester only.
9. Write a query to display min ,max and avg cgpa group by first_name.
10. Write a query to sum up the cgpa.
11. Update any value in the database.
12. Write a query to select cgpa greater than 2.50 and less than 3.0.
13. Remove record where cgpa is less than 2.5.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics
Date
Teacher:
:
Department of: Subject of:
Practical 02
To Practice the Python Platform and Explore
its Libraries for Data Science & Machine
Learning.
Outline:
Introduction to Python
Variables and Types
Data Structures in Python
Functions & Packages
Numpy package
Data visualization with Matplotlib
Control flow
Pandas
Required Tools:
PC with windows
Anaconda 2 or 3
Introduction to Python
Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but
effective approach to object-oriented programming.It was created by Guido van Rossum during 1985- 1990. Python is
very beginner-friendly. The syntax (words and structure) is extremely simple to read and follow, most of which can be
understood even if you do not know any programming. Let’s take a look at one example of this
Example 1.1:
Garage = "Ferrari", "Honda", "Porsche", "Toyota"
"print()" is a built-in Python function that will output some text to the console.
Looking at the code about cars in the garage, can you guess what will happen? You probably have a general idea. For
each_car in the garage, we're going to do something. What are we doing? We are printing each car.
Since "printing" outputs some text to the "console," you can probably figure out that the console will say something like
"Ferrari, Honda, Porsche, Toyota."
Python Shell
A python shell is a way for a user to interact with the python interpreter.
Python IDLE
IDLE (Integrated DeveLopment Environment or Integrated Development and Learning Environment) is an integrated
development environment for Python, which has been bundled with the default implementation of the language since
1.5.2b1. ...Python shell with syntax highlighting.
Python Scripts
Scripts are reusable
Basically, a script is a text file containing the statements that comprise a Python program. Once you have created the
script, you can execute it over and over without having to retype it each time. Python scripts are saved with .py extension.
Scripts are editable
Perhaps, more importantly, you can make different versions of the script by modifying the statements from one file to the
next using a text editor. Then you can execute each of the individual versions. In this way, it is easy to create different
programs with a minimum amount of typing.
Anaconda Distribution
The open-source Anaconda Distribution is the easiest way to perform Python/R data science and machine learning on
Linux, Windows, and Mac OS X. With over 11 million users worldwide, it is the industry standard for developing, testing,
and training on a single machine.
Spyder IDE
Spyder is a powerful scientific environment written in Python, for Python, and designed by and for scientists, engineers
and data analysts. It offers a unique combination of the advanced editing, analysis, debugging, and profiling functionality
of a comprehensive development tool with the data exploration, interactive execution, deep inspection, and beautiful
visualization capabilities of a scientific package.
Beyond its many built-in features, its abilities can be extended even further via its plugin system and API. Furthermore,
Spyder can also be used as a PyQt5 extension library, allowing developers to build upon its functionality and embed its
components, such as the interactive console, in their own PyQt software.
Some of the components of spyder include:
Editor, work efficiently in a multi-language editor with a function/class browser, code analysis tools,
automatic code completion
Ipython console, harness the power of as many IPython consoles as you like within the flexibility of a
full GUI interface; run your code by line, cell, or file; and render plots right inline.
Variable explorer, Interact with and modify variables on the fly: plot a histogram or time series, edit a
data frame or Numpy array, sort a collection, dig into nested objects, and more!
Debugger, Trace each step of your code's execution interactively.
Help, instantly view any object's docs, and render your own.
Here is how Spyder IDE looks like (Windows edition) in action:
Figure 1.2: Spyder IDE
The first thing to note is how the Spyder app is organized. The application includes multiple separate windows (marked
with red rectangles), each of which has its own tabs (marked with green rectangles). You can change which windows you
prefer to have open from the View -> Windows and Toolbars option. The default configuration has the Editor, Object
inspector/Variable explorer/File explorer, and Console/History log windows open as shown above.
The Console is where python is waiting for you to type commands, which tell it to load data, do math, plot data, etc. After
every command, which looks like >>> command, you need to hit the enter key (return key), and then python may or may
not give some output. The Editor allows you to write sequences of commands, which together make up a program. The
History Log stores the last 100 commands you've typed into the Console. The Object inspector/Variable explorer/File
explorer windows are purely informational -- if you watch what the first two display as we go through the tutorial, you'll
see that they can be quite helpful.
Variables & Types
In almost every single Python program you write, you will have variables. Variables act as placeholders for data. They can
aid in short hand, as well as with logic, as variables can change, hence their name.
Python variables do not need explicit declaration to reserve memory space. The declaration happens automatically when
you assign a value to a variable. The equal sign (=) is used to assign values to variables.
Variables help programs become much more dynamic, and allow a program to always reference a value in one spot, rather
than the programmer needing to repeatedly type it out, and, worse, change it if they decide to use a different definition for
it.
Variables can be called just about whatever you want. You wouldn't want them to conflictwith function names, and
they also cannot start with a number.
Example 1.2:
weight=55
height=166
BMI=weight/(height*height)
print(BMI)
In this case, we will have a 0.001995935549426622printed out to console. Here, we were able to store integers and their
manipulations to different variables.
We can also find out the type of any variable by using type function.
print(type(BMI))
This will return the type of variable as <class 'float'>
Data Structures in Python
The most basic data structure in Python is the sequence. Each element of a sequence is assigned a number - its position or
index. The first index is zero, the second index is one, and so forth.
There are certain things you can do with all sequence types. These operations include indexing, slicing, adding,
multiplying, and checking for membership. In addition, Python has built-in functions for finding the length of a sequence
and for finding its largest and smallest elements.
Python Lists
The list is a most versatile datatype available in Python which can be written as a list of comma-separated values (items)
between square brackets. Important thing about a list is that items in a list need not be of the same type.
Creating a list is as simple as putting different comma-separated values between square brackets. For example −
Example 2.3:
list1 = ['physics', 'chemistry', 1997, 2000];
list2 = [1, 2, 3, 4, 5 ];
list3 = ["a", "b", "c", "d"]
List indices start at 0, and lists can be sliced, concatenated and so on.
List of lists can also be created
Example 2.4:
weight=[55,44,45,53]
height=[166, 150, 144,155]
demo_list=[weight, height]
print(demo_list)
Accessing Values in Lists
To access values in lists, use the square brackets for slicing along with the index or indices to obtain value available at that
index. For example −
Example 2.5:
list1 = ['physics', 'chemistry', 1997, 2000];
list2 = [1, 2, 3, 4, 5, 6, 7 ];
Example 2.8:
def printme( str ):
"This prints a passed string into this function"
print (str)
return;
#Calling function
printme("I'm first call to user defined function!")
printme("Again second call to the same function")
# When using integer array indexing, you can reuse the same
# element from the source array:
print(a[[0, 0], [1, 1]]) # Prints "[2 2]"
Histograms
To explore more about dataset one can also use histograms to get the idea about distributions.
Consider an example where we have 12 values in range of 0 and 10. To build a histogram of these values divide them into
equal chunks. Suppose we go for 3 bins, finally draw bar of each line height of bar corresponds to number of data points
falling in particular bin.
Python code for the above scenario-
Example 2.4:
import matplotlib.pyplot as plt
help (plt.hist)
#list with 12 values
values=[1.2,1.3,2.2,3.3,2.4,6.5,6.6,7.7,8.8,9.9,4.2,5.3]
plt.hist(values)
plt.grid(True)
plt.show()
The above code displays following graph.
Logical Operators
A boolean expression (or logicalexpression) evaluates to one of two states true or false. Python provides the boolean type
that can be either set to False or True.
Operator Description
& AND Returns true if both the compared expressions are true
| OR Returns true even if both the compared expressions are
not true
~ It is unary and has the effect of 'flipping' the result.
Table 2.2: Logical operators
Control flow
A program’s control flow is the order in which the program’s code executes. The control flow of a Python program
is regulated by conditional statements, loops, and function calls.
if Statement
Often, you need to execute some statements only if some condition holds, or choose statements to execute depending on
several mutually exclusive conditions. The Python compound statement if, which uses if, elif, andelse clauses, lets
you conditionally execute blocks of statements. Here’s the syntax for the if statement:
ifexpression:
statement(s)
elifexpression:
statement(s)
elifexpression:
statement(s)
...
else:
statement(s)
The elif and else clauses are optional. Note that unlike some languages, Python does not have
a switch statement, so you must use if, elif, and elsefor all conditional processing.
Pandas
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with
“relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing
practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and
flexible open source data analysis / manipulation tool available in any language. It is already well on its way
toward this goal.
Pandas is well suited for many different kinds of data:
Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
Ordered and unordered (not necessarily fixed-frequency) time series data.
Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
Any other form of observational / statistical data sets. The data actually need not be labeled at all to
be placed into a pandas data structure
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast
majority of typical use cases in finance, statistics, social science, and many areas of engineering.
A common way to start databases is to use CSV (Comma Separated Values) files pandas efficiently handles these files.
CSV files are used to store a large number of variables – or data. They are incredibly simplified spreadsheets – think Excel
– only the content is stored in plaintext.The text inside a CSV file is laid out in rows, and each of those has columns, all
separated by commas. Every line in the file is a row in the spreadsheet, while the commas are used to define and separate
cells.
Let’s take a look at one such example which retrieves data from CSV file. Use the given csv file.
Example
import pandas as pd
data=pd.read_csv('C:/Users/Downloads/
Import_User_Sample_en.csv',index_col=0)#skips the 0th column
print(data)
#Original File
print("Original file")
print(data)
#Adding Data
data["Marital
status"]=['Married','Single','Divorced','Single','Single'] #adds
column to CSV file
print("Retrieval of newly added column")
print(data["Marital status"])
#Manipulation of Columns
data["New Number"]=data['Office Number']/data['ZIP or Postal Code']
print("Retrieval of newly added column through manipulation")
print(data['New Number'])
#Row Access
data.set_index("Last Name", inplace=True)
print(data)
print(data.loc["Andrews"])
#Element Access
print("Element Access")
print(data.loc["Andrews"],["Address"])
#Row Access
print('Row access via iloc',data.iloc[0:2, :])
#Column Access
print('Column access via iloc',data.iloc[:, [1]])
Execute the above code and find out the alterations made in the CSV file data.
Exercise
Variables & Types
Question 1:
Create a variable savings with the value 100.
Check out this variable by typing print(savings) in the script.
Question 2:
Python Lists
Question 3:
Create a list, areas, that contains the area of the hallway (hall), kitchen (kit), living room (liv),
bedroom (bed) and bathroom (bath), in this order. Use the predefined variables.
Print areas with the print() function.
List sub-setting
Question 4:
Print out the second element from the areas list, so 11.25.
Subset and print out the last element of areas, being 9.50. Using a negative index makes sense here!
Select the number representing the area of the living room and print it out.
# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0,
"bedroom", 10.75, "bathroom", 9.50]
# Definition of radius
r = 0.43
# Calculate C
C =
# Calculate A
A =
# Build printout
print("Circumference: " + str(C))
print("Area: " + str(A))
Numpy Package
Question 9:
import the numpy package as np, so that you can refer to numpy with np.
Use np.array() to create a Numpy array from baseball. Name this array np_baseball.
Print out the type of np_baseball to check that you got it right.
# Create list baseball
baseball = [180, 215, 210, 210, 188, 176, 209, 200]
# Import numpy
import numpy as np
# Import numpy
import numpy as np
# Define variables
room = "kit"
area = 14.0
Pandas
In the exercises that follow, you will be working with vehicle data in different countries. Each observation corresponds to a
country, and the columns give information about the number of vehicles per capita, whether people drive left or right, and
so on. This data is available in a CSV file, named cars.csv.
Question 15:
To import CSV files, you still need the pandas package: import it as pd.
Use pd.read_csv() to import cars.csv data as a DataFrame. Store this dataframe as cars.
Print out cars. Does everything look OK?
# Import pandas as pd
df =
pd.read_csv('http://assets.datacamp.com/course/intermediate_python/
gapminder.csv', index_col = 0)
gdp_cap = list(df.gdp_cap)
life_exp = list(df.life_exp)
# Basic scatter plot, log scale
plt.scatter(gdp_cap, life_exp)
plt.xscale('log')
# Strings
xlab = 'GDP per Capita [in USD]'
ylab = 'Life Expectancy [in years]'
title = 'World Development in 2007'
# Add axis labels
# Add title
# After customizing, display the plot
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics
Lab 02: To Practice the Python Platform and explore its Libraries for Data Science and Machine
Learning
Roll No: Student Name:
Date
Teacher:
:
Department of: Subject of:
Practical 03
To Perform Exploratory Data Analysis and
Demonstrate In Depth-Visualizations Of
Data.
Outline:
Features and its types
Basic Plots
Higher Dimensionality Visualizations
Required Tools:
PC with windows
Anaconda3-5.0.1
Time
Distance
Cost
Temperature
Categorical Features
With categorical features, there is a specified number of discrete, possible feature values. These values may or may not
have an ordering to them. If they do have a natural ordering, they are called ordinal categorical features. Otherwise if
there is no intrinsic ordering, they are called nominal categorical features.
Nominal
Car Models
Colors
TV Shows
Ordinal
High-Medium-Low
1-10 Years Old, 11-20 Years Old, 30-40 Years Old
Happy, Neutral, Sad
Figure 3.1: Types of features
3.2. Visualizations
One of the most rewarding and useful things you can do to understand your data is to visualize it in a pictorial format.
Visualizing your data allows you to interact with it, analyze it in a straightforward way, and identify new patterns, making
your would-be complex data more accessible and understandable. The way our brains processes visuals like shapes, colors,
and lengths makes looking at charts and graphs more intuitive for us than poring over spreadsheets.
3.2.1. Matplotlib
MatPlotLib is a Python data visualization tool that supports 2D and 3D rendering, animation, UI design, event handling,
and more. It only requires you pass in your data and some display parameters and then takes care of all of the rasterization
implementation details. For the most part, you will be interacting with MatPlotLib's Pyplot functionality through a Pandas
series or dataframe's .plot namespace. Pyplot is a collection of command-style methods that essentially make
MatPlotLib's charting methods feel like MATLAB.
3.3. Basic Plots
3.3.1. Histograms
Histograms are graphical techniques which have been identified as being most helpful for troubleshooting issues.
Histograms help you understand the distribution of a feature in your dataset. They accomplish this by simultaneously
answering the questions where in your feature's domain your records are located at, and how many records exist there.
Let's go ahead and explore a little bit about how to use histograms and what they can actually do. You've probably seen
wheat before, nothing new there, however, it turns out that there's quite a few different varieties of it. A group titled the
Polish Academy of Science, particularly their Agrophysics Institute, what they do is they created a dataset that has a bunch
of different types of wheat in them. And they x-rayed the wheat, the different specimens of wheat and then they featured
the results. So some of the metrics that they curated, include the groove length of the wheat which is this thing over here.
They also got the actual kernel length and the kernel width and some other features about the wheat. In addition to that,
they also added in some engineered or calculated features such as an asymmetry constant.
Figure 3.2: Wheat Kernel
Example 3.1:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot') # Look Pretty, ggplot is Python
implementation of the grammar of graphics.
# If the above line throws an error, use
plt.style.use('ggplot') instead
df =
pd.read_csv("C:/Users/Mehak/Desktop/DataScience/wheat.csv",
index_col=0)
#Prints columns in present in your csv file
print(df.columns)
Example 3.2:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
df.plot.scatter(x='Water', y='Strength')
plt.show()
df.plot.scatter(x='Ash', y='Strength')
plt.show()
Output:
Figure 3.6: 2D Scatter plot based on slag & strength
#creates figure
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')#"1x1 grid, first
subplot"
#OR USE ax = fig.gca(projection='3d'),gca(Get the current axes,
creating one if necessary)
ax.set_xlabel('Slag')
ax.set_ylabel('Water')
ax.set_zlabel('Strength')
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
# Look pretty...
matplotlib.style.use('ggplot')
# If the above line throws an error, use
plt.style.use('ggplot') instead
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
# Look pretty...
matplotlib.style.use('ggplot')
# If the above line throws an error, use
plt.style.use('ggplot') instead
3.4.3. IMSHOW
One last higher dimensionality, visualization-technique you should know how to use is MatPlotLib's .imshow() method.
This command generates an image based off of the normalized values stored in a matrix, or rectangular array of float64s.
The properties of the generated image will depend on the dimensions and contents of the array passed in:
An [X, Y] shaped array will result in a grayscale image being generated
A [X, Y, 3] shaped array results in a full-color image: 1 channel for red, 1 for green, and 1 for blue
A [X, Y, 4] shaped array results in a full-color image as before with an extra channel for alpha
Besides being a straightforward way to display .PNG and other images, the .imshow() method has quite a few other use
cases. When you use the .corr() method on your dataset, Pandas calculates a correlation matrix for you that measures
how close to being linear the relationship between any two features in your dataset are. Correlation values may range
from -1 to 1, where 1 would mean the two features are perfectly positively correlated and have identical slopes for all
values. -1 would mean they are perfectly negatively correlated, and have a negative slope for one another, again being
linear. Values closer to 0 mean there is little to no linear relationship between the two variables at all (e.g., pizza sales and
plant growth), and so the further away from 0 the value is, the stronger the relationship between the features.
Example 3.6:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import matplotlib.image as mpimg
img=mpimg.imread("C:/Users/Mehak/Desktop/DataScience/Capture.png")
print(img.shape)
plt.imshow(img)
#plt.imshow(img,aspect=0.5)
Output:
Exercise
For this assignment, you'll be using the seeds data set, generated by recording X-Ray measurements of various wheat
kernels.
Histograms
Question 1: Write python code that
1. Loads the seeds dataset into a dataframe.
2. Creates a slice of your dataframe that only includes the area and perimeter features
3. Creates another slice that only includes the groove and asymmetry features
4. Creates a histogram for the 'area and perimeter' slice, and another histogram for the 'groove and
asymmetry' slice. Set the optional display parameter: alpha=0.75
Once you're done, run your code and then answer the following questions about your work:
a) Looking at your first plot, the histograms of area and perimeter, which feature do you believe more
closely resembles a Gaussian / normal distribution?
b) In your second plot, does the groove or asymmetry feature have more variance?
2D Scatter Plots
Question 2: Write python code that
1. Loads up the seeds dataset into a dataframe
2. Create a 2d scatter plot that graphs the area and perimeter features
3. Create a 2d scatter plot that graphs the groove and asymmetry features
4. Create a 2d scatter plot that graphs the compactness and width features
Once you're done, answer the following questions about your work:
a) Which of the three plots seems to totally be lacking any correlation?
b) Which of the three plots has the most correlation?
3D Scatter Plots
Question 3: Write python code that
1. Loads up the seeds dataset into a dataframe. You should be very good at doing this by now.
2. Graph a 3D scatter plot using the area, perimeter, and asymmetry features. Be sure to label your axes,
and use the optional display parameter c='red'.
3. Graph a 3D scatter plot using the width, groove, and length features. Be sure to label your axes, and
use the optional display parameter c='green'.
Once you're done, answer the following questions about your work.
a) Which of the plots seems more compact / less spread out?
b) Which of the plots were you able to visibly identify two outliers within, that stuck out from the
samples?
Parallel Coordinates
Question 4: Write python code that
1. Loads up the seeds dataset into a dataframe
2. Drop the area, and perimeter features from your dataset. Use .drop method on data frame to drop
specified columns
3. Plot a parallel coordinates chart, grouped by the wheat_type feature. Be sure to set the optional
display parameter alpha to 0.4
Once you're done, answer the following questions about your work.
a) Which class of wheat do the two outliers you found previously belong to?
b) Which feature has the largest spread of values across all three types of wheat?
Andrew’s Plot
Question 5: Write python code that
1. Loads up the seeds dataset into a dataframe
2. Plot anandrew’s curve chart, grouped by the wheat_type feature. Be sure to set the optional display
parameter alpha to 0.4
Once you're done, answer the following questions about your work.
a) Are your outlier samples still easily identifiable in the plot?
IMSHOW
Question 6: Write python code that
1. Loads up any image of your choice, into a dataframe.
2. Print shape and type of the object holding image.
3. Plot image using imshow.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics
Lab 03: To perform Exploratory Data Analysis and demonstrate in depth-visualization of Data
Roll No: Student Name:
Date
Teacher:
:
Department of: Subject
of:
Computer Systems Data Science & Analytics
Mehran Engineering
University of Year 4th Semester 8th
Engineering
&Technology Batch 18CS Duration 03
Hour
s
Jamshoro
PRACTICAL 04
To perform Inferential Statistics and Hypothesis
testing
Outline:
Statistical Testing
Data Science, Machine Learning, Artificial Intelligence, Deep Learning - You need to learn
the basics before you become a good Data Scientist. Math and Statistics are the building
blocks of Algorithms for Machine Learning. Knowing the techniques behind different
Machine Learning Algorithms is fundamental to knowing how and when to use them.
"Statistics is the practice or science of collecting and analyzing numerical data in large
quantities, especially to infer proportions in a whole from those in a representative
sample."
Statistics are used to interpret data and solve complicated real-world issues. Data scientists
and analysts use statistics to search for concrete patterns and data changes. To put it
simply, Statistics can be used to carry out mathematical computations to extract useful
insights from data.
Hypothetical Statistic
A statistical hypothesis, sometimes called confirmatory data analysis, is a hypothesis that is
testable based on observing a process that is modeled via a set of random variables.
A Null Hypothesis implies that there is no strong difference in a given set of observations.
Alternative hypothesis:-
The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary to
the null hypothesis. It is usually taken to be that the observations are the result of a real
effect (with some amount of chance variation superposed)
Example : a company production is !=50 unit/per day etc.
Level of significance: Refers to the degree of significance in which we accept or reject the
null hypothesis. 100% accuracy is not possible for accepting or rejecting a hypothesis, so
we, therefore, select a level of significance that is usually 5%.
This is normally denoted with alpha(maths symbol ) and generally, it is 0.05 or 5%, which
means your output should be 95% confident to give a similar kind of result in each sample.
Type I error: When we reject the null hypothesis, although that hypothesis was true. Type
I error is denoted by alpha. In hypothesis testing, the normal curve that shows the critical
region is called the alpha region
Type II errors: When we accept the null hypothesis, but it is false. Type II errors are
denoted by beta. In Hypothesis testing, the normal curve that shows the acceptance region
is called the beta region.
One-tailed test:- A test of a statistical hypothesis, where the region of rejection is on only
one side of the sampling distribution, is called a one-tailed test.
Example:- a college has ≥ 4000 students or data science ≤ 80% org adopted.
Two-tailed test:- A two-tailed test is a statistical test in which the critical area of a
distribution is two-sided and tests whether a sample is greater than or less than a certain
range of values. If the sample being tested falls into either of the critical areas, the
alternative hypothesis is accepted instead of the null hypothesis.
P-value:- The P value, or calculated probability, is the probability of finding the observed,
or more extreme, results when the null hypothesis (H 0) of a study question is true — the
definition of ‘extreme’ depends on how the hypothesis is being tested.
If your P value is less than the chosen significance level then you reject the null hypothesis i.e.
accept that your sample gives reasonable evidence to support the alternative hypothesis. It
does NOT imply a “meaningful” or “important” difference; that is for you to decide when
considering the real-world relevance of your result.
Example: you have a coin and you don’t know whether that is fair or tricky so let’s decide the
null and alternate hypothesis
Now let’s toss the coin and calculate the p-value ( probability value).
Toss a coin 1st time and the result is tail- P-value = 50% (as head and tail have equal
probability)
Toss a coin 2nd time and result is tail, now p-value = 50/2 = 25%
and similarly, we Toss 6 consecutive times and got the result as P-value = 1.5% but we set
our significance level as 95% means the 5% error rate we allow, and here we see we are
beyond that level i.e. our null- hypothesis does not hold good so we need to reject and
propose that this coin is a tricky coin which is actually.
Now Let’s see some of the widely used hypothesis testing types:-
2. Z Test
3. ANOVA Test
4. Chi-Square Test
1. T-Test
A t-test is a type of inferential statistic which is used to determine if there is a significant
difference between the means of two groups that may be related to certain features. It is
mostly used when the data sets, like the set of data recorded as an outcome from flipping a
coin 100 times, would follow a normal distribution and may have unknown variances. The T-
test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a
population.
T-test has 2 types: 1. one sampled t-test 2. two-sampled t-test.
One-sample t-test: The One Sample t Test determines whether the sample mean is
statistically different from a known or hypothesized population mean. The One Sample t-Test
is a parametric test.
Example:- you have 10 ages and you are checking whether avg age is 30 or not. (check the
code below for that using python)
Code:
Create CSV on your own
import numpy as np
ages = np.genfromtxt(“ages.csv”)
print(ages)
ages_mean = np.mean(ages)
print(ages_mean)
print(“p-values”,pval)
hypothesis")
else:
Example: is there any association between week1 and week2 ( code is given below in python)
Code:
Create your CSV files:
import numpy as np
print(week1)
print(week2)
week1_mean = np.mean(week1)
week2_mean = np.mean(week2)
week1_std = np.std(week1)
week2_std = np.std(week2)
ttest,pval = ttest_ind(week1,week2)
print("p-value",pval)
if pval <0.05:
else:
print("we accept null hypothesis")
Z Test.
Several different types of tests are used in statistics (i.e. f test, chi-square test, t-test). You
would use a Z test if:
Your sample size is greater than 30. Otherwise, use a t-test. Data points should be
independent of each other. In other words, one data point isn’t related to or doesn’t affect
another data point. Your data should be normally distributed. However, for large sample
sizes (over 30) this doesn’t always matter. Your data should be randomly selected from a
population, where each item has an equal chance of being selected.
For example, again we are using a z-test for blood pressure with some mean like 156 (python
code is below for same) one-sample Z test.
import pandas as pd
print(float(pval))
if pval<0.05:
print("reject null
hypothesis") else:
LAB TASK
1. Define Statistical Testing and its importance.
2. Explain Hypothetical Testing with code examples.
3. Explain at least two types of statistical testing (along with code) as
discussed in theory lecture.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics
PRACTICAL 05
To demonstrate the implementation of Logistic
Regression in Data Modeling
Outline:
Linear Regression implementation with Scikit-Learn
Required Tools:
PC with windows
Python 3.6.2
Anaconda 3-5.0.1
Example 4.1
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
data =
pd.read_csv("C:/Users/Mehak/Desktop/DataScience/unchanged data
set/lsd.csv")
print(data.head())
X = data['Tissue Concentration']
y = data['Test Score']
X=X.values.reshape(len(X),1)
y=y.values.reshape(len(y),1)
print(X)
print(y)
model = LinearRegression()
model.fit(X, y)
plt.scatter(X, y,color='r')
plt.plot(X, model.predict(X),color='k')
plt.show()
Output:
Figure 4.2:Linear regression in Python, Math Test Scores on the Y-Axis, and Amount of LSD intake on the X-
Axis.
Given data, we can try to find the best fit line. After we discover the best fit line, we can use it to make predictions.
Consider we have data about houses: price, size, driveway and so on.
Example 4.2:
import matplotlib
matplotlib.style.use('ggplot')
Y = df['price']
X = df['lotsize']
X=X.values.reshape(len(X),1)
Y=Y.values.reshape(len(Y),1)
# Plot outputs
plt.scatter(X_test, Y_test, color='black')
plt.title('Test Data')
plt.xlabel('Size')
plt.ylabel('Price')
# Plot outputs
plt.plot(X_test, regr.predict(X_test), color='red',linewidth=3)
plt.show()
The next example uses the only the first feature of the diabetes dataset, in order to illustrate a two-dimensional
plot of this regression technique. The straight line can be seen in the plot, showing how linear regression attempts to
draw a straight line that will best minimize the residual sum of squares between the observed responses in the
dataset, and the responses predicted by the linear approximation.
The coefficients, the residual sum of squares and the variance score are also calculated.
Example 4.3:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test,
diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue',
linewidth=3)
plt.show()
Output:
The general structure of linear regression model in this case would be:
Y = a + b.X1 + c.X2
We have following datasets:
Example 4.4:
import matplotlib.pyplot as plt #provides methods for plotting
import pandas as pd #provides methods to deal with data
import numpy as np #provides methods for array creation
from sklearn.linear_model import LinearRegression #provides methods related
to LR
from sklearn.model_selection import train_test_split #provides methods
related to division of dataset
from mpl_toolkits.mplot3d import Axes3D
# Clean data
co2_df = co2_df.iloc[:,:2] # Keep only total CO2
co2_df = co2_df.loc[co2_df['Year'] >= 1960] # Keep only 1960 - 2010
co2_df.columns=['Year','CO2'] # Rename columns
co2_df = co2_df.reset_index(drop=True) # Reset index
print(co2_df.head())
print(temp_df.head())
# Concatenate
climate_change_df = pd.concat([co2_df, temp_df.Temperature], axis=1)
print(climate_change_df.head())
ax.set_ylabel('Relative temperature')
ax.set_xlabel('Year')
ax.set_zlabel('CO2 Emissions')
X = climate_change_df.as_matrix(['Year'])#Creates vector X
Y = climate_change_df.as_matrix(['CO2', 'Temperature']).astype('float32')
#Creates vector Y,Z
reg = LinearRegression()
reg.fit(X_train, y_train)
print('Score: ', reg.score(X_test, y_test))
x_line = np.arange(1960,2011).reshape(-1,1)
p = reg.predict(x_line).T
fig2 = plt.figure()
fig2.set_size_inches(12.5, 7.5)#sets size of figure
ax = fig2.gca( projection='3d')# creates 3D graph
ax.scatter(xs=climate_change_df['Year'],
ys=climate_change_df['Temperature'], zs=climate_change_df['CO2']) #sets data
for X,Y,Z axis and creates scatter plot
ax.set_ylabel('Relative temperature')
ax.set_xlabel('Year')
ax.set_zlabel('CO2 Emissions')
ax.plot(xs=x_line, ys=p[1], zs=p[0], color='red')
Output:
Advances in medicine, an increase in healthcare facilities, and improved standards of care have all contributed to an
increased overall life expectancy over the last few decades. Although this might seem like great achievement for
humanity, it has also led to the abandonment of more elderly people into senior-care and assisted living
communities. The morality, benefits, and disadvantages of leaving one's parents in such facilities are still debatable;
however, the fact that this practice has increased the financial burden on both the private-sector and government is
not.
In this lab assignment, you will be using the subset a life expectancy dataset, provided courtesy of the Center for
Disease Control and Prevention's National Center for Health Statistics page. The page hosts many open datasets on
topics ranging from injuries, poverty, women's health, education, health insurance, and of course infectious diseases,
and much more. But the one you'll be using is their "Life expectancy at birth, at age 65, and at age 75, by sex, race,
and origin" data set, which has statistics dating back from the 1900's to current, taken within the United States. The
dataset only lists the life expectancy of whites and blacks, because throughout most of the collection period, those
were the dominant two races that actively had their statistics recorded within the U.S.
Using linear regression, you will extrapolate how long people will live in the future. The private sector and
governments mirror these calculations when computing social security payouts, taxes, infrastructure, and more.
Complete the following:
1) Make sure the dataset has been properly loaded.
2) Create a linear model to use and re-use throughout the assignment. You can retrain the same
model again, rather than re-creating a new instance of the class.
3) Slice out using indexing any records before 1986 into a brand new slice.
4) Have one slice for training and one for testing. First, map the life expectancy of white males as a
function of age, or WhiteMales = f(age).
5) Fit your model, draw a regression line and scatter plot with the convenience function, and then
print out the actual, observed 2015 White Male life expectancy value from the dataset.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics
Practical 06
To Demonstrate the Implementation of
Logistic Regression Data Modeling
Outline:
Logistic Regression implementation with Scikit-Learn
Required Tools:
PC with windows
Python 3.6.2
Anaconda 3-5.0.1
6.1 Introduction
Logistic regression is another technique borrowed by machine learning from the field of statistics. Logistic
regression is a statistical method for analyzing a dataset in which there are one or more independent variables that
determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible
outcomes).
In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE,
success, malignant, etc.) or 0 (FALSE, failure, benign, etc.)
Classification problems
• Email -> spam/not spam?
• Online transactions -> fraudulent?
• Tumor -> Malignant/benign
These types of problems are called binary class problems. Variable in these problems is Y.
• Y is either 0 or 1
– 0 = negative class (absence of something)
– 1 = positive class (presence of something)
Before drive into the underline mathematical concept of logistic regression. Let’s brush up the logistic regression
understanding level with an example.
Suppose HPenguin wants to know, how likely it will be happy based on its daily activities. If the penguin wants to
build a logistic regression model to predict it happiness based on its daily activities. The penguin needs both the
happy and sad activities. In machine learning terminology these activities are known as the Input parameters
(features).
Figure 5.1: Logistic Regression Model Example
So let’s create a table which contains penguin activities and the result of that activity like happy or sad.
No. Penguin Activity Penguin Activity Description How Penguin felt ( Target )
Penguin is going to use the above activities (features) to train the logistic regression model. Later the trained logistic
regression model will predict how the penguin is feeling for the new penguin activities.
As it’s not possible to use the above categorical data table to build the logistic regression. The above activities data
table needs to convert into activities score, weights, and the corresponding target.
No. Penguin Activity Activity Score Weights Target Target Description
1 X1 6 0.6 1 Happy
2 X2 3 0.4 1 Happy
3 X3 7 -0.7 0 Sad
4 X4 3 -0.3 0 Sad
Table 5.2: Penguin Activity Continuous Data chart
In previous example Linear Regression does a reasonable job of stratifying the data points into one of two classes
• But what if we had a single Yes with a very small tumor
Another issue with linear regression, we know Y is 0 or 1 but hypothesis can give values large than 1 or less than 0.
So, logistic regression generates a value where hypothesis is always either greater than or equal to 0 or 1.
o ≤ hθ (x )≤ 1
Logistic regression is a classification algorithm
6.3 Hypothesis Representation
The function that is used to represent our hypothesis in classification is based on sigmoid function.We want our
classifier to output values between 0 and 1
• When using linear regression we did hθ(x) = (θT x)
– z is a real number
1
hθ ( x ) = T
1+ e−θ x
6.3.1. What does sigmoid function looks like?
Asymptotes at 0 and 1.
– hθ(x) = 0.7
x=
[][
x0
x1
= 1
tumorsize ]
• Tell a patient they have a 70% chance of a tumor being malignant
– P(y=1|x ; θ) + P(y=0|x ; θ) = 1
– P(y=0|x ; θ) = 1 - P(y=1|x ; θ)
import numpy as np
import matplotlib.pyplot as plt
# this is our test set, it's just a straight line with some
# Gaussian noise
xmin, xmax = -5, 5
n_samples = 100
np.random.seed(0)
X = np.random.normal(size=n_samples)
y = (X > 0).astype(np.float)
X[X > 0] *= 4
X += .3 * np.random.normal(size=n_samples)
X = X[:, np.newaxis]
# run the classifier
clf = linear_model.LogisticRegression(C=1e5)
clf.fit(X, y)
def model(x):
return 1 / (1 + np.exp(-x))
loss = model(X_test * clf.coef_ + clf.intercept_).ravel()
plt.plot(X_test, loss, color='red', linewidth=3)
ols = linear_model.LinearRegression()
ols.fit(X, y)
plt.plot(X_test, ols.coef_ * X_test + ols.intercept_,
linewidth=1)
plt.axhline(.5, color='.5')
plt.ylabel('y')
plt.xlabel('X')
plt.xticks(range(-5, 10))
plt.yticks([0, 0.5, 1])
plt.ylim(-.25, 1.25)
plt.xlim(-4, 10)
plt.legend(('Logistic Regression Model', 'Linear Regression
Model'),loc="lower right", fontsize='small')
plt.show()
Output:
Example 5.2:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
print('Accuracy on the training subset:
{:.3f}'.format(log_reg.score(X_train, y_train)))
print('Accuracy on the test subset:
{:.3f}'.format(log_reg.score(X_test, y_test)))
log_reg001 = LogisticRegression(C=0.01)
log_reg001.fit(X_train, y_train)
print('Accuracy on the training subset:
{:.3f}'.format(log_reg001.score(X_train, y_train)))
print('Accuracy on the test subset:
{:.3f}'.format(log_reg001.score(X_test, y_test)))
Output:
Lab 06: To Practice and demonstrate the implementation of Logistic regression in Data
Modeling
Roll No: Student Name:
Practical 07
To Practice and Demonstrate the
Implementation Of Artificial Neural
Network & Deep Learning.
Outline:
Neural Network implementation with Scikit-Learn
Image Classification using Convolutional Neural Networks
Required Tools:
PC with windows
Python 3.6.2
Anaconda 3-5.0.1
7.1. Introduction
Neural Networks are a machine learning framework that attempts to mimic the learning pattern of
natural biological neural networks. Biological neural networks have interconnected neurons with
dendrites that receive inputs, then based on these inputs they produce an output signal through an
axon to another neuron. We will try to mimic this process through the use of Artificial Neural
Networks (ANN), which we will just refer to as neural networks from now on. The process of creating
a neural network begins with the most basic form, a single perceptron.
7.2. The Perceptron
Let's start our discussion by talking about the Perceptron! A perceptron has one or more inputs, a
bias, an activation function, and a single output. The perceptron receives inputs, multiplies them by
some weight, and then passes them into an activation function to produce an output. There are
many possible activation functions to choose from, such as the logistic function, a trigonometric
function, a step function etc. We also make sure to add a bias to the perceptron, this avoids issues
where all inputs could be equal to zero (meaning no multiplicative weight would have an effect),
bias is also used to delay the triggering of the activation function. Check out the diagram below for a
visualization of a perceptron:
To create a neural network, we simply begin to add layers of perceptron together, creating a multi-
layer perceptron model of a neural network. You'll have an input layer which directly takes in your
feature inputs and an output layer which will create the resulting outputs. Any layers in between are
known as hidden layers because they don't directly "see" the feature inputs or outputs.
#Data Exploration
print(cancer['DESCR'])
print(cancer.feature_names)
print(cancer.target_names)
print(cancer.data)
print(cancer.data.shape)
mlp.fit(X_train_scaled, y_train)
plt.figure(figsize=(20,5))
plt.imshow(mlp.coefs_[0], interpolation='None', cmap='GnBu')
plt.yticks(range(30), cancer.feature_names)
plt.xlabel('Columns in weight matrix')
plt.ylabel('Input feature')
plt.colorbar()
Output:
Figure 6.3: Accuracies based on different parameters of MLPClassifier
ABOUT DATASET:
The dataset that will be used is from Kaggle: Pima Indians Diabetes Database.
It has 9 variables: ‘Pregnancies’, ‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’, ‘BMI’,
‘DiabetesPedigreeFunction’,’Age’, ‘Outcome’.
Here is the variable description from Kaggle:
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
All these variables are continuous, the goal of the tutorial is to predict if someone has diabetes (Outcome=1)
according to the other variables. It worth noticing that all the observations are from women older than 21 years old.
Example 6.2:
import pandas as pd
import numpy as np
Diabetes=pd.read_csv('C:/Users/Mehak/Desktop/DataScience/
unchanged data set/diabetes.csv')
table1=np.mean(Diabetes,0)# finds out mean based on columns
table2=np.std(Diabetes,0)# finds out std deviation based on
columns
print(table1)
print(table2)
####Model performance
####Classification rate 'by hand'
##Correctly classified
print('CORRECTLY
CLASSIFIED',np.mean(mlp.predict(inputData)==outputData))
##True positive
trueInput=Diabetes.loc[Diabetes['Outcome']==1].iloc[:,:8]
trueOutput=Diabetes.loc[Diabetes['Outcome']==1].iloc[:,8]
##True positive rate
print('TRUE POSITVE
RATE',np.mean(mlp.predict(trueInput)==trueOutput))
##True negative
falseInput=Diabetes.loc[Diabetes['Outcome']==0].iloc[:,:8]
falseOutput=Diabetes.loc[Diabetes['Outcome']==0].iloc[:,8]
##True negative rate
print('TRUE NEGATIVE
RATE',np.mean(mlp.predict(falseInput)==falseOutput))
plt.figure()
plt.scatter(inputData.iloc[:,1],inputData.iloc[:,5],c=outputDat
a,alpha=0.4)
plt.xlabel('Glucose level ')
plt.ylabel('BMI ')
plt.show()
Output:
Figure 6.5: Scatter plot based on glucose level & BMI (predicted outcomes)
Figure 6.6: Scatter plot based on glucose level & BMI (actual outcomes)
1 Model: "sequential"
2 _________________________________________________________________
3 Layer (type) Output Shape Param #
4 =================================================================
5 conv2d (Conv2D) (None, 32, 32, 32) 896
6
7 dropout (Dropout) (None, 32, 32, 32) 0
8
9 conv2d_1 (Conv2D) (None, 32, 32, 32) 9248
10
11 max_pooling2d (MaxPooling2D (None, 16, 16, 32) 0
12 )
13
14 flatten (Flatten) (None, 8192) 0
15
16 dense (Dense) (None, 512) 4194816
17
18 dropout_1 (Dropout) (None, 512) 0
19
20 dense_1 (Dense) (None, 10) 5130
21
22 =================================================================
23 Total params: 4,210,090
24 Trainable params: 4,210,090
25 Non-trainable params: 0
26 _________________________________________________________________
It is typical in a network for image classification to be comprised of
convolutional layers at an early stage, with dropout and pooling layers
interleaved. Then, at a later stage, the output from convolutional layers is
flattened and processed by some fully connected layers.
1 ...
2 # Extract output from each layer
3 extractor = tf.keras.Model(inputs=model.inputs,
4 outputs=[layer.output for layer in model.layers])
5 features = extractor(np.expand_dims(X_train[7], 0))
6
7 # Show the 32 feature maps from the first layer
8 l0_features = features[0].numpy()[0]
9
10 fig, ax = plt.subplots(4, 8, sharex=True, sharey=True, figsize=(16,8))
11 for i in range(0, 32):
12 row, col = i//8, i%8
13 ax[row][col].imshow(l0_features[..., i])
14
15 plt.show()
The above code will print the feature maps like the following:
You can see that they are called feature maps because they are
highlighting certain features from the input image. A feature is identified
using a small window (in this case, over a 3×3 pixels filter). The input
image has three color channels. Each channel has a different filter
applied, and their results are combined for an output feature.
You can similarly display the feature map from the output of the second
convolutional layer as follows:
1 ...
2 # Show the 32 feature maps from the third layer
3 l2_features = features[2].numpy()[0]
4
5 fig, ax = plt.subplots(4, 8, sharex=True, sharey=True, figsize=(16,8))
6 for i in range(0, 32):
7 row, col = i//8, i%8
8 ax[row][col].imshow(l2_features[..., i])
9
10 plt.show()
This shows the following:
From the above, you can see that the features extracted are more
abstract and less recognizable.
Exercise
Question 1:
Use Housing.csv file from example .2, Write code for MLPclassifier that classifies how likely the attribute value
‘fullbase’ is going to be yes or no.
a) Divide dataset in to training and testing subset.
b) Find out accuracy on the basis of provided data (without preprocessing)
c) Find out accuracy on the basis of scaled data (including preprocessing) Also create a scatter plot
in which one can visualize the probability of getting yes or no via MLPclassifier
d) Write a CNN based code using MNIST data set for Handwritten digit Classification
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics
Lab 07: To Practice and demonstrate the implementation of Artificial Neural Network &
Deep Learning
Roll No: Student Name:
Practical 08
To Practice and Demonstrate the
Implementation Of Support Vector
Machine.
Outline:
SVM implementation with Scikit-Learn
Required Tools:
PC with windows
Python 3.6.2
Anaconda 3-5.0.1
8.1. Introduction
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both
classification and regression challenges. However, it is mostly used in classification problems. In this algorithm, we
plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of
each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane
that differentiate the two classes very well.
Figure 7.1: SVM hyperplane visualization
Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine is a frontier which
best segregates the two classes (hyper-plane/ line).
8.1.1. What is Classification Analysis?
Let’s consider an example to understand these concepts. We have a population composed of 50%-50% Males and
Females. Using a sample of this population, you want to create some set of rules which will guide us the gender
class for rest of the population. Using this algorithm, we intend to build a robot which can identify whether a person
is a Male or a Female. This is a sample problem of classification analysis. Using some set of rules, we will try to
classify the population into two possible segments. For simplicity, let’s assume that the two differentiating factors
identified are: Height of the individual and Hair Length. Following is a scatter plot of the sample.
The blue circles in the plot represent females and green squares represents male. A few expected insights from the
graph are:
1. Males in our population have a higher average height.
2. Females in our population have longer scalp hairs.
If we were to see an individual with height 180 cms and hair length 4 cms, our best guess will be to classify this
individual as a male. This is how we do a classification analysis.
Support Vectors are simply the co-ordinates of individual observation. For instance, (45,150) is a support vector
which corresponds to a female. Support Vector Machine is a frontier which best segregates the Male from the
Females. In this case, the two classes are well separated from each other, hence it is easier to find a SVM.
Finding SVM for case in hand:
There are many possible frontier which can classify the problem in hand. Following are the three possible frontiers.
How do we decide which is the best frontier for this particular problem statement?
The easiest way to interpret the objective function in a SVM is to find the minimum distance of the frontier from
closest support vector (this can belong to any class). For instance, orange frontier is closest to blue circles. And the
closest blue circle is 2 units away from the frontier. Once we have these distances for all the frontiers, we simply
choose the frontier with the maximum distance (from the closest support vector). Out of the three shown frontiers,
we see the black frontier is farthest from nearest support vector (i.e. 15 units).
8.2. Dive Deeper
8.2.1. Identify the right hyper-plane (Scenario-1):Here, we have three hyper-planes (A, B and
C). Now, identify the right hyper-plane to classify star and circle.
You need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-plane which segregates the
two classes better”. In this scenario, hyper-plane “B” has excellently performed this job.
8.2.2. Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes (A, B and
C) and all are segregating the classes well. Now, how can we identify the right hyper-plane?
Figure 7.5: Scenario 2, Hyperplanes
Here, maximizing the distances between nearest data point (either class) and hyper-plane will help us to decide the
right hyper-plane. This distance is called as Margin. Let’s look at the below snapshot:
Above, you can see that the margin for hyper-plane C is high as compared to both A and B. Hence, we name
the right hyper-plane as C. Another lightning reason for selecting the hyper-plane with higher margin is robustness.
If we select a hyper-plane having low margin then there is high chance of miss-classification.
8.2.3. Identify the right hyper-plane (Scenario-3): Hint: Use the rules as discussed in previous
section to identify the right hyper-plane
Some of you may have selected the hyper-plane B as it has higher margin compared to A. But, here is the catch,
SVM selects the hyper-plane which classifies the classes accurately prior to maximizing margin. Here, hyper-plane
B has a classification error and A has classified all correctly. Therefore, the right hyper-plane is A.
Find the hyper-plane to segregate to classes (Scenario-4): In the scenario below, we can’t
have linear hyper-plane between the two classes, so how does SVM classify these two classes? Till now, we have
only looked at the linear hyper-plane.
SVM can solve this problem. Easily! It solves this problem by introducing additional feature. Here, we will add a
new feature z=x^2+y^2. Now, let’s plot the data points on axis x and z:
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data,
cancer.target, random_state=0)
svm = SVC()
svm.fit(X_train, y_train)
min_train = X_train.min(axis=0)
range_train = (X_train - min_train).max(axis=0)
svm = SVC()
svm.fit(X_train_scaled, y_train)
svm = SVC(C=1000)
svm.fit(X_train_scaled, y_train)
Example 7.2:
import numpy as np
import pylab as pl
from sklearn import svm, datasets
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, Y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
#'SVC with linear kernel'
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
pl.contourf(xx, yy, Z, cmap=pl.cm.Paired)
pl.axis('off')
Exercise
Question 1: Repeat example 7.2 to classify the flower dataset in to 3 classes based on:
a) RBF kernel
b) Polynomial (degree 3) kernel
c) LinearSVC (linear kernel)
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics
Lab 08: To Practice and demonstrate the implementation of Support Vector Machine
Roll No: Student Name:
Practical 09
To Practice and Demonstrate the
Implementation Of K- Means Clustering
Algorithm.
Outline:
K-Means clustering implementation with Scikit-Learn
Required Tools:
PC with windows
Python 3.6.2
Anaconda 3-5.0.1
9.1. Introduction
9.1.1. Unsupervised Learning
Unsupervised learning is the training of a machine learning algorithm using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. It is basically learning
from unlabeled data
Unsupervised machine learning can actually solve the exact same problems as supervised machine learning, though
it may not be as efficient or accurate.
Unsupervised machine learning is most often applied to questions of underlying structure. Genomics, for example, is
an area where we do not truly understand the underlying structure. Thus, we use unsupervised machine learning to
help us figure out the structure.
Unsupervised learning can also aid in "feature reduction." A term we will cover eventually here is "Principal
Component Analysis," or PCA, which is another form of feature reduction, used frequently with unsupervised
machine learning. PCA attempts to locate linearly uncorrelated variables, calling these the Principal Components,
since these are the more "unique" elements that differentiate or describe whatever the object of analysis is.
There is also a meshing of supervised and unsupervised machine learning, often called semi-supervised machine
learning. You will often find things get more complicated with real world examples. You may find, for example,
that first you want to use unsupervised machine learning for feature reduction, then you will shift to supervised
machine learning.
Flat Clustering
Flat clustering is where the scientist tells the machine how many categories to cluster the data into.
Hierarchical
Hierarchical clustering is where the machine is allowed to decide how many clusters to create based on its own
algorithms.
9.1.2. Comparison of Supervised & Unsupervised Learning
• Supervised learning
• Unsupervised learning
9.2.2. Algorithm
Our algorithm works as follows, assuming we have inputs x 1, x 2, x 3,…, x n and value of K
Step 1 - Pick K random points as cluster centers called centroids.
Step 2 - Assign each xi to nearest cluster by calculating its distance to each centroid.
Step 3 - Find new cluster center by taking the average of the assigned points.
Step 4 - Repeat Step 2 and 3 until none of the cluster assignments change.
9.2.1 Dive Deeper
Step 1
We randomly pick K cluster centers (centroids). Let’s assume these are c 1 , c2 , … , c k and we can say that;
C=c 1 , c2 , … , c k
C is the set of all centroids.
Step 2
In this step we assign each input value to closest center. This is done by calculating Euclidean (L2) distance between
the point and the each centroid.
2
arg min dist (c i , x)
c i ∈C
Step 3
In this step, we find the new centroid by taking the average of all the points assigned to that cluster.
1
c i=
¿ S i∨¿ ∑ x i ¿
xi ∈Si
th
S iis the set of all points assigned to the i cluster.
Step 4
In this step, we repeat step 2 and 3 until none of the cluster assignments change. That means until our clusters
remain stable, we repeat the algorithm.
9.3. Choosing K
The algorithm described above finds the clusters and data set labels for a particular pre-chosen K. To find the
number of clusters in the data, the user needs to run the K-means clustering algorithm for a range of K values and
compare the results. In general, there is no method for determining exact value of K, but an accurate estimate can be
obtained using the following techniques.
One of the metrics that is commonly used to compare results across different values of K is the mean distance
between data points and their cluster centroid. Since increasing the number of clusters will always reduce the
distance to data points, increasing K will always decrease this metric, to the extreme of reaching zero when K is the
same as the number of data points. Thus, this metric cannot be used as the sole target. Instead, mean distance to the
centroid as a function of K is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to
roughly determine K.
A number of other techniques exist for validating K, including cross-validation, information criteria, the information
theoretic jump method, the silhouette method, and the G-means algorithm. In addition, monitoring the distribution of
data points across groups provides insight into how the algorithm is splitting the data for each K.
For more information on scikit learn’s implementation of KMeans clustering follow the link:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Example 8.1:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans
x = [1, 5, 1.5, 8, 1, 9]
y = [2, 8, 1.8, 8, 0.6, 11]
plt.scatter(x,y)
plt.show()
X = np.array([[1, 2],
[5, 8],
[1.5, 1.8],
[8, 8],
[1, 0.6],
[9, 11]])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r.","c.","y."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize =
10)
plt.show()
Output:
Figure 8.3: Scatter Plot
Example 8.2:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
centers = [[1,1],[5,5],[3,10]]
X, _ = make_blobs(n_samples = 500, centers = centers,
cluster_std = 1)
plt.scatter(X[:,0],X[:,1])
plt.show()
Kmeans = KMeans(n_clusters=3)
Kmeans.fit(X)
labels = Kmeans.labels_
cluster_centers = Kmeans.cluster_centers_
print(cluster_centers)
n_clusters_ = len(np.unique(labels))
print("Number of estimated clusters:", n_clusters_)
colors = 10*['r.','g.','b.','c.','k.','y.','m.']
for i in range(len(X)):
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize =
10)
plt.scatter(cluster_centers[:,0],cluster_centers[:,1],
marker="x",color='k', s=150, linewidths = 5,
zorder=10)
plt.show()
Output:
Example 8.3:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import style
style.use("ggplot")
centers = [[1,1,1],[5,5,5],[3,10,10]]
Kmeans = KMeans(n_clusters=3)
Kmeans.fit(X)
labels = Kmeans.labels_
cluster_centers = Kmeans.cluster_centers_
print(cluster_centers)
n_clusters_ = len(np.unique(labels))
colors = 10*['r','g','b','c','k','y','m']
print(colors)
print(labels)
fig = plt.figure()
ax = fig.gca(projection='3d')
for i in range(len(X)):
ax.scatter(cluster_centers[:,0],cluster_centers[:,1],cluster_ce
nters[:,2], marker="x",color='k', s=150, linewidths = 5,
zorder=10)
plt.show()
Output:
Figure 8.7: Cluster assignment
Example 8.4:
# imports
import matplotlib.pyplot as plt, numpy as np
import pandas as pd
from sklearn.cluster import KMeans
df=pd.read_csv('C:/Users/Mehak/Desktop/DataScience/unchanged
data set/data_1024.csv', sep='\t')
df.head()
plt.rcParams['figure.figsize'] = (12, 6)
plt.figure()
plt.plot(df.Distance_Feature,df.Speeding_Feature,'ko')
plt.ylabel('Speeding Feature')
plt.xlabel('Distance Feature')
plt.ylim(0,100)
plt.show()
Output:
9.5. Conclusion
Even though it works very well, K-Means clustering has its own issues. That include:
• If you run K-means on uniform data, you will get clusters.
• Sensitive to scale due to its reliance on Euclidean distance.
• Even on perfect data sets, it can get stuck in a local minimum
Exercise
Question 1:
Repeat example 4 to make 4 clusters using same dataset.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics
Lab 09: To Practice and demonstrate the implementation of K-means Clustering algorithm
Roll No: Student Name:
Practical 10
To Practice and Demonstrate the
Department of: Subject of:
Implementation Of Decision Trees
Computer Systems Engineering Data Science & Analytics
Algorithm.
Mehran University of Engineering Year 4 Semester th
8th
&Technology
Batch 18 CS Duration 03
Tree based learning algorithms are considered to be one of the best and mostly used supervised learning Hours
methods. Tree based
Jamshoromethods empower Figure 9.1: Decision Trees
predictive models with high accuracy, stability and ease of
interpretation. Unlike linear models, they map non-linear relationships quite well. They are adaptable at solving any
Example:-
kind of problem at hand (classification or regression).
Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X) and Height (5 to 6
ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a model to predict who will play cricket
during leisure period? In this problem, we need to segregate students who play cricket in their leisure time based on
highly significant input variable among all three.
This is where decision tree helps, it will segregate the students based on all values of three variable and identify the
variable, which creates the best homogeneous sets of students (which are heterogeneous to each other). In the
10.1 Introduction
snapshot below, you can see that variable Gender is able to identify best homogeneous sets compared to the other
two variables.
Split on Gender:
Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68
Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59
Similar for Split on Class:
Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
10.7 Are tree based models better than linear models?
“If I can use logistic regression for classification problems and linear regression for regression problems, why is
there a need to use trees”? Many of us have this question. And, this is a valid one too.
Actually, you can use any algorithm. It is dependent on the type of problem you are solving. Let’s look at some key
factors which will help you to decide which algorithm to use:
1. If the relationship between dependent & independent variable is well approximated by a
linear model, linear regression will outperform tree based model.
2. If there is a high non-linearity & complex relationship between dependent & independent
variables, a tree model will outperform a classical regression method.
3. If you need to build a model which is easy to explain to people, a decision tree model will
always do better than a linear model. Decision tree models are even simpler to interpret
than linear regression!
For more information on scikit learn’s implementation of KMeans clustering follow the link:
http://scikit-learn.org/stable/modules/generated/
sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
Example 9.1
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
X, y = make_classification(n_samples=100, n_features=2,
n_redundant=0, n_classes=2, n_clusters_per_class=1)
X1_min, X1_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
X2_min, X2_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
clf = LogisticRegression()
score = np.mean(cross_val_score(clf, X, y, cv=5))
print('Logistic Regression: {}'.format(score))
#customisation
from sklearn.datasets import make_circles
X, y = X, y = make_circles(n_samples=200, noise=0.2,
factor=0.5)
X1_min, X1_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
X2_min, X2_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
clf = LogisticRegression()
score = np.mean(cross_val_score(clf, X, y, cv=5))
print('Logistic Regression: {}'.format(score))
Output:
Figure 9.4: Decision Trees v/s Logistic Regression Accuracy
Exercise
1. Let’s say you are wondering whether to quit your job or not. You have to consider some
important points and questions. Here is an example of a decision tree in this case.
Develop a code for this example tree.
2. Imagine you are an IT project manager and you need to decide whether to start a
particular project or not. You need to take into account important possible outcomes
and consequences. The decision tree examples, in this case, might look like the
diagram below. Develop a code for this example tree.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics
Lab 10: To Practice and demonstrate the implementation of Decision Trees algorithm
Roll No: Student Name:
Practical 11
Department of: Subject of:
The following are the basic steps involved in performing the random forest algorithm:
1. Pick N random records from the dataset.
2. Build a decision tree based on these N records.
3. Choose the number of trees you want in your algorithm and repeat steps 1
and 2.
4. In case of a regression problem, for a new record, each tree in the forest
predicts a value for Y (output). The final value can be calculated by taking the
average of all the values predicted by all the trees in forest. Or, in case of a
classification problem, each tree in the forest predicts the category to which
the new record belongs. Finally, the new record is assigned to the category
that wins the majority vote.
11.3 Advantages of using Random Forest
As with any algorithm, there are advantages and disadvantages to using it. In the next
two sections we'll take a look at the pros and cons of using random forest for
classification and regression.
1. The random forest algorithm is not biased, since, there are multiple trees
and each tree is trained on a subset of data. Basically, the random forest
algorithm relies on the power of "the crowd"; therefore the overall biasedness
of the algorithm is reduced.
2. This algorithm is very stable. Even if a new data point is introduced in the
dataset the overall algorithm is not affected much since new data may impact
one tree, but it is very hard for it to impact all the trees.
3. The random forest algorithm works well when you have both categorical
and numerical features.
4. The random forest algorithm also works well when data has missing values
or it has not been scaled well (although we have performed feature scaling in
this article just for the purpose of demonstration).
11.4 Disadvantages of using Random Forest
1. A major disadvantage of random forests lies in their complexity. They
required much more computational resources, owing to the large number of
decision trees joined together.
2. Due to their complexity, they require much more time to train than other
comparable algorithms.
Scikit-Learn library can be used to implement the random forest algorithm to
solve regression, as well as classification, problems.
Example 10.1
#1.Import Libraries
import pandas as pd
import numpy as np
#2.Importing Dataset
https://drive.google.com/file/d/
1mVmGNx6cbfvRHC_DvF12ZL3wGLSHD9f_/view
dataset = pd.read_csv('D:\Datasets\petrol_consumption.csv')
dataset.head()
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestRegressor
Output:
Mean Absolute Error: 51.765
Mean Squared Error: 4216.16675
Root Mean Squared Error: 64.932016371
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics
Lab 11: To practice and demonstrate the implementation of Random Forest algorithm
Roll No: Student Name:
Practical 12
12.1 Introduction:
Anomaly detection is a technique used to identify unusual patterns that do not conform
to expected behavior, called outliers. It has many applications in business, from
intrusion detection (identifying strange patterns in network traffic that could signal a
hack) to system health monitoring (spotting a malignant tumor in an MRI scan), and
from fraud detection in credit card transactions to fault detection in operating
environments.
12.2 Categories of Anomalies:
Clustering is one of the most popular concepts in the domain of unsupervised learning.
Assumption: Data points that are similar tend to belong to similar groups or clusters, as
determined by their distance from local centroids.
K-means is a widely used clustering algorithm. It creates ‘k’ similar clusters of data
points. Data instances that fall outside of these groups could potentially be marked as
anomalies.
12.3.3 Support Vector Machine-Based Anomaly Detection
A support vector machine is another effective technique for detecting anomalies. A SVM
is typically associated with supervised learning, but there are extensions
(OneClassCVM, for instance) that can be used to identify anomalies as an unsupervised
problems (in which training data are not labeled). The algorithm learns a soft boundary
in order to cluster the normal data instances using the training set, and then, using the
testing instance, it tunes itself to identify the abnormalities that fall outside the learned
region.
Depending on the use case, the output of an anomaly detector could be numeric scalar
values for filtering on domain-specific thresholds or textual labels (such as binary/multi
labels).
Python3
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager
from pyod.models.knn import KNN
from pyod.utils.data import generate_data, get_outliers_inliers
Python3
# generating a random dataset with two features
X_train, y_train = generate_data(n_train = 300, train_only = True,
n_features = 2)
Python3
# Visualising the dataset
# create a meshgrid
xx, yy = np.meshgrid(np.linspace(-10, 10, 200),
np.linspace(-10, 10, 200))
# scatter plot
plt.scatter(f1, f2)
plt.xlabel(‘Feature 1’)
plt.ylabel(‘Feature 2’)
Step 4: Training and evaluating the model
Python3
# Training the classifier
clf = KNN(contamination = outlier_fraction)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_train)
n_errors = (y_pred != y_train).sum()
# Counting the number of errors
Python3
# threshold value to consider a
# datapoint inlier or outlier
threshold = stats.scoreatpercentile(scores_pred, 100 * outlier_fraction)
subplot.legend(
[a.collections[0], b, c],
[‘learned decision function’, ‘true inliers’, ‘true outliers’],
prop = matplotlib.font_manager.FontProperties(size = 10),
loc =’lower right’)
subplot.set_title(‘K-Nearest Neighbours’)
subplot.set_xlim((-10, 10))
subplot.set_ylim((-10, 10))
plt.show()
Exercise
Practical 13
true positives (TP): These are cases in which we predicted yes (they have the
disease), and they do have the disease.
true negatives (TN): We predicted no, and they don't have the disease.
false positives (FP): We predicted yes, but they don't actually have the disease.
(Also known as a "Type I error.")
false negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a "Type II error.")
This is a list of rates that are often computed from a confusion matrix for a binary
classifier:
Example:
Exercise:
1. Develop a mini project using image dataset, train the dataset in such a way
that it should be apple to classify the given image as TP,TN,FP,FN.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics
Lab 13: To explore the performance measuring variables and optimize hyper-parameters
Roll No: Student Name:
Practical 14
To Develop the LSTM Models for Time
Series Forecasting.
Outline:
How to develop LSTM models for univariate time series forecasting.
How to develop LSTM models for multivariate time series forecasting.
How to develop LSTM models for multi-step time series forecasting.
Required Tools:
PC with windows
Python 3.6.2
Anaconda 3-5.0.1
14.1 Introduction to Time Series Forecasting
Time-series forecasting refers to the use of a machine learning model to predict future values based on
previously observed values. Time series data captures a series of data points recorded at (usually) regular
intervals. Though this definition might somewhat remind you of regression models, time-series
forecasting is applied to forecast data that are ordered by time, for example, stock prices by year.
Figure 1 Example of data ordered by date on which time-series forecasting can be applied.
Time-series forecasting is one of the most used applications of Deep Learning in the modern world.
Quantitative analysts use it to predict the value of stocks, business professionals use it to forecast their
sales, and government agencies use it to forecast resource consumption (energy, water, etc.).
Time series prediction problems are a difficult type of predictive modeling problem. Unlike regression
predictive modeling, time series also adds the complexity of a sequence dependence among the input
variables.
Many classical methods (e.g. ARIMA) try to deal with Time Series data with varying success (not to say
they are bad at it). In the last couple of years, Long Short-Term Memory Networks (LSTM) models have
become a very useful method when dealing with those types of data.
A powerful type of neural network designed to handle sequence dependence is called recurrent neural
networks. Recurrent Neural Networks (LSTMs are one type of those) are very good at processing
sequences of data. They can “recall” patterns in the data that are very far into the past (or future). The
Long Short-Term Memory network or LSTM network is a type of recurrent neural network used in deep
learning because very large architectures can be successfully trained.
14.2 Introduction to Long Short-Term Memory (LSTM) Models
We will explore how to develop a suite of different types of LSTM models for time series forecasting.
The models are demonstrated on small contrived time series problems intended to give the flavor of the
type of time series problem being addressed. The chosen configuration of the models is arbitrary and not
optimized for each problem.
There are two types of LSTM Models
1. Univariate LSTM Models
1. Data Preparation
2. Vanilla LSTM
3. Stacked LSTM
4. Bidirectional LSTM
2. Multivariate LSTM Models
1. Multiple Input Series.
2. Multiple Parallel Series.
14.2.1 Univariate LSTM Models
LSTMs can be used to model univariate time series forecasting problems.
These are problems comprised of a single series of observations and a model is required to learn from the
series of past observations to predict the next value in the sequence.
We will demonstrate a number of variations of the LSTM model for univariate time series forecasting.
This section is divided into four parts; they are:
1. Data Preparation
2. Vanilla LSTM
3. Stacked LSTM
4. Bidirectional LSTM
Each of these models are demonstrated for one-step univariate time series forecasting but can easily be
adapted and used as the input part of a model for other types of time series forecasting problems.
14.2.1.1. Data Preparation
Before a univariate series can be modelled, it must be prepared.
The LSTM model will learn a function that maps a sequence of past observations as input to an output
observation. As such, the sequence of observations must be transformed into multiple examples from
which the LSTM can learn.
Consider a given univariate sequence:
We can divide the sequence into multiple input/output patterns called samples, where three time steps are
used as input and one time step is used as output for the one-step prediction that is being learned.
The split_sequence() function below implements this behaviour and will split a given univariate
sequence into multiple samples where each sample has a specified number of time steps and the output is
a single time step.
Now that we know how to prepare a univariate series for modeling, let us look at developing LSTM
models that can learn the mapping of inputs to outputs, starting with a Vanilla LSTM.
14.2.1.2. Vanilla LSTM
A Vanilla LSTM is an LSTM model that has a single hidden layer of LSTM units, and an output layer
used to make a prediction.
Our split_sequence() function in the previous section outputs the X with the shape [samples, timesteps],
so we easily reshape it to have an additional dimension for the one feature.
In this case, we define a model with 50 LSTM units in the hidden layer and an output layer that predicts a
single numerical value.
The model is fit using the efficient Adam version of stochastic gradient descent and optimized using the
mean squared error, or ‘mse‘ loss function. Once the model is defined, we can fit it on the training
dataset.
EXAMPLE 14.2
14.2.1.3. Stacked LSTM
Multiple hidden LSTM layers can be stacked one on top of another in what is referred to as a Stacked
LSTM model.
An LSTM layer requires a three-dimensional input and LSTMs by default will produce a two-
dimensional output as an interpretation from the end of the sequence.
We can address this by having the LSTM output a value for each time step in the input data by setting
the return_sequences=True argument on the layer. This allows us to have 3D output from hidden LSTM
layer as input to the next.
We can demonstrate
this with a simple example of two parallel input time series where the output series is the simple addition
of the input series.
We can reshape
these three arrays of data as a single dataset where each row is a time step, and each column is a separate
time series. This is a standard way of storing parallel time series in a CSV file.
We can define a function named split_sequences() that will take a dataset as we have defined it with rows
for time steps and columns for parallel series and return input/output samples.
EXAMPLE 14.3
The complete example is listed below.
EXERCISE:
1. Repeat Example 2 and 3 by using Stacked and Bidirectional LSTM models.
2. The metrics that you choose to evaluate your machine learning algorithms are very important,
choice of metrics influences how the performance of machine learning algorithms is measured and
compared. They influence how you weight the importance of different characteristics in the results
and your ultimate choice of which algorithm to choose. In this practical we are going discover how
to select and use different machine learning performance metrics in Python with scikit-learn.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics
Lab 14: To develop the LSTM Models for Time series Forecasting
Roll No: Student Name:
Objective: Write a program to predict and forecast the climate or weather of a region.
Apply possible data science approache to clean the data. Use Prediction models (K-
Nearest Neighbor or Artificial Neural Network or Long Short-Term Memory) or any
method which we have discussed. Evaluate and Present the performance matrix.
Course code: CS-454 Course Name: Data science and Analytics
Teacher: Date: