0% found this document useful (0 votes)

72 views141 pages

DSA Practical Workbook - LAb Manuals 18cs

The document provides an overview of MongoDB including its key concepts like documents and collections. It outlines the tasks for the lab which involve installing MongoDB, creating databases and collections, populating them with sample data, performing various queries and aggregation operations.

Uploaded by

Kamran Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views141 pages

DSA Practical Workbook - LAb Manuals 18cs

Uploaded by

Kamran Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 141

MEHRAN UNIVERSITY OF ENGINEERING &

TECHNOLOGY

DEPARTMENT OF COMPUTER SYSTEMS ENGINEERING

DATA SCIENCE AND ANALYTICS

Practical Workbook

Dr. Sanam Narejo

Department: Computer Systems Engineering Course Code: CS-454

Name of Teacher: Dr. Sanam Narejo Semester: 8th

Upon successful completion of the course, the student will be able to:

CLO No. Description Domain Taxonomy Linking

level to PLOs
To develop a practical understanding of the No-SQL
CLO-1 Psychomotor P3 3
database, Statistical methods and EDA techniques for
effective application of data science and write codes for
Machine Learning models.

To develop a comprehensive data-driven project by using

CLO-2 Psychomotor P4 5
modern tools, including problem formulation, data
wrangling, elementary analysis and conclusions.

To develop a solution for complex engineering problem in a

CLO-3 Psychomotor P4 4
methodical way.

Linking No: of
Associated
CLO Psychomotor lecture
S# Topic
Taxonomy hrs
Level required
1. 3 P3
To Practice and Execute NoSQL Database queries - MongoDB 3
2. To Practice the Python Platform and Explore its Libraries for Data 3 P3 3
1 Science & Machine Learning.
3. To Perform Exploratory Data Analysis and demonstrate in depth- 3 P3 3
visualizations of Data.
4. 3 P3
To perform inferential statistics and hypothesis testing. 3
5. To Demonstrate the Implementation of Linear Regression in Data 3 P3 3
Modeling.
6. To Demonstrate the Implementation of Logistic Regression in Data 3 P3 3
Modeling.
7. To Practice and demonstrate the Implementation of Artificial Neural 3 P3 3
Network & Deep Learning.
8. To Practice and demonstrate the implementation of Support Vector 3 P3 3
Machine.
9. To Practice and demonstrate the implementation of K- Means 3 P3 3
Clustering algorithm.
10. To Practice and demonstrate the implementation of Decision Trees 3 P3
3
algorithm.
11. To Practice and demonstrate the Implementation of Random Forest 3 P3 3
algorithm.
12. To Execute anomaly detection techniques in Machine Learning 3 P3 3
Models.
13. To Explore the Performance Measuring variables and optimize 3 P3 3
hyper-parameters.
14. 4 P4
To Develop the LSTM Models for Time Series Forecasting. 3
15. 4 P4
Open-ended Lab 1 3
Total Lecture hrs 45

Department of: Subject of:

Computer Systems Engineering Data Science & Analytics

Mehran University of Engineering Year 4th Semester 8th
&Technology
Batch 18 CS Duration 03
Hours
Jamshoro

PRACTICAL 01
To Practice and Execute NoSQL Database
Queries - MongoDB

Outline:

 To get familiar with Database

 To gain insights on No SQL data base
 To explore and get familiar with MongoDB.

Database
A database is a collection of information that is organized so that it can be easily accessed,
managed and updated. Computer databases typically contain data records or files, containing
information about sales transactions or interactions with specific customers.

NOSQL
NoSQL is an approach to database management that can accommodate a wide variety of data
models, including key-value, document, columnar and graph formats. A NoSQL database
generally means that it is non-relational, distributed, flexible and scalable.

Additional common NoSQL database features include the lack of a database schema, data
clustering, replication support and eventual consistency, as opposed to the typical ACID
(atomicity, consistency, isolation and durability) transaction consistency of relational and SQL
databases. Many NoSQL database systems are also open source.

SQL vs NOSQL
NOSQL DATABASES
•Redis
•Couch Db
•Neo 4j
•Mongo Db
•Cassandra

MONGODB
MongoDB is a cross-platform, document-oriented database that provides, high performance,
high availability, and easy scalability. MongoDB works on concept of collection and document.

 Document
A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema
means that documents in the same collection do not need to have the same set of fields or
structure, and common fields in a collection's documents may hold different types of data.

 Collection
Collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A collection
exists within a single database. Collections do not enforce a schema. Documents within a
collection can have different fields. Typically, all documents in a collection are of similar or
related purpose.

STRENGTHS OF MONGODB
•The following are some of MongoDB benefits and strengths:
•Dynamic schema: As mentioned, this gives you flexibility to change your data schema without
modifying any of your existing data.

•Scalability: MongoDB is horizontally scalable, which helps reduce the workload and scale your
business with ease.

•Manageability: The database doesn’t require a database administrator. Since it is fairly user-
friendly in this way, it can be used by both developers and administrators.

•Speed: It’s high performing for simple queries.

•Flexibility: You can add new columns or fields on MongoDB without affecting existing rows or
application performance.

LAB TASK
1. Installation of Mongo Db and use its interface for creating database and collections.
2. Attach screenshots of your installed database and create 3 different databases of your
own choice.
3. Populate those databases with minimum 15 documents using MongoDB GUI and each
document should contain at least 6 key-value pairs. For example, use Emp table data that
you have used in dbms.
4. Create Database and create collection of the data of your own choice and attach
screenshot of every output.
5. Use at least 5 collection functions of your own choice and attach screenshot of every
output.
6. Use any five Aggregation Pipeline Operators and Aggregation Pipeline Stages on the data
of your own choice and attach screenshot of every output.
7. Create a database name Students, create collection name StudentsInfo, insert 10
documents with column names student first_name, last_name, roll_no, date_of_admission,
Current semester, cgpa in every semester. E.g. semester_cgpa:[3.99,2.87]
8. Write a query to display student First name and current semester only.
9. Write a query to display min ,max and avg cgpa group by first_name.
10. Write a query to sum up the cgpa.
11. Update any value in the database.
12. Write a query to select cgpa greater than 2.50 and less than 3.0.
13. Remove record where cgpa is less than 2.5.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 01: To Practice and Execute NoSQL Database queries - MongoDB

Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

rics
Proficient Adequate Just Acceptable Unacceptable
0.16 0.1 0.05 0
1 Problem Students have Student Student Student don’t
Understanding and carefully read and understand understand understand
Analysis understood the problems and problems and problem.
problem. Students analysis in analysis in just
gives best idea to good way acceptable way
solve problem.
2 Code Originality Students find Adequate Acceptable Code Code originality
themselves and Code Originality. is not
express their own Originality. acceptable.
solutions and will
translate the logic
from the flowchart or
pseudocode- to a
programming
language by writing
their own program’s
code.
3 Completeness and The lab /tasks are The lab is The lab The lab in its
accuracy complete and mostly completion is just current state is
accurate in the complete and acceptable. not
context of accurate in the implementable.
implementation context of its
implementatio
n
4 Analysis and analysis and results Appropriate Inappropriate data Inappropriate
Results Appropriate data data collected, collected, but data collected,
collected, correct but sufficient analysis no
correctly interpreted. insufficient and results understanding of
analysis and inadequately analysis and
results interpreted results
adequately incorrectly
interpreted interpreted.

Date
Teacher:
:
Department of: Subject of:

Computer Systems Engineering Data Science & Analytics

Mehran University of Engineering Year 4th Semester 8th
&Technology
Batch 18 CS Duration 03
Hours
Jamshoro

Practical 02
To Practice the Python Platform and Explore
its Libraries for Data Science & Machine
Learning.
Outline:
 Introduction to Python
 Variables and Types
 Data Structures in Python
 Functions & Packages
 Numpy package
 Data visualization with Matplotlib
 Control flow
 Pandas

Required Tools:
 PC with windows
 Anaconda 2 or 3

Introduction to Python
Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but
effective approach to object-oriented programming.It was created by Guido van Rossum during 1985- 1990. Python is
very beginner-friendly. The syntax (words and structure) is extremely simple to read and follow, most of which can be
understood even if you do not know any programming. Let’s take a look at one example of this
Example 1.1:
Garage = "Ferrari", "Honda", "Porsche", "Toyota"

for each_car in Garage:

print(each_car)

"print()" is a built-in Python function that will output some text to the console.

Looking at the code about cars in the garage, can you guess what will happen? You probably have a general idea. For
each_car in the garage, we're going to do something. What are we doing? We are printing each car.

Since "printing" outputs some text to the "console," you can probably figure out that the console will say something like
"Ferrari, Honda, Porsche, Toyota."

Python Shell
A python shell is a way for a user to interact with the python interpreter.
Python IDLE
IDLE (Integrated DeveLopment Environment or Integrated Development and Learning Environment) is an integrated
development environment for Python, which has been bundled with the default implementation of the language since
1.5.2b1. ...Python shell with syntax highlighting.
Python Scripts
Scripts are reusable
Basically, a script is a text file containing the statements that comprise a Python program. Once you have created the
script, you can execute it over and over without having to retype it each time. Python scripts are saved with .py extension.
Scripts are editable
Perhaps, more importantly, you can make different versions of the script by modifying the statements from one file to the
next using a text editor. Then you can execute each of the individual versions. In this way, it is easy to create different
programs with a minimum amount of typing.
Anaconda Distribution
The open-source Anaconda Distribution is the easiest way to perform Python/R data science and machine learning on
Linux, Windows, and Mac OS X. With over 11 million users worldwide, it is the industry standard for developing, testing,
and training on a single machine.

Figure 1.1: Anaconda Navigator

Spyder IDE
Spyder is a powerful scientific environment written in Python, for Python, and designed by and for scientists, engineers
and data analysts. It offers a unique combination of the advanced editing, analysis, debugging, and profiling functionality
of a comprehensive development tool with the data exploration, interactive execution, deep inspection, and beautiful
visualization capabilities of a scientific package.
Beyond its many built-in features, its abilities can be extended even further via its plugin system and API. Furthermore,
Spyder can also be used as a PyQt5 extension library, allowing developers to build upon its functionality and embed its
components, such as the interactive console, in their own PyQt software.
Some of the components of spyder include:
 Editor, work efficiently in a multi-language editor with a function/class browser, code analysis tools,
automatic code completion
 Ipython console, harness the power of as many IPython consoles as you like within the flexibility of a
full GUI interface; run your code by line, cell, or file; and render plots right inline.
 Variable explorer, Interact with and modify variables on the fly: plot a histogram or time series, edit a
data frame or Numpy array, sort a collection, dig into nested objects, and more!
 Debugger, Trace each step of your code's execution interactively.
 Help, instantly view any object's docs, and render your own.
Here is how Spyder IDE looks like (Windows edition) in action:
Figure 1.2: Spyder IDE

The first thing to note is how the Spyder app is organized. The application includes multiple separate windows (marked
with red rectangles), each of which has its own tabs (marked with green rectangles). You can change which windows you
prefer to have open from the View -> Windows and Toolbars option. The default configuration has the Editor, Object
inspector/Variable explorer/File explorer, and Console/History log windows open as shown above.
The Console is where python is waiting for you to type commands, which tell it to load data, do math, plot data, etc. After
every command, which looks like >>> command, you need to hit the enter key (return key), and then python may or may
not give some output. The Editor allows you to write sequences of commands, which together make up a program. The
History Log stores the last 100 commands you've typed into the Console. The Object inspector/Variable explorer/File
explorer windows are purely informational -- if you watch what the first two display as we go through the tutorial, you'll
see that they can be quite helpful.
Variables & Types
In almost every single Python program you write, you will have variables. Variables act as placeholders for data. They can
aid in short hand, as well as with logic, as variables can change, hence their name.
Python variables do not need explicit declaration to reserve memory space. The declaration happens automatically when
you assign a value to a variable. The equal sign (=) is used to assign values to variables.
Variables help programs become much more dynamic, and allow a program to always reference a value in one spot, rather
than the programmer needing to repeatedly type it out, and, worse, change it if they decide to use a different definition for
it.
Variables can be called just about whatever you want. You wouldn't want them to conflictwith function names, and
they also cannot start with a number.
Example 1.2:
weight=55
height=166
BMI=weight/(height*height)
print(BMI)
In this case, we will have a 0.001995935549426622printed out to console. Here, we were able to store integers and their
manipulations to different variables.
We can also find out the type of any variable by using type function.
print(type(BMI))
This will return the type of variable as <class 'float'>
Data Structures in Python
The most basic data structure in Python is the sequence. Each element of a sequence is assigned a number - its position or
index. The first index is zero, the second index is one, and so forth.
There are certain things you can do with all sequence types. These operations include indexing, slicing, adding,
multiplying, and checking for membership. In addition, Python has built-in functions for finding the length of a sequence
and for finding its largest and smallest elements.
Python Lists
The list is a most versatile datatype available in Python which can be written as a list of comma-separated values (items)
between square brackets. Important thing about a list is that items in a list need not be of the same type.
Creating a list is as simple as putting different comma-separated values between square brackets. For example −
Example 2.3:
list1 = ['physics', 'chemistry', 1997, 2000];
list2 = [1, 2, 3, 4, 5 ];
list3 = ["a", "b", "c", "d"]
List indices start at 0, and lists can be sliced, concatenated and so on.
List of lists can also be created
Example 2.4:
weight=[55,44,45,53]
height=[166, 150, 144,155]
demo_list=[weight, height]
print(demo_list)
Accessing Values in Lists
To access values in lists, use the square brackets for slicing along with the index or indices to obtain value available at that
index. For example −
Example 2.5:
list1 = ['physics', 'chemistry', 1997, 2000];
list2 = [1, 2, 3, 4, 5, 6, 7 ];

print ("list1[0]: ", list1[0])

print ("list2[1:5]: ", list2[1:5])
When the above code is executed, it produces the following result –
list1[0]: physics
list2[1:5]: [2, 3, 4, 5]
Updating Lists
To update single or multiple elements of lists by giving the slice on the left-hand side of the assignment operator, and one
can add to elements in a list with the append() method. For example –
Example 2.6:
list = ['physics', 'chemistry', 1997, 2000];

print ("Value available at index 2 : ")

print (list[2])
list[2] = 2001;
print ("New value available at index 2:")
print(list[2])
When the above code is executed, it produces the following result –
Value available at index 2:
1997
New value available at index 2:
2001
Basic List Operations
Lists respond to the + and * operators much like strings; they mean concatenation and
repetition here too, except that the result is a new list, not a string.
Python Expression Results Description
len([1, 2, 3]) 3 Length
[1, 2, 3] + [4, 5, 6] [1, 2, 3, 4, 5, 6] Concatenation
['Hello User!'] * 4 [' Hello User!', ' Hello User!',Hello User!', ' Hello User!'] Repetition
3 in [1, 2, 3] True Membership
Table1.1: Basic list operations

Indexing, Slicing, & Matrices

Because lists are sequences, indexing and slicing work the same way for lists as they do for strings.
Assuming following input –
L =['spam','Spam','SPAM!']

Python Expression Results Description

L[2] 'SPAM!' Offsets start at zero
L[-2] 'Spam' Negative: count from the right
L[1:] ['Spam', 'SPAM!'] Slicing fetches sections
Table 1.2: Indexing & Slicing

Functions & Packages

Functions
A function is a block of organized, reusable code that is used to perform a single, related action. Functions provide better
modularity for your application and a high degree of code reusing.
Python gives you many built-in functions like print(), etc. but you can also create your own functions. These functions are
called user-defined functions.
Let’s first discuss some built-in functions
Example 2.7:
list = [2017, 2001, 1997, 2000]; #creates list
print(max(list)) #finds max element in list
print(round(1.77,1)) #rounds off the given argument upto
specified digits
help(round) # returns detailed help text
Output of the above code is
2017
1.8
Help on built-in function round in module builtins:
round(...)
round(number[, ndigits]) -> number
Round a number to a given precision in decimal digits
(default 0 digits). This returns an int when called with one
argument, otherwise the same type as the number. ndigits may be
negative.
User-defined functions
You can define functions to provide the required functionality. Here are simple rules to define a function in Python.
 Function blocks begin with the keyword def followed by the function name and parentheses ( ( ) ).
 Any input parameters or arguments should be placed within these parentheses. You can also define
parameters inside these parentheses.
 The first statement of a function can be an optional statement - the documentation string of the function
or docstring.
 The code block within every function starts with a colon (:) and is indented.
 The statement return [expression] exits a function, optionally passing back an expression to the caller. A
return statement with no arguments is the same as return None.

SYNTAX FOR DEFINING A FUNCTION

def functionname( parameters ):
"function_docstring"
function_suite
return [expression]

Example 2.8:
def printme( str ):
"This prints a passed string into this function"
print (str)
return;

#Calling function
printme("I'm first call to user defined function!")
printme("Again second call to the same function")

The result of the above code is-

I'm first call to user defined function!
Again second call to the same function
Methods
Methods are same as functions but they work on some objects. Everything in python is basically an object. Depending on
the type of the object there are different methods.
Example 2.9:
family=["mom","dad","sister","brother",44,42,22,23]
print(family.index("mom"))
print(family.count(44))
print("sister".capitalize())
The result of the above code is-
0
1
Sister
Packages
A package is a hierarchical file directory structure that defines a single Python application environment that consists of
modules and sub-packages and sub-sub-packages, and so on.
There are different packages present for different utilities some of which are: Numpy(efficiently works with arrays),
Matplotlib(used for visualizations), Scikit_learn(used for machine learning)
Import Statement
You can use any Python source file as a module by executing an import statement in some other Python source file.
The import has the following syntax:
import module1[, module2[,... moduleN]
When the interpreter encounters an import statement, it imports the module if the module is present in the search path. A
search path is a list of directories that the interpreter searches before importing a module.
From…. Import Statement
Python's from statement lets you import specific attributes from a module into the current namespace.
The from...import has the following syntax –
from modname import name1[, name2[, ... nameN]]
This statement does not import the entire module into the current namespace; it just introduces the specified item from the
module fib into the global symbol table of the importing module.
Numpy Package
Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array
object, and tools for working with these arrays.
Arrays
A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of
dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each
dimension.
We can initialize numpy arrays from nested Python lists, and access elements using square brackets:
Example 2.10:
import numpy as np
a = np.array([1, 2, 3]) # Create a rank 1 array
print(type(a)) # Prints "<class 'numpy.ndarray'>"
print(a.shape) # Prints "(3,)"
print(a[0], a[1], a[2]) # Prints "1 2 3"
a[0] = 5 # Change an element of the array
print(a) # Prints "[5, 2, 3]"

b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array

print(b.shape) # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0]) # Prints "1 2 4"

Numpy also provides many functions to create arrays:

Example 2.11:
import numpy as np
a = np.zeros((2,2)) # Create an array of all zeros
print(a) # Prints "[[ 0. 0.]
# [ 0. 0.]]"

b = np.ones((1,2)) # Create an array of all ones

print(b) # Prints "[[ 1. 1.]]"

c = np.full((2,2), 7) # Create a constant array

print(c) # Prints "[[ 7. 7.]
# [ 7. 7.]]"

d = np.eye(2) # Create a 2x2 identity matrix

print(d) # Prints "[[ 1. 0.]
# [ 0. 1.]]"

e = np.random.random((2,2)) # Create an array filled with random values

print(e) # Might print "[[ 0.91940167 0.08143941]
# [ 0.68744134 0.87236687]]"
Array Indexing
Numpy offers several ways to index into arrays.
Slicing: Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a
slice for each dimension of the array:
Example 2.12:
import numpy as np

# Create the following rank 3 array with shape (3, 4)

# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

# Use slicing to pull out the subarray consisting of the first

2 rows
# and columns 1 and 2; b is the following array of shape (2,2):
# [[2 3]
# [6 7]]
b = a[:2, 1:3]

# A slice of an array is a view into the same data, so

modifying it
# will modify the original array.
print(a[0, 1]) # Prints "2"
b[0, 0] = 77 # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1]) # Prints "77"

# Two ways of accessing the data in the middle row of the

array.
# Mixing integer indexing with slices yields an array of lower
rank,
# while using only slices yields an array of the same rank as
the
# original array:
row_r1 = a[1, :] # Rank 1 view of the second row of a
row_r2 = a[1:2, :] # Rank 2 view of the second row of a
print(row_r1, row_r1.shape) # Prints "[5 6 7 8] (4,)"
print(row_r2, row_r2.shape) # Prints "[[5 6 7 8]] (1, 4)"
# We can make the same distinction when accessing columns of an
array:
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print(col_r1, col_r1.shape) # Prints "[ 2 6 10] (3,)"
print(col_r2, col_r2.shape) # Prints "[[ 2]
# [ 6]
# [10]] (3, 1)"
Integer Array Indexing
When you index into numpy arrays using slicing, the resulting array view will always be a subarray of the original array. In
contrast, integer array indexing allows you to construct arbitrary arrays using the data from another array.
Example 2.13:
import numpy as np

a = np.array([[1,2], [3, 4], [5, 6]])

# An example of integer array indexing.

# The returned array will have shape (3,) and
print(a[[0, 1, 2], [0, 1, 0]]) # Prints "[1 4 5]"

# The above example of integer array indexing is equivalent to

this:
print(np.array([a[0, 0], a[1, 1], a[2, 0]])) # Prints "[1 4
5]"

# When using integer array indexing, you can reuse the same
# element from the source array:
print(a[[0, 0], [1, 1]]) # Prints "[2 2]"

# Equivalent to the previous integer array indexing example

print(np.array([a[0, 1], a[0, 1]])) # Prints "[2 2]"
Boolean Array Indexing
Boolean array indexing lets you pick out arbitrary elements of an array. Frequently this type of indexing is used to select
the elements of an array that satisfy some condition.
Example 2.14:
import numpy as np

a = np.array([[1,2], [3, 4], [5, 6]])

bool_idx = (a > 2) # Find the elements of a that are bigger

than 2;
# this returns a numpy array of Booleans
of the same
# shape as a, where each slot of bool_idx
tells
# whether that element of a is > 2.

print(bool_idx) # Prints "[[False False]

# [ True True]
# [ True True]]"

# We use boolean array indexing to construct a rank 1 array

# consisting of the elements of a corresponding to the True
values
# of bool_idx
print(a[bool_idx]) # Prints "[3 4 5 6]"

# We can do all of the above in a single concise statement:

print(a[a > 2]) # Prints "[3 4 5 6]"
2.4Basic Statistics with Numpy
Data analysis is all about getting to know about your data. Let’s take city wise survey where we have 5000 adults and ask
their height and weight. So basically we will be having 2D numpy array with 5000 rows and 2 columns (height and
weight), finally we can generate summarizing statistics about the data by using different numpy methods. Some of the
commonly use statistical numpy methods are:
 a.sum() Array-wise sum
 a.min() Array-wise minimum value
 b.max(axis=0) Maximum value of an array row
 b.cumsum(axis=1) Cumulative sum of the elements
 a.mean() Mean
 b.median() Median
 a.corrcoef() Correlation coefficient
 np.std(b) Standard deviation

Data Visualizations with Matplotlib

Matplotlib is capable of creating most kinds of charts, like line graphs, scatter plots, bar charts, pie charts, stack plots, 3D
graphs, and geographic map graphs.
First, in order to actually use Matplotlib, we're going to need it so first statement would be ‘import’ statement. Next, we
invoke the .plot method of pyplot to plot some coordinates (data).
Example
import matplotlib.pyplot as plt
year= [1994,1995,1998,2000]# data points
population=[2.59,3.69,5.33,6.77]# data points
plt.plot(year,population) #plots the data on specified co-
ordinates in background
plt.show() #visualises the graph
plt.scatter(year,population)
plt.show()
The above code creates the following plots.

Figure 2.1: Results of matplotlib functions

Histograms
To explore more about dataset one can also use histograms to get the idea about distributions.
Consider an example where we have 12 values in range of 0 and 10. To build a histogram of these values divide them into
equal chunks. Suppose we go for 3 bins, finally draw bar of each line height of bar corresponds to number of data points
falling in particular bin.
Python code for the above scenario-
Example 2.4:
import matplotlib.pyplot as plt
help (plt.hist)
#list with 12 values
values=[1.2,1.3,2.2,3.3,2.4,6.5,6.6,7.7,8.8,9.9,4.2,5.3]
plt.hist(values)
plt.grid(True)
plt.show()
The above code displays following graph.

Figure 2.4: Histogram

Boolean Logic&Control flow

Relational Operators
These operators compare the values on either sides of them and decide the relation among them.
The following table lists relational operators:
Operator Description
== If the values of two operands are equal, then the condition becomes true.
!= If values of two operands are not equal, then condition becomes true.
> If the value of left operand is greater than the value of right operand, then
condition becomes true.
< If the value of left operand is less than the value of right operand, then
condition becomes true.
>= If the value of left operand is greater than or equal to the value of right
operand, then condition becomes true.
<= If the value of left operand is less than or equal to the value of right operand,
then condition becomes true.
Table 2.1: Relational operators

Logical Operators
A boolean expression (or logicalexpression) evaluates to one of two states true or false. Python provides the boolean type
that can be either set to False or True.
Operator Description
& AND Returns true if both the compared expressions are true
| OR Returns true even if both the compared expressions are
not true
~ It is unary and has the effect of 'flipping' the result.
Table 2.2: Logical operators
Control flow
A program’s control flow is the order in which the program’s code executes. The control flow of a Python program
is regulated by conditional statements, loops, and function calls.
if Statement
Often, you need to execute some statements only if some condition holds, or choose statements to execute depending on
several mutually exclusive conditions. The Python compound statement if, which uses if, elif, andelse clauses, lets
you conditionally execute blocks of statements. Here’s the syntax for the if statement:
ifexpression:
statement(s)
elifexpression:
statement(s)
elifexpression:
statement(s)
...
else:
statement(s)
The elif and else clauses are optional. Note that unlike some languages, Python does not have
a switch statement, so you must use if, elif, and elsefor all conditional processing.
Pandas
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with
“relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing
practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and
flexible open source data analysis / manipulation tool available in any language. It is already well on its way
toward this goal.
Pandas is well suited for many different kinds of data:
 Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
 Ordered and unordered (not necessarily fixed-frequency) time series data.
 Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
 Any other form of observational / statistical data sets. The data actually need not be labeled at all to
be placed into a pandas data structure
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast
majority of typical use cases in finance, statistics, social science, and many areas of engineering.
A common way to start databases is to use CSV (Comma Separated Values) files pandas efficiently handles these files.
CSV files are used to store a large number of variables – or data. They are incredibly simplified spreadsheets – think Excel
– only the content is stored in plaintext.The text inside a CSV file is laid out in rows, and each of those has columns, all
separated by commas. Every line in the file is a row in the spreadsheet, while the commas are used to define and separate
cells.
Let’s take a look at one such example which retrieves data from CSV file. Use the given csv file.

Example
import pandas as pd
data=pd.read_csv('C:/Users/Downloads/
Import_User_Sample_en.csv',index_col=0)#skips the 0th column
print(data)

The above code prints the following data

Using pandas some of the basics operation like manipulation, addition, deletion of data from CSV file can be easily done.
Example 2.7:
import pandas as pd
data=pd.read_csv('C:/Users/Downloads/
Import_User_Sample_en.csv')#skips the 0th column

#Original File
print("Original file")
print(data)

#Retrieval of particular Column

print("Retrieval of particular column")
print(data['User Name']) #retrieves user name column

#Adding Data
data["Marital
status"]=['Married','Single','Divorced','Single','Single'] #adds
column to CSV file
print("Retrieval of newly added column")
print(data["Marital status"])

#Manipulation of Columns
data["New Number"]=data['Office Number']/data['ZIP or Postal Code']
print("Retrieval of newly added column through manipulation")
print(data['New Number'])

#Row Access
data.set_index("Last Name", inplace=True)
print(data)
print(data.loc["Andrews"])

#Element Access
print("Element Access")
print(data.loc["Andrews"],["Address"])

#Row Access
print('Row access via iloc',data.iloc[0:2, :])

#Column Access
print('Column access via iloc',data.iloc[:, [1]])
Execute the above code and find out the alterations made in the CSV file data.

Exercise
Variables & Types
Question 1:
 Create a variable savings with the value 100.
 Check out this variable by typing print(savings) in the script.
Question 2:

 Create a variable factor, equal to 1.10.

 Use savings and factor to calculate the amount of money you end up with after 7 years. Store the result in a new
variable, result.
 Print out the value of result.

Python Lists
Question 3:
 Create a list, areas, that contains the area of the hallway (hall), kitchen (kit), living room (liv),
bedroom (bed) and bathroom (bath), in this order. Use the predefined variables.
 Print areas with the print() function.

List sub-setting
Question 4:
 Print out the second element from the areas list, so 11.25.
 Subset and print out the last element of areas, being 9.50. Using a negative index makes sense here!
 Select the number representing the area of the living room and print it out.
# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0,
"bedroom", 10.75, "bathroom", 9.50]

# Print out second element from areas

# Print out last element from areas
# Print out the area of the living room
List Manipulation
Question 5:
 You did a miscalculation when determining the area of the bathroom; it's 10.50 square meters instead
of 9.50. Can you make the changes?
 Make the areas list more trendy! Change "living room" to "chill zone".

# Create the areas list

areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0,
"bedroom", 10.75, "bathroom", 9.50]

# Correct the bathroom area

# Change "living room" to "chill zone"
Functions
Question 6:
 Use print() in combination with type() to print out the type of var1.
 Use len() to get the length of the list var1. Wrap it in a print() call to directly print it out.
 Use int() to convert var2 to an integer. Store the output as out2.
# Create variables var1 and var2
var1 = [1, 2, 3, 4]
var2 = True

# Print out type of var1

# Print out length of var1

# Convert var2 to an integer: out2

Methods
Question 7: String Methods
 Use the upper() method on room and store the result in room_up. Use the dot notation.
 Print out room and room_up. Did both change?
 Print out the number of o's on the variable room by calling count() on room and passing the
letter "o" as an input to the method. We're talking about the variable room, not the word "room"!

# string to experiment with: room

room = "poolhouse"

# Use upper() on room: room_up

# Print out room and room_up

# Print out the number of o's in room

Packages
Question 8:
 Import the math package. Now you can access the constant pi with math.pi.
 Calculate the circumference of the circle and store it in C.
 Calculate the area of the circle and store it in A.

# Definition of radius
r = 0.43

# Import the math package

# Calculate C
C =
# Calculate A
A =
# Build printout
print("Circumference: " + str(C))
print("Area: " + str(A))
Numpy Package
Question 9:
 import the numpy package as np, so that you can refer to numpy with np.
 Use np.array() to create a Numpy array from baseball. Name this array np_baseball.
 Print out the type of np_baseball to check that you got it right.
# Create list baseball
baseball = [180, 215, 210, 210, 188, 176, 209, 200]

# Import the numpy package as np

# Create a Numpy array from baseball: np_baseball

# Print out type of np_baseball

2D Numpy Arrays
Question 10: First 2D numpy array
 Use np.array() to create a 2D Numpy array from baseball. Name it np_baseball.
 Print out the type of np_baseball.
 Print out the shape attribute of np_baseball. Use np_baseball.shape.

# Create baseball, a list of lists

baseball = [[180, 78.4],
[215, 102.7],
[210, 98.5],
[188, 75.2]]

# Import numpy
import numpy as np

# Create a 2D Numpy array from baseball: np_baseball

# Print out the type of np_baseball

Basic Statistics with Numpy

Question 11:
 Create Numpy array np_height, that is equal to first column of np_baseball.
 Print out the mean of np_height.
 Print out the median of np_height

# Use np_baseball from previous question

# Import numpy
import numpy as np

# Create np_height from np_baseball

# Print out the mean of np_height
# Print out the median of np_height
Basic plots with matplotlib
Question 12:
 print() the last item from both the year and the poplist to see what the predicted population for
the year 2100 is.
 Before you can start, you should import matplotlib.pyplotas plt. pyplot is a sub-package
of matplotlib, hence the dot.
 Use plt.plot() to build a line plot. year should be mapped on the horizontal axis, pop on the
vertical axis. Don't forget to finish off with the show() function to actually display the plot.

# Print the last item from year and pop

# Import matplotlib.pyplot as plt
# Make a line plot: year on the x-axis, pop on the y-axis
Histograms
Question 13:
 Use plt.hist() to create a histogram of the values in prices(a list containing atleast
20 different values). Do not specify the number of bins; Python will set the number of bins to
10 by default for you.
 Add plt.show() to actually display the histogram. Can you tell which bin contains the most
observations?
# Import matplotlib.pyplot as plt

#Create a list named as prices

# Create histogram of prices data

# Display histogram
Boolean logic & Controlflow
Question 14:
 Examine the if statement that prints out "Looking around in the
kitchen." if room equals "kit".
 Write another if statement that prints out "big place!" if area is greater than 15.

# Define variables
room = "kit"
area = 14.0

# if statement for room

if room == "kit" :
print("looking around in the kitchen.")

# if statement for area

Pandas
In the exercises that follow, you will be working with vehicle data in different countries. Each observation corresponds to a
country, and the columns give information about the number of vehicles per capita, whether people drive left or right, and
so on. This data is available in a CSV file, named cars.csv.
Question 15:
 To import CSV files, you still need the pandas package: import it as pd.
 Use pd.read_csv() to import cars.csv data as a DataFrame. Store this dataframe as cars.
 Print out cars. Does everything look OK?

# Import pandas as pd

# Import the cars.csv data: cars

# Print out cars
Think Tank
It's time to customize your own plot. This is the fun part, you will see your plot come to life!
You're going to work on the scatter plot with world development data: GDP per capita on the x-axis (logarithmic scale),
life expectancy on the y-axis. The code for this plot is available in the script.
Question 16:
 The strings xlab and ylab are already set for you. Use these variables to set the label of the x- and y-
axis.
 The string title is also coded for you. Use it to add a title to the plot.
 After these customizations, finish the script with plt.show()to actually display the plot.

import matplotlib.pyplot as plt; import importlib;

importlib.reload(plt)
import pandas as pd
plt.clf()

df =
pd.read_csv('http://assets.datacamp.com/course/intermediate_python/
gapminder.csv', index_col = 0)
gdp_cap = list(df.gdp_cap)
life_exp = list(df.life_exp)
# Basic scatter plot, log scale
plt.scatter(gdp_cap, life_exp)
plt.xscale('log')
# Strings
xlab = 'GDP per Capita [in USD]'
ylab = 'Life Expectancy [in years]'
title = 'World Development in 2007'
# Add axis labels
# Add title
# After customizing, display the plot
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 02: To Practice the Python Platform and explore its Libraries for Data Science and Machine
Learning
Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

rics
Proficient Adequate Just Acceptable Unacceptable
1 Problem Students have Student Student Student don’t
Understanding and carefully read and understand understand understand
Analysis understood the problems and problems and problem.
problem. Students analysis in analysis in just
gives best idea to good way acceptable way
solve problem.
2 Code Originality Students find Adequate Acceptable Code Code originality
themselves and Code Originality. is not
express their own Originality. acceptable.
solutions and will
translate the logic
from the flowchart or
pseudocode- to a
programming
language by writing
their own program’s
code.
3 Completeness and The lab /tasks are The lab is The lab The lab in its
accuracy complete and mostly completion is just current state is
accurate in the complete and acceptable. not
context of accurate in the implementable.
implementation context of its
implementatio
n
4 Analysis and analysis and results Appropriate Inappropriate data Inappropriate
Results Appropriate data data collected, collected, but data collected,
collected, correct but sufficient analysis no
correctly interpreted. insufficient and results understanding of
analysis and inadequately analysis and
results interpreted results
adequately incorrectly
interpreted interpreted.

Date
Teacher:
:
Department of: Subject of:

Computer Systems Engineering Data Science & Analytics

Mehran University of Engineering Year 4th Semester 8th
&Technology
Batch 18 CS Duration 03
Hours
Jamshoro

Practical 03
To Perform Exploratory Data Analysis and
Demonstrate In Depth-Visualizations Of
Data.
Outline:
 Features and its types
 Basic Plots
 Higher Dimensionality Visualizations

Required Tools:
 PC with windows
 Anaconda3-5.0.1

3.1. Features & Its Types

A feature is an individual measurable property or characteristic of a phenomenon being observed. Choosing informative,
discriminating and independent features is a crucial step for effective algorithms in pattern recognition, classification and
regression.
There are two types of features:
Continuous Features
A measurable difference exists between the values continuous features take on. Continuous variables are variables that can
have an infinite number of possible values, as opposed to discrete variables which can only have a specified range of
values. Also continuous features are usually a subset of all real numbers. Some example features are:

 Time
 Distance
 Cost
 Temperature
Categorical Features
With categorical features, there is a specified number of discrete, possible feature values. These values may or may not
have an ordering to them. If they do have a natural ordering, they are called ordinal categorical features. Otherwise if
there is no intrinsic ordering, they are called nominal categorical features.
Nominal
 Car Models
 Colors
 TV Shows
Ordinal
 High-Medium-Low
 1-10 Years Old, 11-20 Years Old, 30-40 Years Old
 Happy, Neutral, Sad
Figure 3.1: Types of features

3.2. Visualizations
One of the most rewarding and useful things you can do to understand your data is to visualize it in a pictorial format.
Visualizing your data allows you to interact with it, analyze it in a straightforward way, and identify new patterns, making
your would-be complex data more accessible and understandable. The way our brains processes visuals like shapes, colors,
and lengths makes looking at charts and graphs more intuitive for us than poring over spreadsheets.
3.2.1. Matplotlib
MatPlotLib is a Python data visualization tool that supports 2D and 3D rendering, animation, UI design, event handling,
and more. It only requires you pass in your data and some display parameters and then takes care of all of the rasterization
implementation details. For the most part, you will be interacting with MatPlotLib's Pyplot functionality through a Pandas
series or dataframe's .plot namespace. Pyplot is a collection of command-style methods that essentially make
MatPlotLib's charting methods feel like MATLAB.
3.3. Basic Plots
3.3.1. Histograms
Histograms are graphical techniques which have been identified as being most helpful for troubleshooting issues.
Histograms help you understand the distribution of a feature in your dataset. They accomplish this by simultaneously
answering the questions where in your feature's domain your records are located at, and how many records exist there.
Let's go ahead and explore a little bit about how to use histograms and what they can actually do. You've probably seen
wheat before, nothing new there, however, it turns out that there's quite a few different varieties of it. A group titled the
Polish Academy of Science, particularly their Agrophysics Institute, what they do is they created a dataset that has a bunch
of different types of wheat in them. And they x-rayed the wheat, the different specimens of wheat and then they featured
the results. So some of the metrics that they curated, include the groove length of the wheat which is this thing over here.
They also got the actual kernel length and the kernel width and some other features about the wheat. In addition to that,
they also added in some engineered or calculated features such as an asymmetry constant.
Figure 3.2: Wheat Kernel

DATA SET DESCRIPTION

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70
elements each, randomly selected for
the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is
non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or
laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine
harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy
of Sciences in Lublin.

Example 3.1:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot') # Look Pretty, ggplot is Python
implementation of the grammar of graphics.
# If the above line throws an error, use
plt.style.use('ggplot') instead

df =
pd.read_csv("C:/Users/Mehak/Desktop/DataScience/wheat.csv",
index_col=0)
#Prints columns in present in your csv file
print(df.columns)

#Creates a squence named as my_series_series using asymetry

column
my_series = df.asymmetry

#Creates a dataframe named as my_dataframe using provided

columns
my_dataframe = df[['wheat_type', 'length', 'asymmetry']]

#Histogram creation of sequence

my_series.plot.hist(alpha=0.5)
plt.show()

#Histogram creation of dataframe

my_dataframe.plot.hist(alpha=0.5)
plt.show()

#Histogram creation based on particular condition

df[df.wheat_type==1].asymmetry.plot.hist(alpha=0.4)#Kama
df[df.wheat_type==2].asymmetry.plot.hist(alpha=0.4)#Rosa
df[df.wheat_type==3].asymmetry.plot.hist(alpha=0.4)#Canadian
plt.show()
Output:

Figure 3.3: Histogram based on one feature

Figure 3.4: Histogram based on three features

Figure 3.5: Histogram based on conditions

3.3.2. 2D Scatter Plots

2D scatter plots are used to visually inspect if a correlation exist between the charted features. Both axes of a 2D scatter
plot represent a distinct, numeric feature. They don't have to be continuous, but they must at least be ordinal since each
record in your dataset is being plotted as a point with its location along the axes corresponding to its feature values.
Without ordering, the position of the plots would have no meaning.
It is possible that either a negative or positive correlation exist between the charted features, or alternatively, none at all.
The correlation type can be assessed through the overall diagonal trending of the plotted points.
Positive and negative correlations may further display a linear or non-linear relationship. If a straight line can be drawn
through your scatter plot and most of points seem to stick close to it, then it can be said with a certain level of confidence
that there is a linear relationship between the plotted features. Similarly, if a curve can be drawn through the points, there is
likely a non-linear relationship. If neither a curve nor line adequately seems to fit the overall shape of the plotted points,
chances are there is neither a correlation nor relationship between the features, or at least not enough information at present
to determine.
Dataset for the next example is taken from UCI Machine Learning Repository.

Example 3.2:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

matplotlib.style.use('ggplot') # Look Pretty

# If the above line throws an error, use plt.style.use('ggplot')
instead

#Creation of dataframe from csv file

df =
pd.read_csv("C:/Users/Mehak/Desktop/DataScience/Concrete_Data.csv",
index_col=0)
#prints columns in file
print(df.columns)
#Rename Columns
df.columns=['Slag','Ash','Water','Superplasticizer','Coarse
Aggregate','Fine Aggregate','Age','Strength']
#Print new column names
print(df.columns)
#Creates scatter plots based on different features
df.plot.scatter(x='Slag', y='Strength')
plt.show()

df.plot.scatter(x='Water', y='Strength')
plt.show()

df.plot.scatter(x='Ash', y='Strength')
plt.show()
Output:
Figure 3.6: 2D Scatter plot based on slag & strength

Figure 3.6: 2D Scatter plot based on water & strength

Figure 3.6: 2D Scatter plot based on ash & strength

3.3.3. 3D Scatter Plots

There surely is a way to visualize the relationship between three variables simultaneously. That way is through 3D scatter
plots. Unfortunately, the Pyplot member of Pandas data frames don't natively support the ability to generate 3D plots, so
for the sake of your visualization repertoire, you're going to learn how to make them directly with MatPlotLib.
Example 3.3:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D #for creation of 3d plots

matplotlib.style.use('ggplot') # Look Pretty

# If the above line throws an error, use plt.style.use('ggplot')
instead

#Creation of dataframe from csv file

#creates figure
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')#"1x1 grid, first
subplot"
#OR USE ax = fig.gca(projection='3d'),gca(Get the current axes,
creating one if necessary)

ax.set_xlabel('Slag')
ax.set_ylabel('Water')
ax.set_zlabel('Strength')

ax.scatter(df.Slag, df.Water, df.Strength, c='b', marker='.')

plt.show()
Output:

Figure 3.7: 3D Scatter plot

3.4. Higher Dimensionality Visualizations

Scatter plots are effective in communicating data by mapping a feature to spatial dimensions, which we understand
intuitively. However, you and I are limited in that we lose the ability to easily and passively comprehend an image past
three spatial dimensions. It takes a great deal of thought and even more creativity to push the envelope any further. You
can introduce a time dimension using animations, but it really doesn't get much better than that.
Real world datasets often have tens of features, if not more. Sparse datasets can have tens of thousands of features. What
are your visualization options if when you have a dataset with more than three dimensions?
3.4.1. Parallel Coordinates
Parallel coordinate plots are similar to scatter plots in that each axis maps to the ordered, numeric domain of a feature. But
instead of having axes aligned in an orthogonal manner, parallel coordinates get their name due to their axes being
arranged vertically and in parallel. All that is just a fancy way of saying parallel coordinates are a bunch of parallel,
labeled, numeric axes.
Each graphed observation is plotted as a polyline, a series of connected line segments. The joints of the polyline fall on
each axis. Since each axis maps to the domain of a numeric feature, the resulting polyline fully describes the value of each
of the observation's features.
Parallel coordinates are a useful charting technique you'll want to add the exploring section of your course map. They are
a higher dimensionality visualization technique because they allow you to easily view observations with more than three
dimensions simply by tacking on additional parallel coordinates. However at some point, it becomes hard to comprehend
the chart anymore due to the sheer number of axes and also potentially due to the number of observations. If you data has
more than 10 features, parallel coordinates might not do it for you.
Parallel coordinates are useful because polylines belonging to similar records tend to cluster together. To graph them with
Pandas and MatPlotLib, you have to specify a feature to group by (it can be non-numeric). This results in each distinct
value of that feature being assigned a unique color when charted. Here's an example of parallel coordinates using SciKit-
Learn's Iris dataset.
Example 3.4:
from sklearn.datasets import load_iris
from pandas.tools.plotting import parallel_coordinates

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

# Look pretty...
matplotlib.style.use('ggplot')
# If the above line throws an error, use
plt.style.use('ggplot') instead

# Load up SKLearn's Iris Dataset into a Pandas Dataframe

data = load_iris() # <class 'sklearn.utils.Bunch'>
df = pd.DataFrame(data.data, columns=data.feature_names)
#creates dataframe

df['target_names'] = [data.target_names[i] for i in

data.target]# creates a series named as target name in df based
on the class of each observation is stored in
the .target attribute of the dataset.

# Parallel Coordinates Start Here:

plt.figure()
parallel_coordinates(df, 'target_names')
plt.show()
Output:

Figure 3.8: Parallel coordinates

Pandas' parallel coordinates interface is extremely easy to use, but use it with care. It only supports a single scale for all
your axes. If you have some features that are on a small scale and others on a large scale, you'll have to deal with a
compressed plot. For now, your only three options are to:
 Normalize your features before charting them
 Change the scale to a log scale

3.4.2. Andrew’s Curve

An Andrews plot, also known as Andrews curve, helps you visualize higher dimensionality, multivariate data by plotting
each of your dataset's observations as a curve. The feature values of the observation act as the coefficients of the curve, so
observations with similar characteristics tend to group closer to each other. Due to this, Andrews’s curves have some use in
outlier detection.
Andrew’s curves are a method for visualizing multidimensional data by mapping each observation onto a function. This
function is defined as:

For implementation details of andrew’s cure in python follow the link:

https://glowingpython.blogspot.com/2014/10/andrews-curves.html
Here's an example of parallel coordinates using SciKit-Learn's Iris dataset.
Example 3.5:
from sklearn.datasets import load_iris
from pandas.tools.plotting import andrews_curves

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

# Look pretty...
matplotlib.style.use('ggplot')
# If the above line throws an error, use
plt.style.use('ggplot') instead

# Load up SKLearn's Iris Dataset into a Pandas Dataframe

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target_names'] = [data.target_names[i] for i in
data.target]

# Andrews Curves Start Here:

plt.figure()
andrews_curves(df, 'target_names')
plt.show()
Output:
Figure 3.9: Andrew’s curve

3.4.3. IMSHOW
One last higher dimensionality, visualization-technique you should know how to use is MatPlotLib's .imshow() method.
This command generates an image based off of the normalized values stored in a matrix, or rectangular array of float64s.
The properties of the generated image will depend on the dimensions and contents of the array passed in:
 An [X, Y] shaped array will result in a grayscale image being generated
 A [X, Y, 3] shaped array results in a full-color image: 1 channel for red, 1 for green, and 1 for blue
 A [X, Y, 4] shaped array results in a full-color image as before with an extra channel for alpha
Besides being a straightforward way to display .PNG and other images, the .imshow() method has quite a few other use
cases. When you use the .corr() method on your dataset, Pandas calculates a correlation matrix for you that measures
how close to being linear the relationship between any two features in your dataset are. Correlation values may range
from -1 to 1, where 1 would mean the two features are perfectly positively correlated and have identical slopes for all
values. -1 would mean they are perfectly negatively correlated, and have a negative slope for one another, again being
linear. Values closer to 0 mean there is little to no linear relationship between the two variables at all (e.g., pizza sales and
plant growth), and so the further away from 0 the value is, the stronger the relationship between the features.
Example 3.6:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import matplotlib.image as mpimg

img=mpimg.imread("C:/Users/Mehak/Desktop/DataScience/Capture.png")
print(img.shape)
plt.imshow(img)
#plt.imshow(img,aspect=0.5)
Output:

Figure 3.10: IMSHOW for images

Exercise
For this assignment, you'll be using the seeds data set, generated by recording X-Ray measurements of various wheat
kernels.

Histograms
Question 1: Write python code that
1. Loads the seeds dataset into a dataframe.
2. Creates a slice of your dataframe that only includes the area and perimeter features
3. Creates another slice that only includes the groove and asymmetry features
4. Creates a histogram for the 'area and perimeter' slice, and another histogram for the 'groove and
asymmetry' slice. Set the optional display parameter: alpha=0.75
Once you're done, run your code and then answer the following questions about your work:
a) Looking at your first plot, the histograms of area and perimeter, which feature do you believe more
closely resembles a Gaussian / normal distribution?
b) In your second plot, does the groove or asymmetry feature have more variance?

2D Scatter Plots
Question 2: Write python code that
1. Loads up the seeds dataset into a dataframe
2. Create a 2d scatter plot that graphs the area and perimeter features
3. Create a 2d scatter plot that graphs the groove and asymmetry features
4. Create a 2d scatter plot that graphs the compactness and width features
Once you're done, answer the following questions about your work:
a) Which of the three plots seems to totally be lacking any correlation?
b) Which of the three plots has the most correlation?

3D Scatter Plots
Question 3: Write python code that
1. Loads up the seeds dataset into a dataframe. You should be very good at doing this by now.
2. Graph a 3D scatter plot using the area, perimeter, and asymmetry features. Be sure to label your axes,
and use the optional display parameter c='red'.
3. Graph a 3D scatter plot using the width, groove, and length features. Be sure to label your axes, and
use the optional display parameter c='green'.
Once you're done, answer the following questions about your work.
a) Which of the plots seems more compact / less spread out?
b) Which of the plots were you able to visibly identify two outliers within, that stuck out from the
samples?

Parallel Coordinates
Question 4: Write python code that
1. Loads up the seeds dataset into a dataframe
2. Drop the area, and perimeter features from your dataset. Use .drop method on data frame to drop
specified columns
3. Plot a parallel coordinates chart, grouped by the wheat_type feature. Be sure to set the optional
display parameter alpha to 0.4
Once you're done, answer the following questions about your work.
a) Which class of wheat do the two outliers you found previously belong to?
b) Which feature has the largest spread of values across all three types of wheat?

Andrew’s Plot
Question 5: Write python code that
1. Loads up the seeds dataset into a dataframe
2. Plot anandrew’s curve chart, grouped by the wheat_type feature. Be sure to set the optional display
parameter alpha to 0.4
Once you're done, answer the following questions about your work.
a) Are your outlier samples still easily identifiable in the plot?
IMSHOW
Question 6: Write python code that
1. Loads up any image of your choice, into a dataframe.
2. Print shape and type of the object holding image.
3. Plot image using imshow.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 03: To perform Exploratory Data Analysis and demonstrate in depth-visualization of Data
Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

Date
Teacher:
:
Department of: Subject
of:
Computer Systems Data Science & Analytics
Mehran Engineering
University of Year 4th Semester 8th
Engineering
&Technology Batch 18CS Duration 03
Hour
s
Jamshoro

PRACTICAL 04
To perform Inferential Statistics and Hypothesis
testing

Outline:

 To get familiar with statistical testing

 To practice and use statistical testing methods.

Statistical Testing
Data Science, Machine Learning, Artificial Intelligence, Deep Learning - You need to learn
the basics before you become a good Data Scientist. Math and Statistics are the building
blocks of Algorithms for Machine Learning. Knowing the techniques behind different
Machine Learning Algorithms is fundamental to knowing how and when to use them.

"Statistics is the practice or science of collecting and analyzing numerical data in large
quantities, especially to infer proportions in a whole from those in a representative
sample."

Statistics are used to interpret data and solve complicated real-world issues. Data scientists
and analysts use statistics to search for concrete patterns and data changes. To put it
simply, Statistics can be used to carry out mathematical computations to extract useful
insights from data.

Key terminology in statistics:

1. The Population is the collection of sources from which to gather data.

2. A Variable is any characteristic, number, or quantity observable or countable.

3. A Sample is a Population subset
4. A Statistical Parameter or Population Parameter is a quantity indexing a family of
distributions of probabilities

Hypothetical Statistic
A statistical hypothesis, sometimes called confirmatory data analysis, is a hypothesis that is
testable based on observing a process that is modeled via a set of random variables.

A statistical hypothesis test is a method of statistical inference. Commonly, two statistical

data sets are compared, or a data set obtained by sampling is compared against a synthetic
data set from an idealized model. An alternative hypothesis is proposed for the statistical
relationship between the two datasets and is compared to an idealized null hypothesis that
proposes no relationship between these two datasets. This comparison is deemed
statistically significant if the relationship between the datasets would be an unlikely
realization of the null hypothesis according to a threshold probability—the significance
level. Hypothesis tests are used when determining what outcomes of a study would lead to
a rejection of the null hypothesis for a pre-specified level of significance.

Or in other words, Hypothesis testing is a statistical approach that is used with

experimental data to make statistical decisions. This is used to determine whether an
experiment performed offers ample evidence to reject a proposal. Before we start to
distinguish between various tests or experiments, we need to gain a clear understanding of
what a null hypothesis is.

A Null Hypothesis implies that there is no strong difference in a given set of observations.

parameter of hypothesis testing

Null hypothesis: - In inferential statistics, the null hypothesis is a general statement or
default position that there is no relationship between two measured phenomena or no
association among groups

In other words, it is a basic assumption or made based on domain or problem

knowledge. Example : a company production is = 50 unit/per day etc.

Alternative hypothesis:-
The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary to
the null hypothesis. It is usually taken to be that the observations are the result of a real
effect (with some amount of chance variation superposed)
Example : a company production is !=50 unit/per day etc.
Level of significance: Refers to the degree of significance in which we accept or reject the
null hypothesis. 100% accuracy is not possible for accepting or rejecting a hypothesis, so
we, therefore, select a level of significance that is usually 5%.

This is normally denoted with alpha(maths symbol ) and generally, it is 0.05 or 5%, which
means your output should be 95% confident to give a similar kind of result in each sample.

Type I error: When we reject the null hypothesis, although that hypothesis was true. Type
I error is denoted by alpha. In hypothesis testing, the normal curve that shows the critical
region is called the alpha region

Type II errors: When we accept the null hypothesis, but it is false. Type II errors are
denoted by beta. In Hypothesis testing, the normal curve that shows the acceptance region
is called the beta region.

One-tailed test:- A test of a statistical hypothesis, where the region of rejection is on only
one side of the sampling distribution, is called a one-tailed test.

Example:- a college has ≥ 4000 students or data science ≤ 80% org adopted.

Two-tailed test:- A two-tailed test is a statistical test in which the critical area of a
distribution is two-sided and tests whether a sample is greater than or less than a certain
range of values. If the sample being tested falls into either of the critical areas, the
alternative hypothesis is accepted instead of the null hypothesis.

Example : a college != 4000 student or data science != 80% org adopted

Figure 1: One and Two Tailed

P-value:- The P value, or calculated probability, is the probability of finding the observed,
or more extreme, results when the null hypothesis (H 0) of a study question is true — the
definition of ‘extreme’ depends on how the hypothesis is being tested.

If your P value is less than the chosen significance level then you reject the null hypothesis i.e.
accept that your sample gives reasonable evidence to support the alternative hypothesis. It
does NOT imply a “meaningful” or “important” difference; that is for you to decide when
considering the real-world relevance of your result.

Example: you have a coin and you don’t know whether that is fair or tricky so let’s decide the
null and alternate hypothesis

H0: a coin is a fair coin.

H1: a coin is a tricky coin. and alpha = 5% or 0.05

Now let’s toss the coin and calculate the p-value ( probability value).

Toss a coin 1st time and the result is tail- P-value = 50% (as head and tail have equal
probability)

Toss a coin 2nd time and result is tail, now p-value = 50/2 = 25%

and similarly, we Toss 6 consecutive times and got the result as P-value = 1.5% but we set
our significance level as 95% means the 5% error rate we allow, and here we see we are
beyond that level i.e. our null- hypothesis does not hold good so we need to reject and
propose that this coin is a tricky coin which is actually.

Now Let’s see some of the widely used hypothesis testing types:-

1. T Test ( Student T-test)

2. Z Test

3. ANOVA Test

4. Chi-Square Test

1. T-Test
A t-test is a type of inferential statistic which is used to determine if there is a significant
difference between the means of two groups that may be related to certain features. It is
mostly used when the data sets, like the set of data recorded as an outcome from flipping a
coin 100 times, would follow a normal distribution and may have unknown variances. The T-
test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a
population.
T-test has 2 types: 1. one sampled t-test 2. two-sampled t-test.

One-sample t-test: The One Sample t Test determines whether the sample mean is
statistically different from a known or hypothesized population mean. The One Sample t-Test
is a parametric test.

Example:- you have 10 ages and you are checking whether avg age is 30 or not. (check the
code below for that using python)

Code:
Create CSV on your own

from scipy.stats import ttest_1samp

import numpy as np

ages = np.genfromtxt(“ages.csv”)

print(ages)

ages_mean = np.mean(ages)

print(ages_mean)

tset, pval = ttest_1samp(ages, 30)

print(“p-values”,pval)

if pval < 0.05: # alpha value is 0.05 or

5% print(" we are rejecting null

hypothesis")

else:

print("we are accepting null hypothesis")

Two sampled T-test:-The Independent Samples t-Test or 2-sample t-test compares the
means of two independent groups to determine whether there is statistical evidence that the
associated population means are significantly different. The Independent Samples t-Test is a
parametric test. This test is also known as the Independent t-Test.

Example: is there any association between week1 and week2 ( code is given below in python)

Code:
Create your CSV files:

from scipy.stats import ttest_ind

import numpy as np

week1 = np.genfromtxt("week1.csv", delimiter=",")

week2 = np.genfromtxt("week2.csv", delimiter=",")

print(week1)

print("week2 data :-\n")

print(week2)

week1_mean = np.mean(week1)

week2_mean = np.mean(week2)

print("week1 mean value:",week1_mean)

print("week2 mean value:",week2_mean)

week1_std = np.std(week1)

week2_std = np.std(week2)

print("week1 std value:",week1_std)

print("week2 std value:",week2_std)

ttest,pval = ttest_ind(week1,week2)

print("p-value",pval)

if pval <0.05:

print("we reject null hypothesis")

else:
print("we accept null hypothesis")

Z Test.
Several different types of tests are used in statistics (i.e. f test, chi-square test, t-test). You
would use a Z test if:

Your sample size is greater than 30. Otherwise, use a t-test. Data points should be
independent of each other. In other words, one data point isn’t related to or doesn’t affect
another data point. Your data should be normally distributed. However, for large sample
sizes (over 30) this doesn’t always matter. Your data should be randomly selected from a
population, where each item has an equal chance of being selected.

Sample sizes should be equal if at all possible.

For example, again we are using a z-test for blood pressure with some mean like 156 (python
code is below for same) one-sample Z test.

import pandas as pd

from scipy import stats

from statsmodels.stats import weight stats as tests

ztest ,pval = stests.ztest(df['bp_before'], x2=None, value=156)

print(float(pval))
if pval<0.05:

print("reject null

hypothesis") else:

print("accept null hypothesis")

LAB TASK
1. Define Statistical Testing and its importance.
2. Explain Hypothetical Testing with code examples.
3. Explain at least two types of statistical testing (along with code) as
discussed in theory lecture.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 04: To perform inferential statistics and hypothesis testing

Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

Department of: Subject of:

Computer Systems Engineering Data Science & Analytics

Mehran University of Engineering Year 4th Semester 8th
&Technology
Batch 18 CS Duration 03
Hours
Jamshoro

PRACTICAL 05
To demonstrate the implementation of Logistic
Regression in Data Modeling
Outline:
 Linear Regression implementation with Scikit-Learn

Required Tools:
 PC with windows
 Python 3.6.2
 Anaconda 3-5.0.1

5.1 Linear Regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed
data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent
variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear
regression model.
Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a
relationship between the variables of interest. This does not necessarily imply that one variable causes the other (for
example, higher SAT scores do not cause higher college grades), but that there is some significant association
between the two variables. Similarly, you can build a linear regression model that fits a relationship between
university student's GPA and their first job's annual salary. But simply having a high GPA in
doesn't cause someone to have a high paying job, although there is probably some significance between the two.
A scatterplot can be a helpful tool in determining the strength of the relationship between two variables. If there
appears to be no association between the proposed explanatory and dependent variables (i.e., the scatterplot does not
indicate any increasing or decreasing trends), then fitting a linear regression model to the data probably will not
provide a useful model. A valuable numerical measure of association between two variables is the correlation
coefficient, which is a value between -1 and 1 indicating the strength of the association of the observed data for the
two variables.
A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the
dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).
Least-Squares Regression
The most common method for fitting a regression line is the method of least-squares. This method calculates the
best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data
point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Because the deviations are
first squared, then summed, there are no cancellations between positive and negative values.
To view the fit of the model to the observed data, one may plot the computed regression line over the actual data
points to evaluate the results.
Outliers and Influential Observations
After a regression line has been computed for a group of data, a point which lies far from the line (and thus has a
large residual value) is known as an outlier. Such points may represent erroneous data, or may indicate a poorly
fitting regression line.
With regression, a continuous value output is calculated from a best fit curve function that runs through your data. In
the special case of linear regression, the curve is restricted such that it is linear.
Effectively predicting the future, known as extrapolating, or identifying a trend in your existing data, known
as interpolating, requires there be a statistically significant, linear correlation between your features.
5.2 How does Univariate Linear Regression works
The univariate part just means that you're only relating a single independent feature "x" to the dependent feature or
output "y". You will see that upgrading this a full-blown, multivariate linear regression is just an exercise of adding
in additional coefficients, so let's first walk through the simple case.
You probably recall a certain equation from school that looked like this: Y = a + bX. This is the basic equation of
linear regression and is the formula for the green line above, however we're going to alter the variable naming
convention slightly, just so that it matches the SciKit-Learn documentation: y = w0 + w1x. All that we've done
so far is change b to w1, and a to w0. The w's stand for 'weights coefficients', our currently unknown parameters for
calculating y given x. From the diagram it's clear that w0 actually corresponds to the y-intercept or offset between
the green line and the x-axis. As for w1, that is the quotient of the change of your dependent variable y and the
change of your independent variable x. That's pretty much it! Linear regression is all about computing a scalar
feature as a linear combination of weights multiplied by dependent features.

Figure 4.1: Working of Linear Regression

So How Do We Find the Weights?

SciKit-Learn uses a technique called ordinary least squares to compute the weights coefficients and intercept needed
to solve for the best fitting line that goes through your samples. In the figure above, each of the black dots represents
one of your features and of course the green line is your least squares, best fitting line. The red lines represent
distances between the true, observed values of your sample compared to the least squares line we're hoping to
calculate. Stated differently, these distances are the error between the approximate solution and the actual value.
Ordinary least squares works by minimizing the squared sum of all these red line errors to compute the best
fitting line.
Once you have the equation, you can use it to calculate an expected value for feature y, given that you have feature
x. If the x values you plug into the equation happen to lie within the x-domain boundary of those samples you
trained your regression with, then this is called interpolation or even approximation, because you do have the actual
observed values of y for the data in that range. When you use the function to calculate a y-value outside the bounds
of your training data's x-domain boundary, which is called extrapolation.
5.3 When to use Linear Regression
Linear regression is widely used in all disciplines for forecasting upcoming feature values by extrapolating the
regression line, and for estimating current feature values by interpolating the regression curve over existing data. It
is an extremely well-understood and interpretable technique that run very fast, and produces reasonable results
as long as you do not extrapolate too far away from your training data.
One of the main advantages of linear regression over other machine learning algorithms is that even though it's a
supervised learning technique, it doesn't force you to fine tune a bunch of parameters to get it working. You can
literally just dump your data into it and let it produce its results.
You can use linear regression if your features display a significant correlation. The stronger the feature correlation, it
being closer to +1 or -1, the better and more accurate the linear regression model for your data will be. The
questions linear regression helps you answer are which independent feature inputs relate to the dependent feature
output variable, and the degree of that relationship.
In business, linear regression is often used to forecast sales. By finding a correlation between time and the number of
sales, a company can predict their near-terms future revenue, which will then help them budget accordingly. Linear
regression can also be used to assess risk. Before issuing a loan, most banks will consider many features or aspects
about their customers, and run a regression to see if it is worthwhile for them to borrow the money, of if their return
on investment is insignificant or even positive for that matter.
In the sciences, geologists train linear regression against historic records to calculate the rate of glacier snow
melting, and can use it extrapolate how long it'll take for it to all disappear. Oil engineers do the same while
calculating how much is potentially left. When measuring experimental results, chemists use linear regression to
empirically calculate and validate concentrations and expected reactions. And of course there are many more uses.
5.3 Scikit-learn & Linear Regression
SciKit-Learn's LinearRegression class implements all the expected methods found in the rest of their supervised
learning classes, including: fit(), predict(), fit_predict(), and score(). As for its
outputs, the attributes you're interested in are:
intercept_ the scalar constant offset value
coef_ an array of weights, one per input feature, which will act as a scaling factor
For more information on scikit learn’s implementation of linear regression follow the link:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
For example 5.1 we are going to find the relationship between math test scores and the tissue concentration of LSD
in a test-taker’s skin (Source: Wagner, Agahajanian, and Bing.1968. “Correlation of Performance Test Scores with
Tissue Concentration of Lysergic Acid Diethylamide in Human Subjects.” Clinical Pharmacology and
Therapeutics, Vol.9 pp 635-638.)

Example 4.1
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

data =
pd.read_csv("C:/Users/Mehak/Desktop/DataScience/unchanged data
set/lsd.csv")
print(data.head())

X = data['Tissue Concentration']
y = data['Test Score']
X=X.values.reshape(len(X),1)

y=y.values.reshape(len(y),1)
print(X)
print(y)

model = LinearRegression()
model.fit(X, y)

plt.scatter(X, y,color='r')
plt.plot(X, model.predict(X),color='k')
plt.show()
Output:

Figure 4.2:Linear regression in Python, Math Test Scores on the Y-Axis, and Amount of LSD intake on the X-
Axis.

Given data, we can try to find the best fit line. After we discover the best fit line, we can use it to make predictions.
Consider we have data about houses: price, size, driveway and so on.

Example 4.2:
import matplotlib
matplotlib.style.use('ggplot')

import matplotlib.pyplot as plt

import numpy as np
from sklearn.linear_model import LinearRegression
import pandas as pd
# Load CSV and columns
df = pd.read_csv("C:/Users/Mehak/Desktop/DataScience/unchanged
data set/Housing.csv")

Y = df['price']
X = df['lotsize']

X=X.values.reshape(len(X),1)
Y=Y.values.reshape(len(Y),1)

# Split the data into training/testing sets

X_train = X[:-250]
X_test = X[-250:]

# Split the targets into training/testing sets

Y_train = Y[:-250]
Y_test = Y[-250:]

# Plot outputs
plt.scatter(X_test, Y_test, color='black')
plt.title('Test Data')
plt.xlabel('Size')
plt.ylabel('Price')

# Create linear regression object

regr = LinearRegression()

# Train the model using the training sets

regr.fit(X_train, Y_train)

# Plot outputs
plt.plot(X_test, regr.predict(X_test), color='red',linewidth=3)
plt.show()

#make individual prediction

print( str(np.round(regr.predict(5000))) )#Output [[ 61623.]]
Output:
Figure 4.3: Prediction of house prices using linear regression

The next example uses the only the first feature of the diabetes dataset, in order to illustrate a two-dimensional
plot of this regression technique. The straight line can be seen in the plot, showing how linear regression attempts to
draw a straight line that will best minimize the residual sum of squares between the observed responses in the
dataset, and the responses predicted by the linear approximation.
The coefficients, the residual sum of squares and the variance score are also calculated.
Example 4.3:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset

diabetes = datasets.load_diabetes()

# Use only one feature

diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets

diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets

diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object

regr = linear_model.LinearRegression()

# Train the model using the training sets

regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test,
diabetes_y_pred))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue',
linewidth=3)

plt.show()
Output:

Figure 4.4: Linear regression on diabetes dataset

5.4 Multiple variable Linear Regression

Multiple Linear Regression is similar to simple linear regression but the major difference being that we try to
establish linear relationship between one response variable and more than one predictor variables. For example
suppose that a researcher is studying that how the housing prices are affected by area of the apartment and the demand
supply gap in that region. The response variable is Y and the predictor variables are X1 and X2.

The general structure of linear regression model in this case would be:
Y = a + b.X1 + c.X2
We have following datasets:
Example 4.4:
import matplotlib.pyplot as plt #provides methods for plotting
import pandas as pd #provides methods to deal with data
import numpy as np #provides methods for array creation
from sklearn.linear_model import LinearRegression #provides methods related
to LR
from sklearn.model_selection import train_test_split #provides methods
related to division of dataset
from mpl_toolkits.mplot3d import Axes3D

co2_df = pd.read_csv('C:/Users/Mehak/Desktop/DataScience/unchanged data

set/global_co2.csv') #loads data
temp_df = pd.read_csv('C:/Users/Mehak/Desktop/DataScience/unchanged data
set/annual_temp.csv') #loads data
print(co2_df.head()) #prints starting records of the csv file based on
default value
print(temp_df.head())#prints starting records of the csv file based on
default value

# Clean data
co2_df = co2_df.iloc[:,:2] # Keep only total CO2
co2_df = co2_df.loc[co2_df['Year'] >= 1960] # Keep only 1960 - 2010
co2_df.columns=['Year','CO2'] # Rename columns
co2_df = co2_df.reset_index(drop=True) # Reset index

temp_df = temp_df[temp_df.Source != 'GISTEMP']

# Keep only one source
temp_df.drop('Source', inplace=True, axis=1)
# Drop name of source
temp_df = temp_df.reindex(index=temp_df.index[::-1])
# Reset index
temp_df = temp_df.loc[temp_df['Year'] >= 1960].loc[temp_df['Year'] <= 2010]
# Keep only 1960 - 2010
temp_df.columns=['Year','Temperature']
# Rename columns
temp_df = temp_df.reset_index(drop=True)
# Reset index

print(co2_df.head())
print(temp_df.head())

# Concatenate
climate_change_df = pd.concat([co2_df, temp_df.Temperature], axis=1)

print(climate_change_df.head())

#from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
fig.set_size_inches(12, 7) #set size for figure
ax = fig.gca( projection='3d') #describes axis
ax.scatter(xs=climate_change_df['Year'],
ys=climate_change_df['Temperature'], zs=climate_change_df['CO2']) #sets data
for X,Y,Z axis and creates scatter plot

ax.set_ylabel('Relative temperature')
ax.set_xlabel('Year')
ax.set_zlabel('CO2 Emissions')

X = climate_change_df.as_matrix(['Year'])#Creates vector X
Y = climate_change_df.as_matrix(['CO2', 'Temperature']).astype('float32')
#Creates vector Y,Z

X_train, X_test, y_train, y_test = np.asarray(train_test_split(X, Y))

#creates training & testing sets

reg = LinearRegression()
reg.fit(X_train, y_train)
print('Score: ', reg.score(X_test, y_test))
x_line = np.arange(1960,2011).reshape(-1,1)
p = reg.predict(x_line).T
fig2 = plt.figure()
fig2.set_size_inches(12.5, 7.5)#sets size of figure
ax = fig2.gca( projection='3d')# creates 3D graph
ax.scatter(xs=climate_change_df['Year'],
ys=climate_change_df['Temperature'], zs=climate_change_df['CO2']) #sets data
for X,Y,Z axis and creates scatter plot
ax.set_ylabel('Relative temperature')
ax.set_xlabel('Year')
ax.set_zlabel('CO2 Emissions')
ax.plot(xs=x_line, ys=p[1], zs=p[0], color='red')
Output:

Figure 4.5: 3D Scatter plot

Figure 4.6: Multiple Variable Linear Regression

Exercise
Question 1:

Advances in medicine, an increase in healthcare facilities, and improved standards of care have all contributed to an
increased overall life expectancy over the last few decades. Although this might seem like great achievement for
humanity, it has also led to the abandonment of more elderly people into senior-care and assisted living
communities. The morality, benefits, and disadvantages of leaving one's parents in such facilities are still debatable;
however, the fact that this practice has increased the financial burden on both the private-sector and government is
not.
In this lab assignment, you will be using the subset a life expectancy dataset, provided courtesy of the Center for
Disease Control and Prevention's National Center for Health Statistics page. The page hosts many open datasets on
topics ranging from injuries, poverty, women's health, education, health insurance, and of course infectious diseases,
and much more. But the one you'll be using is their "Life expectancy at birth, at age 65, and at age 75, by sex, race,
and origin" data set, which has statistics dating back from the 1900's to current, taken within the United States. The
dataset only lists the life expectancy of whites and blacks, because throughout most of the collection period, those
were the dominant two races that actively had their statistics recorded within the U.S.
Using linear regression, you will extrapolate how long people will live in the future. The private sector and
governments mirror these calculations when computing social security payouts, taxes, infrastructure, and more.
Complete the following:
1) Make sure the dataset has been properly loaded.
2) Create a linear model to use and re-use throughout the assignment. You can retrain the same
model again, rather than re-creating a new instance of the class.
3) Slice out using indexing any records before 1986 into a brand new slice.
4) Have one slice for training and one for testing. First, map the life expectancy of white males as a
function of age, or WhiteMales = f(age).
5) Fit your model, draw a regression line and scatter plot with the convenience function, and then
print out the actual, observed 2015 White Male life expectancy value from the dataset.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 05: To demonstrate the implementation of Linear Regression in Data Modeling

Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

Department of: Subject of:

Computer Systems Engineering Data Science & Analytics

Mehran University of Engineering Year 4th Semester 8th
&Technology
Batch 18 CS Duration 03
Hours
Jamshoro

Practical 06
To Demonstrate the Implementation of
Logistic Regression Data Modeling
Outline:
 Logistic Regression implementation with Scikit-Learn

Required Tools:
 PC with windows
 Python 3.6.2
 Anaconda 3-5.0.1

6.1 Introduction
Logistic regression is another technique borrowed by machine learning from the field of statistics. Logistic
regression is a statistical method for analyzing a dataset in which there are one or more independent variables that
determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible
outcomes).
In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE,
success, malignant, etc.) or 0 (FALSE, failure, benign, etc.)
Classification problems
• Email -> spam/not spam?
• Online transactions -> fraudulent?
• Tumor -> Malignant/benign
These types of problems are called binary class problems. Variable in these problems is Y.
• Y is either 0 or 1
– 0 = negative class (absence of something)
– 1 = positive class (presence of something)
Before drive into the underline mathematical concept of logistic regression. Let’s brush up the logistic regression
understanding level with an example.
Suppose HPenguin wants to know, how likely it will be happy based on its daily activities. If the penguin wants to
build a logistic regression model to predict it happiness based on its daily activities. The penguin needs both the
happy and sad activities. In machine learning terminology these activities are known as the Input parameters
(features).
Figure 5.1: Logistic Regression Model Example

So let’s create a table which contains penguin activities and the result of that activity like happy or sad.

No. Penguin Activity Penguin Activity Description How Penguin felt ( Target )

1 X1 Eating squids Happy

2 X2 Eating small Fishes Happy

3 X3 Hit by other Penguin Sad

4 X4 Eating Crabs Sad

Table 5.1: Penguin Activity chart

Penguin is going to use the above activities (features) to train the logistic regression model. Later the trained logistic
regression model will predict how the penguin is feeling for the new penguin activities.
As it’s not possible to use the above categorical data table to build the logistic regression. The above activities data
table needs to convert into activities score, weights, and the corresponding target.
No. Penguin Activity Activity Score Weights Target Target Description

1 X1 6 0.6 1 Happy

2 X2 3 0.4 1 Happy

3 X3 7 -0.7 0 Sad

4 X4 3 -0.3 0 Sad
Table 5.2: Penguin Activity Continuous Data chart

6.2 Developing a Classification Algorithm

Consider developing a classification algorithm for finding out if tumor is malignant or benign based on tumor size.
We could use linear regression
• Then threshold the classifier output (i.e. anything over some value is yes, else no)
• In our example below linear regression with thresholding seems to work.

Figure 5.2: Linear regression performance

In previous example Linear Regression does a reasonable job of stratifying the data points into one of two classes
• But what if we had a single Yes with a very small tumor

• This would lead to classifying all the existing yeses as nos

Another issue with linear regression, we know Y is 0 or 1 but hypothesis can give values large than 1 or less than 0.
So, logistic regression generates a value where hypothesis is always either greater than or equal to 0 or 1.
o ≤ hθ (x )≤ 1
Logistic regression is a classification algorithm
6.3 Hypothesis Representation
The function that is used to represent our hypothesis in classification is based on sigmoid function.We want our
classifier to output values between 0 and 1
• When using linear regression we did hθ(x) = (θT x)

• For classification hypothesis representation we do hθ(x) = g((θT x))

• Where we define g(z)

– z is a real number

• g(z) = 1/(1 + e-z)

– This is the sigmoid function, or the logistic function

• If we combine these equations we can write out the hypothesis as

1
hθ ( x ) = T

1+ e−θ x
6.3.1. What does sigmoid function looks like?

Figure 5.3: Sigmoid function

Crosses 0.5 at the origin, then flattens out

 Asymptotes at 0 and 1.

6.4 Interpreting hypothesis output

When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated probability that y=1 on input x.
Example
– If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize i.e.

– hθ(x) = 0.7
x=
[][
x0
x1
= 1
tumorsize ]
• Tell a patient they have a 70% chance of a tumor being malignant

We can write this using the following notation

hθ(x) = P(y=1|x ; θ)
What does this mean?
• Probability that y=1, given x, parameterized by θ
Since this is a binary classification task we know y = 0 or 1
• So the following must be true

– P(y=1|x ; θ) + P(y=0|x ; θ) = 1

– P(y=0|x ; θ) = 1 - P(y=1|x ; θ)

6.5 Scikit-learn & Logistic Regression

SciKit-Learn's Logistic Regression class implements all the expected methods found in the rest of their supervised
learning classes, including: fit(), predict()and score(). As for its outputs, the attributes you're
interested in are:
intercept_ the scalar constant offset value
coef_ an array of weights, one per input feature, which will act as a scaling factor
For more information on scikit learn’s implementation of logistic regression follow the link:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Example 5.1:
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model

# this is our test set, it's just a straight line with some
# Gaussian noise
xmin, xmax = -5, 5
n_samples = 100
np.random.seed(0)
X = np.random.normal(size=n_samples)
y = (X > 0).astype(np.float)
X[X > 0] *= 4
X += .3 * np.random.normal(size=n_samples)

X = X[:, np.newaxis]
# run the classifier
clf = linear_model.LogisticRegression(C=1e5)
clf.fit(X, y)

# and plot the result

plt.figure(1, figsize=(4, 3))
plt.clf()
plt.scatter(X.ravel(), y, color='black', zorder=20)
X_test = np.linspace(-5, 10, 300)

def model(x):
return 1 / (1 + np.exp(-x))
loss = model(X_test * clf.coef_ + clf.intercept_).ravel()
plt.plot(X_test, loss, color='red', linewidth=3)

ols = linear_model.LinearRegression()
ols.fit(X, y)
plt.plot(X_test, ols.coef_ * X_test + ols.intercept_,
linewidth=1)
plt.axhline(.5, color='.5')

plt.ylabel('y')
plt.xlabel('X')
plt.xticks(range(-5, 10))
plt.yticks([0, 0.5, 1])
plt.ylim(-.25, 1.25)
plt.xlim(-4, 10)
plt.legend(('Logistic Regression Model', 'Linear Regression
Model'),loc="lower right", fontsize='small')
plt.show()
Output:

Figure 5.4: Linear and logistic regression comparision

Example 5.2:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

cancer = load_breast_cancer()

X_train, X_test, y_train, y_test =

train_test_split(cancer.data, cancer.target,
stratify=cancer.target)

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
print('Accuracy on the training subset:
{:.3f}'.format(log_reg.score(X_train, y_train)))
print('Accuracy on the test subset:
{:.3f}'.format(log_reg.score(X_test, y_test)))

#optimize the classifier to work better

#Regularization: prevention of overfitting

#'C':
#parameter to control the strength of regularization
#lower C => log_reg adjusts to the majority of data points.
#higher C => correct classification of each data point.
log_reg100 = LogisticRegression(C=100)
log_reg100.fit(X_train, y_train)
print('Accuracy on the training subset:
{:.3f}'.format(log_reg100.score(X_train, y_train)))
print('Accuracy on the test subset:
{:.3f}'.format(log_reg100.score(X_test, y_test)))

log_reg001 = LogisticRegression(C=0.01)
log_reg001.fit(X_train, y_train)
print('Accuracy on the training subset:
{:.3f}'.format(log_reg001.score(X_train, y_train)))
print('Accuracy on the test subset:
{:.3f}'.format(log_reg001.score(X_test, y_test)))
Output:

Figure 5.5: Accuracies based on different values of logistic regression parameters

Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 06: To Practice and demonstrate the implementation of Logistic regression in Data
Modeling
Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

Department of: Subject of:

Computer Systems Engineering Data Science & Analytics

Mehran University of Engineering Year 4th Semester 8th
&Technology
Batch 18 CS Duration 03
Hours
Jamshoro

Practical 07
To Practice and Demonstrate the
Implementation Of Artificial Neural
Network & Deep Learning.
Outline:
 Neural Network implementation with Scikit-Learn
 Image Classification using Convolutional Neural Networks

Required Tools:
 PC with windows
 Python 3.6.2
 Anaconda 3-5.0.1

7.1. Introduction
Neural Networks are a machine learning framework that attempts to mimic the learning pattern of
natural biological neural networks. Biological neural networks have interconnected neurons with
dendrites that receive inputs, then based on these inputs they produce an output signal through an
axon to another neuron. We will try to mimic this process through the use of Artificial Neural
Networks (ANN), which we will just refer to as neural networks from now on. The process of creating
a neural network begins with the most basic form, a single perceptron.
7.2. The Perceptron
Let's start our discussion by talking about the Perceptron! A perceptron has one or more inputs, a
bias, an activation function, and a single output. The perceptron receives inputs, multiplies them by
some weight, and then passes them into an activation function to produce an output. There are
many possible activation functions to choose from, such as the logistic function, a trigonometric
function, a step function etc. We also make sure to add a bias to the perceptron, this avoids issues
where all inputs could be equal to zero (meaning no multiplicative weight would have an effect),
bias is also used to delay the triggering of the activation function. Check out the diagram below for a
visualization of a perceptron:

Figure 6.1: Perceptron Visualization

To create a neural network, we simply begin to add layers of perceptron together, creating a multi-
layer perceptron model of a neural network. You'll have an input layer which directly takes in your
feature inputs and an output layer which will create the resulting outputs. Any layers in between are
known as hidden layers because they don't directly "see" the feature inputs or outputs.

Figure 6.2: Hidden layer ANN (Multi-layer Perceptron)

7.3. What is an Activation Function?
Activation Function takes the sum of weighted input (w1*x1 + w2*x2 + w3*x3 + 1*b) as an argument and return
the output of the neuron.
N
a=∑ w i x i
i=0
The activation function is mostly used to make a non-linear transformation There are multiple activation functions,
like: “Sigmoid” (the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).), “Tanh” (the hyperbolic tan function,
returns f(x) = tanh(x).), ReLu(the rectified linear unit function, returns f(x) = max(0, x)) and many other.
7.4. Scikit learn & Multi-Layer Perceptron
SciKit-Learn's MLP class implements all the expected methods found in the rest of their supervised learning classes,
including: fit(), predict(),predict_proba()and score(). As for its outputs, the attributes
you're interested in are:
coef_ an array of weights, one per input feature, which will act as a scaling factor
For more information on scikit learn’s implementation of MLP classifer follow the link:
http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
ABOUT DATASET:
We'll use SciKit Learn's built in Breast Cancer Data Set which has several features of tumors with a
labeled class indicating whether the tumor was Malignant or Benign.
Example 6.1:
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
cancer = load_breast_cancer()

#Data Exploration
print(cancer['DESCR'])
print(cancer.feature_names)
print(cancer.target_names)
print(cancer.data)
print(cancer.data.shape)

X_train, X_test, y_train, y_test = train_test_split(cancer.data,

cancer.target)

mlp = MLPClassifier() #default iteration 200, default hidden layers

100
mlp.fit(X_train, y_train)

print('Accuracy on the training subset:', (mlp.score(X_train,

y_train)))
print('Accuracy on the test subset: ',(mlp.score(X_test, y_test)))

print('The maximum per each feature:\

n{}'.format(cancer.data.max(axis=0)))

from sklearn.preprocessing import StandardScaler #as preprocessing

module
scaler = StandardScaler() # intantiate the object
X_train_scaled = scaler.fit(X_train).transform(X_train) #fit and
transform the data
X_test_scaled = scaler.fit(X_test).transform(X_test)

mlp = MLPClassifier(max_iter=1000, random_state=42) #multi layer

perceptron with 1000

mlp.fit(X_train_scaled, y_train)

print('Accuracy on the training subset (Scaled Set):',

(mlp.score(X_train_scaled, y_train)))
print('Accuracy on the test subset: (Scaled Set)',
(mlp.score(X_test_scaled, y_test)))

mlp = MLPClassifier(max_iter=1000, alpha=1, random_state=42)

mlp.fit(X_train_scaled, y_train)

print('Accuracy on the training subset(Changed alpha):',

(mlp.score(X_train_scaled, y_train)))
print('Accuracy on the test subset(Changed alpha): ',
(mlp.score(X_test_scaled, y_test)))

plt.figure(figsize=(20,5))
plt.imshow(mlp.coefs_[0], interpolation='None', cmap='GnBu')
plt.yticks(range(30), cancer.feature_names)
plt.xlabel('Columns in weight matrix')
plt.ylabel('Input feature')
plt.colorbar()
Output:
Figure 6.3: Accuracies based on different parameters of MLPClassifier

Figure 6.4: MLP Co-efficient with color map visualizations

We have following dataset for our next example:

ABOUT DATASET:
The dataset that will be used is from Kaggle: Pima Indians Diabetes Database.
It has 9 variables: ‘Pregnancies’, ‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’, ‘BMI’,
‘DiabetesPedigreeFunction’,’Age’, ‘Outcome’.
Here is the variable description from Kaggle:
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
All these variables are continuous, the goal of the tutorial is to predict if someone has diabetes (Outcome=1)
according to the other variables. It worth noticing that all the observations are from women older than 21 years old.
Example 6.2:
import pandas as pd
import numpy as np
Diabetes=pd.read_csv('C:/Users/Mehak/Desktop/DataScience/
unchanged data set/diabetes.csv')
table1=np.mean(Diabetes,0)# finds out mean based on columns
table2=np.std(Diabetes,0)# finds out std deviation based on
columns
print(table1)
print(table2)

inputData=Diabetes.iloc[:,:8] #creates dataframes based on

columns 1 through 8
outputData=Diabetes.iloc[:,8] #creates sequence based on column
9
print(inputData.head())
print(outputData.head())

from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier()
mlp.fit(inputData,outputData)

print('Score',mlp.score(inputData,outputData)) #defines how

well the algorithm has performed

####Model performance
####Classification rate 'by hand'
##Correctly classified
print('CORRECTLY
CLASSIFIED',np.mean(mlp.predict(inputData)==outputData))
##True positive
trueInput=Diabetes.loc[Diabetes['Outcome']==1].iloc[:,:8]
trueOutput=Diabetes.loc[Diabetes['Outcome']==1].iloc[:,8]
##True positive rate
print('TRUE POSITVE
RATE',np.mean(mlp.predict(trueInput)==trueOutput))
##True negative
falseInput=Diabetes.loc[Diabetes['Outcome']==0].iloc[:,:8]
falseOutput=Diabetes.loc[Diabetes['Outcome']==0].iloc[:,8]
##True negative rate
print('TRUE NEGATIVE
RATE',np.mean(mlp.predict(falseInput)==falseOutput))

##Real vs predicted plot

import matplotlib.pyplot as plt
plt.figure()
plt.scatter(inputData.iloc[:,1],inputData.iloc[:,5],c=mlp.predi
ct_proba(inputData)[:,1],alpha=0.4)
plt.xlabel('Glucose level ')
plt.ylabel('BMI ')
plt.show()

plt.figure()
plt.scatter(inputData.iloc[:,1],inputData.iloc[:,5],c=outputDat
a,alpha=0.4)
plt.xlabel('Glucose level ')
plt.ylabel('BMI ')
plt.show()
Output:

Figure 6.5: Scatter plot based on glucose level & BMI (predicted outcomes)
Figure 6.6: Scatter plot based on glucose level & BMI (actual outcomes)

7.5. Pros & Cons of Artificial Neural Networks

Pros:
 can be used efficiently on large datasets
 can build very complex models
 many parameters for tuning
 flexibility and rapid prototyping
 etc.
Cons:
 many parameters for tuning
 some solvers are scale sensitive
 data may need to be pre-processed
 etc.
Alternatives:
 theano
 tensorflow
 keras
 lasagna
 etc.

7.6 DEEP LEARNING

Deep Learning – which has emerged as an effective tool for analyzing big data – uses complex algorithms and
artificial neural networks to train machines/computers so that they can learn from experience, classify and
recognize data/images just like a human brain does. Within Deep Learning, a Convolutional Neural Network
or CNN is a type of artificial neural network, which is widely used for image/object recognition and
classification. Deep Learning thus recognizes objects in an image by using a CNN. CNNs are playing a major
role in diverse tasks/functions like image processing problems, computer vision tasks like localization and
segmentation, video analysis, to recognize obstacles in self-driving cars, as well as speech recognition in
natural language processing. As CNNs are playing a significant role in these fast-growing and emerging areas,
they are very popular in Deep Learning.
A Potent Tool Within Deep Learning
 CNNs have fundamentally changed our approach towards image recognition as they can detect
patterns and make sense of them. They are considered the most effective architecture for image
classification, retrieval and detection tasks as the accuracy of their results is very high.
 They have broad applications in real-world tests, where they produce high-quality results and can do a
good job of localizing and identifying where in an image a person/car/bird, etc., are. This aspect has
made them the go-to method for predictions involving any image as an input.
 A critical feature of CNNs is their ability to achieve ‘spatial invariance’, which implies that they can
learn to recognize and extract image features anywhere in the image. There is no need for manual
extraction as CNNs learn features by themselves from the image/data and perform extraction directly
from images. This makes CNNs a potent tool within Deep Learning for getting accurate results.
 According to the paper published in ‘Neural Computation’, “the purpose of the pooling layers is to
reduce the spatial resolution of the feature maps and thus achieve spatial invariance to input
distortions and translations.” As the pooling layer brings down the number of parameters needed to
process the image, processing becomes faster even as it reduces memory requirement and
computational cost.
 While image analysis has been the most widespread use of CNNs, they can also be used for other data
analysis and classification problems. Therefore, they can be applied across a diverse range of sectors
to get precise results, covering critical aspects like face recognition, video classification, street/traffic
sign recognition, classification of galaxy and interpretation and diagnosis/analysis of medical images,
among others.

The following is a program to do image classification on the CIFAR-10

dataset:

1 import matplotlib.pyplot as plt

2 import numpy as np
3 import tensorflow as tf
4 from tensorflow.keras.models import Sequential
5 from tensorflow.keras.layers import Conv2D, Dropout, MaxPooling2D, Flatten, Dense
6 from tensorflow.keras.constraints import MaxNorm
7 from tensorflow.keras.datasets.cifar10 import load_data
8
9
10 (X_train, y_train), (X_test, y_test) = load_data()
11
12 # rescale image
13 X_train_scaled = X_train / 255.0
14 X_test_scaled = X_test / 255.0
15
16 model = Sequential([
17 Conv2D(32, (3,3), input_shape=(32, 32, 3), padding="same", activation="relu", kernel_constraint=MaxNorm(3)),
18 Dropout(0.3),
19 Conv2D(32, (3,3), padding="same", activation="relu", kernel_constraint=MaxNorm(3)),
20 MaxPooling2D(),
21 Flatten(),
22 Dense(512, activation="relu", kernel_constraint=MaxNorm(3)),
23 Dropout(0.5),
24 Dense(10, activation="sigmoid")
25 ])
26
27 model.compile(optimizer="adam",
28 loss="sparse_categorical_crossentropy",
29 metrics="sparse_categorical_accuracy")
30
31 model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=25, batch_size=32)
This network should be able to achieve around 70% accuracy in
classification. The images are in 32×32 pixels in RGB color. They are in
10 different classes, and the labels are integers from 0 to 9.

You can print the network using Keras’s summary() function:

1 ...
2 model.summary()
In this network, the following will be shown on the screen:

1 Model: "sequential"
2 _________________________________________________________________
3 Layer (type) Output Shape Param #
4 =================================================================
5 conv2d (Conv2D) (None, 32, 32, 32) 896
6
7 dropout (Dropout) (None, 32, 32, 32) 0
8
9 conv2d_1 (Conv2D) (None, 32, 32, 32) 9248
10
11 max_pooling2d (MaxPooling2D (None, 16, 16, 32) 0
12 )
13
14 flatten (Flatten) (None, 8192) 0
15
16 dense (Dense) (None, 512) 4194816
17
18 dropout_1 (Dropout) (None, 512) 0
19
20 dense_1 (Dense) (None, 10) 5130
21
22 =================================================================
23 Total params: 4,210,090
24 Trainable params: 4,210,090
25 Non-trainable params: 0
26 _________________________________________________________________
It is typical in a network for image classification to be comprised of
convolutional layers at an early stage, with dropout and pooling layers
interleaved. Then, at a later stage, the output from convolutional layers is
flattened and processed by some fully connected layers.

Showing the Feature Maps

In the above network, there are two convolutional layers (Conv2D). The
first layer is defined as follows:
1 Conv2D(32, (3,3), input_shape=(32, 32, 3), padding="same", activation="relu", kernel_constraint=MaxNorm(3))
This means the convolutional layer will have a 3×3 kernel and apply on an
input image of 32×32 pixels and three channels (the RGB colors).
Therefore, the output of this layer will be 32 channels.
In order to make sense of the convolutional layer, you can check out its
kernel. The variable model holds the network, and you can find the kernel
of the first convolutional layer with the following:
1 ...
2 print(model.layers[0].kernel)
This prints:

1 <tf.Variable 'conv2d/kernel:0' shape=(3, 3, 3, 32) dtype=float32, numpy=

2 array([[[[-2.30068922e-01, 1.41024575e-01, -1.93124503e-01,
3 -2.03153938e-01, 7.71819279e-02, 4.81446862e-01,
4 -1.11971676e-01, -1.75487325e-01, -4.01797555e-02,
5 ...
6 4.64215249e-01, 4.10646647e-02, 4.99733612e-02,
7 -5.22711873e-02, -9.20209661e-03, -1.16479330e-01,
8 9.25614685e-02, -4.43541892e-02]]]], dtype=float32)>
You can tell that model.layers[0] is the correct layer by comparing the
name conv2d from the above output to the output of model.summary().
This layer has a kernel of the shape (3, 3, 3, 32), which are the height,
width, input channels, and output feature maps, respectively.
Assume the kernel is a NumPy array k. A convolutional layer will take its
kernel k[:, :, 0, n] (a 3×3 array) and apply on the first channel of the
image. Then apply k[:, :, 1, n] on the second channel of the image,
and so on. Afterward, the result of the convolution on all the channels is
added up to become the feature map n of the output, where n, in this
case, will run from 0 to 31 for the 32 output feature maps.
In Keras, you can extract the output of each layer using an extractor
model. In the following, you will create a batch with one input image and
send it to the network. Then look at the feature maps of the first
convolutional layer:

1 ...
2 # Extract output from each layer
3 extractor = tf.keras.Model(inputs=model.inputs,
4 outputs=[layer.output for layer in model.layers])
5 features = extractor(np.expand_dims(X_train[7], 0))
6
7 # Show the 32 feature maps from the first layer
8 l0_features = features[0].numpy()[0]
9
10 fig, ax = plt.subplots(4, 8, sharex=True, sharey=True, figsize=(16,8))
11 for i in range(0, 32):
12 row, col = i//8, i%8
13 ax[row][col].imshow(l0_features[..., i])
14
15 plt.show()
The above code will print the feature maps like the following:

This corresponds to the following input image:

You can see that they are called feature maps because they are
highlighting certain features from the input image. A feature is identified
using a small window (in this case, over a 3×3 pixels filter). The input
image has three color channels. Each channel has a different filter
applied, and their results are combined for an output feature.
You can similarly display the feature map from the output of the second
convolutional layer as follows:

1 ...
2 # Show the 32 feature maps from the third layer
3 l2_features = features[2].numpy()[0]
4
5 fig, ax = plt.subplots(4, 8, sharex=True, sharey=True, figsize=(16,8))
6 for i in range(0, 32):
7 row, col = i//8, i%8
8 ax[row][col].imshow(l2_features[..., i])
9
10 plt.show()
This shows the following:

From the above, you can see that the features extracted are more
abstract and less recognizable.

Exercise
Question 1:
Use Housing.csv file from example .2, Write code for MLPclassifier that classifies how likely the attribute value
‘fullbase’ is going to be yes or no.
a) Divide dataset in to training and testing subset.
b) Find out accuracy on the basis of provided data (without preprocessing)
c) Find out accuracy on the basis of scaled data (including preprocessing) Also create a scatter plot
in which one can visualize the probability of getting yes or no via MLPclassifier
d) Write a CNN based code using MNIST data set for Handwritten digit Classification
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 07: To Practice and demonstrate the implementation of Artificial Neural Network &
Deep Learning
Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

Computer Systems Engineering Data Science & Analytics

Mehran University of Engineering Year 4th Semester 8th
&Technology
Batch 18 CS Duration 03
Hours
Jamshoro
Teacher: Date:

Practical 08
To Practice and Demonstrate the
Implementation Of Support Vector
Machine.
Outline:
 SVM implementation with Scikit-Learn

Required Tools:
 PC with windows
 Python 3.6.2
 Anaconda 3-5.0.1

8.1. Introduction
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both
classification and regression challenges. However, it is mostly used in classification problems. In this algorithm, we
plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of
each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane
that differentiate the two classes very well.
Figure 7.1: SVM hyperplane visualization

Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine is a frontier which
best segregates the two classes (hyper-plane/ line).
8.1.1. What is Classification Analysis?
Let’s consider an example to understand these concepts. We have a population composed of 50%-50% Males and
Females. Using a sample of this population, you want to create some set of rules which will guide us the gender
class for rest of the population. Using this algorithm, we intend to build a robot which can identify whether a person
is a Male or a Female. This is a sample problem of classification analysis. Using some set of rules, we will try to
classify the population into two possible segments. For simplicity, let’s assume that the two differentiating factors
identified are: Height of the individual and Hair Length. Following is a scatter plot of the sample.

Figure 7.2: Scatter plot

The blue circles in the plot represent females and green squares represents male. A few expected insights from the
graph are:
1. Males in our population have a higher average height.
2. Females in our population have longer scalp hairs.
If we were to see an individual with height 180 cms and hair length 4 cms, our best guess will be to classify this
individual as a male. This is how we do a classification analysis.
Support Vectors are simply the co-ordinates of individual observation. For instance, (45,150) is a support vector
which corresponds to a female. Support Vector Machine is a frontier which best segregates the Male from the
Females. In this case, the two classes are well separated from each other, hence it is easier to find a SVM.
Finding SVM for case in hand:
There are many possible frontier which can classify the problem in hand. Following are the three possible frontiers.

Figure 7.3: Hyperplane selection

How do we decide which is the best frontier for this particular problem statement?
The easiest way to interpret the objective function in a SVM is to find the minimum distance of the frontier from
closest support vector (this can belong to any class). For instance, orange frontier is closest to blue circles. And the
closest blue circle is 2 units away from the frontier. Once we have these distances for all the frontiers, we simply
choose the frontier with the maximum distance (from the closest support vector). Out of the three shown frontiers,
we see the black frontier is farthest from nearest support vector (i.e. 15 units).
8.2. Dive Deeper
8.2.1. Identify the right hyper-plane (Scenario-1):Here, we have three hyper-planes (A, B and
C). Now, identify the right hyper-plane to classify star and circle.

Figure 7.4: Scenario 1, Hyperplanes

You need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-plane which segregates the
two classes better”. In this scenario, hyper-plane “B” has excellently performed this job.
8.2.2. Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes (A, B and
C) and all are segregating the classes well. Now, how can we identify the right hyper-plane?
Figure 7.5: Scenario 2, Hyperplanes

Here, maximizing the distances between nearest data point (either class) and hyper-plane will help us to decide the
right hyper-plane. This distance is called as Margin. Let’s look at the below snapshot:

Figure 7.6: Scenario 2, Hyperplanes (Details)

Above, you can see that the margin for hyper-plane C is high as compared to both A and B. Hence, we name
the right hyper-plane as C. Another lightning reason for selecting the hyper-plane with higher margin is robustness.
If we select a hyper-plane having low margin then there is high chance of miss-classification.

8.2.3. Identify the right hyper-plane (Scenario-3): Hint: Use the rules as discussed in previous
section to identify the right hyper-plane

Figure 7.7: Scenario 3, Hyperplanes

Some of you may have selected the hyper-plane B as it has higher margin compared to A. But, here is the catch,
SVM selects the hyper-plane which classifies the classes accurately prior to maximizing margin. Here, hyper-plane
B has a classification error and A has classified all correctly. Therefore, the right hyper-plane is A.
Find the hyper-plane to segregate to classes (Scenario-4): In the scenario below, we can’t
have linear hyper-plane between the two classes, so how does SVM classify these two classes? Till now, we have
only looked at the linear hyper-plane.

Figure 7.8: Scenario 4, Hyperplanes

SVM can solve this problem. Easily! It solves this problem by introducing additional feature. Here, we will add a
new feature z=x^2+y^2. Now, let’s plot the data points on axis x and z:

Figure 7.9: Scenario 4, Hyperplanes with kernel 1

In above plot, points to consider are:

 All values for z would be positive always because z is the squared sum of both x and y
 In the original plot, red circles appear close to the origin of x and y axes, leading to lower value
of z and star relatively away from the origin result to higher value of z.
In SVM, it is easy to have a linear hyper-plane between these two classes. But, another burning question which
arises is, should we need to add this feature manually to have a hyper-plane. No, SVM has a technique called
the kernel trick. These are functions which takes low dimensional input space and transform it to a higher
dimensional space i.e. it converts not separable problem to separable problem, these functions are called kernels. It
is mostly useful in non-linear separation problem. Simply put, it does some extremely complex data transformations,
then find out the process to separate the data based on the labels or outputs you’ve defined.
When we look at the hyper-plane in original input space it looks like a circle:
Figure 7.10: Scenario 4, Hyperplanes with kernel 2

8.3. Scikit-learn & SVM

SciKit-Learn's SVM class implements all the expected methods found in the rest of their supervised learning classes,
including: fit(), predict()and score(). As for its outputs, the attributes you're interested in are:
n_support_( Number of support vectors for each class),coef_ (Weights assigned to the features
(coefficients in the primal problem). This is only available in the case of a linear kernel),
intercept_ (Constants in decision function)
For more information on scikit learn’s implementation of MLP classifer follow the link:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
Example 7.1:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data,
cancer.target, random_state=0)

svm = SVC()
svm.fit(X_train, y_train)

print('The accuracy on the training subset:

{:.3f}'.format(svm.score(X_train, y_train)))
print('The accuracy on the test subset: {:.3f}'.format(svm.score(X_test,
y_test)))

import matplotlib.pyplot as plt

plt.plot(X_train.min(axis=0), 'o', label='Min')

plt.plot(X_train.max(axis=0), 'v', label='Max')
plt.xlabel('Feature Index')
plt.ylabel('Feature Magnitude in Log Scale')
plt.yscale('log')
plt.legend(loc='upper right')

min_train = X_train.min(axis=0)
range_train = (X_train - min_train).max(axis=0)

X_train_scaled = (X_train - min_train)/range_train

print('Minimum per feature\n{}'.format(X_train_scaled.min(axis=0)))
print('Maximum per feature\n{}'.format(X_train_scaled.max(axis=0)))

X_test_scaled = (X_test - min_train)/range_train

svm = SVC()
svm.fit(X_train_scaled, y_train)

print('The accuracy on the training subset:

{:.3f}'.format(svm.score(X_train_scaled, y_train)))
print('The accuracy on the test subset:
{:.3f}'.format(svm.score(X_test_scaled, y_test)))

svm = SVC(C=1000)
svm.fit(X_train_scaled, y_train)

print('The accuracy on the training subset:

{:.3f}'.format(svm.score(X_train_scaled, y_train)))
print('The accuracy on the test subset:
{:.3f}'.format(svm.score(X_test_scaled, y_test)))

print('The decision function is:\n\

n{}'.format(svm.decision_function(X_test_scaled[:20])))
print('Thresholded decision function:\n\
n{}'.format(svm.decision_function(X_test_scaled[:20])>0))
svm = SVC(C=1000, probability=True)
svm.fit(X_train_scaled, y_train)

print('Predicted probabilities for the samples (malignant and

benign):\n\n{}'.format(svm.predict_proba(X_test_scaled[:20])))
print(svm.predict(X_test_scaled))
Output:
Figure 7.11: Accuracy based on different parameters, Scatter plot based on min & max values

Example 7.2:
import numpy as np
import pylab as pl
from sklearn import svm, datasets

# import some data to play with

iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We
#could avoid this ugly slicing by using
#a two-dim dataset
Y = iris.target

h = .02 # step size in the mesh

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, Y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
#'SVC with linear kernel'
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
pl.contourf(xx, yy, Z, cmap=pl.cm.Paired)
pl.axis('off')

# Plot also the training points

pl.scatter(X[:, 0], X[:, 1], c='Black', cmap=pl.cm.Paired)

pl.title('SVC with linear kernel')

pl.show()
Output:

Figure 7.12: SVC Classifier with linear kernel

8.4. Pros and Cons of Support Vector Machines
Every classification algorithm has its own advantages and disadvantages that are come into play according to the
dataset being analyzed. Some of the advantages of SVMs are as follows:
 SVM is an algorithm which is suitable for both linearly and nonlinearly separable data (using
kernel trick). The only thing to do is to come up with the regularization term, .
 SVMs work well on small as well as high dimensional data spaces. It works effectively for high-
dimensional datasets because of the fact that the complexity of the training dataset in SVM is
generally characterized by the number of support vectors rather than the dimensionality. Even if
all other training examples are removed and the training is repeated, we will get the same
optimal separating hyperplane.
 SVMs can work effectively on smaller training datasets as they don't rely on the entire data.
Disadvantages of SVMs are as follows:
 They are not suitable for larger datasets because the training time with SVMs can be high and
much more computationally intensive.
 They are less effective on noisier datasets that have overlapping classes.

Exercise
Question 1: Repeat example 7.2 to classify the flower dataset in to 3 classes based on:
a) RBF kernel
b) Polynomial (degree 3) kernel
c) LinearSVC (linear kernel)
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 08: To Practice and demonstrate the implementation of Support Vector Machine
Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

Department of: Subject of:

Computer Systems Engineering Data Science & Analytics

Mehran University of Engineering Year 4th Semester 8th
&Technology
Batch 18 CS Duration 03
Hours
Jamshoro

Practical 09
To Practice and Demonstrate the
Implementation Of K- Means Clustering
Algorithm.
Outline:
 K-Means clustering implementation with Scikit-Learn

Required Tools:
 PC with windows
 Python 3.6.2
 Anaconda 3-5.0.1

9.1. Introduction
9.1.1. Unsupervised Learning
Unsupervised learning is the training of a machine learning algorithm using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. It is basically learning
from unlabeled data
Unsupervised machine learning can actually solve the exact same problems as supervised machine learning, though
it may not be as efficient or accurate.
Unsupervised machine learning is most often applied to questions of underlying structure. Genomics, for example, is
an area where we do not truly understand the underlying structure. Thus, we use unsupervised machine learning to
help us figure out the structure.
Unsupervised learning can also aid in "feature reduction." A term we will cover eventually here is "Principal
Component Analysis," or PCA, which is another form of feature reduction, used frequently with unsupervised
machine learning. PCA attempts to locate linearly uncorrelated variables, calling these the Principal Components,
since these are the more "unique" elements that differentiate or describe whatever the object of analysis is.
There is also a meshing of supervised and unsupervised machine learning, often called semi-supervised machine
learning. You will often find things get more complicated with real world examples. You may find, for example,
that first you want to use unsupervised machine learning for feature reduction, then you will shift to supervised
machine learning.
Flat Clustering
Flat clustering is where the scientist tells the machine how many categories to cluster the data into.
Hierarchical
Hierarchical clustering is where the machine is allowed to decide how many clusters to create based on its own
algorithms.
9.1.2. Comparison of Supervised & Unsupervised Learning
• Supervised learning

– Given a set of labels, fit a hypothesis to it

• Unsupervised learning

– Try and determining structure in the data

– Clustering algorithm groups data together based on data features

9.1.3. What is clustering good for?

Clustering is a type of unsupervised learning. This is very often used when you don’t have labeled data. K-Means
Clustering is one of the popular clustering algorithm. The goal of this algorithm is to find groups (clusters) in the
given data.
• Market segmentation - group customers into different market segments
• Social network analysis - Facebook "smart lists"
• Organizing computer clusters and data centers for network layout and location
• Astronomical data analysis - Understanding galaxy formation
• Image Segmentation
• Clustering Gene Segmentation Data
• News Article Clustering
• Clustering Languages
• Species Clustering
• Anomaly Detection

9.2. K-Means Clustering

The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the
number of clusters Κ and the data set. The goal of this algorithm is to find groups in the data, with the number of
groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups
based on the features that are provided. Data points are clustered based on feature similarity. K-means algorithm
involves randomly selecting K initial centroids where K is a user defined number of desired clusters. Each point is
then assigned to a closest centroid and the collection of points close to a centroid form a cluster. The centroid gets
updated according to the points in the cluster and this process continues until the points stop changing their clusters.
Figure 8.1: K- Means Clustering

9.2.2. Algorithm
Our algorithm works as follows, assuming we have inputs x 1, x 2, x 3,…, x n and value of K
Step 1 - Pick K random points as cluster centers called centroids.
Step 2 - Assign each xi to nearest cluster by calculating its distance to each centroid.
Step 3 - Find new cluster center by taking the average of the assigned points.
Step 4 - Repeat Step 2 and 3 until none of the cluster assignments change.
9.2.1 Dive Deeper

Step 1
We randomly pick K cluster centers (centroids). Let’s assume these are c 1 , c2 , … , c k and we can say that;
C=c 1 , c2 , … , c k
C is the set of all centroids.
Step 2
In this step we assign each input value to closest center. This is done by calculating Euclidean (L2) distance between
the point and the each centroid.
2
arg min dist (c i , x)
c i ∈C

Where dist(.) is the Euclidean distance.

Step 3
In this step, we find the new centroid by taking the average of all the points assigned to that cluster.
1
c i=
¿ S i∨¿ ∑ x i ¿
xi ∈Si
th
S iis the set of all points assigned to the i cluster.
Step 4
In this step, we repeat step 2 and 3 until none of the cluster assignments change. That means until our clusters
remain stable, we repeat the algorithm.
9.3. Choosing K
The algorithm described above finds the clusters and data set labels for a particular pre-chosen K. To find the
number of clusters in the data, the user needs to run the K-means clustering algorithm for a range of K values and
compare the results. In general, there is no method for determining exact value of K, but an accurate estimate can be
obtained using the following techniques.
One of the metrics that is commonly used to compare results across different values of K is the mean distance
between data points and their cluster centroid. Since increasing the number of clusters will always reduce the
distance to data points, increasing K will always decrease this metric, to the extreme of reaching zero when K is the
same as the number of data points. Thus, this metric cannot be used as the sole target. Instead, mean distance to the
centroid as a function of K is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to
roughly determine K.
A number of other techniques exist for validating K, including cross-validation, information criteria, the information
theoretic jump method, the silhouette method, and the G-means algorithm. In addition, monitoring the distribution of
data points across groups provides insight into how the algorithm is splitting the data for each K.

Figure 8.2: Elbow Method for choosing number of clusters

9.4. Scikit-learn & K-Means Clustering

SciKit-Learn's K-Means class implements all the expected methods including: fit();
compute k-means clustering , predict(); predicts the closest cluster each sample in X belongs to
and score(); opposite of the value of X on the K-means objective.

For more information on scikit learn’s implementation of KMeans clustering follow the link:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Example 8.1:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans

x = [1, 5, 1.5, 8, 1, 9]
y = [2, 8, 1.8, 8, 0.6, 11]

plt.scatter(x,y)
plt.show()
X = np.array([[1, 2],
[5, 8],
[1.5, 1.8],
[8, 8],
[1, 0.6],
[9, 11]])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)
colors = ["g.","r.","c.","y."]

for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize =
10)

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x",

s=150, linewidths = 5)

plt.show()
Output:
Figure 8.3: Scatter Plot

Figure 8.4: Cluster assignment for different data points

Example 8.2:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
centers = [[1,1],[5,5],[3,10]]
X, _ = make_blobs(n_samples = 500, centers = centers,
cluster_std = 1)
plt.scatter(X[:,0],X[:,1])
plt.show()
Kmeans = KMeans(n_clusters=3)
Kmeans.fit(X)
labels = Kmeans.labels_
cluster_centers = Kmeans.cluster_centers_
print(cluster_centers)
n_clusters_ = len(np.unique(labels))
print("Number of estimated clusters:", n_clusters_)
colors = 10*['r.','g.','b.','c.','k.','y.','m.']
for i in range(len(X)):
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize =
10)

plt.scatter(cluster_centers[:,0],cluster_centers[:,1],
marker="x",color='k', s=150, linewidths = 5,
zorder=10)

plt.show()
Output:

Figure 8.5: Scatter Plot

Figure 8.6: Cluster assignment

Example 8.3:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import style
style.use("ggplot")

centers = [[1,1,1],[5,5,5],[3,10,10]]

X, _ = make_blobs(n_samples = 500, centers = centers,

cluster_std = 1.5)

Kmeans = KMeans(n_clusters=3)
Kmeans.fit(X)
labels = Kmeans.labels_
cluster_centers = Kmeans.cluster_centers_

print(cluster_centers)

n_clusters_ = len(np.unique(labels))

print("Number of estimated clusters:", n_clusters_)

colors = 10*['r','g','b','c','k','y','m']

print(colors)
print(labels)

fig = plt.figure()
ax = fig.gca(projection='3d')
for i in range(len(X)):

ax.scatter(X[i][0], X[i][1], X[i][2], c=colors[labels[i]],

marker='o')

ax.scatter(cluster_centers[:,0],cluster_centers[:,1],cluster_ce
nters[:,2], marker="x",color='k', s=150, linewidths = 5,
zorder=10)

plt.show()
Output:
Figure 8.7: Cluster assignment

Consider we have data about driving speeds: Driver_ID, Distance_Feature, Speeding_Feature.

Example 8.4:
# imports
import matplotlib.pyplot as plt, numpy as np
import pandas as pd
from sklearn.cluster import KMeans
df=pd.read_csv('C:/Users/Mehak/Desktop/DataScience/unchanged
data set/data_1024.csv', sep='\t')
df.head()
plt.rcParams['figure.figsize'] = (12, 6)
plt.figure()
plt.plot(df.Distance_Feature,df.Speeding_Feature,'ko')
plt.ylabel('Speeding Feature')
plt.xlabel('Distance Feature')
plt.ylim(0,100)
plt.show()

### For the purposes of this example, we store feature data

from our dataframe `df`, in the `f1` and `f2` arrays. We
combine this into a feature matrix `X` before entering it into
the algorithm.
f1 = df['Distance_Feature'].values
f2 = df['Speeding_Feature'].values
X=df.as_matrix(['Distance_Feature','Speeding_Feature']).astype(
'float32')
kmeans = KMeans(n_clusters=2).fit(X)
# Plot the results
plt.figure()
h1,=plt.plot(f1[kmeans.labels_==0],f2[kmeans.labels_==0],'go')
plt.plot(np.mean(f1[kmeans.labels_==0]),np.mean(f2[kmeans.label
s_==0]),'k*',markersize=20)
# print centroid 1
print(np.mean(f1[kmeans.labels_==0]),np.mean(f2[kmeans.labels_=
=0]))
h2,=plt.plot(f1[kmeans.labels_==1],f2[kmeans.labels_==1],'bo')
plt.plot(np.mean(f1[kmeans.labels_==1]),np.mean(f2[kmeans.label
s_==1]),'k*',markersize=20)
# print centroid 2
print(np.mean(f1[kmeans.labels_==1]),np.mean(f2[kmeans.labels_=
=1]))
plt.ylabel('Speeding Feature')
plt.xlabel('Distance Feature')
plt.legend([h1,h2],['Group 1','Group 2'], loc='upper left')
plt.show()

Output:

Figure 8.8: Scatter Plot

Figure 8.9: Cluster assignment

9.5. Conclusion
Even though it works very well, K-Means clustering has its own issues. That include:
• If you run K-means on uniform data, you will get clusters.
• Sensitive to scale due to its reliance on Euclidean distance.
• Even on perfect data sets, it can get stuck in a local minimum

Exercise
Question 1:
Repeat example 4 to make 4 clusters using same dataset.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 09: To Practice and demonstrate the implementation of K-means Clustering algorithm
Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

Practical 10
To Practice and Demonstrate the
Department of: Subject of:
Implementation Of Decision Trees
Computer Systems Engineering Data Science & Analytics
Algorithm.
Mehran University of Engineering Year 4 Semester th
8th
&Technology
Batch 18 CS Duration 03
Tree based learning algorithms are considered to be one of the best and mostly used supervised learning Hours
methods. Tree based
Jamshoromethods empower Figure 9.1: Decision Trees
predictive models with high accuracy, stability and ease of
interpretation. Unlike linear models, they map non-linear relationships quite well. They are adaptable at solving any
Example:-
kind of problem at hand (classification or regression).
Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X) and Height (5 to 6
ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a model to predict who will play cricket
during leisure period? In this problem, we need to segregate students who play cricket in their leisure time based on
highly significant input variable among all three.
This is where decision tree helps, it will segregate the students based on all values of three variable and identify the
variable, which creates the best homogeneous sets of students (which are heterogeneous to each other). In the
10.1 Introduction
snapshot below, you can see that variable Gender is able to identify best homogeneous sets compared to the other
two variables.

Figure 9.2: Decision Trees based on different splitting criteria

10.3 Types of Decision Trees

Types of decision tree is based on the type of target variable we have. It can be of two types:
Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it called as
categorical variable decision tree. Example:- In above scenario of student problem, where the target variable was
“Student will play cricket or not” i.e. YES or NO.
Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as
Continuous Variable Decision Tree.
Example: Let’s say we have a problem to predict whether a customer will pay his renewal premium with an
insurance company (yes/ no). Here we know that income of customer is a significant variable but insurance
company does not have income details for all customers. Now, as we know this is an important variable, then we can
build a decision tree to predict customer income based on occupation, product and various other variables. In this
case, we are predicting values for continuous variable.
10.4 Important Terminology related to Decision Trees
Let’s look at the basic terminology used with Decision trees:
1. Root Node: It represents entire population or sample and this further gets
divided into two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-nodes.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called
decision node.
4. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
5. Pruning: When we remove sub-nodes of a decision node, this process is
called pruning. You can say opposite process of splitting.
6. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
7. Parent and Child Node: A node, which is divided into sub-nodes is called
parent node of sub-nodes whereas sub-nodes are the child of parent node.
10.5 Pros & Cons
Pros:
1. Easy to Understand
2. Useful in Data exploration
3. Less data cleaning required
4. Data type is not a constraint
5. Non Parametric Method
Cons:
1. Over fitting
2. Not fit for continuous variables

10.6 How does a tree decide where to split?

The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria is different for
classification and regression trees.
Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of sub-nodes
increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with
respect to the target variable. Decision tree splits the nodes on all available variables and then selects the split which
results in most homogeneous sub-nodes.
The algorithm selection is also based on type of target variables. Let’s look at the one of the most commonly used
algorithms in decision tree:
10.6.1 Gini Index
Gini index says, if we select two items from a population at random then they must be of same class and probability
for this is 1 if population is pure.
It works with categorical target variable “Success” or “Failure”.
It performs only Binary splits
Higher the value of Gini higher the homogeneity.
CART (Classification and Regression Tree) uses Gini method to create binary splits.
Steps to Calculate Gini for a split
Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure
(p^2+q^2).
Calculate Gini for split using weighted Gini score of each node of that split
Example: – Referring to example used above, where we want to segregate the students based on target variable
( playing cricket or not ). In the snapshot below, we split the population using two input variables Gender and Class.
Now, I want to identify which split is producing more homogeneous sub-nodes using Gini index.

Figure 9.3: Decision Trees based on different splitting criteria

Split on Gender:
Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68
Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59
Similar for Split on Class:
Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
10.7 Are tree based models better than linear models?
“If I can use logistic regression for classification problems and linear regression for regression problems, why is
there a need to use trees”? Many of us have this question. And, this is a valid one too.
Actually, you can use any algorithm. It is dependent on the type of problem you are solving. Let’s look at some key
factors which will help you to decide which algorithm to use:
1. If the relationship between dependent & independent variable is well approximated by a
linear model, linear regression will outperform tree based model.
2. If there is a high non-linearity & complex relationship between dependent & independent
variables, a tree model will outperform a classical regression method.
3. If you need to build a model which is easy to explain to people, a decision tree model will
always do better than a linear model. Decision tree models are even simpler to interpret
than linear regression!

10.8 Scikit-learn & Decision Tree Classifier

SciKit-Learn's Decision Tree Classifier class implements all the expected methods including: fit(); builds a
decision tree classifier from the training set (X, y), predict();Predict class or regression value for X
and score();returns the mean accuracy on the given test data and labels.

For more information on scikit learn’s implementation of KMeans clustering follow the link:
http://scikit-learn.org/stable/modules/generated/
sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
Example 9.1
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
X, y = make_classification(n_samples=100, n_features=2,
n_redundant=0, n_classes=2, n_clusters_per_class=1)
X1_min, X1_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
X2_min, X2_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='rainbow')

axes = plt.gca()
axes.set_xlabel('X1')
axes.set_ylabel('X2')
axes.set_xlim([X1_min, X1_max])
axes.set_ylim([X2_min, X2_max])
plt.show()
clf = DecisionTreeClassifier()
score = np.mean(cross_val_score(clf, X, y, cv=5))
print('Decision Tree: {}'.format(score))

clf = LogisticRegression()
score = np.mean(cross_val_score(clf, X, y, cv=5))
print('Logistic Regression: {}'.format(score))
#customisation
from sklearn.datasets import make_circles

X, y = X, y = make_circles(n_samples=200, noise=0.2,
factor=0.5)
X1_min, X1_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
X2_min, X2_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='rainbow')

clf = LogisticRegression()
score = np.mean(cross_val_score(clf, X, y, cv=5))
print('Logistic Regression: {}'.format(score))
Output:
Figure 9.4: Decision Trees v/s Logistic Regression Accuracy

Figure 9.5: Decision Trees v/s Logistic Regression Customized Accuracy

Exercise

1. Let’s say you are wondering whether to quit your job or not. You have to consider some
important points and questions. Here is an example of a decision tree in this case.
Develop a code for this example tree.
2. Imagine you are an IT project manager and you need to decide whether to start a
particular project or not. You need to take into account important possible outcomes
and consequences. The decision tree examples, in this case, might look like the
diagram below. Develop a code for this example tree.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 10: To Practice and demonstrate the implementation of Decision Trees algorithm
Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

Practical 11
Department of: Subject of:

Computer Systems Engineering Data Science & Analytics

Mehran University of Engineering Year 4th Semester 8th
&Technology
Batch 18 CS Duration 03
Hours
Jamshoro

To Practice and Demonstrate the

Implementation Of Random Forest
Algorithm

11.1 What Is Random Forest?

Random forest is a type of supervised machine learning

algorithm based on ensemble learning. Ensemble learning is a type of learning where
you join different types of algorithms or same algorithm multiple times to form a more
powerful prediction model. The random forest algorithm combines multiple algorithms
of the same type i.e. multiple decision trees, resulting in a forest of trees, hence the
name "Random Forest". The random forest algorithm can be used for both regression
and classification tasks.
11.2 How the Random Forest Algorithm Works?

The following are the basic steps involved in performing the random forest algorithm:
1. Pick N random records from the dataset.
2. Build a decision tree based on these N records.
3. Choose the number of trees you want in your algorithm and repeat steps 1
and 2.
4. In case of a regression problem, for a new record, each tree in the forest
predicts a value for Y (output). The final value can be calculated by taking the
average of all the values predicted by all the trees in forest. Or, in case of a
classification problem, each tree in the forest predicts the category to which
the new record belongs. Finally, the new record is assigned to the category
that wins the majority vote.
11.3 Advantages of using Random Forest
As with any algorithm, there are advantages and disadvantages to using it. In the next
two sections we'll take a look at the pros and cons of using random forest for
classification and regression.
1. The random forest algorithm is not biased, since, there are multiple trees
and each tree is trained on a subset of data. Basically, the random forest
algorithm relies on the power of "the crowd"; therefore the overall biasedness
of the algorithm is reduced.
2. This algorithm is very stable. Even if a new data point is introduced in the
dataset the overall algorithm is not affected much since new data may impact
one tree, but it is very hard for it to impact all the trees.
3. The random forest algorithm works well when you have both categorical
and numerical features.
4. The random forest algorithm also works well when data has missing values
or it has not been scaled well (although we have performed feature scaling in
this article just for the purpose of demonstration).
11.4 Disadvantages of using Random Forest
1. A major disadvantage of random forests lies in their complexity. They
required much more computational resources, owing to the large number of
decision trees joined together.
2. Due to their complexity, they require much more time to train than other
comparable algorithms.
Scikit-Learn library can be used to implement the random forest algorithm to
solve regression, as well as classification, problems.
Example 10.1

#1.Import Libraries
import pandas as pd
import numpy as np
#2.Importing Dataset
https://drive.google.com/file/d/
1mVmGNx6cbfvRHC_DvF12ZL3wGLSHD9f_/view
dataset = pd.read_csv('D:\Datasets\petrol_consumption.csv')
dataset.head()

#3. Preparing Data For Training

Two tasks will be performed in this section. The first task is
to divide data into 'attributes' and 'label' sets. The resultant
data is then divided into training and test sets.
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=

0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

from sklearn import metrics

print('Mean Absolute
Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared
Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared
Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Output:
Mean Absolute Error: 51.765
Mean Squared Error: 4216.16675
Root Mean Squared Error: 64.932016371
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 11: To practice and demonstrate the implementation of Random Forest algorithm
Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

Practical 12

Department of: Subject of:

Computer Systems Engineering Data Science & Analytics

Mehran University of Engineering Year 4th Semester 8th
&Technology
Batch 18 CS Duration 03
Hours
Jamshoro

To Execute Anomaly Detection

Techniques in Machine Learning Models.

12.1 Introduction:

Anomaly detection is a technique used to identify unusual patterns that do not conform
to expected behavior, called outliers. It has many applications in business, from
intrusion detection (identifying strange patterns in network traffic that could signal a
hack) to system health monitoring (spotting a malignant tumor in an MRI scan), and
from fraud detection in credit card transactions to fault detection in operating
environments.
12.2 Categories of Anomalies:

Anomalies can be broadly categorized as:

1. Point anomalies: A single instance of data is anomalous if it’s too far off
from the rest. Business use case: Detecting credit card fraud based on
“amount spent.”
2. Contextual anomalies: The abnormality is context specific. This type of
anomaly is common in time-series data. Business use case: Spending $100 on
food every day during the holiday season is normal but may be odd otherwise.
3. Collective anomalies: A set of data instances collectively helps in
detecting anomalies. Business use case: Someone is trying to copy data form a
remote machine to a local host unexpectedly, an anomaly that would be
flagged as a potential cyber-attack.
12.3 Anomaly detection techniques in Machine Learning based approaches:

Below is a brief overview of popular machine learning-based techniques for anomaly

detection.
12.3.1 Density-Based Anomaly Detection

Density-based anomaly detection is based on the k-

nearest neighbors algorithm.
Assumption: Normal data points occur around a dense neighborhood and abnormalities
are far away.
The nearest set of data points are evaluated using a score, which could
be Eucledian distance or a similar measure dependent on the type of the data
(categorical or numerical). They could be broadly classified into two algorithms:
1. K-nearest neighbor: k-NN is a simple, non-parametric lazy learning
technique used to classify data based on similarities in distance metrics such
as Eucledian, Manhattan, Minkowski, or Hamming distance.
2. Relative density of data: This is better known as local outlier factor (LOF).
This concept is based on a distance metric called reachability distance.

12.3.2 Clustering-Based Anomaly Detection

Clustering is one of the most popular concepts in the domain of unsupervised learning.
Assumption: Data points that are similar tend to belong to similar groups or clusters, as
determined by their distance from local centroids.
K-means is a widely used clustering algorithm. It creates ‘k’ similar clusters of data
points. Data instances that fall outside of these groups could potentially be marked as
anomalies.
12.3.3 Support Vector Machine-Based Anomaly Detection

A support vector machine is another effective technique for detecting anomalies. A SVM
is typically associated with supervised learning, but there are extensions
(OneClassCVM, for instance) that can be used to identify anomalies as an unsupervised
problems (in which training data are not labeled). The algorithm learns a soft boundary
in order to cluster the normal data instances using the training set, and then, using the
testing instance, it tunes itself to identify the abnormalities that fall outside the learned
region.
Depending on the use case, the output of an anomaly detector could be numeric scalar
values for filtering on domain-specific thresholds or textual labels (such as binary/multi
labels).

Step 1: Importing the required libraries

 Python3
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager
from pyod.models.knn import KNN
from pyod.utils.data import generate_data, get_outliers_inliers

Step 2: Creating the synthetic data

 Python3
# generating a random dataset with two features
X_train, y_train = generate_data(n_train = 300, train_only = True,
n_features = 2)

# Setting the percentage of outliers

outlier_fraction = 0.1

# Storing the outliers and inliners in different numpy arrays

X_outliers, X_inliers = get_outliers_inliers(X_train, y_train)
n_inliers = len(X_inliers)
n_outliers = len(X_outliers)

# Separating the two features

f1 = X_train[:, [0]].reshape(-1, 1)
f2 = X_train[:, [1]].reshape(-1, 1)

Step 3: Visualizing the data

 Python3
# Visualising the dataset
# create a meshgrid
xx, yy = np.meshgrid(np.linspace(-10, 10, 200),
np.linspace(-10, 10, 200))

# scatter plot
plt.scatter(f1, f2)
plt.xlabel(‘Feature 1’)
plt.ylabel(‘Feature 2’)
Step 4: Training and evaluating the model

 Python3
# Training the classifier
clf = KNN(contamination = outlier_fraction)
clf.fit(X_train, y_train)

# You can print this to see all the prediction scores

scores_pred = clf.decision_function(X_train)*-1

y_pred = clf.predict(X_train)
n_errors = (y_pred != y_train).sum()
# Counting the number of errors

print(‘The number of prediction errors are ‘ + str(n_errors))

Step 5: Visualizing the predictions

 Python3
# threshold value to consider a
# datapoint inlier or outlier
threshold = stats.scoreatpercentile(scores_pred, 100 * outlier_fraction)

# decision function calculates the raw

# anomaly score for every point
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)

# fill blue colormap from minimum anomaly

# score to threshold value
subplot = plt.subplot(1, 2, 1)
subplot.contourf(xx, yy, Z, levels = np.linspace(Z.min(),
threshold, 10), cmap = plt.cm.Blues_r)

# draw red contour line where anomaly

# score is equal to threshold
a = subplot.contour(xx, yy, Z, levels =[threshold],
linewidths = 2, colors =’red’)

# fill orange contour lines where range of anomaly

# score is from threshold to maximum anomaly score
subplot.contourf(xx, yy, Z, levels =[threshold, Z.max()], colors =’orange’)

# scatter plot of inliers with white dots

b = subplot.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1],
c =’white’, s = 20, edgecolor =’k’)

# scatter plot of outliers with black dots

c = subplot.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1],
c =’black’, s = 20, edgecolor =’k’)
subplot.axis(‘tight’)

subplot.legend(
[a.collections[0], b, c],
[‘learned decision function’, ‘true inliers’, ‘true outliers’],
prop = matplotlib.font_manager.FontProperties(size = 10),
loc =’lower right’)

subplot.set_title(‘K-Nearest Neighbours’)
subplot.set_xlim((-10, 10))
subplot.set_ylim((-10, 10))
plt.show()
Exercise

3. Repeat the above given example.

4. Perform the above given example using different types of plots.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 12: To execute anomaly detection techniques in Machine learning Models

Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

Practical 13

Department of: Subject of:

Computer Systems Engineering Data Science & Analytics

Mehran University of Engineering Year 4th Semester 8th
&Technology
Batch 18 CS Duration 03
Hours
Jamshoro

To Explore the Performance Measuring Variables

And Optimize Hyper-Parameters.
13.1 Introduction:

A confusion matrix is a table that is often used to describe the performance of a

classification model (or "classifier") on a set of test data for which the true values are
known.
Acquiring the data, then data cleaning, pre-processing and
wrangling, the first step to do is to feed it to an outstanding model and of course, get
output in probabilities. But hold on! How can we measure the effectiveness of our model.
Better the effectiveness, better the performance and that’s exactly what is wanted. And it
is where the Confusion matrix comes into the light. Confusion Matrix is a performance
measurement for machine learning classification.
13.2 Example of confusion matrix
Example of confusion matrix for a binary classifier (though it can easily
be extended to the case of more than two classes):
13.2.1 What can we learn from this matrix?
 There are two possible predicted classes: "yes" and "no". If we were predicting the
presence of a disease, for example, "yes" would mean they have the disease, and
"no" would mean they don't have the disease.
 The classifier made a total of 165 predictions (e.g., 165 patients were being tested
for the presence of that disease).
 Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.
 In reality, 105 patients in the sample have the disease, and 60 patients do not.
Let's now define the most basic terms, which are whole numbers (not rates):

 true positives (TP): These are cases in which we predicted yes (they have the
disease), and they do have the disease.
 true negatives (TN): We predicted no, and they don't have the disease.
 false positives (FP): We predicted yes, but they don't actually have the disease.
(Also known as a "Type I error.")
 false negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a "Type II error.")

This is a list of rates that are often computed from a confusion matrix for a binary
classifier:

 Accuracy: Overall, how often is the classifier correct?

o (TP+TN)/total = (100+50)/165 = 0.91
 Misclassification Rate: Overall, how often is it wrong?
o (FP+FN)/total = (10+5)/165 = 0.09
o equivalent to 1 minus Accuracy
o also known as "Error Rate"
 True Positive Rate: When it's actually yes, how often does it predict yes?
o TP/actual yes = 100/105 = 0.95
o also known as "Sensitivity" or "Recall"
 False Positive Rate: When it's actually no, how often does it predict yes?
o FP/actual no = 10/60 = 0.17
 True Negative Rate: When it's actually no, how often does it predict no?
o TN/actual no = 50/60 = 0.83
o equivalent to 1 minus False Positive Rate
o also known as "Specificity"
 Precision: When it predicts yes, how often is it correct?
o TP/predicted yes = 100/110 = 0.91
 Prevalence: How often does the yes condition actually occur in our sample?
o actual yes/total = 105/165 = 0.64

Example:
Exercise:
1. Develop a mini project using image dataset, train the dataset in such a way
that it should be apple to classify the given image as TP,TN,FP,FN.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 13: To explore the performance measuring variables and optimize hyper-parameters
Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

Department of: Subject of:

Computer Systems Engineering Data Science & Analytics

Mehran University of Engineering Year 4th Semester 8th
&Technology
Batch 18 CS Duration 03
Hours
Jamshoro

Practical 14
To Develop the LSTM Models for Time
Series Forecasting.
Outline:
 How to develop LSTM models for univariate time series forecasting.
 How to develop LSTM models for multivariate time series forecasting.
 How to develop LSTM models for multi-step time series forecasting.
Required Tools:
 PC with windows
 Python 3.6.2
 Anaconda 3-5.0.1
14.1 Introduction to Time Series Forecasting
Time-series forecasting refers to the use of a machine learning model to predict future values based on
previously observed values. Time series data captures a series of data points recorded at (usually) regular
intervals. Though this definition might somewhat remind you of regression models, time-series
forecasting is applied to forecast data that are ordered by time, for example, stock prices by year.
Figure 1 Example of data ordered by date on which time-series forecasting can be applied.
Time-series forecasting is one of the most used applications of Deep Learning in the modern world.
Quantitative analysts use it to predict the value of stocks, business professionals use it to forecast their
sales, and government agencies use it to forecast resource consumption (energy, water, etc.).
Time series prediction problems are a difficult type of predictive modeling problem. Unlike regression
predictive modeling, time series also adds the complexity of a sequence dependence among the input
variables.

Many classical methods (e.g. ARIMA) try to deal with Time Series data with varying success (not to say
they are bad at it). In the last couple of years, Long Short-Term Memory Networks (LSTM) models have
become a very useful method when dealing with those types of data.
A powerful type of neural network designed to handle sequence dependence is called recurrent neural
networks. Recurrent Neural Networks (LSTMs are one type of those) are very good at processing
sequences of data. They can “recall” patterns in the data that are very far into the past (or future). The
Long Short-Term Memory network or LSTM network is a type of recurrent neural network used in deep
learning because very large architectures can be successfully trained.
14.2 Introduction to Long Short-Term Memory (LSTM) Models
We will explore how to develop a suite of different types of LSTM models for time series forecasting.
The models are demonstrated on small contrived time series problems intended to give the flavor of the
type of time series problem being addressed. The chosen configuration of the models is arbitrary and not
optimized for each problem.
There are two types of LSTM Models
1. Univariate LSTM Models
1. Data Preparation
2. Vanilla LSTM
3. Stacked LSTM
4. Bidirectional LSTM
2. Multivariate LSTM Models
1. Multiple Input Series.
2. Multiple Parallel Series.
14.2.1 Univariate LSTM Models
LSTMs can be used to model univariate time series forecasting problems.
These are problems comprised of a single series of observations and a model is required to learn from the
series of past observations to predict the next value in the sequence.
We will demonstrate a number of variations of the LSTM model for univariate time series forecasting.
This section is divided into four parts; they are:
1. Data Preparation
2. Vanilla LSTM
3. Stacked LSTM
4. Bidirectional LSTM
Each of these models are demonstrated for one-step univariate time series forecasting but can easily be
adapted and used as the input part of a model for other types of time series forecasting problems.
14.2.1.1. Data Preparation
Before a univariate series can be modelled, it must be prepared.
The LSTM model will learn a function that maps a sequence of past observations as input to an output
observation. As such, the sequence of observations must be transformed into multiple examples from
which the LSTM can learn.
Consider a given univariate sequence:

We can divide the sequence into multiple input/output patterns called samples, where three time steps are
used as input and one time step is used as output for the one-step prediction that is being learned.

The split_sequence() function below implements this behaviour and will split a given univariate
sequence into multiple samples where each sample has a specified number of time steps and the output is
a single time step.

We can demonstrate this function on our small contrived dataset above.

EXAMPLE 14.1
The complete example is listed below:
Running the example splits the univariate series into six samples where each sample has three input time
steps and one output time step.

Now that we know how to prepare a univariate series for modeling, let us look at developing LSTM
models that can learn the mapping of inputs to outputs, starting with a Vanilla LSTM.
14.2.1.2. Vanilla LSTM
A Vanilla LSTM is an LSTM model that has a single hidden layer of LSTM units, and an output layer
used to make a prediction.

We can define a Vanilla LSTM for

univariate time series forecasting as follows.
Key in the definition is the shape of the input; that is what the model expects as input for each sample in
terms of the number of time steps and the number of features.
We are working with a univariate series, so the number of features is one, for one variable.
The number of time steps as input is the number we chose when preparing our dataset as an argument to
the split_sequence() function.
The shape of the input for each sample is specified in the input_shape argument on the definition of first
hidden layer.

We almost always have multiple samples; therefore, the

model will expect the input component of training data to have the dimensions or shape:

Our split_sequence() function in the previous section outputs the X with the shape [samples, timesteps],
so we easily reshape it to have an additional dimension for the one feature.

In this case, we define a model with 50 LSTM units in the hidden layer and an output layer that predicts a
single numerical value.
The model is fit using the efficient Adam version of stochastic gradient descent and optimized using the
mean squared error, or ‘mse‘ loss function. Once the model is defined, we can fit it on the training
dataset.

EXAMPLE 14.2
14.2.1.3. Stacked LSTM
Multiple hidden LSTM layers can be stacked one on top of another in what is referred to as a Stacked
LSTM model.
An LSTM layer requires a three-dimensional input and LSTMs by default will produce a two-
dimensional output as an interpretation from the end of the sequence.
We can address this by having the LSTM output a value for each time step in the input data by setting
the return_sequences=True argument on the layer. This allows us to have 3D output from hidden LSTM
layer as input to the next.

We can therefore define

a Stacked LSTM as follows.
14.2.1.4. Bidirectional LSTM
On some sequence prediction problems, it can be beneficial to allow the LSTM model to learn the input
sequence both forward and backwards and concatenate both interpretations.
This is called a Bidirectional LSTM.
We can implement a Bidirectional LSTM for univariate time series forecasting by wrapping the first
hidden layer in a wrapper layer called Bidirectional.
An example of
defining a Bidirectional LSTM to read input both forward and backward is as follows.

14.2.2 Multivariate LSTM Models

Multivariate time series data means data where there is more than one observation for each time step.
There are two main models that we may require with multivariate time series data; they are:
1. Multiple Input Series.
2. Multiple Parallel Series.
Let us take a look at each in turn.
14.2.2.1 Multiple Input Series
A problem may have two or more parallel input time series and an output time series that is dependent on
the input time series.
The input time series are parallel because each series has an observation at the same time steps.

We can demonstrate
this with a simple example of two parallel input time series where the output series is the simple addition
of the input series.

We can reshape
these three arrays of data as a single dataset where each row is a time step, and each column is a separate
time series. This is a standard way of storing parallel time series in a CSV file.
We can define a function named split_sequences() that will take a dataset as we have defined it with rows
for time steps and columns for parallel series and return input/output samples.

We are now ready to fit an LSTM model on this data.

Any of the varieties of LSTMs in the previous section can be used, such as a Vanilla,
Stacked, or Bidirectional model.
We will use a Vanilla LSTM where the number of time steps and parallel series (features) are specified
for the input layer via the input_shape argument.

EXAMPLE 14.3
The complete example is listed below.
EXERCISE:
1. Repeat Example 2 and 3 by using Stacked and Bidirectional LSTM models.
2. The metrics that you choose to evaluate your machine learning algorithms are very important,
choice of metrics influences how the performance of machine learning algorithms is measured and
compared. They influence how you weight the importance of different characteristics in the results
and your ultimate choice of which algorithm to choose. In this practical we are going discover how
to select and use different machine learning performance metrics in Python with scikit-learn.
Lab Rubrics
Course code: CS-454 Course Name: Data science and Analytics

Lab 14: To develop the LSTM Models for Time series Forecasting
Roll No: Student Name:

Batch: Section: Group: Date:

Rub Components Level of Achievement

OPEN ENDED LAB -1

Objective: Write a program to predict and forecast the climate or weather of a region.
Apply possible data science approache to clean the data. Use Prediction models (K-
Nearest Neighbor or Artificial Neural Network or Long Short-Term Memory) or any
method which we have discussed. Evaluate and Present the performance matrix.
Course code: CS-454 Course Name: Data science and Analytics

Lab 15: Open-ended lab

Roll No: Student Name:

Batch: Section: Group: Date:

S.No Rubric Level of Achievement

Proficient Adequate Just Acceptable Unacceptable
1 Analyzing Student has confronted Student has Acceptable He/she was not
Problem/Unders with the tried and he analyzing standards able to even
tanding problem whole problem. He or has are tried. understand
she has made an effort understood actual problem
to understand the problem.
problem, the dataset,
the desired outcome.
2 Methodology Student has selected Student has Student has used Student has used
Selection best available used methods methods which are wrong methods
methods to achieve through which just acceptable and which are not
desired outcome he can achieve can produce few useful for
perfectly according to some accurate outcomes. achieving
the dataset. results outcomes.
3 Completeness/ Student has trained Student has Student has just Student didn’t
Training and model with specific trained model trained and tested even
Testing data also he has with data he model with 50% trained/tested
completed trained has trained accuracy. model.
model by dividing model with
dataset into train/test 80% dividing
sets and he has dataset into
verified performance train/test sets
using testing and he has
dataset . Produced verified
100% accuracy performance
using testing
dataset .
4. Performance Student has Student has just Student hasn’t
Matrix Student has shown shown shown few defined any
classification/regressi classification/r performance types of
on performance egression matrices. performance
matrices along with performance matrices.
graphs and matrices
explanation. results.
5. Presentation Student has presented Student has Student has just Student wasn’t
/Demonstration problem, solution and presented presented able to present
results with best problem, problem, solution anything.
demonstration. solution and and results
results with demonstration.
good
demonstration
.

Teacher: Date:

6 4360704 Nosql Lab Manual
No ratings yet
6 4360704 Nosql Lab Manual
169 pages
Ipl Report
100% (3)
Ipl Report
44 pages
Updated Mongodb Lab Manual IV Sem
No ratings yet
Updated Mongodb Lab Manual IV Sem
48 pages
No SQL PR 1 & 2
No ratings yet
No SQL PR 1 & 2
38 pages
No SQL
No ratings yet
No SQL
4 pages
Manual Mango
No ratings yet
Manual Mango
17 pages
Mongo DB
No ratings yet
Mongo DB
24 pages
Mongodb Manual3
No ratings yet
Mongodb Manual3
20 pages
Advanced - Databases Syllabus
No ratings yet
Advanced - Databases Syllabus
2 pages
Big Data
No ratings yet
Big Data
32 pages
Experiment 9
No ratings yet
Experiment 9
8 pages
Nonsql-Database Note
No ratings yet
Nonsql-Database Note
24 pages
BDA Lab Manual 200305105108
No ratings yet
BDA Lab Manual 200305105108
44 pages
Nosql (Lab Manaual)
No ratings yet
Nosql (Lab Manaual)
62 pages
BDA - Manual - 1to6 Ayushi
No ratings yet
BDA - Manual - 1to6 Ayushi
22 pages
Mongodb Homework 3.1 Python
100% (1)
Mongodb Homework 3.1 Python
6 pages
NoSQL Lab MongoDB Submission
No ratings yet
NoSQL Lab MongoDB Submission
3 pages
Adwt Book
No ratings yet
Adwt Book
67 pages
CCS368 Stream Processing Record
No ratings yet
CCS368 Stream Processing Record
35 pages
BDA Manual SHUBHAM
No ratings yet
BDA Manual SHUBHAM
22 pages
Nosql Lab Mongodb
No ratings yet
Nosql Lab Mongodb
3 pages
Lab 10 Mongo - DB Installtion Aand Config
No ratings yet
Lab 10 Mongo - DB Installtion Aand Config
22 pages
Chapter 5
No ratings yet
Chapter 5
25 pages
Course Code: Course Title: TPC Version No. Course Pre-Requisites/ Co-Requisites Anti-Requisites (If Any) - Objectives
No ratings yet
Course Code: Course Title: TPC Version No. Course Pre-Requisites/ Co-Requisites Anti-Requisites (If Any) - Objectives
4 pages
Design and Implementation of A NoSQL Database
No ratings yet
Design and Implementation of A NoSQL Database
94 pages
No SQL Lab Manual
No ratings yet
No SQL Lab Manual
19 pages
Ins - 4360704
No ratings yet
Ins - 4360704
8 pages
U18cst5203 Nosql Database
No ratings yet
U18cst5203 Nosql Database
2 pages
Big Data Practical 3
No ratings yet
Big Data Practical 3
4 pages
Nosql
No ratings yet
Nosql
64 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
60 pages
Dav - Lab Manual
No ratings yet
Dav - Lab Manual
34 pages
DAL Lab File
No ratings yet
DAL Lab File
38 pages
Data Science Practical
No ratings yet
Data Science Practical
55 pages
6417 Sudesh NGT
No ratings yet
6417 Sudesh NGT
172 pages
DS Retest
No ratings yet
DS Retest
18 pages
Syllabus ADBMS
No ratings yet
Syllabus ADBMS
3 pages
Data Science Practical Explanation
No ratings yet
Data Science Practical Explanation
26 pages
Sample Project
No ratings yet
Sample Project
36 pages
Database
No ratings yet
Database
5 pages
Nptel Swayam Courses
No ratings yet
Nptel Swayam Courses
1 page
AdityaGaur BDA Exp3
No ratings yet
AdityaGaur BDA Exp3
3 pages
Updated Resume 1
No ratings yet
Updated Resume 1
2 pages
Unit 1 Mangodb
No ratings yet
Unit 1 Mangodb
57 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
1 Intro
No ratings yet
1 Intro
33 pages
Lab Sheet 9
No ratings yet
Lab Sheet 9
13 pages
Lab2 Assignment Statement
No ratings yet
Lab2 Assignment Statement
4 pages
Certification in Advanced Python, R and Data Management 18.12.24
No ratings yet
Certification in Advanced Python, R and Data Management 18.12.24
6 pages
Data Collection DBMS
No ratings yet
Data Collection DBMS
6 pages
NoSQL Assignment
No ratings yet
NoSQL Assignment
2 pages
Assignment 16 Utkarsh
No ratings yet
Assignment 16 Utkarsh
8 pages
DBMS
No ratings yet
DBMS
5 pages
Cse2024 Set A
No ratings yet
Cse2024 Set A
3 pages
ADM
No ratings yet
ADM
8 pages
Python For Data Science 2025 Slides
No ratings yet
Python For Data Science 2025 Slides
364 pages
FSD Notes Unit-3-1
No ratings yet
FSD Notes Unit-3-1
26 pages
MongoDB - Course Curriculum
No ratings yet
MongoDB - Course Curriculum
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.