0% found this document useful (0 votes)
38 views148 pages

Unit 3

This document is a confidential educational resource for RMK Group of Educational Institutions, detailing a course on Artificial Intelligence integrated with lab work. It outlines course objectives, prerequisites, a comprehensive syllabus, and expected outcomes for students, along with a mapping of course outcomes to program outcomes. The document also includes a lecture plan, activity-based learning tasks, and various exercises to enhance students' understanding of AI and machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views148 pages

Unit 3

This document is a confidential educational resource for RMK Group of Educational Institutions, detailing a course on Artificial Intelligence integrated with lab work. It outlines course objectives, prerequisites, a comprehensive syllabus, and expected outcomes for students, along with a mapping of course outcomes to program outcomes. The document also includes a lecture plan, activity-based learning tasks, and various exercises to enhance students' understanding of AI and machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

Please read this disclaimer before proceeding:

This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
24AM201
INTRODUCTION TO ARTIFICIAL
INTELLIGENCE
(Lab Integrated)

Department: AIML/CSBS/CSE/ECE/IT

Batch/Year: BATCH 2024-2028/I

Created by:

Dr. Hemalatha M, Professor, CSE,RMDEC

Dr. Srijayanthi S, Assoc Professor, ADS,RMKEC

Dr. Josephin Shermila P Assoc Professor, ADS,RMKCET

Date: 17.02.2025
Table of Contents
Sl. No. Contents Page No.

1 Contents 5

2 Course Objectives 6

3 Pre Requisites (Course Name with Code) 8

4 Syllabus (With Subject Code, Name, LTPC details) 10

5 Course Outcomes 13

6 CO-PO/PSO Mapping 15
Lecture Plan (S.No., Topic, No. of Periods, Proposed
7 date, Actual Lecture Date, pertaining CO, Taxonomy 18
level, Mode of Delivery)
8 Activity based learning 20
Lecture Notes ( with Links to Videos, e-book reference,
9 22
PPTs, Quiz and any other learning materials )
Assignments ( For higher level learning and Evaluation
10 123
- Examples: Case study, Comprehensive design, etc.,)
11 Part A Q & A (with K level and CO) 124

12 Part B Qs (with K level and CO) 131


Supportive online Certification courses (NPTEL,
13 135
Swayam, Coursera, Udemy, etc.,)
14 Real time Applications in day to day life and to Industry 137
Contents beyond the Syllabus ( COE related Value
15 140
added courses)
16 Assessment Schedule ( Proposed Date & Actual Date) 143

17 Prescribed Text Books & Reference Books 144

18 Mini Project 146


Course Objectives
COURSE OBJECTIVES
The Course will enable learners to:

Understand the basics and applications of Artificial


Intelligence.

Apply the basics of Python programming.

Use python libraries to solve simple problems.

Understand the different types of Machine Learning


algorithms.

Solve real world problems using AI/ML.

Explore the various applications in the field of Artificial


Intelligence and Machine Learning.
PRE REQUISITES
PRE REQUISITES
• Mathematical Skills

• Basic Programming Knowledge

• Strong Analytical skills

• Basic knowledge of Statistics and modelling

• Ability to understand complex algorithms


Syllabus
Syllabus
Unit I ARTIFICIAL INTELLIGENCE 6+6

Introduction – Types of AI – ANI, AGI, ASI – Narrow, General, Super AI, Examples - AI
problems – Production Systems – State space Representation – Applications of AI in
various industries.

List of Exercise:

1. Build a simple AI model using python

Unit II BASICS OF PYTHON 6+6


Introduction to Python programming – Arithmetic Operators - values and types -
variables, expressions, statements – Functions – Conditionals and Recursion –Iteration.
Lists: Sequence, Mutable, Traversing, Operations, list slices, list methods - Tuples:
Immutable, Tuple Assignment, Tuple as Return Values, Comparing and Sorting.
List of Exercises:
1. Compute the GCD of two numbers.
2. Operations on Tuples: a) finding repeated elements, b) slice a tuple c) reverse a
tuple d) replace last value of a tuple.
Unit III PYTHON LIBRARIE 6+6
Introduction to Numpy - Multidimensional Ndarrays – Indexing – Properties – Constants
– Data Visualization: Ndarray Creation – Matplotlib - Introduction to Pandas – Series –
Dataframes – Visualizing the Data in Dataframes - Pandas Objects – Data Indexing and
Selection – Handling missing data – Hierarchical indexing – Combining datasets –
Aggregation and Grouping – Joins- Pivot Tables - String operations – Working with time
series – High performance Pandas.
List of Exercises:
Download, install and explore the features of R/Python for data analytics
Installing Anaconda
Basic Operations in Jupyter Notebook
Basic Data Handling
Working with Numpy arrays - Creation of numpy array using the tuple, Determine the
size, shape and dimension of the array, Manipulation with array Attributes, Creation of
Sub array, Perform the reshaping of the array along the row vector and column vector,
Create two arrays and perform the concatenation among the arrays.
Working with Pandas data frames - Series, DataFrame , and Index, Implement the Data
Selection Operations, Data indexing operations like: loc, iloc, and ix, operations of
handling the missing data like None, Nan, Manipulate on the operation of Null Vaues (is
null(), not null(), dropna(), fillna()).
Perform the Statistics operation for the data (the sum, product, median, minimum
and maximum, quantiles, arg min, arg max etc.).
Use any data set compute the mean ,standard deviation, Percentile.
Unit IV MACHINE LEARNING 6+6
Introduction – ML Algorithms Overview – Types – Supervised – Unsupervised –
Reinforcement Learning – Introduction to Neural Networks – Working of Deep
Learning – Applications of DL – Ethical consideration in AI and ML.
List of Exercise:
Apply any Machine Learning model to predict the sales in a store.

Unit V CASE STUDIES 6+6


Disease Prediction – Share Price Forecasting – Weather Prediction – Domain Specific
Case Studies.
List of Domain Specific Case Studies:
• For CSE & allied: Sentiment analysis of product reviews using machine learning.
• For ECE & allied: Smart homes using AI.
• For EEE: Forecasting of Renewable energy availability during a specified period
using AI.
• Civil: Application of ML for crack detection on concrete structures.
• Mech: Predictive Maintenance for CNC Machines Using AI and Machine Learning.
List of Exercise:
1. Build a machine learning model to solve any real-world problem from your
domain
Course Outcomes
Course Outcomes
Course Description Knowledg
Outcomes e Level
Elaborate the basics and applications of Artificial
CO1 K2
Intelligence.
Apply the basics of Python programming to solve
CO2 K3
problems.
CO3 Use python libraries to solve simple ML problems. K3

Outline the different types of Machine Learning


CO4 K2
algorithms.
Use Machine Learning Algorithms to solve real world
CO5 K3
problems.
Outline the recent developments in the field of Artificial
CO6 K2
Intelligence.

Knowledge Level Description

K6 Evaluation

K5 Synthesis

K4 Analysis

K3 Application

K2 Comprehension

K1 Knowledge
CO – PO/PSO
Mapping
CO – PO /PSO Mapping Matrix
CO PO PO PO PO PO PO PO PO PO PO PO PO PS PS PS
1 2 3 4 5 6 7 8 9 10 11 12 O1 O2 03

1 3 3 3 2 2 3 3 3 3 2 2 2 3 2 2

2 3 3 3 2 2 3 3 3 3 2 2 2 3 2 2

3 3 3 3 2 2 3 3 3 3 2 2 2 3 3 2

4 3 3 3 2 2 3 3 3 3 2 2 2 3 3 2

5 3 3 3 2 2 3 3 3 3 2 2 2 3 3 3

6 3 3 3 2 2 3 3 3 3 2 3 3 3 3 3
UNIT – III

ARTIFICIAL INTELLIGENCE
Lecture Plan
Unit - III
Lecture Plan – Unit III – ARTIFICIAL
INTELLIGENCE
Sl. Topic Number Proposed Actual CO Taxono Mode of
No. of Date Lecture my Delivery
Periods Date Level

Introduction to
1 1 1.03.2025 1.03.2025 CO3 K1 PPT
Numpy

2 Data Visualization 1 5.03.2025 5.03.2025 CO3 K1 PPT

3 Pandas Objects 1 7.03.2025 7.03.2025 CO3 K1 PPT

Combining
4 1 10.03.2025 10.03.2025 CO3 K1 PPT
datasets

Working with time


5 1 11.03.2025 11.03.2025 CO3 K1 PPT
series

High performance
6 1 12.03.2025 12.03.2025 CO3 K1 PPT
Pandas
Activity Based
Learning
Activity Based Learning
Improve your python coding ability through each of the problem statement stated
below

1. Calculation of sum and product of (a) individual elements (b) collection of elements.
2. Identify maximum and minimum (a) using library functions (b) without using library
functions.
3. Develop code that enables you to calculate quartiles.
4. Enable variance calculations with (a) math library (b) numpy (c) pandas.
5. Represent a set of data by a representative value which would approximately define the
entire collection.
6. Execute standard deviation calculations with (a) math library (b) numpy (c) pandas.
7. Compare covariance and correlation through python coding.
8. Learn different types of plots and try to enhance your data-based story telling skill.
Lecture Notes –
Unit III
UNIT III ARTIFICIAL INTELLIGENCE

Sl. No. Contents Page No.

Introduction to Numpy - Multidimensional Ndarrays –


1 24
Indexing – Properties – Constants
Data Visualization: Ndarray Creation – Matplotlib –
2 Introduction to Pandas – Series – Data frames – 50
Visualizing the Data in Data frames
Pandas Objects – Data Indexing and Selection –
3 77
Handling missing data – Hierarchical indexing
Combining datasets – Aggregation and Grouping –
4 91
Joins- Pivot Tables - String operations
5 Working with time series 104

6 High performance Pandas 121


3.1 INTRODUCTION TO NUMPY
NumPy stands for Numerical Python. It is a Python library used for working
with arrays. It also has functions for working in domain of linear algebra,
Fourier transform, and matrices. NumPy was created in 2005 by Travis
Oliphant. It is an open-source project and can be used freely. It is written
partially in Python, but most of the parts that require fast computation are
written in C or C++.

3.1.1. List Vs Numpy:

In Python we have lists. lists serve the purpose of arrays but they are slow
to process. NumPy aims to provide an array object that is up to 50 times
faster than traditional Python lists. NumPy arrays are stored at one
continuous place in memory unlike lists, so processes can access and
manipulate them very efficiently. This behavior is called locality of reference
in computer science.This is the main reason why NumPy is faster than lists.
Numpy is also optimized to work with latest CPU architectures.

3.1.2. Installation of NumPy:


Step 1:
Two ways can be adapted for Installation of NumPy
use a python distribution that already has NumPy installed like, Anaconda,
Spyder etc.
If you have Python and PIP already installed on a system, then installation of
NumPy is very easy. Install it using this command: C:\Users\Your Name>pip
install numpy
Step 2:
Once NumPy is installed, import it in your applications by adding
the import keyword:
import numpy

24
Once NumPy is imported and ready to use try out the below simple code to
check
import numpy
arr=numpy.array([1, 2, 3, 4, 5])
print(arr)
You will get an ouput as below
[1 2 3 4 5]
NumPy is usually imported under the np alias.
The version string is stored under __version__ attribute.
Try the below code to check
import numpy as np
arr=np.array([1,2,3,4,5])
print(arr)
print("np version is ",np.__version__)
output:
[1 2 3 4 5]
np version is 1.26.4

3.1.3. Numpy ndarray Object:


NumPy is used to work with arrays.
The array object in NumPy is called ndarray.
create a NumPy ndarray object by using the array() function.
we can pass a list, tuple or any array-like object into the array() method,
and it will be converted into an ndarray object
type() is a built-in Python function that tells us the type of the object passed
to it. It can also be used with Ndarray objects. In the below code it shows
that arr is ‘numpy.ndarray’ type.

import numpy as np
arr=np.array([1,2,3,4,5])
print(type(arr))

25
output:
<class 'numpy.ndarray'>

3.1.4. Dimensions in Arrays:


A dimension in arrays is one level of array depth. nested array are arrays
that have arrays as their elements. In the above code change with below
given code snippet to learn array dimensions
Dimension Description Code Snippet
0-D the elements in an arr = np.array(42)
arrays/scalars array.
Each value in an array
is a 0-D array
1-D has 0-D arrays as its arr =
Arrays/uni elements np.array([1, 2, 3, 4, 5])
dimensional the most common and
array basic arrays
2-D has 1-D arrays as its arr = np.array([[1, 2, 3],
Arrays/Matrix elements [4, 5, 6]])
often used to
represent matrix or
2nd order tensors
NumPy has a whole
sub module dedicated
towards matrix
operations
called numpy.mat
3-D arrays has 2-D arrays arr = np.array([[[1, 2, 3],
(matrices) as its [4, 5, 6]], [[1, 2, 3],
elements [4, 5, 6]]])
often used to

26
represent a 3rd order
tensor.
Higher An array can have any arr = np.array([1, 2, 3,
Dimensional number of dimensions 4],ndmin=3)
Arrays To define the number
of dimensions, use
the “ndmin” argument.

import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
print(arr.ndim)
print(arr)
output:
0
1
2
3
4

NumPy Arrays provides the ndim attribute that returns an integer that tells
us how many dimensions the array have.
import numpy as np
arr = np.array([1, 2, 3, 4],ndmin=4)
print(arr)

27
output:
[[[[1 2 3 4]]]]

3.2 MULTIDIMENSIONAL NDARRAYS

1-D array:
When numpy arrays have one dimension their elements are arranged as a
list and they can be accessed using a single index as stated below
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6])
a[0] # get the 0-th element of the array
output: 1

Definition:
A multi-dimensional array is an array with more than one level or dimension.
For example, a 2D array, or two-dimensional array, is an array of arrays,
meaning it is a matrix of rows and columns (think of a table). A 3D array
adds another dimension, turning it into an array of arrays of arrays.

3.2.1. Methods to create Multidimensional arrays


In general numpy arrays can have more than one dimension. The following
are the ways adapted for multidimensional arrays

Method 1:
start with a 1-dimensional array and use the numpy reshape() function that
rearranges elements of that array into a new shape.

Method 2:
The numpy functions zeros(), ones(), and empty() can be also used to
create arrays with more than one dimension:

import numpy as np

28
a = np.array([1, 2, 3, 4, 5, 6])
b = np.reshape(
a, # the array to be reshaped
(2,3) # dimensions of the new array
)
c = np.zeros((3,4)) # creates an array 3 rows and 4 columns with zeros
d = np.ones((3,4)) # creates an array 3 rows and 4 columns with ones
e = np.empty((3,4)) # creates an array 3 rows and 4 columns with zeros
print('1D array')
print(a) # the original 1-dimensional array
print('2D array')
print(b) # the reshaped array
print('3D array -zeros')
print(c)
print('3D array -ones')
print(d)
print('3D array empty')
print(e)

Output:
1D array
[1 2 3 4 5 6]
2D array
[[1 2 3]
[4 5 6]]
3D array -zeros
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
3D array -ones
[[1. 1. 1. 1.]

29
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
3D array empty
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]

3.2.2. Mathematical Operations on Multidimensional Arrays


Mathematical operations on multidimensional arrays work similarly as for 1-
dimensional arrays. The following operations can be performed.
 Arrange
 Reshape
 Scalar Multiplication
 Addition of two arrays of same dimensions
 Multiplication of 2 same dimension array- array multiplication
multiplies corresponding elements of arrays
 Multiplication using dot() function- In order to perform matrix
multiplication of 2-dimensional arrays use the numpy dot() function
 Application of mathematical functions defined by Numpy (ex:cos)

Python Code for all the above operations is given below


import numpy as np
a = np.arange(4) #creates array with elements from 0 to 3
print('1D array:',a)
b = np.reshape(a, (2,2)) # reshaping 1D to 2D
print('reshaped array:')
print(b)
c = 10*b # multiplication by a number
print('scalar multiplication')
print(c)
d = np.ones((2,2))

30
print('Matrix d')
e = b+d # addition of two arrays of the same dimensions
print('sum of 2 matrices b&d:')
print(e)
f = b*e # multiplication of two arrays of the same dimensions
print('multiplication of 2 matrices b&e:')
print(f)
g = np.dot(b, e) # matrix multiplication of b and e
print('multiplication using dot function')
print(g)
print('applying numpy mathematical functions')
h = np.cos(g) # compute cosine of all elements of the array g
print(h)
Output:
1D array: [0 1 2 3]
reshaped array:
[[0 1]
[2 3]]
scalar multiplication
[[ 0 10]
[20 30]]
Matrix d
sum of 2 matrices b&d:
[[1. 2.]
[3. 4.]]
multiplication of 2 matrices b&e:
[[ 0. 2.]
[ 6. 12.]]
multiplication using dot function
[[ 3. 4.]
[11. 16.]]

31
applying numpy mathematical functions
[[-0.9899925 -0.65364362]
[ 0.0044257 -0.95765948]]

3.3. INDEXING

Array indexing is the same as accessing an array element by referring to its


index number.

3.3.1. Indexing in 1D-Array


The indexes in NumPy arrays start with 0, meaning that the first element has
index 0, and the second has index 1 etc.
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[2] + arr[3])

3.3.2. Indexing 2-D Arrays


To access elements from 2-D arrays we can use comma separated integers
representing the dimension and the index of the element.
Think of 2-D arrays like a table with rows and columns, where the dimension
represents the row and the index represents the column.
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])
print('5th element on 2nd row: ', arr[1, 4])

3.3.3. Access 3-D Arrays


To access elements from 3-D arrays we can use comma separated integers
representing the dimensions and the index of the element.

32
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])

Explanation for the above example follows


arr[0, 1, 2] prints the value 6.
And this is why:
The first number represents the first dimension, which contains two arrays:
[[1, 2, 3], [4, 5, 6]]
and:
[[7, 8, 9], [10, 11, 12]]
Since we selected 0, we are left with the first array:
[[1, 2, 3], [4, 5, 6]]
The second number represents the second dimension, which also contains
two arrays:
[1, 2, 3]
and:
[4, 5, 6]
Since we selected 1, we are left with the second array:
[4, 5, 6]
The third number represents the third dimension, which contains three
values:
4
5
6
Since we selected 2, we end up with the third value:
6

3.3.4. Negative Indexing


Use negative indexing (minus operator) to access an array from the end.

33
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('Last element from 2nd dim: ', arr[1, -1])
output:
Last element from 2nd dim: 10

3.3.5. Select Elements in Rows


Specify the rows and columns that are of interest. Choose the syntax that
suits your requirement.

Rember a[i] is the same as a[i,:] i.e. it selects the i-th row of the array:
Syntax:
 arrayname[start index:end index+1]
 arrayname[:end index+1]
 arrayname[:]
 We can also define the step, like this: [start:end:step].

Program:
'''selecting elements by row/column
o arrayname[start index:end index+1]
o arrayname[:end index+1]
o arrayname[:]
'''
a = np.reshape(np.arange(30), (5,6)) # create a 5x6 array
print('create matrix with reshape')
print(a)
print('syntax-1')
b = a[1:4, 0:2] #select elements in rows 1-3 and columns 0-1
print(b)
print('syntax 2')
c = a[:3, 2:4] #select elements in rows 0-2 and columns 2-3

34
print(c)
print('syntax 3')
d = a[:, 0] # select all elements in the 0-th column
print(d)

Output:
create matrix with reshape
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]]
syntax-1
[[ 6 7]
[12 13]
[18 19]]
syntax 2
[[ 2 3]
[ 8 9]
[14 15]]
syntax 3
[ 0 6 12 18 24]

3.3.6. Slicing Multidimensional Arrays


In order to create a slice of a multidimensional array we need to specify
which part of each dimension we want to select.
For 1-dimensional arrays slicing produces a view of the original array
Changing a slice changes the original array
The below two examples would help for a clear understanding of slicing
operation.
Example 1:

35
a = np.reshape(np.arange(30), (5,6)) # create a 5x6 array
print(a)
Output:
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]]
Example 2:
a = np.reshape(np.arange(30), (5,6)) # create a 5x6 array
b = a[1:4, 0:2] #select elements in rows 1-3 and columns 0-1
c = a[:3, 2:4] #select elements in rows 0-2 and columns 2-3
d = a[:, 0] # select all elements in the 0-th column
print(a[1])
b = a[:3, :3]
print(b)
b[0,0] = 1000
print(b)
print(a)
#use this to change many entries of an array at once
a[:4, :4] = 0 # set all entries of the slize to 0
print(a)

Output:
[ 6 7 8 9 10 11]
[[ 0 1 2]
[ 6 7 8]
[12 13 14]]
[[1000 1 2]
[ 6 7 8]
[ 12 13 14]]

36
[[1000 1 2 3 4 5]
[ 6 7 8 9 10 11]
[ 12 13 14 15 16 17]
[ 18 19 20 21 22 23]
[ 24 25 26 27 28 29]]
[[ 0 0 0 0 4 5]
[ 0 0 0 0 10 11]
[ 0 0 0 0 16 17]
[ 0 0 0 0 22 23]
[24 25 26 27 28 29]]

3.4. PROPERTIES
In NumPy, attributes are properties of NumPy arrays that provide
information about the array's shape, size, data type, dimension, and so on.
To access the Numpy attributes, we use the . (DOT) notation. There are
numerous attributes available. some of the commonly used NumPy attributes
are stated below
Attribute Description Usage
array1 =
returns number of dimensions
Ndim np.array([[2, 4,
of the array
6],[1, 3, 5]])
Returns total number of
elements in the array
Size array1.size
regardless of the number of
dimensions
returns data type of elements
dtype array1.shape
in the array
returns a tuple of integers that
shape gives the size of the array in array1.dtype
each dimension

37
returns the size (in bytes) of
itemsize array1.itemsize
each element in the array
returns the buffer containing
actual elements of the array. It
is like a pointer to the memory
Data array1.data
location where the array's data
is stored in the computer's
memory.

The below code demonstrates the above listed Numpy Properties/attributes

import numpy as np
# create a 2-D array
array1 = np.array([[2, 4, 6],
[1, 3, 5]])
# check the dimension of array1
print(array1.ndim)
# return total number of elements in array1
print(array1.size)
# return a tuple that gives size of array in each dimension
print(array1.shape)
# create an array of integers
array1 = np.array([6, 7, 8])
# check the data type of array1
print(array1.dtype)
# create a 1-D array of 32-bit integers
array2 = np.array([6, 7, 8, 10, 13], dtype=np.int32)
print(array1.itemsize)
print(array2.itemsize)
# print memory address of array1's and array2's data
print("\nData of array1 is: ",array1.data)

38
print("Data of array2 is: ",array2.data)

Output
2
6
(2, 3)
int64
8
4
Data of array1 is: <memory at 0x7fd869e68ac0>
Data of array2 is: <memory at 0x7fd869e68ac0>

3.5. CONSTANTS

NumPy constants are the predefined fixed values used for mathematical
calculations. Using predefined constants makes our code concise and easier
to read. the most commonly used constants are pi and e.
np.pi
It is a mathematical constant that returns the value of pi(π) as a floating
point number. Its value is approximately 3.141592653589793. Instead of the
long floating point number, we can use the constant np.pi. It makes our
code look clean. Let's see an example.
import numpy as np
radius = 2
circumference = 2 * np.pi * radius
print(circumference)
Output:
12.566370614359172
np.e
It is widely used with exponential and logarithmic functions. Its value is
approximately 2.718281828459045.
Let's see an example.

39
import numpy as np
y = np.e
print(y)
Output:
2.718281828459045

We usually use the constant e with the function exp(). e is the base of
exponential function, exp(x), which is equivalent to e^x
import numpy as np
x=2
y = np.exp(x) #caluculating e power 2
Output
7.38905609893065 .

3.6 Ndarray Creation


There are 6 general mechanisms for creating arrays:
1. Conversion from other Python structures (i.e. lists and tuples)
2. Intrinsic NumPy array creation functions (e.g. arange, ones, zeros, etc.)
3. Replicating, joining, or mutating existing arrays
4. Reading arrays from disk, either from standard or custom formats
5. Creating arrays from raw bytes through the use of strings or buffers
6. Use of special library functions (e.g., random)
You can use these methods to create ndarrays or Structured arrays. The
following are general methods for ndarray creation.
3.6.1. Converting Python Sequences To Numpy Arrays
NumPy arrays can be defined using Python sequences such as lists and tuples.
Lists and tuples are defined using [...] and (...), respectively. Lists and tuples
can define ndarray creation:
 a list of numbers will create a 1D array,
 a list of lists will create a 2D array,

40
 further nested lists will create higher-dimensional arrays. In general, any
array object is called an ndarray in NumPy.
>>> import numpy as np
>>> a1D = np.array([1, 2, 3, 4])
>>> a2D = np.array([[1, 2], [3, 4]])
>>> a3D = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
When using numpy.array to define a new array, consider the dtype of the
elements in the array, which can be specified explicitly. This feature gives more
control over the underlying data structures and how the elements are handled.
When values do not fit and you are using a dtype, NumPy may raise an error:
>>> import numpy as np
>>> np.array([127, 128, 129], dtype=np.int8)
Traceback (most recent call last):
...
OverflowError: Python integer 128 out of bounds for int8
An 8-bit signed integer represents integers from -128 to 127. Assigning
the int8 array to integers outside of this range results in overflow. This feature
can often be misunderstood. Performing calculations with mismatching dtypes,
gives unwanted results as stated in below example.
>>> import numpy as np
>>> a = np.array([2, 3, 4], dtype=np.uint32)
>>> b = np.array([5, 6, 7], dtype=np.uint32)
>>> c_unsigned32 = a - b
>>> print('unsigned c:', c_unsigned32, c_unsigned32.dtype)
unsigned c: [4294967293 4294967293 4294967293] uint32
>>> c_signed32 = a - b.astype(np.int32)
>>> print('signed c:', c_signed32, c_signed32.dtype)
signed c: [-3 -3 -3] int64
operations with two arrays of the same dtype: uint32, the resulting array is the
same type. When you perform operations with different dtype, NumPy will

41
assign a new type that satisfies all of the array elements involved in the
computation, here uint32 and int32 can both be represented in as int64.

The default NumPy behavior is to create arrays in either 32 or 64-bit signed


integers (platform dependent and matches C long size) or double precision
floating point numbers. If you expect your integer arrays to be a specific type,
then you need to specify the dtype while you create the array.

3.6.2. Intrinsic Numpy Array Creation Functions


NumPy has over 40 built-in functions for creating arrays as laid out in the Array
creation routines. These functions can be split into roughly three categories,
based on the dimension of the array they create:
a) 1D arrays
b) 2D arrays
c) Ndarrays
a) 1D Array Creation Functions
The 1D array creation functions
e.g. numpy.linspace and numpy.arange generally need at least two
inputs, start and stop.
numpy.arange creates arrays with regularly incrementing values. Check the
documentation for complete information and examples. A few examples are
shown:
>>> import numpy as np
>>> np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.arange(2, 10, dtype=float)
array([2., 3., 4., 5., 6., 7., 8., 9.])
>>> np.arange(2, 3, 0.1)
array([2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9])
Note: best practice for numpy.arange is to use integer start, end, and step
values. There are some subtleties regarding dtype. In the second example,

42
the dtype is defined. In the third example, the array is dtype=float to
accommodate the step size of 0.1. Due to roundoff error, the stop value is
sometimes included.
numpy.linspace will create arrays with a specified number of elements, and
spaced equally between the specified beginning and end values. For example:
>>> import numpy as np
>>> np.linspace(1., 4., 6)
array([1. , 1.6, 2.2, 2.8, 3.4, 4. ])
The advantage of this creation function is that you guarantee the number of
elements and the starting and end point. The
previous arange(start, stop, step) will not include the value stop.
b) 2D Array Creation Functions
The 2D array creation functions e.g. numpy.eye, numpy.diag,
and numpy.vander define properties of special matrices represented as 2D
arrays.
np.eye(n, m) defines a 2D identity matrix. The elements where i=j (row index
and column index are equal) are 1 and the rest are 0, as such:
>>> import numpy as np
>>> np.eye(3)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
>>> np.eye(3, 5)
array([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.]])

numpy.diag can define either a square 2D array with given values along the
diagonal or if given a 2D array returns a 1D array that is only the diagonal
elements. The two array creation functions can be helpful while doing linear
algebra, as such:

43
>>> import numpy as np
>>> np.diag([1, 2, 3])
array([[1, 0, 0],
[0, 2, 0],
[0, 0, 3]])
>>> np.diag([1, 2, 3], 1)
array([[0, 1, 0, 0],
[0, 0, 2, 0],
[0, 0, 0, 3],
[0, 0, 0, 0]])
>>> a = np.array([[1, 2], [3, 4]])
>>> np.diag(a)
array([1, 4])
vander(x, n) defines a Vandermonde matrix as a 2D NumPy array. Each column
of the Vandermonde matrix is a decreasing power of the input 1D array or list or
tuple, x where the highest polynomial order is n-1. This array creation routine is
helpful in generating linear least squares models, as such:
>>> import numpy as np
>>> np.vander(np.linspace(0, 2, 5), 2)
array([[0. , 1. ],
[0.5, 1. ],
[1. , 1. ],
[1.5, 1. ],
[2. , 1. ]])
>>> np.vander([1, 2, 3, 4], 2)
array([[1, 1],
[2, 1],
[3, 1],
[4, 1]])
>>> np.vander((1, 2, 3, 4), 4)
array([[ 1, 1, 1, 1],

44
[ 8, 4, 2, 1],
[27, 9, 3, 1],
[64, 16, 4, 1]])
c) General ndarray Creation Functions
 The ndarray creation functions e.g. numpy.zeros, numpy.ones,
and random define arrays based upon the desired shape.
 The ndarray creation functions can create arrays with any dimension by
specifying how many dimensions and length along that dimension in a
tuple or list.
numpy.zeros will create an array filled with 0 values with the specified shape.
The default dtype is float64:
>>> import numpy as np
>>> np.zeros((2, 3))
array([[0., 0., 0.],
[0., 0., 0.]])
>>> np.zeros((2, 3, 2))
array([[[0., 0.],
[0., 0.],
[0., 0.]],
[[0., 0.],
[0., 0.],
[0., 0.]]])
numpy.ones will create an array filled with 1 values. It is identical to zeros in
all other respects as such:
>>> import numpy as np
>>> np.ones((2, 3))
array([[1., 1., 1.],
[1., 1., 1.]])
>>> np.ones((2, 3, 2))
array([[[1., 1.],
[1., 1.],

45
[1., 1.]],
[[1., 1.],
[1., 1.],
[1., 1.]]])
The random method will create an array filled with random values between 0
and 1. It is included with the numpy.random library. Below, two arrays are
created with shapes (2,3) and (2,3,2), respectively. The seed is set to 42 so you
can reproduce these pseudorandom numbers:
>>> import numpy as np
>>> from numpy.random import default_rng
>>> default_rng(42).random((2,3))
array([[0.77395605, 0.43887844, 0.85859792],
[0.69736803, 0.09417735, 0.97562235]])
>>> default_rng(42).random((2,3,2))
array([[[0.77395605, 0.43887844],
[0.85859792, 0.69736803],
[0.09417735, 0.97562235]],
[[0.7611397 , 0.78606431],
[0.12811363, 0.45038594],
[0.37079802, 0.92676499]]])
numpy.indices will create a set of arrays (stacked as a one-higher
dimensioned array), one per dimension with each representing variation in that
dimension:
>>> import numpy as np
>>> np.indices((3,3))
array([[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]],
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]])

46
This is particularly useful for evaluating functions of multiple dimensions on a
regular grid.

3.6.3. Replicating, joining, or mutating existing arrays


Once you have created arrays, you can replicate, join, or mutate those existing
arrays to create new arrays. When you assign an array or its elements to a new
variable, you have to explicitly numpy.copy the array, otherwise the variable is
a view into the original array. Consider the following example:
>>> import numpy as np
>>> a = np.array([1, 2, 3, 4, 5, 6])
>>> b = a[:2]
>>> b += 1
>>> print('a =', a, '; b =', b)
a = [2 3 3 4 5 6] ; b = [2 3]
In this example, you did not create a new array. You created a variable, ‘b’ that
viewed the first 2 elements of ‘a’. When you added 1 to b you would get the
same result by adding 1 to a[:2]. If you want to create a new array, use
the numpy.copy array creation routine as such:
>>> import numpy as np
>>> a = np.array([1, 2, 3, 4])
>>> b = a[:2].copy()
>>> b += 1
>>> print('a = ', a, 'b = ', b)
a = [1 2 3 4] b = [2 3]
There are a number of routines to join existing arrays
e.g. numpy.vstack, numpy.hstack, and numpy.block. Here is an example
of joining four 2-by-2 arrays into a 4-by-4 array using block:
>>> import numpy as np
>>> A = np.ones((2, 2))
>>> B = np.eye(2, 2)
>>> C = np.zeros((2, 2))

47
>>> D = np.diag((-3, -4))
>>> np.block([[A, B], [C, D]])
array([[ 1., 1., 1., 0.],
[ 1., 1., 0., 1.],
[ 0., 0., -3., 0.],
[ 0., 0., 0., -4.]])
Other routines use similar syntax to join ndarrays.

3.6.4. Reading arrays from disk, from standard or custom formats


This is the most common case of large array creation. The details depend
greatly on the format of data on disk. This section gives general pointers on
how to handle various formats.
Standard binary formats
Various fields have standard formats for array data. The following lists the ones
with known Python libraries to read them and return NumPy arrays.
HDF5: h5py
FITS: Astropy
Examples of formats that cannot be read directly but for which it is not hard to
convert are those formats supported by libraries like PIL (able to read and write
many image formats such as jpg, png, etc).

Common ASCII formats


Delimited files such as Comma Separated Value (csv) and Tab Separated Value
(tsv) files are used for programs like Excel and LabView. Python functions can
read and parse these files line-by-line.
NumPy has two standard routines for importing a file with delimited data
numpy.loadtxt and
numpy.genfromtxt.
These functions have more involved use cases in Reading and writing files.
A simple example given a simple.csv:
$ cat simple.csv

48
x, y
0, 0
1, 1
2, 4
3, 9
Importing simple.csv is accomplished using numpy.loadtxt:
>>> import numpy as np
>>> np.loadtxt('simple.csv', delimiter = ',', skiprows = 1)
array([[0., 0.],
[1., 1.],
[2., 4.],
[3., 9.]])

3.6.5. Creating arrays from raw bytes through strings or buffer usage
There are a variety of approaches one can use. If the file has a relatively simple
format then one can write a simple I/O library and use the
NumPy fromfile() function and .tofile() method to read and write NumPy arrays
directly (mind your byteorder though!) If a good C or C++ library exists that
read the data, one can wrap that library with a variety of techniques though
that certainly is much more work and requires significantly more advanced
knowledge to interface with C or C++.

3.6.6. Use of special library functions


NumPy is the fundamental library for array containers in the Python Scientific
Computing stack. Many Python libraries, including SciPy, Pandas, and OpenCV,
use NumPy ndarrays as the common format for data exchange, These libraries
can create, operate on, and work with NumPy arrays.

3.7 Data Visualization using Matplotlib in Python

49
Matplotlib is a powerful and widely-used Python library for creating static,
animated and interactive data visualizations. Here comes the guide on Matplotlib
and how to use it for data visualization with practical implementation.
3.7.1. Installing Matplotlib for Data Visualization
To install Matplotlib type the below command in the terminal.
pip install matplotlib
If you are using Jupyter Notebook, you can install it within a notebook cell by
using:
!pip install matplotlib
3.7.2. Data Visualization with Pyplot using Matplotlib
Matplotlib provides a module called pyplot which offers a MATLAB-like
interface for creating plots and charts. It simplifies the process of generating
various types of visualizations by providing a collection of functions that handle
common plotting tasks.
Matplotlib supports a variety of plots including line charts, bar charts,
histograms, scatter plots, etc. Let’s understand them with implementation using
pyplot.
Line Chart
Line chart is one of the basic plots and can be created using
the plot() function. It is used to represent a relationship between two data X
and Y on a different axis.
Example:
import matplotlib.pyplot as plt
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
plt.plot(x, y)
plt.title("Line Chart")
plt.ylabel('Y-Axis')
plt.xlabel('X-Axis')
plt.show()

50
Output:

Bar Chart
A bar chart is a graph that represents the category of data with
rectangular bars with lengths and heights that is proportional to the
values which they represent. The bar plots can be plotted horizontally or
vertically. A bar chart describes the comparisons between the different
categories. It can be created using the bar() method.

In the below example we will use the tips dataset. Tips database is the record
of the tip given by the customers in a restaurant for two and a half months in
the early 1990s. It contains 6 columns as total_bill, tip, sex, smoker, day, time,
size.
Example:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('tips.csv')
x = data['day']
y = data['total_bill']
plt.bar(x, y)

51
plt.title("Tips Dataset")
plt.ylabel('Total Bill')
plt.xlabel('Day')
plt.show()
Output:

Histogram
A histogram is basically used to represent data provided in a form of some
groups. It is a type of bar plot where the X-axis represents the bin ranges while
the Y-axis gives information about frequency. The hist() function is used to
compute and create histogram of x.
Example:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('tips.csv')
x = data['total_bill']
plt.hist(x)
plt.title("Tips Dataset")
plt.ylabel('Frequency')
plt.xlabel('Total Bill')
plt.show()

52
Output:

Scatter Plot
Scatter plots are used to observe relationships between variables.
The scatter() method in the matplotlib library is used to draw a scatter plot.
Example:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('tips.csv')
x = data['day']
y = data['total_bill']
plt.scatter(x, y)
plt.title("Tips Dataset")
plt.ylabel('Total Bill')
plt.xlabel('Day')
plt.show()

53
Output:

Pie Chart
Pie chart is a circular chart used to display only one series of data. The area of
slices of the pie represents the percentage of the parts of the data. The slices of
pie are called wedges. It can be created using the pie() method.
Syntax:
matplotlib.pyplot.pie(data, explode=None, labels=None, colors=None,
autopct=None, shadow=False)
Example:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('tips.csv')
cars = ['AUDI', 'BMW', 'FORD',
'TESLA', 'JAGUAR',]
data = [23, 10, 35, 15, 12]
plt.pie(data, labels=cars)
plt.title("Car data")
plt.show()

54
Output:

Box Plot
A Box Plot is also known as a Whisker Plot and is a standardized way of
displaying the distribution of data based on a five-number summary: minimum,
first quartile (Q1), median (Q2), third quartile (Q3) and maximum. It can also
show outliers.Let’s see an example of how to create a Box Plot using Matplotlib
in Python:
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(10)
data = [np.random.normal(0, std, 100) for std in range(1, 4)]
# Create a box plot
plt.boxplot(data, vert=True, patch_artist=True,
boxprops=dict(facecolor='skyblue'),
medianprops=dict(color='red'))
plt.xlabel('Data Set')
plt.ylabel('Values')
plt.title('Example of Box Plot')

55
plt.show()
Output:

Explanation:
 plt.boxplot(data): Creates the box plot. The vert=True argument
makes the plot vertical, and patch_artist=True fills the box with color.
 boxprops and medianprops: Customize the appearance of the boxes
and median lines respectively.
The box shows the interquartile range (IQR) the line inside the box shows
the median and the “whiskers” extend to the minimum and maximum values
within 1.5 * IQR from the first and third quartiles. Any points outside this range
are considered outliers and are plotted as individual points.

Heatmap
A Heatmap is a data visualization technique that represents data in a matrix
form where individual values are represented as colors. Heatmaps are
particularly useful for visualizing the magnitude of multiple features in a two-

56
dimensional surface and identifying patterns, correlations and concentrations.
Let’s see an example of how to create a Heatmap using Matplotlib in Python:
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)
data = np.random.rand(10, 10)
plt.imshow(data, cmap='viridis', interpolation='nearest')
plt.colorbar()
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Example of Heatmap')
plt.show()
Output:

Explanation:
 plt.imshow(data, cmap='viridis'): Displays the data as an image
(heatmap). The cmap='viridis' argument specifies the color map used for
the heatmap.
 interpolation='nearest': Ensures that each data point is shown as a
block of color without smoothing.

57
The color bar on the side provides a scale to interpret the colors with darker
colors representing lower values and lighter colors representing higher values.
This type of plot is often used in fields like data analysis, bioinformatics and
finance to visualize data correlations and distributions across a matrix.

3.7.3. Matplotlib’s Core Components: Figures and Axes


Before moving any further with Matplotlib let’s discuss some important classes
that will be used further in the tutorial. These classes are:
 Figure
 Axes
Figure class
Consider the figure class as the overall window or page on which everything is
drawn. It is a top-level container that contains one or more axes. A figure can
be created using the figure() method.
Syntax:
class matplotlib.figure.Figure(figsize=None, dpi=None, facecolor=None,
edgecolor=None, linewidth=0.0, frameon=None, subplotpars=None,
tight_layout=None, constrained_layout=None)
Example:
import matplotlib.pyplot as plt
from matplotlib.figure import Figure
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
# Creating a new figure with width = 7 inches
# and height = 5 inches with face color as
# green, edgecolor as red and the line width
# of the edge as 7
fig = plt.figure(figsize =(7, 5), facecolor='g', edgecolor='b', linewidth=7)
ax = fig.add_axes([1, 1, 1, 1])
ax.plot(x, y)
plt.title("Linear graph", fontsize=25, color="yellow")

58
plt.ylabel('Y-Axis')
plt.xlabel('X-Axis')
plt.ylim(0, 80)
plt.xticks(x, labels=["one", "two", "three", "four"])
plt.legend(["GFG"])
plt.show()
Output:

Axes Class
Axes class is the most basic and flexible unit for creating sub-plots. A given
figure may contain many axes, but a given axes can only be present in one
figure. The axes() function creates the axes object.
Syntax:
axes([left, bottom, width, height])
Just like pyplot class, axes class also provides methods for adding titles,
legends, limits, labels, etc. Let’s see a few of them –
 ax.set_title() is used to add title.
 To Adding X Label and Y label – ax.set_xlabel(), ax.set_ylabel()
 To set the limits we use ax.set_xlim(), ax.set_ylim()
 ax.set_xticklabels(), ax.set_yticklabels() are used to tick labels.

59
 To add legend we use ax.legend()
Example:
import matplotlib.pyplot as plt from matplotlib.figure import Figure
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
fig = plt.figure(figsize = (5, 4))
ax = fig.add_axes([1, 1, 1, 1])
ax1 = ax.plot(x, y)
ax2 = ax.plot(y, x)
ax.set_title("Linear Graph")
ax.set_xlabel("X-Axis")
ax.set_ylabel("Y-Axis")
ax.legend(labels = ('line 1', 'line 2'))
plt.show()
Output:

3.8 Introduction about Pandas


Pandas is an open-source, BSD-licensed Python library providing high-
performance, easy-to-use data structures and data analysis tools for the Python
programming language. This Pandas tutorial has been prepared for those who
want to learn about the foundations and advanced features of the Pandas
Python package. Python with Pandas is

60
used in a wide range of fields including academic and commercial domains
including finance, economics, Statistics, analytics, etc. In this tutorial, we will
learn the various features of Python Pandas and how to use them in practice.
What is Pandas?
Pandas is a powerful Python library that is specifically designed to work on data
frames that have "relational" or "labeled" data. Its aim aligns with doing real-
world data analysis using Python. Its flexibility and functionality make it
indispensable for various data-related tasks. Hence, this Python package works
well for data manipulation, operating a dataset, exploring a data frame, data
analysis, and machine learning-related tasks. To work on it we should first
install it using a pip command like "pip install pandas" and then import it like
"import pandas as pd". After successfully installing and importing, we can enjoy
the innovative functions of pandas to work on datasets or data frames. Pandas
versatility and ease of use make it a go-to tool for working with structured data
in Python.
Generally, Pandas operates a data frame using Series and DataFrame; where
Series works on a one-dimensional labeled array holding data of any type
like integers, strings, and objects, while a DataFrame is a two-dimensional data
structure that manages and operates data in tabular form (using rows and
columns).
Why Pandas?
The beauty of Pandas is that it simplifies the task related to data frames and
makes it simple to do many of the time-consuming, repetitive tasks involved in
working with data frames, such as:
 Import datasets - available in the form of spreadsheets, comma-
separated values (CSV) files, and more.
 Data cleansing - dealing with missing values and representing them as
NaN, NA, or NaT.
 Size mutability - columns can be added and removed from DataFrame
and higher-dimensional objects.

61
 Data normalization – normalize the data into a suitable format for
analysis.
 Data alignment - objects can be explicitly aligned to a set of labels.
Intuitive merging and joining data sets – we can merge and join
datasets.
 Reshaping and pivoting of datasets – datasets can be reshaped and
pivoted as per the need.
 Efficient manipulation and extraction - manipulation and extraction of
specific parts of extensive datasets using intelligent label-based slicing,
indexing, and subsetting techniques.
 Statistical analysis - to perform statistical operations on datasets.
 Data visualization - Visualize datasets and uncover insights.

Applications of Pandas
The most common applications of Pandas are as follows:
 Data Cleaning: Pandas provides functionalities to clean messy data, deal
with incomplete or inconsistent data, handle missing values, remove
duplicates, and standardize formats to do effective data analysis.
 Data Exploration: Pandas easily summarize statistics, find trends, and
visualize data using built-in plotting functions, Matplotlib, or Seaborn
integration.
 Data Preparation: Pandas may pivot, melt, convert variables, and merge
datasets based on common columns to prepare data for analysis.
 Data Analysis: Pandas supports descriptive statistics, time series analysis,
group-by operations, and custom functions.
 Data Visualisation: Pandas itself has basic plotting capabilities; it
integrates and supports data visualization libraries like Matplotlib,
Seaborn, and Plotly to create innovative visualizations.
 Time Series Analysis: Pandas supports date/time indexing, resampling,
frequency conversion, and rolling statistics for time series data.

62
 Data Aggregation and Grouping: Pandas groupby() function lets you
aggregate data and compute group-wise summary statistics or apply
functions to groups.
 Data Input/Output: Pandas makes data input and export easy by reading
and writing CSV, Excel, JSON, SQL databases, and more.
 Machine Learning: Pandas works well with Scikit-learn for data
preparation, feature engineering, and model input data.
 Web Scraping: Pandas may be used with BeautifulSoup or Scrapy to
parse and analyse structured web data for web scraping and data
extraction.
 Financial Analysis: Pandas is commonly used in finance for stock market
data analysis, financial indicator calculation, and portfolio optimization.
 Text Data Analysis: Pandas' string manipulation, regular expressions, and
text mining functions help analyse textual data.
 Experimental Data Analysis: Pandas makes manipulating and analysing
large datasets, performing statistical tests, and visualizing results easy.

Key Features of Pandas


 It has a DataFrame object that is quick and effective, with both standard
and custom indexing.
 Utilized for reshaping and turning of the informational indexes.
 For aggregations and transformations, group by data.
 It is used to align the data and integrate the data that is missing.
 Provide Time Series functionality.
 Process a variety of data sets in various formats, such as matrix data,
heterogeneous tabular data, and time series.
 Manage the data sets' multiple operations, including subsetting, slicing,
filtering, groupBy, reordering, and reshaping.
 It incorporates with different libraries like SciPy, and scikit-learn.
 Performs quickly, and the Cython can be used to accelerate it even further.

63
Python Pandas Data Structures
Data structures in Pandas are designed to handle data efficiently. They allow for
the organization, storage, and modification of data in a way that optimizes
memory usage and computational performance. Python Pandas library provides
two primary data structures for handling and analyzing data −
 Series
 DataFrame
In general programming, the term "data structure" refers to the method of
collecting, organizing, and storing data to enable efficient access and
modification. Data structures are collections of data types that provide the best
way of organizing items (values) in terms of memory usage.
Pandas is built on top of NumPy and integrates well within a scientific
computing environment with many other third-party libraries. This tutorial will
provide a detailed introduction to these data structures.
Dimension and Description of Pandas Data Structures

Data Dimensions Description


Structure
Series 1 A one-dimensional labeled homogeneous
array, size immutable. labeled,
A two-dimensional size-mutable
Data Frames 2 tabular structure with potentially
heterogeneously typed columns.

Working with two or more dimensional arrays can be complex and time-
consuming, as users need to carefully consider the data's orientation when
writing functions. However, Pandas simplifies this process by reducing the
mental effort required. For example, when dealing with tabular data
(DataFrame), it's easier to think in terms of rows and columns instead of axis 0
and axis 1.

64
Mutability of Pandas Data Structures
All Pandas data structures are value mutable, meaning their contents can be
changed. However, their size mutability varies as stated below
 Series − Size immutable.
 DataFrame − Size mutable.

3.9. Series
A Series is a one-dimensional labeled array that can hold any data type. It can
store integers, strings, floating-point numbers, etc. Each value in a Series is
associated with a label (index), which can be an integer or a string.

Name Steve
Age 35
Gender Male
Rating 3.5
Example
Consider the following Series which is a collection of different data types
Open Compiler
import pandas as pd
data = ['Steve', '35', 'Male', '3.5']
series = pd.Series(data, index=['Name', 'Age', 'Gender', 'Rating'])
print(series)
On executing the above program, you will get the following

Output:
Name Steve
Age 35
Gender Male
Rating 3.5
dtype: object

65
3.10. DataFrame
A pandas DataFrame can be created using the following constructor –

pandas. DataFrame(data=None, index=None, columns=None,


dtype=None, copy=None)

The parameters of the constructor are as follows

Sl.
Parameter & Description
No

Data - data takes various forms like ndarray, series, map, lists, dict,
1
constants and also another DataFrame.

Index - For the row labels, the Index to be used for the resulting frame
2
is Optional Default np.arange(n) if no index is passed.

Columns - This parameter specifies the column labels, the optional


3
default syntax is - np.arange(n). This is only true if no index is passed.

4 Dtype - Data type of each column.

Copy - This command (or whatever it is) is used for copying of data, if
5
the default is False.

Creating a DataFrame from Different Inputs


A pandas DataFrame can be created using various inputs like −
 Lists
 Dictionary
 Series
 Numpy ndarrays
 Another DataFrame
 External input iles like CSV, JSON, HTML, Excel sheet, and more.

A DataFrame is a two-dimensional labeled data structure with columns that can


hold different data types.
It is similar to a table in a database or a spreadsheet.

66
Consider the following data representing the performance rating of a sales team

Name Age Gender Rating

Steve 32 Male 3.45

Lia 28 Female 4.6

Vin 45 Male 3.9

Katie 38 Female 2.78

Example
The above tabular data can be represented in a DataFrame as follows
import pandas as pd
# Data represented as a dictionary
data = {
'Name': ['Steve', 'Lia', 'Vin', 'Katie'],
'Age': [32, 28, 45, 38],
'Gender': ['Male', 'Female', 'Male', 'Female'],
'Rating': [3.45, 4.6, 3.9, 2.78]
}
# Creating the DataFrame
df = pd.DataFrame(data)
# Display the DataFrame
print(df)

Output
On executing the above code, you will get the following output −
Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 4.60
Vin 45 Male 3.90
Katie 38 Female 2.78

67
Features of DataFrame
 Columns can be of different types.
 Size is mutable.
 Labeled axes (rows and columns).
 Can Perform Arithmetic operations on rows and columns.

3.11. Data Visualization with Pandas


Pandas allows to create various graphs directly from your data using
built-in functions. This topic covers Pandas capabilities for visualizing data
with line plots, area charts, bar plots, and more.

3.11.1. Introducing Pandas for Data Visualization


Pandas is a powerful open-source data analysis and manipulation library for
Python. The library is particularly well-suited for handling labeled data such as
tables with rows and columns.

Key Features for Data Visualization with Pandas:


Pandas offers several features that make it a great choice for data visualization:
 Variety of Plot Types: Pandas supports various plot types including line
plots, bar plots, histograms, box plots, and scatter plots.
 Customization: Users can customize plots by adding titles, labels, and
styling enhancing the readability of the visualizations.
 Handling of Missing Data: Pandas efficiently handles missing data
ensuring that visualizations accurately represent the dataset without
errors.
 Integration with Matplotlib: Pandas integrates with Matplotlib that
allow users to create a wide range of static, animated, and interactive
plots.

68
Advantages of Data Visualization:

 Easy to understand: Managers and decision-makers use data visualization

tools to create plots easily and rapidly consume important metrics. These
metrics show the clear-cut growth or loss in business. For example, if Sales
are significantly going down in one region, decision-makers will easily find out
from the data what circumstances or decisions are at present and how to
respond to the factors encountered. Through graphical representations, we
can interpret the vast features of data very clearly and cohesively, which
allows us to understand the data and to draw conclusions from those insights
and see business outlook.
 Quick Decision Making: The Human mind can process visual images faster

than texts and numerical values. Hence, seeing a graph, chart, or other visual
and graphical representations of data is more pleasant and easy for our brain
to process. To read and grasp text, and then convert this into a visualization
of the data that might not be entirely accurate becomes difficult and time
consuming to understand for the team of decision-makers. It is a good
human ability that easily interprets visual data; data visualization completely

69
proves to improve the speed of decision-making processes. Data visualization
always helps to shorten business meetings and efficient decision making.
 Better Analysis: Data visualization plays an important role for business

stakeholders to analyze reports of business regarding sales, marketing


strategies, and product interest. Better analysis can put our focus on the
areas that require more attention to improve the strategies that increase
profits and make the business more productive.
 Identifying patterns: Huge amount of sophisticated data will give several

opportunities for insights after we visualize them. Visualization permits


business users to acknowledge relationships and patterns between the data,
also providing bigger meaning to it. Exploring these patterns helps users
concentrate on specific areas that need attention within the data, to establish
the importance of these areas to drive their business forward.
 Detecting Errors: Visualizing our data helps quickly determine any errors

within the data. If the data tend to counsel the incorrect actions, visualization
facilitates detecting inaccurate data sooner so it will be off from the analysis.
 Exploring business insights: Within the current competitive business

atmosphere, finding data correlations using visual representations is essential


to characterize business insights. Exploring these insights is very important
for business users or executives to line up the correct path to achieving the
business goals.
 Efficient Storytelling: Data visualizations are acknowledged because of the

method of displaying data to produce insights that will support better


decisions i.e., telling the story behind the data. It can offer factors, raise and
draw attention to crucial insights and visually beat the other’s business.
 Discovery of Latest Trends within the Market: Using data visualization,

we’ll be able to discover the most recent trends in your business to produce a
quality product and determine issues before they arise. Staying on high of
trends, we’ll be able to place a lot of effort into augmented profits for our
business.

70
Installation of Pandas
To get started you need to install Pandas using pip:
pip install pandas
Importing necessary libraries and data files
Once Pandas is installed, import the required libraries and load your data
Sample CSV files df1 and df2.
import numpy as np
import pandas as pd
df1 = pd.read_csv('df1', index_col=0)
df2 = pd.read_csv('df2')

3.11.2. Pandas DataFrame Plots


Pandas provides several built-in plotting functions to create various types of
charts mainly focused on statistical data. These plots described below help
visualize trends, distributions, and relationships within the data.
1. Line Plots using Pandas DataFrame
A Line plot is a graph that shows the frequency of data along a number line. It
is best to use a line plot when the data is time series. It can be created
using Dataframe.plot() function.
df2.plot()

71
We have got the well-versed line plot for df without specifying any type of
features in the .plot() function.
2. Area Plots using Pandas DataFrame
Area plot shows data with a line and fills the space below the line with color. It
helps see how things change over time. we can plot it
using DataFrame.plot.area() function.
df2.plot.area(alpha=0.4)

3. Bar Plots using Pandas DataFrame


A bar chart presents categorical data with rectangular bars with heights or
lengths proportional to the values that they represent.The bars can be plotted
vertically or horizontally with DataFrame.plot.bar() function.
df2.plot.bar()

72
4. Histogram Plot using Pandas DataFrame
Histograms help visualize the distribution of data by grouping values into bins.
Pandas use DataFrame.plot.hist() function to plot histogram.
df1['A'].plot.hist(bins=50)

5. Scatter Plot using Pandas DataFrame


Scatter plots are used when you want to show the relationship between two
variables. They are also called correlation and can be created
using DataFrame.plot.scatter() function.
df1.plot.scatter(x ='A', y ='B')

73
6. Box Plots using Pandas DataFrame
A box plot displays the distribution of data, showing the median, quartiles, and
outliers. We can use DataFrame.plot.box() or DataFrame.boxplot() to
create it.
df2.plot.box()

7. Hexagonal Bin Plots using Pandas DataFrame


Hexagonal binning helps manage dense datasets by using hexagons instead of
individual points. It’s useful for visualizing large datasets where points may
overlap. Let’s create the hexagonal bin plot.
df.plot.hexbin(x ='a', y ='b', gridsize = 25, cmap ='Oranges')

74
8. Kernel Density Estimation plot (KDE) using Pandas DataFrame
KDE (Kernel Density Estimation) creates a smooth curve to show the shape of
data by using the df.plot.kde() function. It’s useful for visualizing data
patterns and simulating new data based on real examples.
df2['a'].plot.kde()

3.11.3. Customizing Plots


Pandas allows you to customize your plots in many ways. You can change
things like colors, titles, labels, and more. Here are some common
customizations.
1. Adding a Title, Axis Labels, and Gridlines
You can customize the plot by adding a title and labels for the x and y axes. You
can also enable gridlines to make the plot easier to read:
df.plot(title='Customized Line Plot', xlabel='Index', ylabel='Values', grid=True)

75
2. Line Plot with Different Line Styles
If you want to differentiate between the two lines visually you can change the
line style (e.g., solid line, dashed line) with the help of pandas.
df.plot(style=['-', '--', '-.', ':'], title='Line Plot with Different Styles',
xlabel='Index', ylabel='Values', grid=True)

3. Adjusting the Plot Size


Change the size of the plot to better fit the presentation or analysis context You
can change it by using the figsize parameter:
df.plot(figsize=(12, 6), title='Line Plot with Adjusted Size', xlabel='Index',
ylabel='Values', grid=True)

4. Stacked Bar Plot


A stacked bar plot can be created by setting stacked=True. It helps you
visualize the cumulative value for each index.

76
df.plot.bar(stacked=True, figsize=(10, 6), title='Stacked Bar Plot',
xlabel='Index', ylabel='Values', grid=True)

3.12 Pandas Objects

Pandas objects can be thought of as enhanced versions of NumPy structured


arrays in which the rows and columns are identified with labels rather than
simple integer indices.
Pandas provides a host of useful tools, methods, and functionality on top of the
basic data structures, but nearly everything that follows will require an
understanding of what these structures are.
three fundamental Pandas data structures: the Series, DataFrame, and Index
the standard NumPy and Pandas imports:
import numpy as np
import pandas as pd

3.12.1 The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created


from a list or array

Input:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

77
Output:
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64

The Series wraps both a sequence of values and a sequence of indices, which
we can access with the values and index attributes.

The values are simply a familiar NumPy array


Input:
data.values
Outut:
array([ 0.25, 0.5 , 0.75, 1. ])

The index is an array-like object of type pd.Index


Input:
data.index
Output:
RangeIndex(start=0, stop=4, step=1)

NumPy array, data can be accessed by the associated index via the familiar
Python square-bracket notation

Example 1:
Input:
data[1]
Output:
0.5

Example 2:
Input:
data[1:3]
Output:
1 0.50
2 0.75
dtype: float64

The Pandas Series is much more general and flexible than the one-dimensional
NumPy array

78
3.12.2 The Pandas DataFrame Object

The next fundamental structure in Pandas is the DataFrame. Like


the Series object discussed in the previous section, the DataFrame can be
thought of either as a generalization of a NumPy array, or as a specialization of
a Python dictionary

DataFrame as a generalized NumPy array

If a Series is an analog of a one-dimensional array with flexible indices,


a DataFrame is an analog of a two-dimensional array with both flexible row
indices and flexible column names. In a two-dimensional array as an ordered
sequence of aligned one-dimensional columns DataFrame as a sequence of
aligned Series objects. Here, by "aligned" we mean that they share the same
index

To demonstrate this, let's construct a new Series listing the area of each of the
five states

Input:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

Output:
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
dtype: int64

Let us consider the population Series from before, we can use a dictionary to
construct a single two-dimensional object containing this information:

Input:
states = pd.DataFrame({'population': population,'area': area})
states
Output:

79
Area population

California 423967 38332521

Florida 170312 19552860

Illinois 149995 12882135

New York 141297 19651127

Texas 695662 26448193

3.12.3 The Pandas Index Object

Here that both the Series and DataFrame objects contain an explicit index that
lets you reference and modify data. This Index object is an interesting structure
in itself, and it can be thought of either as an immutable array or as an ordered
set (technically a multi-set, as Index objects may contain repeated values).
Those views have some interesting consequences in the operations available
on Index objects.

let's construct an Index from a list of integers:

Input:
ind = pd.Index([2, 3, 5, 7, 11])
ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')

Index as immutable array

The Index in many ways operates like an array. For example, we can use
standard Python indexing notation to retrieve values or slices:

Example 1
Input:
ind[1]
Output:
3

80
Example 2
Input:
ind[::2]
Output:
Int64Index([2, 5, 11], dtype='int64')

Index objects also have many of the attributes familiar from NumPy arrays:

Input:
print(ind.size, ind.shape, ind.ndim, ind.dtype)
5 (5,) 1 int64
One difference between Index objects and NumPy arrays is that indices are
immutable–that is, they cannot be modified via the normal means:

Input:
ind[1] = 0
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-34-40e631c82e8a> in <module>()
----> 1 ind[1] = 0

/Users/jakevdp/anaconda/lib/python3.5/site-packages/pandas/indexes/base.py
in __setitem__(self, key, value)
1243
1244 def __setitem__(self, key, value):
-> 1245 raise TypeError("Index does not support mutable operations")
1246
1247 def __getitem__(self, key):

TypeError: Index does not support mutable operations


This immutability makes it safer to share indices between multiple DataFrames
and arrays, without the potential for side effects from inadvertent index
modification.

3.13 Indexing and selecting data

The axis labeling information in pandas objects serves many purposes:

 Identifies data (i.e. provides metadata) using known indicators, important


for analysis, visualization, and interactive console display.
 Enables automatic and explicit data alignment.
 Allows intuitive getting and setting of subsets of the data set.

81
The Python and NumPy indexing operators [] and attribute operator . provide
quick and easy access to pandas data structures across a wide range of use
cases. This makes interactive work intuitive, as there’s little new to learn if you
already know how to deal with Python dictionaries and NumPy arrays. However,
since the type of the data to be accessed isn’t known in advance, directly using
standard operators has some optimization limits. For production code, we
recommended that you take advantage of the optimized pandas data access
methods

3.13.1 Different choices for indexing


Object selection has had a number of user-requested additions in order to
support more explicit location based indexing. pandas now supports three types
of multi-axis indexing.

 .loc is primarily label based, but may also be used with a boolean
array. .loc will raise KeyError when the items are not found. Allowed
inputs are:

o A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of
the index. This use is not an integer position along the index.).
o A list or array of labels ['a', 'b', 'c'].
o A slice object with labels 'a':'f' (Note that contrary to usual Python
slices, both the start and the stop are included, when present in
the index! See Slicing with labels and Endpoints are inclusive.)
o A boolean array (any NA values will be treated as False).
o A callable function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the
above).
o A tuple of row (and column) indices whose elements are one of
the above inputs.
See more at Selection by Label.

 .iloc is primarily integer position based (from 0 to length-1 of the axis),


but may also be used with a boolean array. .iloc will raise IndexError if a
requested indexer is out-of-bounds, except slice indexers which allow
out-of-bounds indexing. (this conforms with
Python/NumPy slice semantics). Allowed inputs are:
o An integer e.g. 5.
o A list or array of integers [4, 3, 0].
o A slice object with ints 1:7.
o A boolean array (any NA values will be treated as False).

82
o A callable function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the
above).
o A tuple of row (and column) indices whose elements are one of
the above inputs.
Destructuring tuple keys into row (and column) indexes occurs before callables
are applied, so you cannot return a tuple from a callable to index both rows and
columns.

The primary function of indexing with [] (a.k.a. __getitem__ for those familiar
with implementing class behavior in Python) is selecting out lower-dimensional
slices. The following table shows return type values when indexing pandas
objects with []:

Object Type Selection Return Value Type


Series series[label] scalar valuT
Series corresponding to
DataFrame frame[colname]
colname
Let us construct a simple time series data set to use for illustrating the indexing
functionality:

Input:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4),
...:
index=dates, columns=['A', 'B', 'C', 'D'])
...:
df

Output:
A B C D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
2000-01-04 0.721555 -0.706771 -1.039575 0.271860
2000-01-05 -0.424972 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427 0.524988
2000-01-07 0.404705 0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312 0.844885

83
3.13.1 Attribute access
The index can bee accessed a Series or column on a DataFrame directly as an
attribute:

sa = pd.Series([1, 2, 3], index=list('abc'))

dfa = df.copy()

3.13.2 Slicing ranges


The most robust and consistent way of slicing ranges along arbitrary axes is
described in the Selection by Position section detailing the .iloc method. For
now, we explain the semantics of slicing using the [] operator.

3.14 Handling Missing Datas:

In pandas, missing data can also be referred to as NA (Not Available) values.


Many datasets simply have missing data when they are imported into
DataFrame, either because the data was never gathered or because it was
present but was not captured. For example, suppose different individuals being
surveyed opt not to reveal their income, and some users choose not to share
their address; as a result, several datasets go missing.

Pandas support two values to represent missing data:

 None: None is a Python singleton object that is commonly used in Python


programs to represent missing data.
 NaN: Also known as Not a Number, or NaN, is a particular floating-point
value that is accepted by all systems that employ the IEEE standard for
floating-point representation.

Handling missing data will go through each step in more detail, but here's a
general idea:

1. We start by importing the necessary packages.


2. We use the read_csv() function to read the dataset.
3. The dataset is printed. And we check if any record has NaN values or
missing data.
4. On the dataset, we apply the dropna() function. The records that contain
missing values are deleted using this procedure. In order to remove the
entries and update the new dataset in the same variable, we additionally
pass the argument in place to be True.
5. The dataset is printed. No records contain missing values anymore.

84
3.14.1 Calculation with Missing Data

None, is a Python singular object that is frequently used for missing data in
Python programs. Because it is a Python object, None can only be used in
arrays of the data type "object" (i.e., arrays of Python objects), and cannot be
used in any other NumPy/Pandas array:

import numpy as np
import pandas as pd
array = np.array([3, None, 0, 4, None])
print(array)

Output:
array([3, None, 0, 4, None], dtype=object)

The alternative missing data representation, NaN (an acronym for Not a
Number), is distinct; it is a unique floating-point value recognized by all
systems that utilize the common IEEE floating-point notation

import numpy as np
import pandas as pd
array = np.array([3, np.nan, 0, 4, np.nan])
print(array)
Output:
array([ 3., nan, 0., 4., nan])

It's important to note that NumPy selected a native floating-point type for
this array, which implies that in contrast to the object array from earlier, this
array allows rapid operations that are pushed into the produced code.

3.14.2 Cleaning Missing Data


The result of the isna() and isnull() methods is a Boolean check of whether
or not each cell of the DataFrame has a missing value. In this way, if a value
is absent from a certain cell, the function will return True; otherwise, it will
return False
#Import the libraries
import numpy as np
import pandas as pd

# Create a CSV dataset


data_string = '''ID,Gender,Salary,Country,Company
1,Male,15000,India,Google
2,Female,45000,China,NaN
3,Female,25000,India,Google
4,NaN,NaN,Australia,Google
5,Male,NaN,India,Google

85
6,Male,54000,NaN,Alibaba
7,NaN,74000,China,NaN
8,Male,14000,Australia,NaN
9,Female,15000,NaN,NaN
10,Male,33000,Australia,NaN'''

with open('salary.csv', 'w') as out:


out.write(data_string)

# Import the dataset


df = pd.read_csv('/content/salary.csv')
print('Salary Dataset: \n', df)

# Check for missing data


print('Missing Data\n', df.isna())

print('Missing Data\n', df.isnull())

# Print only missing data


print('Filter based on columns: \n', df[df.isnull().any(axis=1)])

# Sum up the missing values


print('Sum up the missing values: \n', df.isnull().sum())

Output:
Salary Dataset:
ID Gender Salary Country Company
0 1 Male 15000.0 India Google
1 2 Female 45000.0 China NaN
2 3 Female 25000.0 India Google
3 4 NaN NaN Australia Google
4 5 Male NaN India Google
5 6 Male 54000.0 NaN Alibaba
6 7 NaN 74000.0 China NaN
7 8 Male 14000.0 Australia NaN
8 9 Female 15000.0 NaN NaN
9 10 Male 33000.0 Australia NaN
Missing Data
ID Gender Salary Country Company
0 False False False False False
1 False False False False True
2 False False False False False
3 False True True False False
4 False False True False False
5 False False False True False
6 False True False False True

86
7 False False False False True
8 False False False True True
9 False False False False True
Missing Data
ID Gender Salary Country Company
0 False False False False False
1 False False False False True
2 False False False False False
3 False True True False False
4 False False True False False
5 False False False True False
6 False True False False True
7 False False False False True
8 False False False True True
9 False False False False True
Filter based on columns:
ID Gender Salary Country Company
1 2 Female 45000.0 China NaN
3 4 NaN NaN Australia Google
4 5 Male NaN India Google
5 6 Male 54000.0 NaN Alibaba
6 7 NaN 74000.0 China NaN
7 8 Male 14000.0 Australia NaN
8 9 Female 15000.0 NaN NaN
9 10 Male 33000.0 Australia NaN
Sum up the missing values:
ID 0
Gender 2
Salary 2
Country 2
Company 5
dtype: int64

3.14.3 Important Functions for Handling Missing Data in Pandas

isnull()

The isnull() method returns a dataframe of boolean values that are True for
NaN values when checking null values in a Pandas DataFrame.

notnull()

The notnull() method returns a dataframe of boolean values that are False for
NaN values when checking for null values in a Pandas Dataframe.

87
dropna()

The dropna() method is used to remove null values from a dataframe. This
function removes rows and columns of datasets containing null values in several
ways.

fillna()

By using the fillna(), replace(), and interpolate() functions, we may fill in any
null values in a dataset by replacing NaN values with alternative values. The
datasets of a DataFrame can be filled with null values thanks to all these
functions. The Interpolate() method is mostly used to fill NA values in a
dataframe, although it does so using various interpolation techniques rather
than hard-coding the value.

replace()

Using the replace method, we can not only replace or fill null values but any
value specified as a function attribute. We specify the value to be replaced
in to_replace and the new value in value.

interpolate()

Pandas' ability to substitute missing values with those that make sense is
another characteristic. The interpolate() function is employed. Pandas fill in the
gaps beautifully by using the points' midpoints. Naturally, if this was curvilinear,
a function would be fitted to it in order to discover a different technique to
determine the average.

3.15 Hierarchical Indexing

One-dimensional and two-dimensional data, stored in


Pandas Series and DataFrame objects, respectively. Often it is useful to go
beyond this and store higher-dimensional data–that is, data indexed by more
than one or two keys. While Pandas does provide Panel and Panel4D objects
that natively handle three-dimensional and four-dimensional data

the higher-dimensional data can be compactly represented within the familiar


one-dimensional Series and two-dimensional DataFrame objects.

A Multiply Indexed Series

Let's start by considering how we might represent two-dimensional data within


a one-dimensional Series. For concreteness, we will consider a series of data
where each point has a character and numerical key.

88
3.15.1 Creating a MultiIndex

Let’s start by creating a MultiIndex. Assume we have data on students’ scores in

different subjects across various semesters. Here’s how we can create a

MultiIndex DataFrame:
import pandas as pd
# Creating a sample dataset
arrays = [['Semester 1', 'Semester 1', 'Semester 2', 'Semester 2'],
['Math', 'Science', 'Math', 'Science']]

index = pd.MultiIndex.from_arrays(arrays, names=('Semester', 'Subject'))


data = {
'Student A': [88, 92, 85, 90],
'Student B': [78, 85, 80, 88],
'Student C': [92, 90, 93, 95]
}

df = pd.DataFrame(data, index=index)
print(df)

Output:

Student A Student B Student C


Semester Subject
Semester 1 Math 88 78 92
Science 92 85 90
Semester 2 Math 85 80 93
Science 90 88 95

3.15.2 Accessing Data in a MultiIndex DataFrame


You can access data in a MultiIndex DataFrame using various methods:
Using .loc[] with Tuple Indexing

# Accessing scores of Student A in Math during Semester 1


print(df.loc[('Semester 1', 'Math'), 'Student A'])

Output:
88

3.15.3 Using Slicing


# Accessing all scores for Semester 1
print(df.loc['Semester 1'])
Output:
Student A Student B Student C

89
Subject
Math 88 78 92
Science 92 85 90

3.15.4 Advanced Operations with Hierarchical Indexing


3.15.4.1Swapping Levels
You can swap the levels of a MultiIndex to facilitate different analyses:
# Swapping levels
swapped_df = df.swaplevel('Semester', 'Subject')
print(swapped_df)
Output:
Student A Student B Student C
Subject Semester
Math Semester 1 88 78 92
Semester 2 85 80 93
Science Semester 1 92 85 90
Semester 2 90 88 95
3.15.4.2 Sorting the MultiIndex
Sorting is crucial for efficient slicing:
# Sorting the index
sorted_df = df.sort_index()
print(sorted_df)
OUTPUT:
Student A Student B Student C
Semester Subject
Semester 1 Math 88 78 92
Science 92 85 90
Semester 2 Math 85 80 93
Science 90 88 95
3.15.4.3 Grouping by Level

You can group data by a specific level in the MultiIndex


# Grouping by 'Semester' level and calculating mean scores
mean_scores = df.groupby(level='Semester').mean()
print(mean_scores)
OUTPUT:
Student A Student B Student C
Semester
Semester 1 90.0 81.5 91.0
Semester 2 87.5 84.0 94.0

90
3.16 Combining Datasets

Using pandas to Combine DataFrames

pandas are the most common library for handling datasets in Python. You can
use concat(), merge(), or join() to combine datasets.

3.16.1 Using concat() (Stacking DataFrames)

concat() is used for stacking DataFrames either vertically (rows) or horizontally


(columns).

import pandas as pd

# Sample data

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})

df2 = pd.DataFrame({'ID': [4, 5, 6], 'Name': ['David', 'Eve', 'Frank']})

# Vertical concatenation (stacking rows)

df_combined = pd.concat([df1, df2], ignore_index=True)

print(df_combined)

For horizontal concatenation (adding columns):

df3 = pd.DataFrame({'Age': [25, 30, 35]})

df_combined = pd.concat([df1, df3], axis=1)

print(df_combined)

3.16.2 Using merge() (SQL-style Joins)

Use merge() when combining datasets based on a key column, like an SQL join.

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})

df2 = pd.DataFrame({'ID': [2, 3, 4], 'Salary': [50000, 60000, 70000]})

# Inner join (only matching IDs)

df_merged = df1.merge(df2, on='ID', how='inner')

91
print(df_merged)

# Left join (keep all records from df1)

df_left = df1.merge(df2, on='ID', how='left')

print(df_left)

# Outer join (keep all records from both datasets)

df_outer = df1.merge(df2, on='ID', how='outer')

print(df_outer)

3.16.3 Using join() (Joining on Index)

If DataFrames have the same index, you can use join().

df1 = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie']}, index=[1, 2, 3])


df2 = pd.DataFrame({'Salary': [50000, 60000, 70000]}, index=[1, 2, 3])

df_joined = df1.join(df2)
print(df_joined)

3.16.4. Using NumPy for Array-based Concatenation


If your data is in NumPy arrays, you can use numpy.concatenate().

import numpy as np

arr1 = np.array([[1, 2], [3, 4]])


arr2 = np.array([[5, 6], [7, 8]])

# Stacking vertically
arr_combined = np.vstack((arr1, arr2))
print(arr_combined)

# Stacking horizontally
arr_combined_h = np.hstack((arr1, arr2))
print(arr_combined_h)

3.16.4. Appending Data


For pandas, you can use .append() (deprecated in favor of concat()):

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})


df2 = pd.DataFrame({'ID': [3, 4], 'Name': ['Charlie', 'David']})

92
df_combined = df1._append(df2, ignore_index=True)
print(df_combined)

Method Use Case


concat() Stacking datasets vertically or horizontally
merge() SQL-like joins on a common key
join() Joining based on index
NumPy concatenate() Combining NumPy arrays

.append() Appending rows (use concat() instead)

3.17. Aggregating and Grouping Data


Grouping and aggregating data is essential for summarizing and analyzing
datasets. The groupby() function in pandas allows you to group data by one or
more columns and perform aggregate operations like sum, mean, count, etc.

3.17.1. Basic groupby() Usage


You can group data by a column and apply aggregate functions like sum(),
mean(), count(), etc.
import pandas as pd
# Sample dataset
data = {'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Salary': [50000, 55000, 60000, 62000, 70000, 72000]}
df = pd.DataFrame(data)
# Group by 'Department' and get the total salary per department
df_grouped = df.groupby('Department')['Salary'].sum()
print(df_grouped)

Output:
Department
HR 122000

93
IT 142000
Sales 105000
Name: Salary, dtype: int64

3.17.2. Applying Multiple Aggregations


You can use .agg() to apply multiple functions at once.
df_grouped = df.groupby('Department')['Salary'].agg(['sum', 'mean', 'count'])
print(df_grouped)
Output:
sum mean count
Department
HR 122000 61000.0 2
IT 142000 71000.0 2
Sales 105000 52500.0 2

3.17.3. Grouping by Multiple Columns


You can group by more than one column.
data = {'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT'],
'Location': ['NY', 'NY', 'CA', 'CA', 'TX', 'TX'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Salary': [50000, 55000, 60000, 62000, 70000, 72000]}
df = pd.DataFrame(data)
# Group by 'Department' and 'Location' and sum salaries
df_grouped = df.groupby(['Department', 'Location'])['Salary'].sum()
print(df_grouped)
Output:
Department Location
HR CA 122000
IT TX 142000
Sales NY 105000
Name: Salary, dtype: int64

94
3.17.4. Using transform() for Group-wise Calculations
If you want to perform group calculations and retain the original structure of the
DataFrame, use transform().
df['Avg_Dept_Salary'] = df.groupby('Department')['Salary'].transform('mean')
print(df)
3.17.5. Pivot Tables for Aggregation
For a cleaner table format, use pivot_table().
df_pivot = df.pivot_table(index='Department', values='Salary', aggfunc=['sum',
'mean', 'count'])
print(df_pivot)
Method Use Case
groupby() Aggregate data based on one or more columns
agg() Apply multiple aggregation functions
transform() Apply transformations while keeping the original structure
pivot_table() Reshape and summarize data in a table

3.18. Joins
Joins in pandas work similarly to SQL joins, allowing you to merge data from
multiple DataFrames based on common columns. The main function for joins in
pandas is merge().Their types are stated below.

Join Type Description


Inner Join Keeps only matching rows from both tables.
Left Join Keeps all rows from the left table and matching rows from the right.
Right Join Keeps all rows from the right table and matching rows from the left.

Outer Join Keeps all rows from both tables, filling missing values with NaN.

Sample DataFrames
Let's create two DataFrames to demonstrate different types of joins.
import pandas as pd

95
df1 = pd.DataFrame({'ID': [1, 2, 3, 4], 'Name': ['Alice', 'Bob', 'Charlie', 'David']})
df2 = pd.DataFrame({'ID': [3, 4, 5, 6], 'Salary': [60000, 70000, 80000, 90000]})
print("df1:")
print(df1)
print("\ndf2:")
print(df2)
df1
ID Name
0 1 Alice
1 2 Bob
2 3 Charlie
3 4 David
df2
ID Salary
0 3 60000
1 4 70000
2 5 80000
3 6 90000

3.18.1. Performing Different Types of Joins


1.Inner Join (Keeps only matching rows)
df_inner = df1.merge(df2, on='ID', how='inner')
print(df_inner)

Output:
ID Name Salary
0 3 Charlie 60000
1 4 David 70000

96
2.Left Join (Keeps all rows from df1, fills missing values from df2)
df_left = df1.merge(df2, on='ID', how='left')
print(df_left)
Output:
ID Name Salary
0 1 Alice NaN
1 2 Bob NaN
2 3 Charlie 60000.0
3 4 David 70000.0

3.Right Join (Keeps all rows from df2, fills missing values from df1)
df_right = df1.merge(df2, on='ID', how='right')
print(df_right)
Output:
ID Name Salary
0 3 Charlie 60000
1 4 David 70000
2 5 NaN 80000
3 6 NaN 90000

4.Outer Join (Keeps all rows from both tables, fills missing values with NaN)
df_outer = df1.merge(df2, on='ID', how='outer')
print(df_outer)
Output:
ID Name Salary
0 1 Alice NaN
1 2 Bob NaN
2 3 Charlie 60000.0
3 4 David 70000.0
4 5 NaN 80000.0
5 6 NaN 90000.0

97
5.Joining on Multiple Columns
If your DataFrames have multiple keys, you can join on multiple columns.
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Dept': ['HR', 'IT', 'HR'], 'Name': ['Alice',
'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Dept': ['HR', 'IT', 'Finance'], 'Salary':
[50000, 60000, 70000]})
df_multi = df1.merge(df2, on=['ID', 'Dept'], how='inner')
print(df_multi)

6.Using join() for Index-based Joins


If you want to join DataFrames based on their index, use join().
df1.set_index('ID', inplace=True)
df2.set_index('ID', inplace=True)
df_index_join = df1.join(df2, how='inner') # Default is left join
print(df_index_join)

Method Description Example


.merge(on='col', Matches only common values df1.merge(df2, on='ID',
how='inner') how='inner')
.merge(on='col', Keeps all from left, fills missing df1.merge(df2, on='ID',
how='left') from right how='left')
.merge(on='col', Keeps all from right, fills missing df1.merge(df2, on='ID',
how='right') from left how='right')
.merge(on='col', Keeps all records from both df1.merge(df2, on='ID',
how='outer') how='outer')
.join() Index-based join df1.join(df2,
how='inner')

98
3.19. Pivot Tables
Pivot tables are used to summarize and analyze data in a structured way. They
work similarly to Excel pivot tables and are useful for aggregating and
restructuring large datasets.

Creating a Simple Pivot Table


The pivot_table() function in pandas is used to create pivot tables.
Example Dataset
import pandas as pd
# Sample data
data = {
'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'Sales', 'IT'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hank'],
'Salary': [50000, 55000, 60000, 62000, 70000, 72000, 58000, 75000],
'Bonus': [5000, 7000, 8000, 6000, 9000, 7500, 6500, 8000],
'Location': ['NY', 'NY', 'CA', 'CA', 'TX', 'TX', 'NY', 'TX']
}
df = pd.DataFrame(data)
print(df)

3.19.1.Creating a Pivot Table


pivot = df.pivot_table(index='Department', values='Salary', aggfunc='mean')
print(pivot)
Output:
Salary
Department
HR 61000.0
IT 72333.3
Sales 54333.3
This shows the average salary per department.

99
3.19.2. Multiple Aggregations in Pivot Tables
You can apply multiple aggregation functions at the same time.
pivot = df.pivot_table(index='Department', values=['Salary', 'Bonus'],
aggfunc=['sum', 'mean'])
print(pivot)
Output:
sum_salary mean_salary sum_bonus mean_bonus
Department
HR 122000 61000.0 14000.0 7000.0
IT 217000 72333.3 24500.0 8166.7
Sales 163000 54333.3 18500.0 6166.7

3.19.3. Grouping by Multiple Columns


You can group by more than one category using index and columns.
pivot = df.pivot_table(index=['Department', 'Location'], values='Salary',
aggfunc='sum')
print(pivot)
Output:
Salary
Department Location
HR CA 122000
IT TX 217000
Sales NY 163000

3.19.4. Using columns Parameter to Add a Second Dimension


pivot = df.pivot_table(index='Department', columns='Location', values='Salary',
aggfunc='mean')
print(pivot)
Output:
Location CA NY TX
Department
HR 61000.0 NaN NaN
IT NaN NaN 72333.3

100
Sales NaN 54333.3 NaN

3.19.5. Including Totals with margins=True


pivot = df.pivot_table(index='Department', values='Salary', aggfunc='mean',
margins=True)
print(pivot)
Output:
Salary
Department
HR 61000.0
IT 72333.3
Sales 54333.3
All 63875.0

3.19.6. Filling Missing Values with fill_value


pivot = df.pivot_table(index='Department', columns='Location', values='Salary',
aggfunc='sum', fill_value=0)
print(pivot)

This replaces NaN with 0 for better readability.


Feature Code Example
Basic Pivot Table df.pivot_table(index='Department', values='Salary',
aggfunc='mean')
Multiple Aggregations df.pivot_table(index='Department', values=['Salary',
'Bonus'], aggfunc=['sum', 'mean'])
Multiple Index Columns df.pivot_table(index=['Department', 'Location'],
values='Salary', aggfunc='sum')
Column-based Grouping df.pivot_table(index='Department',
columns='Location', values='Salary', aggfunc='sum')

101
Include Totals df.pivot_table(index='Department', values='Salary',
aggfunc='sum', margins=True)
Fill Missing Values df.pivot_table(index='Department',
columns='Location', values='Salary', aggfunc='sum',
fill_value=0)

3.20. String Operations


Working with text data in Python is easy with built-in string methods and
pandas string functions (.str) for handling text data in DataFrames.
3.20.1. Basic String Methods
These methods work on individual strings.
text = " Hello, World! "
print(text.lower()) # Convert to lowercase
print(text.upper()) # Convert to uppercase
print(text.strip()) # Remove leading & trailing spaces
print(text.replace("Hello", "Hi")) # Replace substring
print(text.split(",")) # Split into a list
Output:
hello, world!
HELLO, WORLD!
Hello, World!
Hi, World!
[' Hello', ' World! ']

3.20.2. String Operations in pandas (.str)


When working with DataFrames, use .str for string operations.
import pandas as pd
data = {'Name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown'],
'Email': ['alice@example.com', 'bob_smith@gmail.com',
'charlie@company.org']}

102
df = pd.DataFrame(data)
print(df)

3.20.3. Converting Case


df['Name_Lower'] = df['Name'].str.lower() # Lowercase
df['Name_Upper'] = df['Name'].str.upper() # Uppercase
print(df)

3.20.4. Splitting Strings


df['First_Name'] = df['Name'].str.split().str[0] # Extract first word
df['Last_Name'] = df['Name'].str.split().str[1] # Extract second word
print(df)

3.20.5. Extracting Parts of a String


df['Domain'] = df['Email'].str.split("@").str[1] # Extract domain name
df['Username'] = df['Email'].str.extract(r'(^[\w\d_.-]+)') # Extract username
before @
print(df)

3.20.6. Replacing Text


df['Email'] = df['Email'].str.replace("gmail", "outlook") # Change domain
print(df)

3.20.7. Finding Text


df['Has_Company'] = df['Email'].str.contains("company") # Boolean if
'company' exists
print(df)

3.20.8. String Length


df['Name_Length'] = df['Name'].str.len() # Count characters in Name

103
print(df)
3.20.9. Removing Spaces & Special Characters
df['Cleaned_Name'] = df['Name'].str.strip() # Remove extra spaces
df['No_Special_Chars'] = df['Name'].str.replace(r'[^a-zA-Z ]', '', regex=True) #
Keep only letters
print(df)
3.20.10. Using apply() for Custom String Functions
If .str methods don’t cover your case, use apply().
df['Custom'] = df['Name'].apply(lambda x: x[::-1]) # Reverse the name
print(df)
Operation Example
Convert to lowercase df['Name'].str.lower()
Convert to uppercase df['Name'].str.upper()
Split a column df['Name'].str.split().str[0]
Extract from string df['Email'].str.extract(r'(\w+)')
Replace substring df['Email'].str.replace("gmail", "outlook")
Check if substring exists df['Email'].str.contains("company")
Remove spaces df['Name'].str.strip()
Remove special characters df['Name'].str.replace(r'[^a-zA-Z ]', '',
regex=True)
Apply a custom function df['Name'].apply(lambda x: x[::-1])

3.21 Time Series Data


Time series data refers to a sequence of data points collected or recorded at
successive time intervals. It is used to analyze trends, patterns, and behaviors
over time. Time series data can be continuous or discrete, depending on the
nature of the data.

104
3.21.1 Key Aspects of Time Series Data
1. Time Interval: The data points are indexed in time order, with each
data point associated with a specific timestamp (e.g., daily, weekly,
monthly).
2. Examples: Area of Application
o Stock prices
o Temperature readings over a period
o Sales data over months or years
o Economic indicators (like GDP, inflation rates)
3. Components of Time Series Data:
o Trend: The long-term movement or direction in the data (e.g., an
upward or downward trend in stock prices).
o Seasonality: Regular, repeating fluctuations in the data within
fixed periods (e.g., higher ice cream sales in summer).
o Cyclic Patterns: Irregular fluctuations that do not have a fixed
period (e.g., economic booms and recessions).
o Noise: Random variations or fluctuations in the data that do not
follow any pattern.
4. Time Series Analysis: Techniques such as ARIMA (AutoRegressive
Integrated Moving Average), exponential smoothing, and decomposition
methods are used to analyze time series data, forecast future values, and
uncover underlying patterns.

3.21.2 Types of Time Series Data


a) Univariate vs. Multivariate Time Series
 Univariate: A single variable or feature recorded over time (e.g., daily
temperature).
 Multivariate: Multiple variables recorded over time (e.g., daily
temperature, humidity, and wind speed).

105
b)Regular vs. Irregular Time Series:
 Regular: Data points are recorded at consistent time intervals (e.g.,
hourly, daily).
 Irregular: Data points are recorded at inconsistent time intervals.

3.21.3 Python Libraries


In Python, working with time series data is commonly done using libraries such
as Pandas, NumPy, Matplotlib, Statsmodels, and Scikit-learn. Below is
an overview of how to handle time series data in Python, along with some basic
steps to manipulate and analyze the data.With python tools such as Pandas for
manipulation, Matplotlib for visualization, and Statsmodels for statistical
modeling, you can process, analyze, and forecast time series data efficiently.

3.21.3.1 Importing Libraries


To begin working with time series data, we first need to import the necessary
libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

3.21.3.2 Loading Time Series Data


Time series data can be loaded from various sources like CSV files, Excel files,
or APIs. In this case, let’s assume we’re loading data from a CSV file
df = pd.read_csv('path_to_your_file.csv', parse_dates=['Date'],
index_col='Date')
# Load time series data with 'Date' as the timestamp column

106
print(df.head())
# Check the structure of the data

parse_dates=['Date']: This tells Pandas to parse the Date column as datetime


objects.
index_col='Date': This sets the Date column as the index of the DataFrame,
which is essential for time series analysis.
head() - It is used to quickly view the first few rows of a DataFrame or Series,
making it easier to inspect the data.

3.21.3.3 Inspecting and Exploring Time Series Data


Once the data is loaded, it’s important to check some basic properties like
summary statistics, and the frequency of time intervals. Also need to
understand the structure of the data and identify any missing or anomalous
entries.
print(df.head()) # Check the first few rows to inspect the data
print(df.describe()) # Summary statistics for numerical columns
print(df.isnull().sum()) # Check for missing values
print(df.index.freq) # Check the frequency of the time series data
(daily, monthly, etc.)

3.21.3.4 Resampling Time Series Data


Time series data often needs to be aggregated or resampled to a different
frequency (e.g., from daily to monthly data). Pandas makes this easy through
the resample() method.

# Resample to monthly data using the mean of the values


df_monthly = df.resample('M').mean()
# Resample to weekly data using the sum of the values
df_weekly = df.resample('W').sum()
# View the resampled data

107
print(df_monthly.head())
print(df_weekly.head())
'M' stands for monthly, 'W' stands for weekly, and you can use other time
frequencies like 'D' for daily, 'Q' for quarterly, etc.
mean() and sum() aggregate the data by taking the average and sum,
respectively.

3.21.3.5 Plotting Time Series Data


Visualizing your time series data helps you detect underlying trends, seasonality,
and outliers. You can easily create line plots with Matplotlib
.
Plot Types: Supports various plot types, including line plots, bar plots,
histograms, scatter plots, etc.Matplotlib is a popular Python library used for
creating static, animated, and interactive visualizations. It is highly customizable
and can be used to generate a wide range of plots, including line plots, bar
charts, histograms, scatter plots, 3D plots, and more.
Example Template
df.plot(figsize=(10, 6),
title='Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
df_monthly.plot(figsize=(10, 6), title='Monthly Resampled Data')
plt.xlabel('Date')
plt.ylabel('Average Value')
plt.show()

3.21.3.6 Decomposing Time Series Data


Time series data often has three main components: Trend, Seasonality, and
Noise (residuals). You can decompose the series into these components using
the seasonal_decompose() function from Statsmodels.

108
Example Template
decomposition = seasonal_decompose(df, model='additive',
period=12)
decomposition.plot()
plt.show()
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

seasonal_decompose()  decomposes the series into trend, seasonal, and


residual components.
The model='additive'  assumes that the components add together (you can
also use multiplicative if they multiply together).
period=12 defines the seasonal period, for instance, 12 for – twelve months

3.21.3.7 Making Time Series Forecasts (ARIMA Model)


ARIMA (AutoRegressive Integrated Moving Average) is one of the most popular
models for forecasting time series data. It requires tuning three parameters: p
(autoregressive), d (differencing), and q (moving average).

model = ARIMA(df['your_column_name'], order=(1, 1, 1))


model_fit = model.fit()
print(model_fit.summary())
forecast = model_fit.forecast(steps=10)
print(forecast)

Python offers a variety of libraries and techniques for time-series forecasting,


and one popular method is the autoregressive integrated moving average
(ARIMA) model. ARIMA is a powerful and widely used approach that combines

109
the three following components to capture the patterns and trends in time-
series data:
1. Autoregression (AR)
2. Differencing (I)
3. Moving Average (MA)

The moving average (MA) model calculates the average of past observations to
forecast future values. It helps eliminate short-term fluctuations and identify
underlying trends in the data. The autoregressive (AR) model predicts future
values using past observations and a linear regression equation. It assumes that
the future values depend on the previous values with a lag. Differencing (I) in
time series analysis refers to a method of transforming a time series dataset by
subtracting the previous value from the current value.

3.21.3.8. Forecasting Future Values


Once the model is trained, you can use it to forecast future values. If you want
to forecast beyond the current data range, simply call forecast() with the
desired number of steps.
forecast_values = model_fit.forecast(steps=30)
# Forecast the next 30 time points

3.21.3.9 Stationarity
Stationarity refers to a key concept in time-series analysis where the statistical
properties of a dataset, such as mean and variance, which remain constant over
time is applied. In Python, testing for stationarity involves methods like the
Augmented Dickey-Fuller (ADF) test, Kwiatkowski-Phillips-Schmidt-Shin (KPSS)
test, and visual inspection of time series plots

from statsmodels.tsa.stattools import adfuller


result = adfuller(timeseriesdata)
print('ADF Statistic:', result[0])

110
print('p-value:', result[1])

3.21.3.10 Autocorrelation and Partial Autocorrelation

Autocorrelation measures the relationship between a variable's current and past


values at different time lags. On the other hand, partial autocorrelation
quantifies the direct relationship between a variable's current value and its past
values, excluding the influence of intermediate-lagged variables.
In Python, testing for autocorrelation and partial autocorrelation often involves
plotting the autocorrelation function (ACF) and partial autocorrelation function
(PACF) and observing the patterns.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf


plot_acf(timeseriesdata)
plot_pacf(timeseriesdata)
plt.show()

3.21.4 Pandas Time Series Library


The core library for time-series analysis in Python is pandas.Pandas provides a
rich set of functionalities for analyzing and visualizing time-series data. You can
perform various operations, including data aggregation, filtering, and computing
summary statistics. Additionally, pandas integrates well with visualization
libraries like Matplotlib allowing you to create insightful plots and charts to
explore patterns and trends in the data.
With pandas, you can perform basic analysis and visualization of time-series
data. The central data structure in pandas is the DataFrame, which serves as
the primary unit for representing time-series data.

Here's an example that demonstrates the steps of loading and working with
time-series data using pandas in Python:

111
Example Program

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Step 1: Load time-series Data


dates = pd.date_range(start='2025-01-01', periods=100)
values = np.sin(np.linspace(0, 2*np.pi, 100))
data = pd.DataFrame({'Date': dates, 'Value': values})

# Step 2: Perform Data Analysis


# Calculate summary statistics
summary_stats = data.describe()

# Filter data based on specific conditions


filtered_data = data[data['Value'] > 0]

# Resample data to a different frequency


resampled_data = data.resample('1W', on='Date').sum()

# Step 3: Visualize time-series Data


plt.plot(data['Date'], data['Value'])
plt.xlabel('Date')
plt.ylabel('Value')
plt.xticks(rotation = 45)
plt.title('Time Series Data')
plt.show()

112
3.21.5 Matplotlib Library and Time Series Data
Python provides the Matplotlib library, which includes the Pyplot module for
creating various types of plots, including line plots, scatter plots, and
histograms. Plotting time-series data is an essential step in visualizing patterns,
trends, and anomalies.

Example Program
import numpy as np
import matplotlib.pyplot as plt
# Generate random time-series data
np.random.seed(0)
dates = pd.date_range(start='2025-02-01', periods=100)
values = np.random.randn(100).cumsum()
# Plot the time-series data
plt.plot(dates, values)
plt.xlabel('Date')
plt.ylabel('Value')
plt.xticks(rotation = 45)
plt.title('Time Series Data')

113
plt.show()

Output

3.21.6 Combining Pandas and Matplotlib for Time Series Data


 Load the Data: Import your time series data into a Pandas DataFrame.
 Preprocess the Data: Perform any cleaning or transformation, such as
setting a DateTime index or filling missing values.
 Plot the Data: Use Matplotlib to create line plots or other visualizations of
the time series.
Example Program
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(42) # Generate random time-series data
dates = pd.date_range(start='2025-01-05', periods=100, freq='D')
values = np.random.randn(100).cumsum()

114
# Create a DataFrame from the generated data
data = pd.DataFrame({'date': dates, 'value': values})
# Set the 'date' column as the index
data.set_index('date', inplace=True)
# Plot the time-series data
plt.plot(data.index, data['value'])
plt.xlabel('Time')
plt.ylabel('Value')
plt.xticks(rotation = 45)
plt.title('Time Series Data')
plt.show()
Output

3.21.7 Time Series Analysis with stats models python library

The statsmodels library in Python provides various tools for performing time
series analysis, including models for autoregressive (AR), moving average (MA),
and integrated (I) processes, which are the building blocks for models like
ARIMA (AutoRegressive Integrated Moving Average). It also supports other time
series models like SARIMA (Seasonal ARIMA), state space models

115
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
from statsmodels.tsa.api import SimpleExpSmoothing, Holt,
ExponentialSmoothing
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose

# Generate Simulated Time Series Data


np.random.seed(42)
n = 200
time = pd.date_range(start="2025-01-01", periods=n, freq="D")
trend = np.linspace(10, 50, n)
seasonality = 10 * np.sin(np.linspace(0, 2 * np.pi, n))
noise = np.random.normal(0, 2, n)
data = trend + seasonality + noise
# Create a DataFrame; split the data
df = pd.DataFrame({"date": time, "value": data})
df.set_index("date", inplace=True)
hold_out_days = 30
train = df.iloc[:-hold_out_days]
hold_out = df.iloc[-hold_out_days:]
# Plot the Data
plt.figure(figsize=(10, 6))
plt.plot(df.index, df["value"], label="Full Dataset", color="Blue")
plt.plot(hold_out.index, hold_out["value"], label="Hold-Out (True Values)",
color="Green")
plt.title("Simulated Time Series with Training and Hold-Out Sets")
plt.xlabel("Date")

116
plt.ylabel("Value")
plt.legend()
plt.grid()
plt.savefig("simulated_time_series.png")
plt.show()
Output

3.21.7.1 Time Series Decomposition – Template


from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df["value"], model="additive",
period=30) # Plot the components
fig = decomposition.plot()
plt.show()

117
3.21.7.2 Time Series Stationarity

The Augmented Dickey-Fuller (ADF) test is a statistical test that tells whether
data change over time or not.

from statsmodels.tsa.stattools import adfuller


# Perform ADF test
result = adfuller(df["value"])
print(f"ADF Statistic: {result[0]:.4f}")
print(f"P-Value: {result[1]:.4f}")
if result[1] > 0.05:
print("The time series is non-stationary.")
else:
print("The time series is stationary.")

118
Output
ADF Statistic: -0.5022
P-Value: 0.8916
The time series is non-stationary.

3.21.7.3 Autocorrelation and Partial Autocorrelation

In Python, you can calculate and plot the Autocorrelation Function (ACF) and
Partial Autocorrelation Function (PACF) using the statsmodels library. These
plots help in analyzing the autocorrelations and partial autocorrelations of time
series data, which are important for model selection in time series forecasting

Template
# Plot ACF and PACF fig, axes = plt.subplots(1, 2, figsize=(12, 6))

plot_acf(df["value"], lags=30, ax=axes[0])


plot_pacf(df["value"], lags=30, ax=axes[1])
plt.suptitle("ACF and PACF Plots", fontsize=16)
plt.savefig("acf_pacf_plots.png")
plt.show()

119
3.21.7.3 ARIMA model to the data for forecasting
ARIMA stands for AutoRegressive Integrated Moving Average. It is a popular
statistical method used for time series forecasting and modeling. ARIMA is a
powerful tool for forecasting univariate (single variable) time series data,
especially when the data exhibits patterns such as trends and seasonality.
Template
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(df["value"], order=(2, 1, 2))
arima_result = model.fit()
print(arima_result.summary())
plt.savefig("arima_residuals_diagnostics.png")
plt.show()

The ARIMA model is fitted with the specified order (2, 1, 2) where:
o 2: AR (AutoRegressive) term
o 1: Differencing order (1st difference to make the series stationary)
o 2: MA (Moving Average) term

120
3.22 High Performance Pandas

Pandas is a powerful Python library for data analysis, but it can be slow when
working with large datasets. Here are some techniques to improve the
performance of your Pandas code:
1. Vectorization:
 Avoid explicit loops whenever possible. Pandas and NumPy are optimized
for vectorized operations, which are much faster than iterating through
rows.
 Use Pandas' built-in functions like apply(), map(), and agg() for efficient
data manipulation.
2. Data Types:
 Choose the smallest appropriate data type for your columns. For
example, use int8 instead of int64 if your data allows it. This can
significantly reduce memory usage and improve performance.
 Use category data type for columns with a limited number of unique
values.

121
3. Memory Optimization:
 Read only the columns you need from your data files.
 Use chunking to process large files in smaller parts.
 Delete unnecessary dataframes to free up memory.
4. Parallel Processing:
 Use libraries like Dask or Ray to parallelize your Pandas operations across
multiple cores or machines.
5. Just-in-Time Compilation:
 Use Numba to compile your Python functions to machine code for faster
execution.
6. GPU Acceleration:
 Consider using cuDF, a GPU-accelerated library with a Pandas-like API,
for even greater performance gains.
7. Profiling and Benchmarking:
 Use tools like cProfile or line_profiler to identify performance bottlenecks
in your code.
 Benchmark different approaches to choose the most efficient solution.
8. Other Tips:
 Use pd.eval() and df.query() for faster expression evaluation.
 Avoid creating unnecessary copies of dataframes.
 Use inplace operations when possible.
 Optimize your data import process.
By applying these techniques, you can significantly improve the performance of
your Pandas code and work more efficiently with large datasets.

122
ASSIGNMENTS

S. Questions K Cos
No Level

Very Easy Level

Create a NumPy array of zeros with shape (4,4) and


1 K1 CO3
change the diagonal elements to 1.

Easy Level

Write a program to perform element-wise addition,

2 subtraction, multiplication, and division of two NumPy K1 CO3


arrays.

Medium Level

Handle Missing Data, Create a DataFrame with some


missing values (NaN), and fill missing values with the
3 column mean, and Drop rows with missing values. K4 CO3

Hard Level

Create a pivot table summarizing average scores by


gender and subject and generate a crosstab showing
4 K4 CO3
the count of students passing or failing by
department.

VERY HARD LEVEL

Explore the working with large data set, Load a large


dataset (e.g., 100,000+ rows) using chunksize. And

5 process each chunk separately and calculate summary K4 CO3


statistics.

123
Part – A Questions and Answers

124
Part-A

1. What is NumPy and why is it used? (K1,CO3)

NumPy (Numerical Python) is a Python library used for numerical computing. It


provides support for multidimensional arrays, mathematical operations,
and linear algebra functions.

2. What is the difference between a list and a NumPy array? (K2,CO3)

Lists are built-in Python data structures that store heterogeneous data, whereas
NumPy arrays are homogeneous, optimized for fast numerical computations.

3. What is a NumPy ndarray? (K4,CO3)


A NumPy ndarray (n-dimensional array) is a multi-dimensional container for
homogeneous data, meaning all elements must be of the same type.

import numpy as np
arr=np.array([1,2,3,4,5])
print(type(arr))
output:
<class 'numpy.ndarray'>

4. What are NumPy constants? Give an example. (K3,CO3)

Constants are predefined values like np.pi (π), np.e (Euler’s number).

import numpy as np
radius = 2
circumference = 2 * np.pi * radius
print(circumference)
Output:
12.566370614359172

125
5. What is Matplotlib used for? (K3,CO3)
Matplotlib is used for creating static, animated, and interactive visualizations in
Python.For example

import matplotlib.pyplot as plt


x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
plt.plot(x, y)
plt.title("Line Chart")
plt.ylabel('Y-Axis')
plt.xlabel('X-Axis')
plt.show()

6. What is Matplotlib? (K1,CO3)


Matplotlib is a Python library used for data visualization like line plots, scatter
plots, and histograms.

7.How do you plot a line graph using Matplotlib? (K1,CO3)


import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [10, 20, 25, 30]
plt.plot(x, y)
plt.show()

8. What is a Pandas Series? (K1,CO3)

A Pandas Series is a one-dimensional labeled array capable of holding any data


type. The Series wraps both a sequence of values and a sequence of indices,
which we can access with the values and index attributes. The Pandas Series is
much more general and flexible than the one-dimensional NumPy array.

126
9. What is Pandas used for? (K1,CO3)

Pandas is a Python library for data manipulation and analysis using Series and
DataFrames.

10. How can you handle missing data in Pandas? (K1,CO3)


Using df.fillna(value) to replace missing values or df.dropna() to remove missing
data.

11. What is a Pandas DataFrame? (K1,CO3)

A DataFrame is a two-dimensional table-like structure with labeled rows and


columns.

import pandas as pd
df1 = pd.DataFrame({'ID': [1, 2, 3, 4], 'Name': ['Alice', 'Bob', 'Charlie', 'David']})
df2 = pd.DataFrame({'ID': [3, 4, 5, 6], 'Salary': [60000, 70000, 80000, 90000]})
print("df1:")
print(df1)
print("\ndf2:")
print(df2)
12. What is the difference between .loc[] and .iloc[] in Pandas?
(K1,CO3)

 .loc[] is label-based indexing.


 .iloc[] is position-based indexing.

13. What is hierarchical indexing in Pandas? (K1,CO3)


Hierarchical indexing allows multiple index levels on a DataFrame, enabling
more complex data structures.

127
14. What is a pivot table in Pandas? (K1,CO3)
A pivot table is a table that summarizes data using aggregation functions like
sum, mean, or count.

15. How do you perform an inner join on two DataFrames? (K3,CO3)

df1.merge(df2, on='key', how='inner')

16. How do you improve Pandas performance? (K1,CO3)


Using vectorized operations, efficient data types (category, Int32), and libraries
like Dask or NumPy.

17. What is hierarchical indexing? (K2,CO3)

Hierarchical indexing allows multiple levels of indexing in Pandas.

18. What is the difference between merge and join in Pandas?


(K1,CO3)

merge() is flexible for different types of joins, while join() is primarily for index-
based merging.

19. What is time series data in Pandas? (K2,CO3)

Time series data consists of indexed data points based on time, useful for trend
analysis.

20. How do you group data by a column and compute the mean?
(K1,CO3)

df.groupby('Category')['Sales'].mean().

21. What is resampling in time-series analysis? (K1,CO3)

128
Resampling is changing the frequency of time-series data (e.g., daily to
monthly).

22. What is the advantage of using .apply() in Pandas? (K1,CO3)

It allows custom functions to be applied to rows or columns efficiently.

23. What is vectorization in Pandas? (K1,CO3)

Performing operations without explicit loops, improving performance.

24. How do you improve performance when handling large datasets in


Pandas? (K1,CO3)

Use vectorized operations and chunksize while reading large files.

25. How do you create a hierarchical index in Pandas? (K3,CO3)

import pandas as pd

data = pd.Series([10, 20, 30], index=[['A', 'A', 'B'], [1, 2, 1]])

print(data)

26. What is the advantage of using NumPy over Python lists? (K1,CO3)

NumPy arrays are faster, memory-efficient, and optimized for numerical


computations.

27. What are the different types of joins in Pandas? (K1,CO3)

 inner (default)
 outer
 left
 right

129
28. What is the benefit of using .apply() over loops in Pandas?
(K1,CO3)
.apply() is optimized for performance and works faster than explicit loops when
applying functions to DataFrame rows/columns.

29. What is the purpose of data visualization in Pandas? (K2,CO3)

Data visualization helps in understanding trends, patterns, and insights from a


dataset by representing data graphically.

30. What function is used to create a histogram in Pandas? (K1,CO3)

The .hist() function is used to create a histogram.

df['column_name'].hist()

130
Part B – Questions

131
Part – B Questions

Q. Questions CO K Level
No. Level

Explain the concept of NumPy arrays and their


advantages over Python lists. Write a program to
1 CO3 K1
create a 2D NumPy array and demonstrate indexing,
slicing, and reshaping.

What is Matplotlib? Explain different types of plots


available in Matplotlib. Write a Python program to plot
2 CO3 K1
a line graph, scatter plot, and histogram using NumPy
arrays.

Define a Pandas Series and DataFrame. How do they


differ? Write a Python program to create a Pandas
DataFrame from a dictionary and perform the
following operations:
 Display the first 5 rows.
 Select a specific column and filter rows based
3 CO3 K4
on a condition.
 Sort the DataFrame by a specific column in
descending order.
(Marks Distribution: Theory – 4, Code – 4,
Output & Explanation – 4)

Explain different methods of selecting and filtering


data in Pandas. Write a program to demonstrate
4 CO3 K4
indexing using .loc[] and .iloc[] on a sample
DataFrame.

132
What are missing values in Pandas? Why is handling
missing data important? Write a Python program to
create a Pandas DataFrame with missing values and
perform the following operations:
5 CO3 K4
 Check for missing values.
 Fill missing values with the column mean.
 Drop rows with missing values.

Explain different ways to merge and join Pandas


DataFrames. Write a Python program to perform inner
6 CO3 K4
join and outer join on two DataFrames using a
common key.

What is a pivot table in Pandas? How does groupby()


work? Write a Python program to create a DataFrame
7 and use the groupby() function to find the average CO3 K4
salary by department. Also, create a pivot table
summarizing sales by product and region.

Explain how Pandas handles string operations. Write a


Python program to create a Pandas DataFrame with a
text column and perform the following operations:
8  Convert all text to lowercase. CO3 K4
 Replace spaces with underscores.
 Find the number of times a word appears in the
column.

Explain time-series analysis in Pandas. Write a Python


program to generate a time-series dataset with a date
9 CO3 K4
column and perform the following:
 Convert the column to DateTime format.

133
 Extract the year, month, and day from the date
column.
 Resample the data to show monthly averages.

What is vectorization in Pandas? How does it improve


performance? Write a Python program to apply a
10 function to an entire column using both .apply() and CO3 K4
vectorized operations. Compare their performance
using the %timeit function.

134
Supportive online Certification courses (NPTEL,
Swayam, Coursera, Udemy, etc.,)

135
Supportive online Certification courses

Sl. Courses Platform


No.

1 Python for Machine Learning with Udemy


Numpy, Pandas & Matplotlib https://www.udemy.com/course/python-
for-machine-learning-with-numpy-and-
pandas/

2 NumPy, SciPy, Matplotlib & Pandas Udemy


A-Z: Machine Learning https://www.udemy.com/course/numpy-
scipy-matplotlib-pandas-a-z-machine-
learning/

3 Data Analysis with Python Coursera


https://www.coursera.org/learn/python-
data-analysis

4 Python for Data Science edX


https://www.edx.org/course/python-for-
data-science

5 Python Data Analysis LinkedIn Learning


https://www.linkedin.com/learning/python-
data-analysis-2

136
Real time Applications in day to day life and to Industry

137
NumPy Applications
Daily Life Applications
✅ Scientific Calculations: Used in weather forecasting to process large climate
datasets.
✅ Personal Finance Management: Helps calculate monthly budgets, track
expenses, and analyze financial trends using array operations.
✅ Gaming: NumPy is used in game physics (e.g., calculating object
movements, collision detection).
✅ Image Processing: NumPy arrays represent images, allowing easy
manipulation (e.g., Instagram filters, facial recognition).
Industry Applications
🔹 Artificial Intelligence & Machine Learning: Used in deep learning frameworks
like TensorFlow and PyTorch for handling multidimensional data efficiently.
🔹 Robotics & Automation: Helps in motion planning and sensor data processing
in autonomous vehicles.
🔹 Healthcare & Genomics: Used in medical image analysis (MRI, CT scans) and
DNA sequence data processing.
🔹 Stock Market Analysis: Helps analyze large financial datasets for predicting
market trends.
2. Matplotlib Applications
Daily Life Applications
✅ Tracking Personal Goals: Used to plot weight loss, fitness progress,
expenses, or investment growth.
✅ Smart Home Systems: Visualizing power consumption trends in smart
meters.
✅ Social Media Analytics: Helps influencers track their engagement rates and
follower growth over time.

138
Industry Applications
🔹 Business Analytics: Companies use Matplotlib to create sales reports,
performance dashboards, and revenue trends.
🔹 Healthcare & Medicine: Used to visualize patient heart rate, sugar levels, or
other health indicators.
🔹 Engineering & Manufacturing: Engineers plot stress-strain curves, thermal
expansion graphs, and real-time monitoring of machine performance.
🔹 Weather Forecasting: Scientists analyze climate data to predict rainfall,
temperature trends, and hurricanes.
3. Pandas Applications
Daily Life Applications
✅ Personal Expense Tracking: Helps in managing monthly expenses, grocery
budgets, and travel costs.
✅ Sports Performance Analysis: Used by coaches to analyze player statistics,
win-loss records, and game trends.
✅ E-commerce Price Comparison: Used in online shopping sites to analyze
product prices across different platforms.
✅ Social Media Management: Helps influencers and marketers track
engagement metrics.
Industry Applications
🔹 Banking & Finance: Used for fraud detection, customer segmentation, and
loan approval analysis.
🔹 Retail & E-Commerce: Helps companies like Amazon, Flipkart, and Walmart
analyze sales trends, customer preferences, and demand forecasting.
🔹 Supply Chain & Logistics: Used to optimize delivery routes, manage
warehouse inventory, and reduce transportation costs.
🔹 Healthcare: Hospitals use Pandas for patient records analysis, predicting
disease outbreaks, and optimizing medical resources.

139
Content Beyond Syllabus

140
CONTENT BEYOND THE SYLLABUS

1. Advanced NumPy Topics


🔹 Broadcasting in NumPy: Allows operations between arrays of different
shapes without explicit loops.
🔹 Memory Optimization in NumPy: Using views and strides to manipulate data
without memory overhead.
🔹 Numba & Cython for Speeding Up NumPy: How to use JIT (Just-In-Time)
compilation to accelerate array operations.
🔹 Multithreading & Parallel Processing in NumPy: Using multiprocessing and
joblib for handling large datasets efficiently.
🔹 Sparse Matrices & SciPy Integration: Handling large, sparse datasets
commonly used in machine learning & scientific computing.

2. Advanced Data Visualization with Matplotlib


🔹 3D Plotting with Matplotlib: Creating 3D scatter plots, surface plots, and
wireframes.
🔹 Animation in Matplotlib: Using FuncAnimation to create dynamic, real-time
visualizations.
🔹 Seaborn & Plotly for Interactive Graphs: Moving beyond Matplotlib for more
visually appealing and interactive plots.
🔹 Heatmaps & Geospatial Data Visualization: How to plot geographical data,
population density maps, and heatmaps.
🔹 Dashboards with Matplotlib & Dash: Building real-time business dashboards
for industry applications.

3. Beyond Pandas: High-Performance & Big Data Handling


🔹 Dask - Parallel Computing for Pandas: Works like Pandas but scales to big
data with parallel execution.

141
🔹 Vaex - Handling Billion-Row Datasets: An alternative to Pandas for fast, out-
of-core DataFrame operations.
🔹 Modin - Speeding Up Pandas with Multi-Core Processing: Uses parallel
execution to accelerate Pandas operations.
🔹 SQL Integration with Pandas: Using Pandas with databases (SQLite, MySQL,
PostgreSQL) for real-world data analytics.
🔹 Time Series Forecasting with Pandas: Advanced analysis techniques for
predicting stock prices, weather trends, and sales forecasting.

4. Industry Applications and Case Studies


🔹 Real-Time Stock Market Analysis using Pandas & NumPy

🔹 Using NumPy for AI & Deep Learning with TensorFlow/PyTorch

🔹 Financial Risk Prediction using Pandas & Matplotlib


🔹 Big Data Analytics in E-commerce using Pandas and Dask

🔹 Healthcare Data Analysis & Patient Trend Forecasting

5. Machine Learning & AI Beyond NumPy & Pandas


🔹 Scikit-Learn - The Next Step After Pandas:
 Used for machine learning models, feature selection, and data
preprocessing.
🔹 Deep Learning with NumPy & Pandas:
 How NumPy is used in deep learning models, neural network weights,
and tensor operations.
🔹 Natural Language Processing (NLP) with Pandas:
 Analyzing text datasets, sentiment analysis, and chatbot development.
🔹 Web Scraping with Pandas & BeautifulSoup:
 Extracting real-time data from websites and storing it in Pandas
DataFrames.

142
Assessment Schedule (Proposed Date & Actual Date)

Assessment Proposed Actual Date Course Program


Tool Date Outcome Outcome
(Filled Gap)

Assessment I 24.02.2025 24.02.2025 CO1, CO2

Assessment II 01.04.2025 01.04.2025 CO3, CO4

Model 28.04.2025 28.04.2025 CO1, CO2, CO3,


CO4, CO5,CO6

143
Prescribed Text Books & Reference

144
Sl.
Book Name & Author Book
No.

1 Allen B. Downey, “Think Python: How to Think Like a Text Book


Computer Scientist”, 2nd edition, Updated for Python 3,
Shroff/O‘Reilly Publishers, 2016.
2 Jake VanderPlas, “Python Data Science Handbook – Essential Text Book
tools for working with data”, O’Reilly, 2017.

3 Steve Abrams, “Artificial Intelligence and Machine Learning for Text Book
Beginners: A simple guide to understanding and Applying AI
and ML”, Independently published, May 14, 2024.
4 Vinod Chandra S S, Anand Hareendran S, Artificial Intelligence Reference
and Machine Learning, PHI Learning, 2014.
Book

5 Russell, S. and Norvig, P, Artificial Intelligence: A Modern Reference


Approach, Third Edition, Prentice Hall, 2010. Book

6 Ethem Alpaydın, Introduction to Machine Learning, Second Reference


Edition, the MIT Press, Cambridge, Massachusetts, London, Book
England.
7 Stephen Marsland, Machine Learning - An Algorithmic Reference
Perspective, 2nd Edition, 2015, by Taylor & Francis Group, Book
2015.
8 Tom M. Mitchell, Machine Learning, McGraw-Hill Science, ISBN: Reference
0070428077 Book

9 Mayuri Mehta, Vasile Palade, Indranath Chatterjee, Explainable Reference


AI: Foundations, Methodologies and Applications, Springer, Book
2023.
10 Siddhartha Bhattacharyya, Indrajit Pan, Ashish Mani, Sourav Reference
De, Elizabeth Behrman, Susanta Chakraborti, "Quantum Book
Machine Learning", De Gruyter Frontiers in Computational
Intelligence, 2020.

145
MINI PROJECT SUGGESTIONS

146
MINI PROJECT

S. Questions K COs
No Level

Very Easy Level

Convert a color image to grayscale using NumPy


1 arrays and perform basic image manipulations (e.g., K4 CO1
flipping, rotation).

Easy Level

Build a Python program that allows users to input


2 matrices and perform operations like addition, K4 CO1
multiplication, and inversion.

Medium Level

Use publicly available COVID-19 data to analyze cases,


3 recoveries, and deaths across different countries and K4 CO1
visualize the trends.

Hard Level

Analyze an e-commerce dataset and classify


4 K4 CO1
customers based on spending behavior.

VERY HARD LEVEL

Scrape Twitter data, process text using Pandas, and


5 K4 CO1
visualize public sentiment on trending topics.

147
THANK YOU

Disclaimer:

This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.

148

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy