Unit 3
Unit 3
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
24AM201
INTRODUCTION TO ARTIFICIAL
INTELLIGENCE
(Lab Integrated)
Department: AIML/CSBS/CSE/ECE/IT
Created by:
Date: 17.02.2025
Table of Contents
Sl. No. Contents Page No.
1 Contents 5
2 Course Objectives 6
5 Course Outcomes 13
6 CO-PO/PSO Mapping 15
Lecture Plan (S.No., Topic, No. of Periods, Proposed
7 date, Actual Lecture Date, pertaining CO, Taxonomy 18
level, Mode of Delivery)
8 Activity based learning 20
Lecture Notes ( with Links to Videos, e-book reference,
9 22
PPTs, Quiz and any other learning materials )
Assignments ( For higher level learning and Evaluation
10 123
- Examples: Case study, Comprehensive design, etc.,)
11 Part A Q & A (with K level and CO) 124
Introduction – Types of AI – ANI, AGI, ASI – Narrow, General, Super AI, Examples - AI
problems – Production Systems – State space Representation – Applications of AI in
various industries.
List of Exercise:
K6 Evaluation
K5 Synthesis
K4 Analysis
K3 Application
K2 Comprehension
K1 Knowledge
CO – PO/PSO
Mapping
CO – PO /PSO Mapping Matrix
CO PO PO PO PO PO PO PO PO PO PO PO PO PS PS PS
1 2 3 4 5 6 7 8 9 10 11 12 O1 O2 03
1 3 3 3 2 2 3 3 3 3 2 2 2 3 2 2
2 3 3 3 2 2 3 3 3 3 2 2 2 3 2 2
3 3 3 3 2 2 3 3 3 3 2 2 2 3 3 2
4 3 3 3 2 2 3 3 3 3 2 2 2 3 3 2
5 3 3 3 2 2 3 3 3 3 2 2 2 3 3 3
6 3 3 3 2 2 3 3 3 3 2 3 3 3 3 3
UNIT – III
ARTIFICIAL INTELLIGENCE
Lecture Plan
Unit - III
Lecture Plan – Unit III – ARTIFICIAL
INTELLIGENCE
Sl. Topic Number Proposed Actual CO Taxono Mode of
No. of Date Lecture my Delivery
Periods Date Level
Introduction to
1 1 1.03.2025 1.03.2025 CO3 K1 PPT
Numpy
Combining
4 1 10.03.2025 10.03.2025 CO3 K1 PPT
datasets
High performance
6 1 12.03.2025 12.03.2025 CO3 K1 PPT
Pandas
Activity Based
Learning
Activity Based Learning
Improve your python coding ability through each of the problem statement stated
below
1. Calculation of sum and product of (a) individual elements (b) collection of elements.
2. Identify maximum and minimum (a) using library functions (b) without using library
functions.
3. Develop code that enables you to calculate quartiles.
4. Enable variance calculations with (a) math library (b) numpy (c) pandas.
5. Represent a set of data by a representative value which would approximately define the
entire collection.
6. Execute standard deviation calculations with (a) math library (b) numpy (c) pandas.
7. Compare covariance and correlation through python coding.
8. Learn different types of plots and try to enhance your data-based story telling skill.
Lecture Notes –
Unit III
UNIT III ARTIFICIAL INTELLIGENCE
In Python we have lists. lists serve the purpose of arrays but they are slow
to process. NumPy aims to provide an array object that is up to 50 times
faster than traditional Python lists. NumPy arrays are stored at one
continuous place in memory unlike lists, so processes can access and
manipulate them very efficiently. This behavior is called locality of reference
in computer science.This is the main reason why NumPy is faster than lists.
Numpy is also optimized to work with latest CPU architectures.
24
Once NumPy is imported and ready to use try out the below simple code to
check
import numpy
arr=numpy.array([1, 2, 3, 4, 5])
print(arr)
You will get an ouput as below
[1 2 3 4 5]
NumPy is usually imported under the np alias.
The version string is stored under __version__ attribute.
Try the below code to check
import numpy as np
arr=np.array([1,2,3,4,5])
print(arr)
print("np version is ",np.__version__)
output:
[1 2 3 4 5]
np version is 1.26.4
import numpy as np
arr=np.array([1,2,3,4,5])
print(type(arr))
25
output:
<class 'numpy.ndarray'>
26
represent a 3rd order
tensor.
Higher An array can have any arr = np.array([1, 2, 3,
Dimensional number of dimensions 4],ndmin=3)
Arrays To define the number
of dimensions, use
the “ndmin” argument.
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
print(arr.ndim)
print(arr)
output:
0
1
2
3
4
NumPy Arrays provides the ndim attribute that returns an integer that tells
us how many dimensions the array have.
import numpy as np
arr = np.array([1, 2, 3, 4],ndmin=4)
print(arr)
27
output:
[[[[1 2 3 4]]]]
1-D array:
When numpy arrays have one dimension their elements are arranged as a
list and they can be accessed using a single index as stated below
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6])
a[0] # get the 0-th element of the array
output: 1
Definition:
A multi-dimensional array is an array with more than one level or dimension.
For example, a 2D array, or two-dimensional array, is an array of arrays,
meaning it is a matrix of rows and columns (think of a table). A 3D array
adds another dimension, turning it into an array of arrays of arrays.
Method 1:
start with a 1-dimensional array and use the numpy reshape() function that
rearranges elements of that array into a new shape.
Method 2:
The numpy functions zeros(), ones(), and empty() can be also used to
create arrays with more than one dimension:
import numpy as np
28
a = np.array([1, 2, 3, 4, 5, 6])
b = np.reshape(
a, # the array to be reshaped
(2,3) # dimensions of the new array
)
c = np.zeros((3,4)) # creates an array 3 rows and 4 columns with zeros
d = np.ones((3,4)) # creates an array 3 rows and 4 columns with ones
e = np.empty((3,4)) # creates an array 3 rows and 4 columns with zeros
print('1D array')
print(a) # the original 1-dimensional array
print('2D array')
print(b) # the reshaped array
print('3D array -zeros')
print(c)
print('3D array -ones')
print(d)
print('3D array empty')
print(e)
Output:
1D array
[1 2 3 4 5 6]
2D array
[[1 2 3]
[4 5 6]]
3D array -zeros
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
3D array -ones
[[1. 1. 1. 1.]
29
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
3D array empty
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
30
print('Matrix d')
e = b+d # addition of two arrays of the same dimensions
print('sum of 2 matrices b&d:')
print(e)
f = b*e # multiplication of two arrays of the same dimensions
print('multiplication of 2 matrices b&e:')
print(f)
g = np.dot(b, e) # matrix multiplication of b and e
print('multiplication using dot function')
print(g)
print('applying numpy mathematical functions')
h = np.cos(g) # compute cosine of all elements of the array g
print(h)
Output:
1D array: [0 1 2 3]
reshaped array:
[[0 1]
[2 3]]
scalar multiplication
[[ 0 10]
[20 30]]
Matrix d
sum of 2 matrices b&d:
[[1. 2.]
[3. 4.]]
multiplication of 2 matrices b&e:
[[ 0. 2.]
[ 6. 12.]]
multiplication using dot function
[[ 3. 4.]
[11. 16.]]
31
applying numpy mathematical functions
[[-0.9899925 -0.65364362]
[ 0.0044257 -0.95765948]]
3.3. INDEXING
32
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
33
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('Last element from 2nd dim: ', arr[1, -1])
output:
Last element from 2nd dim: 10
Rember a[i] is the same as a[i,:] i.e. it selects the i-th row of the array:
Syntax:
arrayname[start index:end index+1]
arrayname[:end index+1]
arrayname[:]
We can also define the step, like this: [start:end:step].
Program:
'''selecting elements by row/column
o arrayname[start index:end index+1]
o arrayname[:end index+1]
o arrayname[:]
'''
a = np.reshape(np.arange(30), (5,6)) # create a 5x6 array
print('create matrix with reshape')
print(a)
print('syntax-1')
b = a[1:4, 0:2] #select elements in rows 1-3 and columns 0-1
print(b)
print('syntax 2')
c = a[:3, 2:4] #select elements in rows 0-2 and columns 2-3
34
print(c)
print('syntax 3')
d = a[:, 0] # select all elements in the 0-th column
print(d)
Output:
create matrix with reshape
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]]
syntax-1
[[ 6 7]
[12 13]
[18 19]]
syntax 2
[[ 2 3]
[ 8 9]
[14 15]]
syntax 3
[ 0 6 12 18 24]
35
a = np.reshape(np.arange(30), (5,6)) # create a 5x6 array
print(a)
Output:
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]]
Example 2:
a = np.reshape(np.arange(30), (5,6)) # create a 5x6 array
b = a[1:4, 0:2] #select elements in rows 1-3 and columns 0-1
c = a[:3, 2:4] #select elements in rows 0-2 and columns 2-3
d = a[:, 0] # select all elements in the 0-th column
print(a[1])
b = a[:3, :3]
print(b)
b[0,0] = 1000
print(b)
print(a)
#use this to change many entries of an array at once
a[:4, :4] = 0 # set all entries of the slize to 0
print(a)
Output:
[ 6 7 8 9 10 11]
[[ 0 1 2]
[ 6 7 8]
[12 13 14]]
[[1000 1 2]
[ 6 7 8]
[ 12 13 14]]
36
[[1000 1 2 3 4 5]
[ 6 7 8 9 10 11]
[ 12 13 14 15 16 17]
[ 18 19 20 21 22 23]
[ 24 25 26 27 28 29]]
[[ 0 0 0 0 4 5]
[ 0 0 0 0 10 11]
[ 0 0 0 0 16 17]
[ 0 0 0 0 22 23]
[24 25 26 27 28 29]]
3.4. PROPERTIES
In NumPy, attributes are properties of NumPy arrays that provide
information about the array's shape, size, data type, dimension, and so on.
To access the Numpy attributes, we use the . (DOT) notation. There are
numerous attributes available. some of the commonly used NumPy attributes
are stated below
Attribute Description Usage
array1 =
returns number of dimensions
Ndim np.array([[2, 4,
of the array
6],[1, 3, 5]])
Returns total number of
elements in the array
Size array1.size
regardless of the number of
dimensions
returns data type of elements
dtype array1.shape
in the array
returns a tuple of integers that
shape gives the size of the array in array1.dtype
each dimension
37
returns the size (in bytes) of
itemsize array1.itemsize
each element in the array
returns the buffer containing
actual elements of the array. It
is like a pointer to the memory
Data array1.data
location where the array's data
is stored in the computer's
memory.
import numpy as np
# create a 2-D array
array1 = np.array([[2, 4, 6],
[1, 3, 5]])
# check the dimension of array1
print(array1.ndim)
# return total number of elements in array1
print(array1.size)
# return a tuple that gives size of array in each dimension
print(array1.shape)
# create an array of integers
array1 = np.array([6, 7, 8])
# check the data type of array1
print(array1.dtype)
# create a 1-D array of 32-bit integers
array2 = np.array([6, 7, 8, 10, 13], dtype=np.int32)
print(array1.itemsize)
print(array2.itemsize)
# print memory address of array1's and array2's data
print("\nData of array1 is: ",array1.data)
38
print("Data of array2 is: ",array2.data)
Output
2
6
(2, 3)
int64
8
4
Data of array1 is: <memory at 0x7fd869e68ac0>
Data of array2 is: <memory at 0x7fd869e68ac0>
3.5. CONSTANTS
NumPy constants are the predefined fixed values used for mathematical
calculations. Using predefined constants makes our code concise and easier
to read. the most commonly used constants are pi and e.
np.pi
It is a mathematical constant that returns the value of pi(π) as a floating
point number. Its value is approximately 3.141592653589793. Instead of the
long floating point number, we can use the constant np.pi. It makes our
code look clean. Let's see an example.
import numpy as np
radius = 2
circumference = 2 * np.pi * radius
print(circumference)
Output:
12.566370614359172
np.e
It is widely used with exponential and logarithmic functions. Its value is
approximately 2.718281828459045.
Let's see an example.
39
import numpy as np
y = np.e
print(y)
Output:
2.718281828459045
We usually use the constant e with the function exp(). e is the base of
exponential function, exp(x), which is equivalent to e^x
import numpy as np
x=2
y = np.exp(x) #caluculating e power 2
Output
7.38905609893065 .
40
further nested lists will create higher-dimensional arrays. In general, any
array object is called an ndarray in NumPy.
>>> import numpy as np
>>> a1D = np.array([1, 2, 3, 4])
>>> a2D = np.array([[1, 2], [3, 4]])
>>> a3D = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
When using numpy.array to define a new array, consider the dtype of the
elements in the array, which can be specified explicitly. This feature gives more
control over the underlying data structures and how the elements are handled.
When values do not fit and you are using a dtype, NumPy may raise an error:
>>> import numpy as np
>>> np.array([127, 128, 129], dtype=np.int8)
Traceback (most recent call last):
...
OverflowError: Python integer 128 out of bounds for int8
An 8-bit signed integer represents integers from -128 to 127. Assigning
the int8 array to integers outside of this range results in overflow. This feature
can often be misunderstood. Performing calculations with mismatching dtypes,
gives unwanted results as stated in below example.
>>> import numpy as np
>>> a = np.array([2, 3, 4], dtype=np.uint32)
>>> b = np.array([5, 6, 7], dtype=np.uint32)
>>> c_unsigned32 = a - b
>>> print('unsigned c:', c_unsigned32, c_unsigned32.dtype)
unsigned c: [4294967293 4294967293 4294967293] uint32
>>> c_signed32 = a - b.astype(np.int32)
>>> print('signed c:', c_signed32, c_signed32.dtype)
signed c: [-3 -3 -3] int64
operations with two arrays of the same dtype: uint32, the resulting array is the
same type. When you perform operations with different dtype, NumPy will
41
assign a new type that satisfies all of the array elements involved in the
computation, here uint32 and int32 can both be represented in as int64.
42
the dtype is defined. In the third example, the array is dtype=float to
accommodate the step size of 0.1. Due to roundoff error, the stop value is
sometimes included.
numpy.linspace will create arrays with a specified number of elements, and
spaced equally between the specified beginning and end values. For example:
>>> import numpy as np
>>> np.linspace(1., 4., 6)
array([1. , 1.6, 2.2, 2.8, 3.4, 4. ])
The advantage of this creation function is that you guarantee the number of
elements and the starting and end point. The
previous arange(start, stop, step) will not include the value stop.
b) 2D Array Creation Functions
The 2D array creation functions e.g. numpy.eye, numpy.diag,
and numpy.vander define properties of special matrices represented as 2D
arrays.
np.eye(n, m) defines a 2D identity matrix. The elements where i=j (row index
and column index are equal) are 1 and the rest are 0, as such:
>>> import numpy as np
>>> np.eye(3)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
>>> np.eye(3, 5)
array([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.]])
numpy.diag can define either a square 2D array with given values along the
diagonal or if given a 2D array returns a 1D array that is only the diagonal
elements. The two array creation functions can be helpful while doing linear
algebra, as such:
43
>>> import numpy as np
>>> np.diag([1, 2, 3])
array([[1, 0, 0],
[0, 2, 0],
[0, 0, 3]])
>>> np.diag([1, 2, 3], 1)
array([[0, 1, 0, 0],
[0, 0, 2, 0],
[0, 0, 0, 3],
[0, 0, 0, 0]])
>>> a = np.array([[1, 2], [3, 4]])
>>> np.diag(a)
array([1, 4])
vander(x, n) defines a Vandermonde matrix as a 2D NumPy array. Each column
of the Vandermonde matrix is a decreasing power of the input 1D array or list or
tuple, x where the highest polynomial order is n-1. This array creation routine is
helpful in generating linear least squares models, as such:
>>> import numpy as np
>>> np.vander(np.linspace(0, 2, 5), 2)
array([[0. , 1. ],
[0.5, 1. ],
[1. , 1. ],
[1.5, 1. ],
[2. , 1. ]])
>>> np.vander([1, 2, 3, 4], 2)
array([[1, 1],
[2, 1],
[3, 1],
[4, 1]])
>>> np.vander((1, 2, 3, 4), 4)
array([[ 1, 1, 1, 1],
44
[ 8, 4, 2, 1],
[27, 9, 3, 1],
[64, 16, 4, 1]])
c) General ndarray Creation Functions
The ndarray creation functions e.g. numpy.zeros, numpy.ones,
and random define arrays based upon the desired shape.
The ndarray creation functions can create arrays with any dimension by
specifying how many dimensions and length along that dimension in a
tuple or list.
numpy.zeros will create an array filled with 0 values with the specified shape.
The default dtype is float64:
>>> import numpy as np
>>> np.zeros((2, 3))
array([[0., 0., 0.],
[0., 0., 0.]])
>>> np.zeros((2, 3, 2))
array([[[0., 0.],
[0., 0.],
[0., 0.]],
[[0., 0.],
[0., 0.],
[0., 0.]]])
numpy.ones will create an array filled with 1 values. It is identical to zeros in
all other respects as such:
>>> import numpy as np
>>> np.ones((2, 3))
array([[1., 1., 1.],
[1., 1., 1.]])
>>> np.ones((2, 3, 2))
array([[[1., 1.],
[1., 1.],
45
[1., 1.]],
[[1., 1.],
[1., 1.],
[1., 1.]]])
The random method will create an array filled with random values between 0
and 1. It is included with the numpy.random library. Below, two arrays are
created with shapes (2,3) and (2,3,2), respectively. The seed is set to 42 so you
can reproduce these pseudorandom numbers:
>>> import numpy as np
>>> from numpy.random import default_rng
>>> default_rng(42).random((2,3))
array([[0.77395605, 0.43887844, 0.85859792],
[0.69736803, 0.09417735, 0.97562235]])
>>> default_rng(42).random((2,3,2))
array([[[0.77395605, 0.43887844],
[0.85859792, 0.69736803],
[0.09417735, 0.97562235]],
[[0.7611397 , 0.78606431],
[0.12811363, 0.45038594],
[0.37079802, 0.92676499]]])
numpy.indices will create a set of arrays (stacked as a one-higher
dimensioned array), one per dimension with each representing variation in that
dimension:
>>> import numpy as np
>>> np.indices((3,3))
array([[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]],
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]])
46
This is particularly useful for evaluating functions of multiple dimensions on a
regular grid.
47
>>> D = np.diag((-3, -4))
>>> np.block([[A, B], [C, D]])
array([[ 1., 1., 1., 0.],
[ 1., 1., 0., 1.],
[ 0., 0., -3., 0.],
[ 0., 0., 0., -4.]])
Other routines use similar syntax to join ndarrays.
48
x, y
0, 0
1, 1
2, 4
3, 9
Importing simple.csv is accomplished using numpy.loadtxt:
>>> import numpy as np
>>> np.loadtxt('simple.csv', delimiter = ',', skiprows = 1)
array([[0., 0.],
[1., 1.],
[2., 4.],
[3., 9.]])
3.6.5. Creating arrays from raw bytes through strings or buffer usage
There are a variety of approaches one can use. If the file has a relatively simple
format then one can write a simple I/O library and use the
NumPy fromfile() function and .tofile() method to read and write NumPy arrays
directly (mind your byteorder though!) If a good C or C++ library exists that
read the data, one can wrap that library with a variety of techniques though
that certainly is much more work and requires significantly more advanced
knowledge to interface with C or C++.
49
Matplotlib is a powerful and widely-used Python library for creating static,
animated and interactive data visualizations. Here comes the guide on Matplotlib
and how to use it for data visualization with practical implementation.
3.7.1. Installing Matplotlib for Data Visualization
To install Matplotlib type the below command in the terminal.
pip install matplotlib
If you are using Jupyter Notebook, you can install it within a notebook cell by
using:
!pip install matplotlib
3.7.2. Data Visualization with Pyplot using Matplotlib
Matplotlib provides a module called pyplot which offers a MATLAB-like
interface for creating plots and charts. It simplifies the process of generating
various types of visualizations by providing a collection of functions that handle
common plotting tasks.
Matplotlib supports a variety of plots including line charts, bar charts,
histograms, scatter plots, etc. Let’s understand them with implementation using
pyplot.
Line Chart
Line chart is one of the basic plots and can be created using
the plot() function. It is used to represent a relationship between two data X
and Y on a different axis.
Example:
import matplotlib.pyplot as plt
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
plt.plot(x, y)
plt.title("Line Chart")
plt.ylabel('Y-Axis')
plt.xlabel('X-Axis')
plt.show()
50
Output:
Bar Chart
A bar chart is a graph that represents the category of data with
rectangular bars with lengths and heights that is proportional to the
values which they represent. The bar plots can be plotted horizontally or
vertically. A bar chart describes the comparisons between the different
categories. It can be created using the bar() method.
In the below example we will use the tips dataset. Tips database is the record
of the tip given by the customers in a restaurant for two and a half months in
the early 1990s. It contains 6 columns as total_bill, tip, sex, smoker, day, time,
size.
Example:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('tips.csv')
x = data['day']
y = data['total_bill']
plt.bar(x, y)
51
plt.title("Tips Dataset")
plt.ylabel('Total Bill')
plt.xlabel('Day')
plt.show()
Output:
Histogram
A histogram is basically used to represent data provided in a form of some
groups. It is a type of bar plot where the X-axis represents the bin ranges while
the Y-axis gives information about frequency. The hist() function is used to
compute and create histogram of x.
Example:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('tips.csv')
x = data['total_bill']
plt.hist(x)
plt.title("Tips Dataset")
plt.ylabel('Frequency')
plt.xlabel('Total Bill')
plt.show()
52
Output:
Scatter Plot
Scatter plots are used to observe relationships between variables.
The scatter() method in the matplotlib library is used to draw a scatter plot.
Example:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('tips.csv')
x = data['day']
y = data['total_bill']
plt.scatter(x, y)
plt.title("Tips Dataset")
plt.ylabel('Total Bill')
plt.xlabel('Day')
plt.show()
53
Output:
Pie Chart
Pie chart is a circular chart used to display only one series of data. The area of
slices of the pie represents the percentage of the parts of the data. The slices of
pie are called wedges. It can be created using the pie() method.
Syntax:
matplotlib.pyplot.pie(data, explode=None, labels=None, colors=None,
autopct=None, shadow=False)
Example:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('tips.csv')
cars = ['AUDI', 'BMW', 'FORD',
'TESLA', 'JAGUAR',]
data = [23, 10, 35, 15, 12]
plt.pie(data, labels=cars)
plt.title("Car data")
plt.show()
54
Output:
Box Plot
A Box Plot is also known as a Whisker Plot and is a standardized way of
displaying the distribution of data based on a five-number summary: minimum,
first quartile (Q1), median (Q2), third quartile (Q3) and maximum. It can also
show outliers.Let’s see an example of how to create a Box Plot using Matplotlib
in Python:
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(10)
data = [np.random.normal(0, std, 100) for std in range(1, 4)]
# Create a box plot
plt.boxplot(data, vert=True, patch_artist=True,
boxprops=dict(facecolor='skyblue'),
medianprops=dict(color='red'))
plt.xlabel('Data Set')
plt.ylabel('Values')
plt.title('Example of Box Plot')
55
plt.show()
Output:
Explanation:
plt.boxplot(data): Creates the box plot. The vert=True argument
makes the plot vertical, and patch_artist=True fills the box with color.
boxprops and medianprops: Customize the appearance of the boxes
and median lines respectively.
The box shows the interquartile range (IQR) the line inside the box shows
the median and the “whiskers” extend to the minimum and maximum values
within 1.5 * IQR from the first and third quartiles. Any points outside this range
are considered outliers and are plotted as individual points.
Heatmap
A Heatmap is a data visualization technique that represents data in a matrix
form where individual values are represented as colors. Heatmaps are
particularly useful for visualizing the magnitude of multiple features in a two-
56
dimensional surface and identifying patterns, correlations and concentrations.
Let’s see an example of how to create a Heatmap using Matplotlib in Python:
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)
data = np.random.rand(10, 10)
plt.imshow(data, cmap='viridis', interpolation='nearest')
plt.colorbar()
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Example of Heatmap')
plt.show()
Output:
Explanation:
plt.imshow(data, cmap='viridis'): Displays the data as an image
(heatmap). The cmap='viridis' argument specifies the color map used for
the heatmap.
interpolation='nearest': Ensures that each data point is shown as a
block of color without smoothing.
57
The color bar on the side provides a scale to interpret the colors with darker
colors representing lower values and lighter colors representing higher values.
This type of plot is often used in fields like data analysis, bioinformatics and
finance to visualize data correlations and distributions across a matrix.
58
plt.ylabel('Y-Axis')
plt.xlabel('X-Axis')
plt.ylim(0, 80)
plt.xticks(x, labels=["one", "two", "three", "four"])
plt.legend(["GFG"])
plt.show()
Output:
Axes Class
Axes class is the most basic and flexible unit for creating sub-plots. A given
figure may contain many axes, but a given axes can only be present in one
figure. The axes() function creates the axes object.
Syntax:
axes([left, bottom, width, height])
Just like pyplot class, axes class also provides methods for adding titles,
legends, limits, labels, etc. Let’s see a few of them –
ax.set_title() is used to add title.
To Adding X Label and Y label – ax.set_xlabel(), ax.set_ylabel()
To set the limits we use ax.set_xlim(), ax.set_ylim()
ax.set_xticklabels(), ax.set_yticklabels() are used to tick labels.
59
To add legend we use ax.legend()
Example:
import matplotlib.pyplot as plt from matplotlib.figure import Figure
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
fig = plt.figure(figsize = (5, 4))
ax = fig.add_axes([1, 1, 1, 1])
ax1 = ax.plot(x, y)
ax2 = ax.plot(y, x)
ax.set_title("Linear Graph")
ax.set_xlabel("X-Axis")
ax.set_ylabel("Y-Axis")
ax.legend(labels = ('line 1', 'line 2'))
plt.show()
Output:
60
used in a wide range of fields including academic and commercial domains
including finance, economics, Statistics, analytics, etc. In this tutorial, we will
learn the various features of Python Pandas and how to use them in practice.
What is Pandas?
Pandas is a powerful Python library that is specifically designed to work on data
frames that have "relational" or "labeled" data. Its aim aligns with doing real-
world data analysis using Python. Its flexibility and functionality make it
indispensable for various data-related tasks. Hence, this Python package works
well for data manipulation, operating a dataset, exploring a data frame, data
analysis, and machine learning-related tasks. To work on it we should first
install it using a pip command like "pip install pandas" and then import it like
"import pandas as pd". After successfully installing and importing, we can enjoy
the innovative functions of pandas to work on datasets or data frames. Pandas
versatility and ease of use make it a go-to tool for working with structured data
in Python.
Generally, Pandas operates a data frame using Series and DataFrame; where
Series works on a one-dimensional labeled array holding data of any type
like integers, strings, and objects, while a DataFrame is a two-dimensional data
structure that manages and operates data in tabular form (using rows and
columns).
Why Pandas?
The beauty of Pandas is that it simplifies the task related to data frames and
makes it simple to do many of the time-consuming, repetitive tasks involved in
working with data frames, such as:
Import datasets - available in the form of spreadsheets, comma-
separated values (CSV) files, and more.
Data cleansing - dealing with missing values and representing them as
NaN, NA, or NaT.
Size mutability - columns can be added and removed from DataFrame
and higher-dimensional objects.
61
Data normalization – normalize the data into a suitable format for
analysis.
Data alignment - objects can be explicitly aligned to a set of labels.
Intuitive merging and joining data sets – we can merge and join
datasets.
Reshaping and pivoting of datasets – datasets can be reshaped and
pivoted as per the need.
Efficient manipulation and extraction - manipulation and extraction of
specific parts of extensive datasets using intelligent label-based slicing,
indexing, and subsetting techniques.
Statistical analysis - to perform statistical operations on datasets.
Data visualization - Visualize datasets and uncover insights.
Applications of Pandas
The most common applications of Pandas are as follows:
Data Cleaning: Pandas provides functionalities to clean messy data, deal
with incomplete or inconsistent data, handle missing values, remove
duplicates, and standardize formats to do effective data analysis.
Data Exploration: Pandas easily summarize statistics, find trends, and
visualize data using built-in plotting functions, Matplotlib, or Seaborn
integration.
Data Preparation: Pandas may pivot, melt, convert variables, and merge
datasets based on common columns to prepare data for analysis.
Data Analysis: Pandas supports descriptive statistics, time series analysis,
group-by operations, and custom functions.
Data Visualisation: Pandas itself has basic plotting capabilities; it
integrates and supports data visualization libraries like Matplotlib,
Seaborn, and Plotly to create innovative visualizations.
Time Series Analysis: Pandas supports date/time indexing, resampling,
frequency conversion, and rolling statistics for time series data.
62
Data Aggregation and Grouping: Pandas groupby() function lets you
aggregate data and compute group-wise summary statistics or apply
functions to groups.
Data Input/Output: Pandas makes data input and export easy by reading
and writing CSV, Excel, JSON, SQL databases, and more.
Machine Learning: Pandas works well with Scikit-learn for data
preparation, feature engineering, and model input data.
Web Scraping: Pandas may be used with BeautifulSoup or Scrapy to
parse and analyse structured web data for web scraping and data
extraction.
Financial Analysis: Pandas is commonly used in finance for stock market
data analysis, financial indicator calculation, and portfolio optimization.
Text Data Analysis: Pandas' string manipulation, regular expressions, and
text mining functions help analyse textual data.
Experimental Data Analysis: Pandas makes manipulating and analysing
large datasets, performing statistical tests, and visualizing results easy.
63
Python Pandas Data Structures
Data structures in Pandas are designed to handle data efficiently. They allow for
the organization, storage, and modification of data in a way that optimizes
memory usage and computational performance. Python Pandas library provides
two primary data structures for handling and analyzing data −
Series
DataFrame
In general programming, the term "data structure" refers to the method of
collecting, organizing, and storing data to enable efficient access and
modification. Data structures are collections of data types that provide the best
way of organizing items (values) in terms of memory usage.
Pandas is built on top of NumPy and integrates well within a scientific
computing environment with many other third-party libraries. This tutorial will
provide a detailed introduction to these data structures.
Dimension and Description of Pandas Data Structures
Working with two or more dimensional arrays can be complex and time-
consuming, as users need to carefully consider the data's orientation when
writing functions. However, Pandas simplifies this process by reducing the
mental effort required. For example, when dealing with tabular data
(DataFrame), it's easier to think in terms of rows and columns instead of axis 0
and axis 1.
64
Mutability of Pandas Data Structures
All Pandas data structures are value mutable, meaning their contents can be
changed. However, their size mutability varies as stated below
Series − Size immutable.
DataFrame − Size mutable.
3.9. Series
A Series is a one-dimensional labeled array that can hold any data type. It can
store integers, strings, floating-point numbers, etc. Each value in a Series is
associated with a label (index), which can be an integer or a string.
Name Steve
Age 35
Gender Male
Rating 3.5
Example
Consider the following Series which is a collection of different data types
Open Compiler
import pandas as pd
data = ['Steve', '35', 'Male', '3.5']
series = pd.Series(data, index=['Name', 'Age', 'Gender', 'Rating'])
print(series)
On executing the above program, you will get the following
Output:
Name Steve
Age 35
Gender Male
Rating 3.5
dtype: object
65
3.10. DataFrame
A pandas DataFrame can be created using the following constructor –
Sl.
Parameter & Description
No
Data - data takes various forms like ndarray, series, map, lists, dict,
1
constants and also another DataFrame.
Index - For the row labels, the Index to be used for the resulting frame
2
is Optional Default np.arange(n) if no index is passed.
Copy - This command (or whatever it is) is used for copying of data, if
5
the default is False.
66
Consider the following data representing the performance rating of a sales team
Example
The above tabular data can be represented in a DataFrame as follows
import pandas as pd
# Data represented as a dictionary
data = {
'Name': ['Steve', 'Lia', 'Vin', 'Katie'],
'Age': [32, 28, 45, 38],
'Gender': ['Male', 'Female', 'Male', 'Female'],
'Rating': [3.45, 4.6, 3.9, 2.78]
}
# Creating the DataFrame
df = pd.DataFrame(data)
# Display the DataFrame
print(df)
Output
On executing the above code, you will get the following output −
Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 4.60
Vin 45 Male 3.90
Katie 38 Female 2.78
67
Features of DataFrame
Columns can be of different types.
Size is mutable.
Labeled axes (rows and columns).
Can Perform Arithmetic operations on rows and columns.
68
Advantages of Data Visualization:
tools to create plots easily and rapidly consume important metrics. These
metrics show the clear-cut growth or loss in business. For example, if Sales
are significantly going down in one region, decision-makers will easily find out
from the data what circumstances or decisions are at present and how to
respond to the factors encountered. Through graphical representations, we
can interpret the vast features of data very clearly and cohesively, which
allows us to understand the data and to draw conclusions from those insights
and see business outlook.
Quick Decision Making: The Human mind can process visual images faster
than texts and numerical values. Hence, seeing a graph, chart, or other visual
and graphical representations of data is more pleasant and easy for our brain
to process. To read and grasp text, and then convert this into a visualization
of the data that might not be entirely accurate becomes difficult and time
consuming to understand for the team of decision-makers. It is a good
human ability that easily interprets visual data; data visualization completely
69
proves to improve the speed of decision-making processes. Data visualization
always helps to shorten business meetings and efficient decision making.
Better Analysis: Data visualization plays an important role for business
within the data. If the data tend to counsel the incorrect actions, visualization
facilitates detecting inaccurate data sooner so it will be off from the analysis.
Exploring business insights: Within the current competitive business
we’ll be able to discover the most recent trends in your business to produce a
quality product and determine issues before they arise. Staying on high of
trends, we’ll be able to place a lot of effort into augmented profits for our
business.
70
Installation of Pandas
To get started you need to install Pandas using pip:
pip install pandas
Importing necessary libraries and data files
Once Pandas is installed, import the required libraries and load your data
Sample CSV files df1 and df2.
import numpy as np
import pandas as pd
df1 = pd.read_csv('df1', index_col=0)
df2 = pd.read_csv('df2')
71
We have got the well-versed line plot for df without specifying any type of
features in the .plot() function.
2. Area Plots using Pandas DataFrame
Area plot shows data with a line and fills the space below the line with color. It
helps see how things change over time. we can plot it
using DataFrame.plot.area() function.
df2.plot.area(alpha=0.4)
72
4. Histogram Plot using Pandas DataFrame
Histograms help visualize the distribution of data by grouping values into bins.
Pandas use DataFrame.plot.hist() function to plot histogram.
df1['A'].plot.hist(bins=50)
73
6. Box Plots using Pandas DataFrame
A box plot displays the distribution of data, showing the median, quartiles, and
outliers. We can use DataFrame.plot.box() or DataFrame.boxplot() to
create it.
df2.plot.box()
74
8. Kernel Density Estimation plot (KDE) using Pandas DataFrame
KDE (Kernel Density Estimation) creates a smooth curve to show the shape of
data by using the df.plot.kde() function. It’s useful for visualizing data
patterns and simulating new data based on real examples.
df2['a'].plot.kde()
75
2. Line Plot with Different Line Styles
If you want to differentiate between the two lines visually you can change the
line style (e.g., solid line, dashed line) with the help of pandas.
df.plot(style=['-', '--', '-.', ':'], title='Line Plot with Different Styles',
xlabel='Index', ylabel='Values', grid=True)
76
df.plot.bar(stacked=True, figsize=(10, 6), title='Stacked Bar Plot',
xlabel='Index', ylabel='Values', grid=True)
Input:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
77
Output:
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
The Series wraps both a sequence of values and a sequence of indices, which
we can access with the values and index attributes.
NumPy array, data can be accessed by the associated index via the familiar
Python square-bracket notation
Example 1:
Input:
data[1]
Output:
0.5
Example 2:
Input:
data[1:3]
Output:
1 0.50
2 0.75
dtype: float64
The Pandas Series is much more general and flexible than the one-dimensional
NumPy array
78
3.12.2 The Pandas DataFrame Object
To demonstrate this, let's construct a new Series listing the area of each of the
five states
Input:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area
Output:
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
dtype: int64
Let us consider the population Series from before, we can use a dictionary to
construct a single two-dimensional object containing this information:
Input:
states = pd.DataFrame({'population': population,'area': area})
states
Output:
79
Area population
Here that both the Series and DataFrame objects contain an explicit index that
lets you reference and modify data. This Index object is an interesting structure
in itself, and it can be thought of either as an immutable array or as an ordered
set (technically a multi-set, as Index objects may contain repeated values).
Those views have some interesting consequences in the operations available
on Index objects.
Input:
ind = pd.Index([2, 3, 5, 7, 11])
ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')
The Index in many ways operates like an array. For example, we can use
standard Python indexing notation to retrieve values or slices:
Example 1
Input:
ind[1]
Output:
3
80
Example 2
Input:
ind[::2]
Output:
Int64Index([2, 5, 11], dtype='int64')
Index objects also have many of the attributes familiar from NumPy arrays:
Input:
print(ind.size, ind.shape, ind.ndim, ind.dtype)
5 (5,) 1 int64
One difference between Index objects and NumPy arrays is that indices are
immutable–that is, they cannot be modified via the normal means:
Input:
ind[1] = 0
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-34-40e631c82e8a> in <module>()
----> 1 ind[1] = 0
/Users/jakevdp/anaconda/lib/python3.5/site-packages/pandas/indexes/base.py
in __setitem__(self, key, value)
1243
1244 def __setitem__(self, key, value):
-> 1245 raise TypeError("Index does not support mutable operations")
1246
1247 def __getitem__(self, key):
81
The Python and NumPy indexing operators [] and attribute operator . provide
quick and easy access to pandas data structures across a wide range of use
cases. This makes interactive work intuitive, as there’s little new to learn if you
already know how to deal with Python dictionaries and NumPy arrays. However,
since the type of the data to be accessed isn’t known in advance, directly using
standard operators has some optimization limits. For production code, we
recommended that you take advantage of the optimized pandas data access
methods
.loc is primarily label based, but may also be used with a boolean
array. .loc will raise KeyError when the items are not found. Allowed
inputs are:
o A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of
the index. This use is not an integer position along the index.).
o A list or array of labels ['a', 'b', 'c'].
o A slice object with labels 'a':'f' (Note that contrary to usual Python
slices, both the start and the stop are included, when present in
the index! See Slicing with labels and Endpoints are inclusive.)
o A boolean array (any NA values will be treated as False).
o A callable function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the
above).
o A tuple of row (and column) indices whose elements are one of
the above inputs.
See more at Selection by Label.
82
o A callable function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the
above).
o A tuple of row (and column) indices whose elements are one of
the above inputs.
Destructuring tuple keys into row (and column) indexes occurs before callables
are applied, so you cannot return a tuple from a callable to index both rows and
columns.
The primary function of indexing with [] (a.k.a. __getitem__ for those familiar
with implementing class behavior in Python) is selecting out lower-dimensional
slices. The following table shows return type values when indexing pandas
objects with []:
Input:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4),
...:
index=dates, columns=['A', 'B', 'C', 'D'])
...:
df
Output:
A B C D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
2000-01-04 0.721555 -0.706771 -1.039575 0.271860
2000-01-05 -0.424972 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427 0.524988
2000-01-07 0.404705 0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312 0.844885
83
3.13.1 Attribute access
The index can bee accessed a Series or column on a DataFrame directly as an
attribute:
dfa = df.copy()
Handling missing data will go through each step in more detail, but here's a
general idea:
84
3.14.1 Calculation with Missing Data
None, is a Python singular object that is frequently used for missing data in
Python programs. Because it is a Python object, None can only be used in
arrays of the data type "object" (i.e., arrays of Python objects), and cannot be
used in any other NumPy/Pandas array:
import numpy as np
import pandas as pd
array = np.array([3, None, 0, 4, None])
print(array)
Output:
array([3, None, 0, 4, None], dtype=object)
The alternative missing data representation, NaN (an acronym for Not a
Number), is distinct; it is a unique floating-point value recognized by all
systems that utilize the common IEEE floating-point notation
import numpy as np
import pandas as pd
array = np.array([3, np.nan, 0, 4, np.nan])
print(array)
Output:
array([ 3., nan, 0., 4., nan])
It's important to note that NumPy selected a native floating-point type for
this array, which implies that in contrast to the object array from earlier, this
array allows rapid operations that are pushed into the produced code.
85
6,Male,54000,NaN,Alibaba
7,NaN,74000,China,NaN
8,Male,14000,Australia,NaN
9,Female,15000,NaN,NaN
10,Male,33000,Australia,NaN'''
Output:
Salary Dataset:
ID Gender Salary Country Company
0 1 Male 15000.0 India Google
1 2 Female 45000.0 China NaN
2 3 Female 25000.0 India Google
3 4 NaN NaN Australia Google
4 5 Male NaN India Google
5 6 Male 54000.0 NaN Alibaba
6 7 NaN 74000.0 China NaN
7 8 Male 14000.0 Australia NaN
8 9 Female 15000.0 NaN NaN
9 10 Male 33000.0 Australia NaN
Missing Data
ID Gender Salary Country Company
0 False False False False False
1 False False False False True
2 False False False False False
3 False True True False False
4 False False True False False
5 False False False True False
6 False True False False True
86
7 False False False False True
8 False False False True True
9 False False False False True
Missing Data
ID Gender Salary Country Company
0 False False False False False
1 False False False False True
2 False False False False False
3 False True True False False
4 False False True False False
5 False False False True False
6 False True False False True
7 False False False False True
8 False False False True True
9 False False False False True
Filter based on columns:
ID Gender Salary Country Company
1 2 Female 45000.0 China NaN
3 4 NaN NaN Australia Google
4 5 Male NaN India Google
5 6 Male 54000.0 NaN Alibaba
6 7 NaN 74000.0 China NaN
7 8 Male 14000.0 Australia NaN
8 9 Female 15000.0 NaN NaN
9 10 Male 33000.0 Australia NaN
Sum up the missing values:
ID 0
Gender 2
Salary 2
Country 2
Company 5
dtype: int64
isnull()
The isnull() method returns a dataframe of boolean values that are True for
NaN values when checking null values in a Pandas DataFrame.
notnull()
The notnull() method returns a dataframe of boolean values that are False for
NaN values when checking for null values in a Pandas Dataframe.
87
dropna()
The dropna() method is used to remove null values from a dataframe. This
function removes rows and columns of datasets containing null values in several
ways.
fillna()
By using the fillna(), replace(), and interpolate() functions, we may fill in any
null values in a dataset by replacing NaN values with alternative values. The
datasets of a DataFrame can be filled with null values thanks to all these
functions. The Interpolate() method is mostly used to fill NA values in a
dataframe, although it does so using various interpolation techniques rather
than hard-coding the value.
replace()
Using the replace method, we can not only replace or fill null values but any
value specified as a function attribute. We specify the value to be replaced
in to_replace and the new value in value.
interpolate()
Pandas' ability to substitute missing values with those that make sense is
another characteristic. The interpolate() function is employed. Pandas fill in the
gaps beautifully by using the points' midpoints. Naturally, if this was curvilinear,
a function would be fitted to it in order to discover a different technique to
determine the average.
88
3.15.1 Creating a MultiIndex
MultiIndex DataFrame:
import pandas as pd
# Creating a sample dataset
arrays = [['Semester 1', 'Semester 1', 'Semester 2', 'Semester 2'],
['Math', 'Science', 'Math', 'Science']]
df = pd.DataFrame(data, index=index)
print(df)
Output:
Output:
88
89
Subject
Math 88 78 92
Science 92 85 90
90
3.16 Combining Datasets
pandas are the most common library for handling datasets in Python. You can
use concat(), merge(), or join() to combine datasets.
import pandas as pd
# Sample data
print(df_combined)
print(df_combined)
Use merge() when combining datasets based on a key column, like an SQL join.
91
print(df_merged)
print(df_left)
print(df_outer)
df_joined = df1.join(df2)
print(df_joined)
import numpy as np
# Stacking vertically
arr_combined = np.vstack((arr1, arr2))
print(arr_combined)
# Stacking horizontally
arr_combined_h = np.hstack((arr1, arr2))
print(arr_combined_h)
92
df_combined = df1._append(df2, ignore_index=True)
print(df_combined)
Output:
Department
HR 122000
93
IT 142000
Sales 105000
Name: Salary, dtype: int64
94
3.17.4. Using transform() for Group-wise Calculations
If you want to perform group calculations and retain the original structure of the
DataFrame, use transform().
df['Avg_Dept_Salary'] = df.groupby('Department')['Salary'].transform('mean')
print(df)
3.17.5. Pivot Tables for Aggregation
For a cleaner table format, use pivot_table().
df_pivot = df.pivot_table(index='Department', values='Salary', aggfunc=['sum',
'mean', 'count'])
print(df_pivot)
Method Use Case
groupby() Aggregate data based on one or more columns
agg() Apply multiple aggregation functions
transform() Apply transformations while keeping the original structure
pivot_table() Reshape and summarize data in a table
3.18. Joins
Joins in pandas work similarly to SQL joins, allowing you to merge data from
multiple DataFrames based on common columns. The main function for joins in
pandas is merge().Their types are stated below.
Outer Join Keeps all rows from both tables, filling missing values with NaN.
Sample DataFrames
Let's create two DataFrames to demonstrate different types of joins.
import pandas as pd
95
df1 = pd.DataFrame({'ID': [1, 2, 3, 4], 'Name': ['Alice', 'Bob', 'Charlie', 'David']})
df2 = pd.DataFrame({'ID': [3, 4, 5, 6], 'Salary': [60000, 70000, 80000, 90000]})
print("df1:")
print(df1)
print("\ndf2:")
print(df2)
df1
ID Name
0 1 Alice
1 2 Bob
2 3 Charlie
3 4 David
df2
ID Salary
0 3 60000
1 4 70000
2 5 80000
3 6 90000
Output:
ID Name Salary
0 3 Charlie 60000
1 4 David 70000
96
2.Left Join (Keeps all rows from df1, fills missing values from df2)
df_left = df1.merge(df2, on='ID', how='left')
print(df_left)
Output:
ID Name Salary
0 1 Alice NaN
1 2 Bob NaN
2 3 Charlie 60000.0
3 4 David 70000.0
3.Right Join (Keeps all rows from df2, fills missing values from df1)
df_right = df1.merge(df2, on='ID', how='right')
print(df_right)
Output:
ID Name Salary
0 3 Charlie 60000
1 4 David 70000
2 5 NaN 80000
3 6 NaN 90000
4.Outer Join (Keeps all rows from both tables, fills missing values with NaN)
df_outer = df1.merge(df2, on='ID', how='outer')
print(df_outer)
Output:
ID Name Salary
0 1 Alice NaN
1 2 Bob NaN
2 3 Charlie 60000.0
3 4 David 70000.0
4 5 NaN 80000.0
5 6 NaN 90000.0
97
5.Joining on Multiple Columns
If your DataFrames have multiple keys, you can join on multiple columns.
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Dept': ['HR', 'IT', 'HR'], 'Name': ['Alice',
'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Dept': ['HR', 'IT', 'Finance'], 'Salary':
[50000, 60000, 70000]})
df_multi = df1.merge(df2, on=['ID', 'Dept'], how='inner')
print(df_multi)
98
3.19. Pivot Tables
Pivot tables are used to summarize and analyze data in a structured way. They
work similarly to Excel pivot tables and are useful for aggregating and
restructuring large datasets.
99
3.19.2. Multiple Aggregations in Pivot Tables
You can apply multiple aggregation functions at the same time.
pivot = df.pivot_table(index='Department', values=['Salary', 'Bonus'],
aggfunc=['sum', 'mean'])
print(pivot)
Output:
sum_salary mean_salary sum_bonus mean_bonus
Department
HR 122000 61000.0 14000.0 7000.0
IT 217000 72333.3 24500.0 8166.7
Sales 163000 54333.3 18500.0 6166.7
100
Sales NaN 54333.3 NaN
101
Include Totals df.pivot_table(index='Department', values='Salary',
aggfunc='sum', margins=True)
Fill Missing Values df.pivot_table(index='Department',
columns='Location', values='Salary', aggfunc='sum',
fill_value=0)
102
df = pd.DataFrame(data)
print(df)
103
print(df)
3.20.9. Removing Spaces & Special Characters
df['Cleaned_Name'] = df['Name'].str.strip() # Remove extra spaces
df['No_Special_Chars'] = df['Name'].str.replace(r'[^a-zA-Z ]', '', regex=True) #
Keep only letters
print(df)
3.20.10. Using apply() for Custom String Functions
If .str methods don’t cover your case, use apply().
df['Custom'] = df['Name'].apply(lambda x: x[::-1]) # Reverse the name
print(df)
Operation Example
Convert to lowercase df['Name'].str.lower()
Convert to uppercase df['Name'].str.upper()
Split a column df['Name'].str.split().str[0]
Extract from string df['Email'].str.extract(r'(\w+)')
Replace substring df['Email'].str.replace("gmail", "outlook")
Check if substring exists df['Email'].str.contains("company")
Remove spaces df['Name'].str.strip()
Remove special characters df['Name'].str.replace(r'[^a-zA-Z ]', '',
regex=True)
Apply a custom function df['Name'].apply(lambda x: x[::-1])
104
3.21.1 Key Aspects of Time Series Data
1. Time Interval: The data points are indexed in time order, with each
data point associated with a specific timestamp (e.g., daily, weekly,
monthly).
2. Examples: Area of Application
o Stock prices
o Temperature readings over a period
o Sales data over months or years
o Economic indicators (like GDP, inflation rates)
3. Components of Time Series Data:
o Trend: The long-term movement or direction in the data (e.g., an
upward or downward trend in stock prices).
o Seasonality: Regular, repeating fluctuations in the data within
fixed periods (e.g., higher ice cream sales in summer).
o Cyclic Patterns: Irregular fluctuations that do not have a fixed
period (e.g., economic booms and recessions).
o Noise: Random variations or fluctuations in the data that do not
follow any pattern.
4. Time Series Analysis: Techniques such as ARIMA (AutoRegressive
Integrated Moving Average), exponential smoothing, and decomposition
methods are used to analyze time series data, forecast future values, and
uncover underlying patterns.
105
b)Regular vs. Irregular Time Series:
Regular: Data points are recorded at consistent time intervals (e.g.,
hourly, daily).
Irregular: Data points are recorded at inconsistent time intervals.
106
print(df.head())
# Check the structure of the data
107
print(df_monthly.head())
print(df_weekly.head())
'M' stands for monthly, 'W' stands for weekly, and you can use other time
frequencies like 'D' for daily, 'Q' for quarterly, etc.
mean() and sum() aggregate the data by taking the average and sum,
respectively.
108
Example Template
decomposition = seasonal_decompose(df, model='additive',
period=12)
decomposition.plot()
plt.show()
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
109
the three following components to capture the patterns and trends in time-
series data:
1. Autoregression (AR)
2. Differencing (I)
3. Moving Average (MA)
The moving average (MA) model calculates the average of past observations to
forecast future values. It helps eliminate short-term fluctuations and identify
underlying trends in the data. The autoregressive (AR) model predicts future
values using past observations and a linear regression equation. It assumes that
the future values depend on the previous values with a lag. Differencing (I) in
time series analysis refers to a method of transforming a time series dataset by
subtracting the previous value from the current value.
3.21.3.9 Stationarity
Stationarity refers to a key concept in time-series analysis where the statistical
properties of a dataset, such as mean and variance, which remain constant over
time is applied. In Python, testing for stationarity involves methods like the
Augmented Dickey-Fuller (ADF) test, Kwiatkowski-Phillips-Schmidt-Shin (KPSS)
test, and visual inspection of time series plots
110
print('p-value:', result[1])
Here's an example that demonstrates the steps of loading and working with
time-series data using pandas in Python:
111
Example Program
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
112
3.21.5 Matplotlib Library and Time Series Data
Python provides the Matplotlib library, which includes the Pyplot module for
creating various types of plots, including line plots, scatter plots, and
histograms. Plotting time-series data is an essential step in visualizing patterns,
trends, and anomalies.
Example Program
import numpy as np
import matplotlib.pyplot as plt
# Generate random time-series data
np.random.seed(0)
dates = pd.date_range(start='2025-02-01', periods=100)
values = np.random.randn(100).cumsum()
# Plot the time-series data
plt.plot(dates, values)
plt.xlabel('Date')
plt.ylabel('Value')
plt.xticks(rotation = 45)
plt.title('Time Series Data')
113
plt.show()
Output
114
# Create a DataFrame from the generated data
data = pd.DataFrame({'date': dates, 'value': values})
# Set the 'date' column as the index
data.set_index('date', inplace=True)
# Plot the time-series data
plt.plot(data.index, data['value'])
plt.xlabel('Time')
plt.ylabel('Value')
plt.xticks(rotation = 45)
plt.title('Time Series Data')
plt.show()
Output
The statsmodels library in Python provides various tools for performing time
series analysis, including models for autoregressive (AR), moving average (MA),
and integrated (I) processes, which are the building blocks for models like
ARIMA (AutoRegressive Integrated Moving Average). It also supports other time
series models like SARIMA (Seasonal ARIMA), state space models
115
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
from statsmodels.tsa.api import SimpleExpSmoothing, Holt,
ExponentialSmoothing
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose
116
plt.ylabel("Value")
plt.legend()
plt.grid()
plt.savefig("simulated_time_series.png")
plt.show()
Output
117
3.21.7.2 Time Series Stationarity
The Augmented Dickey-Fuller (ADF) test is a statistical test that tells whether
data change over time or not.
118
Output
ADF Statistic: -0.5022
P-Value: 0.8916
The time series is non-stationary.
In Python, you can calculate and plot the Autocorrelation Function (ACF) and
Partial Autocorrelation Function (PACF) using the statsmodels library. These
plots help in analyzing the autocorrelations and partial autocorrelations of time
series data, which are important for model selection in time series forecasting
Template
# Plot ACF and PACF fig, axes = plt.subplots(1, 2, figsize=(12, 6))
119
3.21.7.3 ARIMA model to the data for forecasting
ARIMA stands for AutoRegressive Integrated Moving Average. It is a popular
statistical method used for time series forecasting and modeling. ARIMA is a
powerful tool for forecasting univariate (single variable) time series data,
especially when the data exhibits patterns such as trends and seasonality.
Template
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(df["value"], order=(2, 1, 2))
arima_result = model.fit()
print(arima_result.summary())
plt.savefig("arima_residuals_diagnostics.png")
plt.show()
The ARIMA model is fitted with the specified order (2, 1, 2) where:
o 2: AR (AutoRegressive) term
o 1: Differencing order (1st difference to make the series stationary)
o 2: MA (Moving Average) term
120
3.22 High Performance Pandas
Pandas is a powerful Python library for data analysis, but it can be slow when
working with large datasets. Here are some techniques to improve the
performance of your Pandas code:
1. Vectorization:
Avoid explicit loops whenever possible. Pandas and NumPy are optimized
for vectorized operations, which are much faster than iterating through
rows.
Use Pandas' built-in functions like apply(), map(), and agg() for efficient
data manipulation.
2. Data Types:
Choose the smallest appropriate data type for your columns. For
example, use int8 instead of int64 if your data allows it. This can
significantly reduce memory usage and improve performance.
Use category data type for columns with a limited number of unique
values.
121
3. Memory Optimization:
Read only the columns you need from your data files.
Use chunking to process large files in smaller parts.
Delete unnecessary dataframes to free up memory.
4. Parallel Processing:
Use libraries like Dask or Ray to parallelize your Pandas operations across
multiple cores or machines.
5. Just-in-Time Compilation:
Use Numba to compile your Python functions to machine code for faster
execution.
6. GPU Acceleration:
Consider using cuDF, a GPU-accelerated library with a Pandas-like API,
for even greater performance gains.
7. Profiling and Benchmarking:
Use tools like cProfile or line_profiler to identify performance bottlenecks
in your code.
Benchmark different approaches to choose the most efficient solution.
8. Other Tips:
Use pd.eval() and df.query() for faster expression evaluation.
Avoid creating unnecessary copies of dataframes.
Use inplace operations when possible.
Optimize your data import process.
By applying these techniques, you can significantly improve the performance of
your Pandas code and work more efficiently with large datasets.
122
ASSIGNMENTS
S. Questions K Cos
No Level
Easy Level
Medium Level
Hard Level
123
Part – A Questions and Answers
124
Part-A
Lists are built-in Python data structures that store heterogeneous data, whereas
NumPy arrays are homogeneous, optimized for fast numerical computations.
import numpy as np
arr=np.array([1,2,3,4,5])
print(type(arr))
output:
<class 'numpy.ndarray'>
Constants are predefined values like np.pi (π), np.e (Euler’s number).
import numpy as np
radius = 2
circumference = 2 * np.pi * radius
print(circumference)
Output:
12.566370614359172
125
5. What is Matplotlib used for? (K3,CO3)
Matplotlib is used for creating static, animated, and interactive visualizations in
Python.For example
126
9. What is Pandas used for? (K1,CO3)
Pandas is a Python library for data manipulation and analysis using Series and
DataFrames.
import pandas as pd
df1 = pd.DataFrame({'ID': [1, 2, 3, 4], 'Name': ['Alice', 'Bob', 'Charlie', 'David']})
df2 = pd.DataFrame({'ID': [3, 4, 5, 6], 'Salary': [60000, 70000, 80000, 90000]})
print("df1:")
print(df1)
print("\ndf2:")
print(df2)
12. What is the difference between .loc[] and .iloc[] in Pandas?
(K1,CO3)
127
14. What is a pivot table in Pandas? (K1,CO3)
A pivot table is a table that summarizes data using aggregation functions like
sum, mean, or count.
merge() is flexible for different types of joins, while join() is primarily for index-
based merging.
Time series data consists of indexed data points based on time, useful for trend
analysis.
20. How do you group data by a column and compute the mean?
(K1,CO3)
df.groupby('Category')['Sales'].mean().
128
Resampling is changing the frequency of time-series data (e.g., daily to
monthly).
import pandas as pd
print(data)
26. What is the advantage of using NumPy over Python lists? (K1,CO3)
inner (default)
outer
left
right
129
28. What is the benefit of using .apply() over loops in Pandas?
(K1,CO3)
.apply() is optimized for performance and works faster than explicit loops when
applying functions to DataFrame rows/columns.
df['column_name'].hist()
130
Part B – Questions
131
Part – B Questions
Q. Questions CO K Level
No. Level
132
What are missing values in Pandas? Why is handling
missing data important? Write a Python program to
create a Pandas DataFrame with missing values and
perform the following operations:
5 CO3 K4
Check for missing values.
Fill missing values with the column mean.
Drop rows with missing values.
133
Extract the year, month, and day from the date
column.
Resample the data to show monthly averages.
134
Supportive online Certification courses (NPTEL,
Swayam, Coursera, Udemy, etc.,)
135
Supportive online Certification courses
136
Real time Applications in day to day life and to Industry
137
NumPy Applications
Daily Life Applications
✅ Scientific Calculations: Used in weather forecasting to process large climate
datasets.
✅ Personal Finance Management: Helps calculate monthly budgets, track
expenses, and analyze financial trends using array operations.
✅ Gaming: NumPy is used in game physics (e.g., calculating object
movements, collision detection).
✅ Image Processing: NumPy arrays represent images, allowing easy
manipulation (e.g., Instagram filters, facial recognition).
Industry Applications
🔹 Artificial Intelligence & Machine Learning: Used in deep learning frameworks
like TensorFlow and PyTorch for handling multidimensional data efficiently.
🔹 Robotics & Automation: Helps in motion planning and sensor data processing
in autonomous vehicles.
🔹 Healthcare & Genomics: Used in medical image analysis (MRI, CT scans) and
DNA sequence data processing.
🔹 Stock Market Analysis: Helps analyze large financial datasets for predicting
market trends.
2. Matplotlib Applications
Daily Life Applications
✅ Tracking Personal Goals: Used to plot weight loss, fitness progress,
expenses, or investment growth.
✅ Smart Home Systems: Visualizing power consumption trends in smart
meters.
✅ Social Media Analytics: Helps influencers track their engagement rates and
follower growth over time.
138
Industry Applications
🔹 Business Analytics: Companies use Matplotlib to create sales reports,
performance dashboards, and revenue trends.
🔹 Healthcare & Medicine: Used to visualize patient heart rate, sugar levels, or
other health indicators.
🔹 Engineering & Manufacturing: Engineers plot stress-strain curves, thermal
expansion graphs, and real-time monitoring of machine performance.
🔹 Weather Forecasting: Scientists analyze climate data to predict rainfall,
temperature trends, and hurricanes.
3. Pandas Applications
Daily Life Applications
✅ Personal Expense Tracking: Helps in managing monthly expenses, grocery
budgets, and travel costs.
✅ Sports Performance Analysis: Used by coaches to analyze player statistics,
win-loss records, and game trends.
✅ E-commerce Price Comparison: Used in online shopping sites to analyze
product prices across different platforms.
✅ Social Media Management: Helps influencers and marketers track
engagement metrics.
Industry Applications
🔹 Banking & Finance: Used for fraud detection, customer segmentation, and
loan approval analysis.
🔹 Retail & E-Commerce: Helps companies like Amazon, Flipkart, and Walmart
analyze sales trends, customer preferences, and demand forecasting.
🔹 Supply Chain & Logistics: Used to optimize delivery routes, manage
warehouse inventory, and reduce transportation costs.
🔹 Healthcare: Hospitals use Pandas for patient records analysis, predicting
disease outbreaks, and optimizing medical resources.
139
Content Beyond Syllabus
140
CONTENT BEYOND THE SYLLABUS
141
🔹 Vaex - Handling Billion-Row Datasets: An alternative to Pandas for fast, out-
of-core DataFrame operations.
🔹 Modin - Speeding Up Pandas with Multi-Core Processing: Uses parallel
execution to accelerate Pandas operations.
🔹 SQL Integration with Pandas: Using Pandas with databases (SQLite, MySQL,
PostgreSQL) for real-world data analytics.
🔹 Time Series Forecasting with Pandas: Advanced analysis techniques for
predicting stock prices, weather trends, and sales forecasting.
142
Assessment Schedule (Proposed Date & Actual Date)
143
Prescribed Text Books & Reference
144
Sl.
Book Name & Author Book
No.
3 Steve Abrams, “Artificial Intelligence and Machine Learning for Text Book
Beginners: A simple guide to understanding and Applying AI
and ML”, Independently published, May 14, 2024.
4 Vinod Chandra S S, Anand Hareendran S, Artificial Intelligence Reference
and Machine Learning, PHI Learning, 2014.
Book
145
MINI PROJECT SUGGESTIONS
146
MINI PROJECT
S. Questions K COs
No Level
Easy Level
Medium Level
Hard Level
147
THANK YOU
Disclaimer:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
148