0% found this document useful (0 votes)
4 views58 pages

Swarang Raut EDVA Experiment 1 Numpy Pandas

The document provides an overview of NumPy, a fundamental Python package for numerical computations, detailing its array structures, creation, and various operations. It covers topics such as array dimensions, data types, reshaping, slicing, and arithmetic operations, along with examples of how to implement these features in Python code. Additionally, it discusses advanced topics like stacking, random number generation, and concatenation of arrays.

Uploaded by

devyanigawade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views58 pages

Swarang Raut EDVA Experiment 1 Numpy Pandas

The document provides an overview of NumPy, a fundamental Python package for numerical computations, detailing its array structures, creation, and various operations. It covers topics such as array dimensions, data types, reshaping, slicing, and arithmetic operations, along with examples of how to implement these features in Python code. Additionally, it discusses advanced topics like stacking, random number generation, and concatenation of arrays.

Uploaded by

devyanigawade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 58

Numpy, Skill-Based Lab Course [Python

Programming] (CSL405) by Prof Nilesh Deotale,


Department of Computer Engineering, St John
College of Engineering and
Management, Palghar

May 2, 2021

1 Introduction to NumPy
NumPy is a Python package and it stands for numerical
python Fundamental package for numerical
computations in Python
Supports N-dimensional array objects that can be used for processing
multidimensional data Supports different data-types

2 Array
An array is a data structure that stores values of same
data type Lists can contain values corresponding to
different data types
Arrays in python can only contain values corresponding to same data type

3 NumPy Array
•A numpy array is a grid of values, all of the same type, and is indexed
by a tuple of nonnegative integers
•The number of dimensions is the rank of the array
•The shape of an array is a tuple of integers giving the size of the array along each dimension

4 Creation of array
To create numpy array, we first need to import the numpy package:

[89] : import numpy as np

Single-dimensional Numpy Array:


1
[90] : import numpy
as np
a=np.array([1,2
,3]) print(a)

[1 2 3]

[91] : print(type(a))

<class 'numpy.ndarray'>
Multi-dimensional Array

[94]: a=np.array([(1,2,3),(4,5,6)])
print(a)

[[1 2 3]
[4 5 6]]

5 Python NumPy Operations

6 ndim:
You can find the dimension of the array, whether it is a two-dimensional array or a
single dimen sional array.

[100]: import numpy as np


a = np.array([(1,2,3),(4,5,6)])
print(a.ndi

m) 2

7 itemsize:
You can calculate the byte size of each

element. [101]: import numpy as np


a=
np.array([(1,2,3)])
print(a.itemsize)

8 dtype:
You can find the data type of the elements that are stored in an array.

[7]: import numpy as np


a = np.array([(1,2,3)])
print(a.dtyp
e)

2
int32

9 size and shape of the array using ‘size’ and ‘shape’ function

[102]: import numpy as np


a = np.array([(1,2,3,4,5,6)])
print(a.size)
print(a.shap
e)
print(a.ndim
)

6
(1, 6)
2

10 reshape:
Reshape is when you change the number of rows and columns which gives a new
view to an object.

[103]: import numpy as np


a = np.array([(8,9,10),(11,12,13)])
print(a)
a=a.reshape(3
,2) print(a)

[[ 8 9 10]
[11 12 13]]
[[ 8 9]
[10 11]
[12 13]]

11 slicing:Slicing is basically extracting particular set of elements


from an array.

[105]: import numpy as np


a=np.array([(1,2,3,4),(3,4,5,6)])
print(a[1,

2]) 5

[108]: import numpy as np


a=np.array([(1,2,3,4),(3,4,5,6), (23, 24, 25, 26)])
print(a[1:,2])
[ 5 25]

3
Here colon represents all the rows, including zero. Now to get the 2nd element,
we’ll call index 2 from both of the rows which gives us the value 3 and 5
respectively.

[109]: import numpy as np


a=np.array([(8,9),(10,11),(12,13)])
print(a[0:2,1])

[ 9 11]
Now when I have written 0:2, this does not include the second index of the third
row of an array. Therefore, only 9 and 11 gets printed else you will get all the
elements i.e [9 11 13].

12 max/ min
[13]: import numpy as

np a=

np.array([1,2,3])
print(a.min())
print(a.max())
print(a.sum())

1
3
6
Suppose you want to calculate the sum of all the columns, then you can make use of axis.

[14] : a= np.array([(1,2,3),(3,4,5)])
print(a.sum(axis=

0)) [4 6 8]

[15] : a= np.array([(1,2,3),(3,4,5)])
print(a.sum(axis=

1)) [ 6 12]

13 Square Root & Standard Deviation

[113]: import numpy as np


a=np.array([(1,2,3),(3,4,5)])
q=np.sqrt(a)
print(q)#square root of all the elements are printed.
print(q.dtype)
[[1. 1.41421356 1.73205081]
[1.73205081 2. 2.23606798]]
float64

4
[115]: print(np.std(a))

1.2909944487358056
standard deviation is printed for the above array i.e how much each element varies
from the mean value of the python numpy array.

14 Addition Operation

[119] : import numpy as np


x= np.array([(1,2,3),(3,4,5)])
y= np.array([(1,2,3),(3,4,5)])
q=x+
y
print(
q)

print(q.reshape(3,

2)) [[ 2 4 6]
[ 6 8 10]]
[[ 2 4]
[ 6 6]
[ 8 10]]

[120] : import numpy as np


x= np.array([(1,2,3),(3,4,5)])
y= np.array([(1,2,3),(3,4,5)])
print(x-
y)
print(x*
y)
print(x/
y)

[[0 0 0]
[0 0 0]]
[[ 1 4 9]
[ 9 16 25]]
[[1. 1. 1.]
[1. 1. 1.]]
15 Vertical & Horizontal Stacking
if you want to concatenate two arrays and not just add them, you can perform it using
two ways – vertical stacking and horizontal stacking.
[20] : import numpy as np
x= np.array([(1,2,3),(3,4,5)])
print(np.vstack((x,y)))

[[1 2 3]
[3 4 5]

5
[1 2 3]
[3 4 5]]

[21] : print(np.hstack((x,y)))

[[1 2 3 1 2 3]
[3 4 5 3 4 5]]

[22] : import numpy as np


np.random.seed(0) # seed for reproducibility

x1 = np.random.randint(10, size=6) # One-dimensional array


x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array [23]: x1

[23] : array([5, 0, 3, 3, 7, 9])

[24] : x2

[24] : array([[3, 5, 2, 4],


[7, 6, 8, 8],
[1, 6, 7, 7]])

[25] : x3

[25] : array([[[8, 1, 5, 9, 8],


[9, 4, 3, 0, 3],
[5, 0, 2, 3, 8],
[1, 3, 3, 3, 7]],

[[0, 1, 9, 9, 0],
[4, 7, 3, 2, 7],
[2, 0, 0, 4, 5],
[5, 6, 8, 4, 1]],

[[4, 9, 8, 1, 1],
[7, 9, 9, 3, 6],
[7, 2, 0, 3, 5],
[9, 4, 4, 6, 4]]])

We’ll use NumPy’s random number generator, which we will seed with a set value in order to
ensure that the same random arrays are generated each time this code is run

[26] : print("x3 ndim: ",


x3.ndim) print("x3
shape:", x3.shape)
print("x3 size: ",
x3.size)

6
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
dtype, the data type of the array

[27] : print("dtype:", x3.dtype)

dtype: int32
itemsize, which lists the size (in bytes) of each array element, and nbytes, which
lists the total size (in bytes) of the array

[28] : print("itemsize:",
x3.itemsize, "bytes")
print("nbytes:", x3.nbytes,
"bytes")

itemsize: 4 bytes
nbytes: 240 bytes

16 Reshaping of Arrays

[29] : grid = np.arange(1,


10).reshape((3, 3)) print(grid)

[[1 2 3]
[4 5 6]
[7 8 9]]

[30] : a = np.array([[1,2,3],[4,5,6]])
print(a.shap

e) (2, 3)

[31] : # reshaping the ndarray


a. shape = (3,
2) print(a)

[[1 2]
[3 4]
[5 6]]
[32] : # Reshape function to resize
an array b = a.reshape(3,2)
print(
b)
[[1 2]
[3 4]
[5 6]]

7
[33] : x = np.array([1, 2, 3])

# row vector via reshape


x.reshape((1, 3))

[33] : array([[1, 2, 3]])

[34] : r = range(24)

[35] : print(r)

range(0, 24)

[122]: # An array of evenly spaced


numbers a = np.arange(24)
print(a)

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23] [123]: print(a.ndim)

[125]: # Reshaping the


array a b =
a.reshape(6,4, 1)
print(b)

[[[ 0]
[ 1]
[ 2]
[ 3]]

[[ 4]
[ 5]
[ 6]
[ 7]]

[[ 8]
[ 9]
[10]
[11]]
[[12]
[13]
[14]
[15]]

[[16]
[17]
[18]

8
[19]]

[[20]
[21]
[22]
[23]]]

[84]: array = np.arange(16).reshape(2, 2, 4)


print("\nOriginal array reshaped to 3D : \
n", array)

Original array reshaped to 3D :


[[[ 0 1 2 3]
[ 4 5 6 7]]

[[ 8 9 10 11]
[12 13 14 15]]]
numpy.itemsize
This array attribute returns the length of each element of array

in bytes. [127]: # dtype of array is int8 (1 byte)


x = np.array([1,2,3,4,5], dtype = np.int8)
print(x.itemsiz

e) 1

[40]: # dtype of array is float 32 (4 bytes)


x = np.array([1,2,3,4,5], dtype =
np.float32) print(x.itemsize)

4
In a multi-dimensional array, items can be accessed using a comma-separated

tuple of indices: [41]: x2 = np.random.randint(10, size=(3, 4))

[42]: x2

[42] : array([[4, 3, 4, 4],


[8, 4, 3, 7],
[5, 5, 0, 1]])

[43] : x2[0, 0]

[43]: 4

[44] : x2[2, 0]

[44]: 5

9
[45] : x2[2, -1]

[45]: 1

17 NumPy Arithmetic operations

[46] : x = np.array([[1,2],[3,4]],

dtype=np.float64) print(x) [[1. 2.]


[3. 4.]]

[47] : y = np.array([[5,6],[7,8]],

dtype=np.float64) print(y) [[5. 6.]


[7. 8.]]

[48] : print(x + y)

[[ 6. 8.]
[10. 12.]]

[49] : print(np.add(x, y))

[[ 6. 8.]
[10. 12.]]

[50] : print(x - y)

[[-4. -4.]
[-4. -4.]]

[51] : print(np.subtract(x, y))

[[-4. -4.]
[-4. -4.]]

[52] : print(x * y)

[[ 5. 12.]
[21.
32.]]
[53] : print(np.multiply(x, y))

[[ 5. 12.]
[21. 32.]]

[54] : print(x.dot(y))

10
[[19. 22.]
[43. 50.]]

[55] : print(x.dot(y))

[[19. 22.]
[43. 50.]]

[56] : print(np.dot(x, y))

[[19. 22.]
[43. 50.]]

[57] : print(x / y)

[[0.2 0.33333333]
[0.42857143 0.5 ]]

[58] : print(np.divide(x, y))

[[0.2 0.33333333]
[0.42857143 0.5 ]]

[59] : print (np . sum(x) ) #Compute sum of all elements 10.0

[60] : print (np . sum(x, axis=0) ) #Compute sum of all columns [4. 6.]

[61] : print (np . sum(x, axis=1) )#Compute sum of all rows [3. 7.]

18 Concatenation of arrays

[62] : x = np.array([1, 2, 3])


y = np.array([3, 2, 1])
np.concatenate([x, y])

[62] : array([1, 2, 3, 3, 2, 1])

[63] : z = [99, 99, 99]


print(np.concatenate([x, y, z]))
[ 1 2 3 3 2 1 99 99 99]

11
Pandas, Skill-Based Lab Course [Python
Programming] (CSL405) by Prof Vivian Lobo,
Department of Computer Engineering, St John
College of Engineering and Management, Palghar

May 7, 2021

1 Pandas Series
Pandas Series is a one-dimensional labeled array capable of holding
any data type. Pandas Series is nothing but a column in an excel
sheet.

How to Create a Series?


A Pandas Series can be created out of a Python list or NumPy array. It has to be
remembered that unlike Python lists, a Series will always contain data of the same
type. This makes NumPy array a better candidate for creating a pandas series

[1] : import pandas


as pd import
numpy as np
series_list = pd.Series([1,2,3,4,5,6])
series_np = pd.Series(np.array([10,20,30,40,50,60]))

[2] : series_list

[2]: 0 1
12
23
34
45
56
dtype: int64

[3] : series_np

[3]: 0 10
1 20
2 30
3 40
4 50

1
5 60
dtype: int32

The example below uses a NumPy generated Sequence

[11] : series_index = pd.Series(


np.array([10,20,30,40,50,60, 70]),
index=np.arange(0,14,2)
)

[12] : series_index

[12]: 0 10
2 20
4 30
6 40
8 50
10 60
12 70
dtype: int32

The example below usage strings as row index

[13] : series_index = pd.Series(


np.array([10,20,30,40,50,60]),
index=['a', 'b', 'c', 'd', 'e', 'f' ]
)

[14] : series_index

[14]: a 10
b 20
c 30
d 40
e 50
f 60
dtype: int32

Creating Pandas Series from python Dictionary

[15]: t_dict = {'a' : 1, 'b': 2, 'c':3}


# Creating a Series out of above dict
series_dict = pd.Series(t_dict)
[16]: series_dict

[16]: a 1
b2
c3

2
dtype: int64

[13]: t_dict = {'a' : [1,2,3], 'b': [4,5], 'c':6, 'd': "Hello World"} #
Creating a Series out of above dict
series_dict1 = pd.Series(t_dict)

[14]: series_dict1

[14]: a [1, 2, 3]
b [4, 5]
c6
d Hello
World
dtype:
object

2 Python Pandas Data Frame


Pandas is a python library that provides high-performance, easy-to-use data
structures such as a series, Data Frame, and Panel for data analysis tools for Python
programming language. More over, Pandas Data Frame consists of main
components, the data, rows, and columns. To use the pandas’ library and its data
structures, all you have to do is to install it and import it.
Basic operations that can be applied on a pandas Data Frame are as
shown below. Creating a Data Frame.
Performing operations on Rows and
Columns. Data Selection, addition, deletion.
Working with missing data.
Renaming the Columns or Indices of a DataFrame.
1. Creating a Data Frame.
The Pandas data frame can be created by loading the data from the external,
existing storage like a database, SQL, or CSV files. But the Pandas Data Frame can
also be created from the lists, dictionary, etc. One of the ways to create a pandas
data frame is shown below:

[17]: # import the pandas


library import pandas
as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
'Age': [24, 23, 22, 19, 10]}
data

[17]: {'Name': ['Ashika', 'Tanu', 'Ashwin', 'Mohit',


'Sourabh'], 'Age': [24, 23, 22, 19, 10]}

[18]: # Calling the pandas data frame method by passing the dictionary (data) as a parameter
,→

3
df =
pd.DataFrame(data)
df

[18]: Name Age


0 Ashika 24
1 Tanu 23
2 Ashwin 22
3 Mohit 19
4 Sourabh 10

2. Performing operations on Rows and Columns.


Data Frame is a two-dimensional data structure, data is stored in rows and columns.
Below we can perform some operations on Rows and Columns. Selecting a Column:
In order to select a particular column, all we can do is just call the name of the
column inside the data frame.

[56]: # import the pandas


library import pandas
as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit',
'Sourabh'], 'Age': [24, 23, 22, 19, 10]}
data

[56]: {'Name': ['Ashika', 'Tanu', 'Ashwin', 'Mohit',


'Sourabh'], 'Age': [24, 23, 22, 19, 10]}

[21]: # Calling the pandas data frame method by passing the dictionary (data) as a
,→parameter df = pd.DataFrame(data)

# Selecting column
df[['Name', 'Age']]

[21]: Name Age


0 Ashika 24
1 Tanu 23
2 Ashwin 22
3 Mohit 19
4 Sourabh 10

Selecting a Row:
Pandas Data Frame provides a method called “loc” which is used to retrieve rows from the data
frame. Also, rows can also be selected by using the “iloc” as a function.

[23]: # Calling the pandas data frame method by passing the dictionary (data) as a
,→parameter df = pd.DataFrame(data)

# Selecting a row
#row = df.loc[3]

4
row =
df.iloc[0:3]
row

[23]: Name Age


0 Ashika 24
1 Tanu 23
2 Ashwin 22

3. Data Selection, addition, deletion.


You can treat a DataFrame semantically like a dictionary of like-indexed Series
objects. Getting, setting, and deleting columns works with the same syntax as the
analogous dictionary operations:

[29]: # import the pandas


library import pandas
as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit',
'Sourabh'], 'Age': [24, 23, 22, 19, 10]}
# Calling the pandas data frame method by passing the dictionary (data) as a
,→parameter

df = pd.DataFrame(data)
# Selecting the data from the column
df['Age']

[29]: 0 24
1 23
2 22
3 19
4 10
Name: Age, dtype: int64

[30]: del
df['Age']
df

[30]: Name
0 Ashika
1 Tanu
2 Ashwin
3 Mohit
4 Sourabh
Data can be added by using the insert function. The insert function is available to
insert at a particular location in the columns:

[33]: df.insert(1, 'name',


df['Name']) df

[33]: Name name


0 Ashika Ashika

5
1 Tanu Tanu
2 Ashwin Ashwin
3 Mohit Mohit
4 Sourabh Sourabh

4. Working with missing data.


Missing data occur a lot of times when we are accessing big data sets. It occurs often
like NaN (Not a number). In order to fill those values, we can use “isnull()” method.
This method checks whether a null value is present in a data frame or not.
Checking for the missing values.

[35]: # importing both pandas and numpy


libraries import pandas as pd
import numpy as np
# Dictionary of key pair values called
data data ={'First name':[np.nan,
np.nan],
'Age': [23, np.nan]}
df =
pd.DataFrame(data)
df

[35]: First name Age


0 NaN 23.0
1 NaN NaN

[37]: # using the isnull()


function df.isna()

[37]: First name Age


0 True False
1 True True

The isnull () returns false if the null is not present and true for null values. Now we
have found the missing values, the next task is to fill those values with 0 this can
be done as shown below:

[76]: df.fillna(0)
df.dtypes
[76]: name object
age int64
designation
object dtype:
object

5. Renaming the Columns or Indices of a DataFrame.


To give the columns or the index values of your data frame a different value, it’s best to use the
.rename() method. Purposefully I have changed the column name to give a better insight.

6
[43]: # import the pandas
library import pandas
as pd
# Dictionary of key pair values called data
data = {'NAMe':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
'AGe': [24, 23, 22, 19, 10]}
# Calling the pandas data frame method by passing the dictionary (data) as a
,→parameter

df =
pd.DataFrame(data)
df

[43]: NAMe AGe


0 Ashika 24
1 Tanu 23
2 Ashwin 22
3 Mohit 19
4 Sourabh 10

[50]: newcols = {
'NAMe': 'Name',
'AGe': 'Age'
}
# Use `rename()` to rename your
columns df.rename(columns=newcols,
inplace=False) df

[50]: Name Age


0 Ashika 24
1 Tanu 23
2 Ashwin 22
3 Mohit 19
4 Sourabh 10

[52]: # The values of new


index newindex = {
0: 'a',
1: 'b',
2: 'c',
3: 'd',
4: 'e'
}
# Rename your index
df.rename(index=newindex,
inplace=False)

[52]: Name Age


a Ashika 24
b Tanu 23
c Ashwin 22

7
d Mohit 19
e Sourabh 10

[54] : my_dict = {
'name' : ["a", "b", "c", "d",
"e","f", "g"], 'age' : [20,27, 35,
55, 18, 21, 35],
'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]
}

[55] : import pandas as pd


df = pd.DataFrame(my_dict)

[56] : df

[56]: name age designation


0 a 20 VP
1 b 27 CEO
2 c 35 CFO
3 d 55 VP
4 e 18 VP
5 f 21 CEO
6 g 35 MD

The Row Index Since, we haven’t provided any Row Index values to the DataFrame,
it auto matically generates a sequence (0. . . 6) as row index. To provide our own
row index, we need to pass index parameter in the DataFrame(. . . ) function as

[61]: df = pd.DataFrame(my_dict, index=[2,3,4,5,6,7,8])

[62]: df

[62]: name age designation


2 a 20 VP
3 b 27 CEO
4 c 35 CFO
5 d 55 VP
6 e 18 VP
7 f 21 CEO
8 g 35 MD

The index need not be numerical all the time, we can pass strings also as index. For example

[42]: df = pd.DataFrame(
my_dict,
index=["First", "Second", "Third", "Fourth", "Fifth", "Sixth", "Seventh"] )

[43]: df

8
[43]: name age
designation First
a 20 VP Second b
27 CEO Third c 35
CFO Fourth d 55
VP
Fifth e 18 VP
Sixth f 21 CEO
Seventh g 35
MD

As you might have guessed that Index are homogeneous in nature which means we
can also use NumPy arrays as Index.

[67]: np_arr = np.array([10,20,30,40,50,60,70])


df = pd.DataFrame(my_dict, index=np_arr)

[68]: df

[68]: name age


designation 10 a
20 VP
20 b 27 CEO
30 c 35 CFO
40 d 55 VP
50 e 18 VP
60 f 21 CEO
70 g 35 MD

The Columns of Pandas DataFrame


Unlike python lists or dictionaries and just like NumPy, a column of the DataFrame will
always be of same type. We can check the data type of a column either using
dictionary like syntax or by adding the column name using DataFrame

[75]: #df['age'].dtype # Dict Like Syntax


#df.age.dtype #
DataFrame.ColumnName
df.name.dtype #
DataFrame.ColumnName
[75]: dtype('O')

If we want to check the data types of all columns inside the DataFrame, we’ll use
the dtypes function of the DataFrame as

[48]: df.dtypes

[48]: name object


age int64
designation
object dtype:
object

Viewing the Data of a DataFrame

9
[49]: df.head() # Displays 1st Five Rows

[49] : name age


designation 10 a
20 VP
20 b 27 CEO
30 c 35 CFO
40 d 55 VP
50 e 18 VP

[50] : df.tail() # Displays last

Five Rows [50]: name age

designation
30 c 35 CFO
40 d 55 VP
50 e 18 VP
60 f 21 CEO
70 g 35 MD

[51] : df.head(2) # Displays 1st two Rows

[51]: name age


designation 10 a
20 VP
20 b 27 CEO

[52]: df.tail(7) # Displays last 7 Rows

[52]: name age


designation 10 a
20 VP
20 b 27 CEO
30 c 35 CFO
40 d 55 VP
50 e 18 VP
60 f 21 CEO
70 g 35 MD

Getting a Series out of a Pandas DataFrame

[54] : my_dict = {
'name' : ["a", "b", "c", "d",
"e"], 'age' : [10,20, 30, 40,
50],
'designation': ["CEO", "VP", "SVP", "AM", "DEV"] }
df = pd.DataFrame( my_dict,
index = [
"First -> ",
"Second -> ",
"Third -> ",
"Fourth -> ",

10
"Fifth -> "])

[55] : df

[55] : name age


designation First -
> a 10 CEO
Second -> b 20
VP Third -> c 30
SVP Fourth -> d
40 AM Fifth -> e
50 DEV

DataFrame provides two ways of accessing the column i.e by using dictionary
syntax df[‘column_name’] or df.column_name . Each time we use these
representation to get a column, we get a Pandas Series.

[56] : series_name =
df.name
series_age =
df.age
series_designation = df.designation

[57] : series_name

[57]: First -> a


Second ->
b Third ->
c Fourth ->
d Fifth -> e
Name: name, dtype: object
[58]: series_age

[58]: First -> 10


Second -> 20
Third -> 30
Fourth -> 40
Fifth -> 50
Name: age, dtype: int64

[59]: series_designation

[59]: First -> CEO


Second ->
VP Third ->
SVP Fourth -
> AM Fifth -
> DEV
Name: designation, dtype: object

11
3 Grouping Function in Pandas
Grouping is an essential part of data analyzing in Pandas. We can group similar
types of data and implement various functions on them.
For grouping in Pandas, we will use the .groupby() function to group according to
“Month” and then find the mean:

[20]: # importing pandas


as pd import pandas
as pd

# Creating the dataframe


df = pd.read_csv("nba.csv")

# Print the
dataframe df

[20]: Name Team Number Position Age Height Weight \ 0 Avery Bradley Boston
Celtics 0.0 PG 25.0 6-2 180.0 1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6
235.0 2 John Holland
Boston Celtics 30.0 SG 27.0 6-5 205.0 3 R.J. Hunter Boston Celtics 28.0 SG 22.0
6-5 185.0 4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0.453 Shelvin
Mack
Utah Jazz 8.0 PG 26.0 6-3 203.0 454 Raul Neto Utah Jazz 25.0 PG 24.0 6-1 179.0 455 Tibor
Pleiss Utah Jazz 21.0 C 26.0 7-3 256.0 456 Jeff Withey Utah Jazz 24.0 C 26.0 7-0
231.0 457 NaN NaN NaN NaN NaN NaN NaN

College Salary
0 Texas 7730337.0
1 Marquette 6796117.0
2 Boston University NaN
3 Georgia State 1148640.0
4 NaN 5000000.0
.. ... ...
453 Butler 2433333.0
454 NaN 900000.0
455 NaN 2900000.0
456 Kansas 947276.0
457 NaN NaN

[458 rows x 9 columns]

[26]: # applying groupby() function to


# group the data on team
value. gk =
df.groupby('Team')

12
# Let's print the first
entries # in all the
groups formed. gk.first()

[26]: Name Number Position Age Height \ Team


Atlanta Hawks Kent Bazemore 24.0 SF 26.0 6-5 Boston Celtics Avery Bradley 0.0 PG
25.0 6-2 Brooklyn Nets Bojan Bogdanovic 44.0 SG 27.0 6-8 Charlotte Hornets Nicolas
Batum 5.0 SG 27.0 6-8 Chicago Bulls Cameron Bairstow 41.0 PF 25.0 6-9 Cleveland
Cavaliers Matthew Dellavedova 8.0 PG 25.0 6-4 Dallas Mavericks Justin Anderson 1.0
SG 22.0 6-6 Denver Nuggets Darrell Arthur 0.0 PF 28.0 6-9 Detroit Pistons Joel
Anthony 50.0 C 33.0 6-9 Golden State Warriors Leandro Barbosa 19.0 SG 33.0 6-3
Houston Rockets Trevor Ariza 1.0 SF 30.0 6-8 Indiana Pacers Lavoy Allen 5.0
PF 27.0 6-9 Los Angeles Clippers Cole Aldrich 45.0 C 27.0 6-11 Los Angeles
Lakers Brandon Bass 2.0 PF 31.0 6-8 Memphis Grizzlies Jordan Adams 3.0
SG 21.0 6-5 Miami Heat
Chris Bosh 1.0 PF 32.0 6-11 Milwaukee Bucks Giannis Antetokounmpo 34.0 SF 21.0
6-11 Minnesota Timberwolves Nemanja Bjelica 88.0 PF 28.0 6-10 New Orleans
Pelicans Alexis Ajinca 42.0 C 28.0 7-2 New York Knicks Arron Afflalo 4.0 SG
30.0 6-5 Oklahoma City Thunder Steven Adams 12.0 C 22.0 7-0 Orlando
Magic Dewayne Dedmon 3.0 C 26.0 7-0 Philadelphia 76ers Elton Brand 42.0
PF 37.0 6-9 Phoenix Suns
Eric Bledsoe 2.0 PG 26.0 6-1 Portland Trail Blazers Cliff Alexander 34.0 PF
20.0 6-8 Sacramento Kings Quincy Acy 13.0 SF 25.0 6-7 San Antonio Spurs
LaMarcus Aldridge
12.0 PF 30.0 6-11 Toronto Raptors Bismack Biyombo 8.0 C 23.0 6-9 Utah Jazz Trevor
Booker 33.0 PF 28.0 6-8 Washington Wizards Alan Anderson 6.0 SG 33.0 6-6

Weight College Salary


Team
Atlanta Hawks 201.0 Old Dominion 2000000.0 Boston Celtics 180.0
Texas 7730337.0 Brooklyn Nets 216.0 Oklahoma State
3425510.0 Charlotte
Hornets 200.0 Virginia Commonwealth 13125306.0 Chicago Bulls 250.0
New Mexico 845059.0 Cleveland Cavaliers 198.0 Saint Mary's 1147276.0
Dallas Mavericks 228.0 Virginia 1449000.0
13
Denver Nuggets 235.0 Kansas 2814000.0 Detroit Pistons 245.0 UNLV
2500000.0 Golden State Warriors 194.0 North Carolina 2500000.0 Houston
Rockets 215.0 UCLA 8193030.0 Indiana Pacers 255.0 Temple
4050000.0 Los Angeles Clippers 250.0 Kansas 1100602.0 Los
Angeles Lakers 250.0 LSU 3000000.0 Memphis Grizzlies 209.0
UCLA 1404600.0 Miami Heat
235.0 Georgia Tech 22192730.0 Milwaukee Bucks 222.0 Arizona
1953960.0 Minnesota Timberwolves 240.0 Louisville 3950001.0 New
Orleans Pelicans 248.0 California 4389607.0 New York Knicks 210.0 UCLA
8000000.0 Oklahoma City Thunder 255.0 Pittsburgh 2279040.0 Orlando
Magic 245.0 USC 947276.0 Philadelphia 76ers 254.0 Duke 947276.0
Phoenix Suns 190.0 Kentucky 13500000.0 Portland Trail Blazers 240.0
Kansas 525093.0 Sacramento Kings 240.0 Baylor 981348.0 San
Antonio Spurs 240.0 Texas 19689000.0 Toronto Raptors 245.0
Missouri 2814000.0 Utah Jazz 228.0 Clemson 4775000.0
Washington Wizards 220.0 Michigan
State 4000000.0

[28]: # Finding the values contained in the "Boston Celtics"


group gk.get_group('Brooklyn Nets')

[28]: Name Number Position Age Height Weight \ 15 Bojan Bogdanovic


44.0 SG 27.0 6-8 216.0 16 Markel Brown 22.0 SG 24.0 6-3 190.0 17
Wayne Ellington 21.0 SG 28.0
6-4 200.0 18 Rondae Hollis-Jefferson 24.0 SG 21.0 6-7 220.0 19 Jarrett Jack 2.0 PG
32.0 6-3 200.0 20 Sergey Karasev 10.0 SG 22.0 6-7 208.0 21 Sean Kilpatrick 6.0 SG
26.0 6-4 219.0 22 Shane Larkin 0.0 PG 23.0 5-11 175.0 23 Brook Lopez 11.0 C 28.0
7-0 275.0 24 Chris McCullough 1.0 PF 21.0 6-11 200.0 25 Willie Reed 33.0 PF 26.0
6-10 220.0 26 Thomas Robinson 41.0 PF 25.0 6-10 237.0 27 Henry Sims 14.0 C 26.0
6-10 248.0 28 Donald Sloan 15.0 PG 28.0 6-3 205.0 29 Thaddeus Young 30.0 PF
27.0 6-8 221.0

College Salary
15 NaN 3425510.0
16 Oklahoma State 845059.0

14
17 North Carolina 1500000.0
18 Arizona 1335480.0
19 Georgia Tech 6300000.0
20 NaN 1599840.0
21 Cincinnati 134215.0
22 Miami (FL) 1500000.0
23 Stanford 19689000.0
24 Syracuse 1140240.0
25 Saint Louis 947276.0
26 Kansas 981348.0
27 Georgetown 947276.0
28 Texas A&M 947276.0
29 Georgia Tech 11235955.0

Use groupby() function to form groups based on more than one category (i.e. Use more
than one column to perform the splitting).

[30]: # importing pandas


as pd import pandas
as pd

# Creating the dataframe


df = pd.read_csv("nba.csv")

# First grouping based on "Team"


# Within each team we are grouping based on
"Position" gkk = df.groupby(['Team', 'Position',
'Weight'])

# Print the first value in each


group gkk.first()

[30]: Name Number Age Height \ Team Position Weight


Atlanta Hawks C 245.0 Al Horford 15.0 30.0 6-10 260.0 Walter Tavares 22.0 24.0
7-3
PF 235.0 Kris Humphries 43.0 31.0 6-9
237.0 Mike Scott 32.0 27.0 6-8
240.0 Mike Muscala 31.0 24.0 6-11
. Washington Wizards SF 225.0 Jared Dudley 1.0 30.0 6-7 SG 195.0
Garrett Temple 17.0 30.0 6-6
207.0 Bradley Beal 3.0 22.0 6-5
218.0 Jarell Eddie 8.0 24.0 6-7
220.0 Alan Anderson 6.0 33.0 6-6

College Salary
Team Position Weight
Atlanta Hawks C 245.0 Florida
12000000.0
260.0 NaN 1000000.0

15
PF 235.0 Minnesota 1000000.0
237.0 Virginia 3333333.0
240.0 Bucknell 947276.0
... ... ...
Washington Wizards SF 225.0 Boston College 4375000.0
SG 195.0 LSU 1100602.0
207.0 Florida 5694674.0
218.0 Virginia Tech 561716.0
220.0 Michigan State 4000000.0
[414 rows x 6 columns]
4 Pandas dataframe.aggregate()
Dataframe.aggregate() function is used to apply some aggregation across one or
more column. Aggregate using callable, string, dict, or list of string/callables. Most
frequently used aggregations are:
sum: Return the sum of the values for the requested
axis min: Return the minimum of the values for the
requested axis
max: Return the maximum of the values for the requested axis

[33]: #Aggregate `sum' and `min' function across all the columns in data frame.

# importing pandas package


import pandas as pd

# making data frame from


csv file df =
pd.read_csv("nba.csv")

# printing the first 10 rows of the


dataframe df[0:15]
df.dtypes

[33]: Name object


Team object
Number
float64
Position object
Age float64
Height object
Weight float64
College object
Salary float64
dtype: object

16
[34]: #Aggregation works with only numeric type columns.

# Applying aggregation across all the


columns # sum and min will be found
for each
# numeric type column in df dataframe
df.aggregate(['sum', 'min'])

[34]: Number Age Weight Salary


sum 8079.0 12311.0 101236.0 2.159837e+09
min 0.0 19.0 161.0 3.088800e+04

In Pandas, we can also apply different aggregation functions across different


columns. For that, we need to pass a dictionary with key containing the column
names and values containing the list of aggregation functions for any specific
column.

[35]: # importing pandas


package import pandas
as pd

# making data frame from


csv file df =
pd.read_csv("nba.csv")

# We are going to find aggregation for these columns


df.aggregate({"Number":['sum', 'min'],
"Age":['max', 'min'],
"Weight":['min', 'sum'],
"Salary":['sum']})

[35] : Number Age Weight


Salary max NaN 40.0
NaN NaN min 0.0 19.0
161.0 NaN
sum 8079.0 NaN 101236.0 2.159837e+09

5 Merging DataFrame

[36] : #Merging a dataframe with one unique key

combination # importing pandas module


import pandas as pd

# Define a dictionary containing employee


data data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
'Name':['Jai', 'Princi', 'Gaurav',
'Anuj'], 'Age':[27, 24, 22, 32],}

# Define a dictionary containing employee


data data2 = {'key': ['K0', 'K1', 'K2', 'K3'],

17
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}

# Convert the dictionary into


DataFrame df =
pd.DataFrame(data1)

# Convert the dictionary into DataFrame


df1 = pd.DataFrame(data2)

print(df, "\n\n",

df1) key Name

Age
0 K0 Jai 27
1 K1 Princi 24
2 K2 Gaurav 22
3 K3 Anuj 32

key Address Qualification


0 K0 Nagpur Btech
1 K1 Kanpur B.A
2 K2 Allahabad Bcom
3 K3 Kannuaj B.hons

[37] : #Now we are using .merge() with one unique key combination

# using .merge() function


res = pd.merge(df, df1,

on='key') res

[37]: key Name Age Address Qualification


0 K0 Jai 27 Nagpur Btech
1 K1 Princi 24 Kanpur B.A
2 K2 Gaurav 22 Allahabad Bcom
3 K3 Anuj 32 Kannuaj B.hons

[39]: #Merging dataframe using multiple

join keys. # importing pandas

module
import pandas as pd

# Define a dictionary containing employee


data data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K1', 'K0', 'K1'],
'Name':['Jai', 'Princi', 'Gaurav',
'Anuj'], 'Age':[27, 24, 22, 32],}

18
# Define a dictionary containing employee
data data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K0', 'K0', 'K0'],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
# Convert the dictionary into
DataFrame df =
pd.DataFrame(data1)

# Convert the dictionary into


DataFrame df1 =
pd.DataFrame(data2)

print(df, "\n\n",

df1) key key1

Name Age
0 K0 K0 Jai 27
1 K1 K1 Princi 24
2 K2 K0 Gaurav 22
3 K3 K1 Anuj 32

key key1 Address Qualification


0 K0 K0 Nagpur Btech
1 K1 K0 Kanpur B.A
2 K2 K0 Allahabad Bcom
3 K3 K0 Kannuaj B.hons

[40]: #Now we merge dataframe using multiple keys

# merging dataframe using multiple


keys res1 = pd.merge(df, df1,
on=['key', 'key1'])

res1

[40] : key key1 Name Age Address Qualification


0 K0 K0 Jai 27 Nagpur Btech
1 K2 K0 Gaurav 22 Allahabad Bcom

Merging dataframe using how in an argument:


We use how argument to merge specifies how to determine which keys are to be
included in the resulting table. If a key combination does not appear in either the
left or right tables, the values in the joined table will be NA.

[41] : # importing pandas


module import
pandas as pd

# Define a dictionary containing employee


data data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
19
'key1': ['K0', 'K1', 'K0', 'K1'],
'Name':['Jai', 'Princi', 'Gaurav',
'Anuj'], 'Age':[27, 24, 22, 32],}
# Define a dictionary containing employee
data data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K0', 'K0', 'K0'],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}

# Convert the dictionary into


DataFrame df =
pd.DataFrame(data1)

# Convert the dictionary into


DataFrame df1 =
pd.DataFrame(data2)

print(df, "\n\n",

df1) key key1

Name Age
0 K0 K0 Jai 27
1 K1 K1 Princi 24
2 K2 K0 Gaurav 22
3 K3 K1 Anuj 32

key key1 Address Qualification


0 K0 K0 Nagpur Btech
1 K1 K0 Kanpur B.A
2 K2 K0 Allahabad Bcom
3 K3 K0 Kannuaj B.hons

[42] : #Now we set how = 'left' in order to use keys from left frame only.

# using keys from left frame


res = pd.merge(df, df1, how='left',

on=['key', 'key1']) res

[42] : key key1 Name Age Address Qualification


0 K0 K0 Jai 27 Nagpur Btech
1 K1 K1 Princi 24 NaN NaN
2 K2 K0 Gaurav 22 Allahabad Bcom
3 K3 K1 Anuj 32 NaN NaN

[43] : #Now we set how = 'right' in order to use keys from

right frame only. # using keys from right frame


res1 = pd.merge(df, df1, how='right', on=['key', 'key1']) 20
res1
[43]: key key1 Name Age Address Qualification 0 K0 K0 Jai
27.0 Nagpur Btech 1 K1 K0 NaN NaN Kanpur B.A
2 K2 K0 Gaurav 22.0 Allahabad Bcom 3 K3 K0
NaN NaN Kannuaj B.hons

[44]: # getting union of keys


res2 = pd.merge(df, df1, how='outer', on=['key', 'key1']) res2

[44]: key key1 Name Age Address Qualification 0 K0 K0 Jai


27.0 Nagpur Btech 1 K1 K1 Princi 24.0 NaN NaN 2
K2 K0 Gaurav 22.0 Allahabad Bcom 3 K3 K1 Anuj
32.0 NaN NaN 4 K1 K0 NaN NaN Kanpur B.A 5
K3 K0 NaN NaN
Kannuaj B.hons

[45]: # getting intersection of keys


res3 = pd.merge(df, df1, how='inner', on=['key', 'key1']) res3

[45]: key key1 Name Age Address Qualification 0 K0 K0


Jai 27 Nagpur Btech 1 K2 K0 Gaurav 22 Allahabad
Bcom

[67]: # importing pandas


package import pandas
as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col='Name')

#retrieving row by loc


method first =
data.loc["Avery Bradley"]
second = data.loc["R.J.
Hunter"]

print(first, "\n\n\n", second)

Team Boston
Celtics Number 0
Position PG
Age 25

21
Height 6-2
Weight 180
College
Texas
Salary 7.73034e+06
Name: Avery Bradley, dtype: object

Team Boston Celtics


Number 28
Position SG
Age 22
Height 6-5
Weight 185
College Georgia
State Salary
1.14864e+06
Name: R.J. Hunter, dtype: object 22

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy