Swarang Raut EDVA Experiment 1 Numpy Pandas
Swarang Raut EDVA Experiment 1 Numpy Pandas
May 2, 2021
1 Introduction to NumPy
NumPy is a Python package and it stands for numerical
python Fundamental package for numerical
computations in Python
Supports N-dimensional array objects that can be used for processing
multidimensional data Supports different data-types
2 Array
An array is a data structure that stores values of same
data type Lists can contain values corresponding to
different data types
Arrays in python can only contain values corresponding to same data type
3 NumPy Array
•A numpy array is a grid of values, all of the same type, and is indexed
by a tuple of nonnegative integers
•The number of dimensions is the rank of the array
•The shape of an array is a tuple of integers giving the size of the array along each dimension
4 Creation of array
To create numpy array, we first need to import the numpy package:
[1 2 3]
[91] : print(type(a))
<class 'numpy.ndarray'>
Multi-dimensional Array
[94]: a=np.array([(1,2,3),(4,5,6)])
print(a)
[[1 2 3]
[4 5 6]]
6 ndim:
You can find the dimension of the array, whether it is a two-dimensional array or a
single dimen sional array.
m) 2
7 itemsize:
You can calculate the byte size of each
8 dtype:
You can find the data type of the elements that are stored in an array.
2
int32
9 size and shape of the array using ‘size’ and ‘shape’ function
6
(1, 6)
2
10 reshape:
Reshape is when you change the number of rows and columns which gives a new
view to an object.
[[ 8 9 10]
[11 12 13]]
[[ 8 9]
[10 11]
[12 13]]
2]) 5
3
Here colon represents all the rows, including zero. Now to get the 2nd element,
we’ll call index 2 from both of the rows which gives us the value 3 and 5
respectively.
[ 9 11]
Now when I have written 0:2, this does not include the second index of the third
row of an array. Therefore, only 9 and 11 gets printed else you will get all the
elements i.e [9 11 13].
12 max/ min
[13]: import numpy as
np a=
np.array([1,2,3])
print(a.min())
print(a.max())
print(a.sum())
1
3
6
Suppose you want to calculate the sum of all the columns, then you can make use of axis.
[14] : a= np.array([(1,2,3),(3,4,5)])
print(a.sum(axis=
0)) [4 6 8]
[15] : a= np.array([(1,2,3),(3,4,5)])
print(a.sum(axis=
1)) [ 6 12]
4
[115]: print(np.std(a))
1.2909944487358056
standard deviation is printed for the above array i.e how much each element varies
from the mean value of the python numpy array.
14 Addition Operation
print(q.reshape(3,
2)) [[ 2 4 6]
[ 6 8 10]]
[[ 2 4]
[ 6 6]
[ 8 10]]
[[0 0 0]
[0 0 0]]
[[ 1 4 9]
[ 9 16 25]]
[[1. 1. 1.]
[1. 1. 1.]]
15 Vertical & Horizontal Stacking
if you want to concatenate two arrays and not just add them, you can perform it using
two ways – vertical stacking and horizontal stacking.
[20] : import numpy as np
x= np.array([(1,2,3),(3,4,5)])
print(np.vstack((x,y)))
[[1 2 3]
[3 4 5]
5
[1 2 3]
[3 4 5]]
[21] : print(np.hstack((x,y)))
[[1 2 3 1 2 3]
[3 4 5 3 4 5]]
[24] : x2
[25] : x3
[[0, 1, 9, 9, 0],
[4, 7, 3, 2, 7],
[2, 0, 0, 4, 5],
[5, 6, 8, 4, 1]],
[[4, 9, 8, 1, 1],
[7, 9, 9, 3, 6],
[7, 2, 0, 3, 5],
[9, 4, 4, 6, 4]]])
We’ll use NumPy’s random number generator, which we will seed with a set value in order to
ensure that the same random arrays are generated each time this code is run
6
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
dtype, the data type of the array
dtype: int32
itemsize, which lists the size (in bytes) of each array element, and nbytes, which
lists the total size (in bytes) of the array
[28] : print("itemsize:",
x3.itemsize, "bytes")
print("nbytes:", x3.nbytes,
"bytes")
itemsize: 4 bytes
nbytes: 240 bytes
16 Reshaping of Arrays
[[1 2 3]
[4 5 6]
[7 8 9]]
[30] : a = np.array([[1,2,3],[4,5,6]])
print(a.shap
e) (2, 3)
[[1 2]
[3 4]
[5 6]]
[32] : # Reshape function to resize
an array b = a.reshape(3,2)
print(
b)
[[1 2]
[3 4]
[5 6]]
7
[33] : x = np.array([1, 2, 3])
[34] : r = range(24)
[35] : print(r)
range(0, 24)
[[[ 0]
[ 1]
[ 2]
[ 3]]
[[ 4]
[ 5]
[ 6]
[ 7]]
[[ 8]
[ 9]
[10]
[11]]
[[12]
[13]
[14]
[15]]
[[16]
[17]
[18]
8
[19]]
[[20]
[21]
[22]
[23]]]
[[ 8 9 10 11]
[12 13 14 15]]]
numpy.itemsize
This array attribute returns the length of each element of array
e) 1
4
In a multi-dimensional array, items can be accessed using a comma-separated
[42]: x2
[43] : x2[0, 0]
[43]: 4
[44] : x2[2, 0]
[44]: 5
9
[45] : x2[2, -1]
[45]: 1
[46] : x = np.array([[1,2],[3,4]],
[47] : y = np.array([[5,6],[7,8]],
[48] : print(x + y)
[[ 6. 8.]
[10. 12.]]
[[ 6. 8.]
[10. 12.]]
[50] : print(x - y)
[[-4. -4.]
[-4. -4.]]
[[-4. -4.]
[-4. -4.]]
[52] : print(x * y)
[[ 5. 12.]
[21.
32.]]
[53] : print(np.multiply(x, y))
[[ 5. 12.]
[21. 32.]]
[54] : print(x.dot(y))
10
[[19. 22.]
[43. 50.]]
[55] : print(x.dot(y))
[[19. 22.]
[43. 50.]]
[[19. 22.]
[43. 50.]]
[57] : print(x / y)
[[0.2 0.33333333]
[0.42857143 0.5 ]]
[[0.2 0.33333333]
[0.42857143 0.5 ]]
[60] : print (np . sum(x, axis=0) ) #Compute sum of all columns [4. 6.]
[61] : print (np . sum(x, axis=1) )#Compute sum of all rows [3. 7.]
18 Concatenation of arrays
11
Pandas, Skill-Based Lab Course [Python
Programming] (CSL405) by Prof Vivian Lobo,
Department of Computer Engineering, St John
College of Engineering and Management, Palghar
May 7, 2021
1 Pandas Series
Pandas Series is a one-dimensional labeled array capable of holding
any data type. Pandas Series is nothing but a column in an excel
sheet.
[2] : series_list
[2]: 0 1
12
23
34
45
56
dtype: int64
[3] : series_np
[3]: 0 10
1 20
2 30
3 40
4 50
1
5 60
dtype: int32
[12] : series_index
[12]: 0 10
2 20
4 30
6 40
8 50
10 60
12 70
dtype: int32
[14] : series_index
[14]: a 10
b 20
c 30
d 40
e 50
f 60
dtype: int32
[16]: a 1
b2
c3
2
dtype: int64
[13]: t_dict = {'a' : [1,2,3], 'b': [4,5], 'c':6, 'd': "Hello World"} #
Creating a Series out of above dict
series_dict1 = pd.Series(t_dict)
[14]: series_dict1
[14]: a [1, 2, 3]
b [4, 5]
c6
d Hello
World
dtype:
object
[18]: # Calling the pandas data frame method by passing the dictionary (data) as a parameter
,→
3
df =
pd.DataFrame(data)
df
[21]: # Calling the pandas data frame method by passing the dictionary (data) as a
,→parameter df = pd.DataFrame(data)
# Selecting column
df[['Name', 'Age']]
Selecting a Row:
Pandas Data Frame provides a method called “loc” which is used to retrieve rows from the data
frame. Also, rows can also be selected by using the “iloc” as a function.
[23]: # Calling the pandas data frame method by passing the dictionary (data) as a
,→parameter df = pd.DataFrame(data)
# Selecting a row
#row = df.loc[3]
4
row =
df.iloc[0:3]
row
df = pd.DataFrame(data)
# Selecting the data from the column
df['Age']
[29]: 0 24
1 23
2 22
3 19
4 10
Name: Age, dtype: int64
[30]: del
df['Age']
df
[30]: Name
0 Ashika
1 Tanu
2 Ashwin
3 Mohit
4 Sourabh
Data can be added by using the insert function. The insert function is available to
insert at a particular location in the columns:
5
1 Tanu Tanu
2 Ashwin Ashwin
3 Mohit Mohit
4 Sourabh Sourabh
The isnull () returns false if the null is not present and true for null values. Now we
have found the missing values, the next task is to fill those values with 0 this can
be done as shown below:
[76]: df.fillna(0)
df.dtypes
[76]: name object
age int64
designation
object dtype:
object
6
[43]: # import the pandas
library import pandas
as pd
# Dictionary of key pair values called data
data = {'NAMe':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
'AGe': [24, 23, 22, 19, 10]}
# Calling the pandas data frame method by passing the dictionary (data) as a
,→parameter
df =
pd.DataFrame(data)
df
[50]: newcols = {
'NAMe': 'Name',
'AGe': 'Age'
}
# Use `rename()` to rename your
columns df.rename(columns=newcols,
inplace=False) df
7
d Mohit 19
e Sourabh 10
[54] : my_dict = {
'name' : ["a", "b", "c", "d",
"e","f", "g"], 'age' : [20,27, 35,
55, 18, 21, 35],
'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]
}
[56] : df
The Row Index Since, we haven’t provided any Row Index values to the DataFrame,
it auto matically generates a sequence (0. . . 6) as row index. To provide our own
row index, we need to pass index parameter in the DataFrame(. . . ) function as
[62]: df
The index need not be numerical all the time, we can pass strings also as index. For example
[42]: df = pd.DataFrame(
my_dict,
index=["First", "Second", "Third", "Fourth", "Fifth", "Sixth", "Seventh"] )
[43]: df
8
[43]: name age
designation First
a 20 VP Second b
27 CEO Third c 35
CFO Fourth d 55
VP
Fifth e 18 VP
Sixth f 21 CEO
Seventh g 35
MD
As you might have guessed that Index are homogeneous in nature which means we
can also use NumPy arrays as Index.
[68]: df
If we want to check the data types of all columns inside the DataFrame, we’ll use
the dtypes function of the DataFrame as
[48]: df.dtypes
9
[49]: df.head() # Displays 1st Five Rows
designation
30 c 35 CFO
40 d 55 VP
50 e 18 VP
60 f 21 CEO
70 g 35 MD
[54] : my_dict = {
'name' : ["a", "b", "c", "d",
"e"], 'age' : [10,20, 30, 40,
50],
'designation': ["CEO", "VP", "SVP", "AM", "DEV"] }
df = pd.DataFrame( my_dict,
index = [
"First -> ",
"Second -> ",
"Third -> ",
"Fourth -> ",
10
"Fifth -> "])
[55] : df
DataFrame provides two ways of accessing the column i.e by using dictionary
syntax df[‘column_name’] or df.column_name . Each time we use these
representation to get a column, we get a Pandas Series.
[56] : series_name =
df.name
series_age =
df.age
series_designation = df.designation
[57] : series_name
[59]: series_designation
11
3 Grouping Function in Pandas
Grouping is an essential part of data analyzing in Pandas. We can group similar
types of data and implement various functions on them.
For grouping in Pandas, we will use the .groupby() function to group according to
“Month” and then find the mean:
# Print the
dataframe df
[20]: Name Team Number Position Age Height Weight \ 0 Avery Bradley Boston
Celtics 0.0 PG 25.0 6-2 180.0 1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6
235.0 2 John Holland
Boston Celtics 30.0 SG 27.0 6-5 205.0 3 R.J. Hunter Boston Celtics 28.0 SG 22.0
6-5 185.0 4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0.453 Shelvin
Mack
Utah Jazz 8.0 PG 26.0 6-3 203.0 454 Raul Neto Utah Jazz 25.0 PG 24.0 6-1 179.0 455 Tibor
Pleiss Utah Jazz 21.0 C 26.0 7-3 256.0 456 Jeff Withey Utah Jazz 24.0 C 26.0 7-0
231.0 457 NaN NaN NaN NaN NaN NaN NaN
College Salary
0 Texas 7730337.0
1 Marquette 6796117.0
2 Boston University NaN
3 Georgia State 1148640.0
4 NaN 5000000.0
.. ... ...
453 Butler 2433333.0
454 NaN 900000.0
455 NaN 2900000.0
456 Kansas 947276.0
457 NaN NaN
12
# Let's print the first
entries # in all the
groups formed. gk.first()
College Salary
15 NaN 3425510.0
16 Oklahoma State 845059.0
14
17 North Carolina 1500000.0
18 Arizona 1335480.0
19 Georgia Tech 6300000.0
20 NaN 1599840.0
21 Cincinnati 134215.0
22 Miami (FL) 1500000.0
23 Stanford 19689000.0
24 Syracuse 1140240.0
25 Saint Louis 947276.0
26 Kansas 981348.0
27 Georgetown 947276.0
28 Texas A&M 947276.0
29 Georgia Tech 11235955.0
Use groupby() function to form groups based on more than one category (i.e. Use more
than one column to perform the splitting).
College Salary
Team Position Weight
Atlanta Hawks C 245.0 Florida
12000000.0
260.0 NaN 1000000.0
15
PF 235.0 Minnesota 1000000.0
237.0 Virginia 3333333.0
240.0 Bucknell 947276.0
... ... ...
Washington Wizards SF 225.0 Boston College 4375000.0
SG 195.0 LSU 1100602.0
207.0 Florida 5694674.0
218.0 Virginia Tech 561716.0
220.0 Michigan State 4000000.0
[414 rows x 6 columns]
4 Pandas dataframe.aggregate()
Dataframe.aggregate() function is used to apply some aggregation across one or
more column. Aggregate using callable, string, dict, or list of string/callables. Most
frequently used aggregations are:
sum: Return the sum of the values for the requested
axis min: Return the minimum of the values for the
requested axis
max: Return the maximum of the values for the requested axis
[33]: #Aggregate `sum' and `min' function across all the columns in data frame.
16
[34]: #Aggregation works with only numeric type columns.
5 Merging DataFrame
17
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
print(df, "\n\n",
Age
0 K0 Jai 27
1 K1 Princi 24
2 K2 Gaurav 22
3 K3 Anuj 32
[37] : #Now we are using .merge() with one unique key combination
on='key') res
module
import pandas as pd
18
# Define a dictionary containing employee
data data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K0', 'K0', 'K0'],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
# Convert the dictionary into
DataFrame df =
pd.DataFrame(data1)
print(df, "\n\n",
Name Age
0 K0 K0 Jai 27
1 K1 K1 Princi 24
2 K2 K0 Gaurav 22
3 K3 K1 Anuj 32
res1
print(df, "\n\n",
Name Age
0 K0 K0 Jai 27
1 K1 K1 Princi 24
2 K2 K0 Gaurav 22
3 K3 K1 Anuj 32
[42] : #Now we set how = 'left' in order to use keys from left frame only.
Team Boston
Celtics Number 0
Position PG
Age 25
21
Height 6-2
Weight 180
College
Texas
Salary 7.73034e+06
Name: Avery Bradley, dtype: object