SR Ip Pandas I Full Notes
SR Ip Pandas I Full Notes
Some functionalities of pandas may return the result in the form of numpy arrays.
So you must have a thorough knowledge of numpy arrays.
Python numpy (Numeric Python)
It has homogeneous list of elements
Vectorised operations can be performed
It has two types
1. 1 D array
2. 2 D array
Creations of arrays
To create both 1-D and 2-D arrays, the module to be imported is:
import numpy as np
Creating 1-D and 2-D arrays:
1-D array
import numpy as np
A=np.array([1,2,3,4])
print(A)
2-D array
import numpy as np
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
print(A)
Working with numpy pandas
Series Dataframe
One-dimensional Two-dimensional
Homogenous data i.e. all Heterogeneous data i.e. elements of
elements are of same type different data types
Value mutable i.e. element’s Value mutable i.e. element’s value
value can be changed can be changed
Size immutable i.e. once created, Size mutable i.e. size can be
size of series cannot be changed changed after creation
import pandas as pd
OUTPUT
S1=pd.Series() Series( [], dtype: object64 )
print(S1)
Note: index argument is optional. If not given, index is taken as 0,1,2,3,--- by default
List – [ ] – is must separation of values using , (comma)
Tuple – ( ) – is must separation of values using , (comma)
import pandas as pd
s1=pd.Series([12,10,14,16])
s2=pd.Series([12,10,14,16],index=[‘a’,’b’,’c’,’d’])
print(“Series object with default index”)
print(s1)
print(“Series object with specified index”)
print(s2)
Output:
Series object with default index
0 12
1 10
2 14
3 16
Series object with specified index
a 12
b 10
c 14
d 16
Here you need to specify arguments for data and index as per the syntax:
Series object= pandas.Series(data,index=idx) OUTPUT
0 Anjali
Eg: 1 Arunima
import pandas as pd 2 Chaithra
S1=pd.Series(['Anjali','Arunima','Chaithra','Diya']) 3 Diya
print(S1) dtype: object
Eg:2
import pandas as pd OUTPUT
l=[31,28,31,30,31] Jan 31
Feb 28
ind=['Jan','Feb','Mar','Apr','May']
Mar 31
obj=pd.Series(l,ind) Apr 30
print(obj) May 31
dtype: Int64
Eg: 1 OUTPUT
0 2
import pandas as pd 1 4
obj=pd.Series(range(2,10,2)) 2 6
print(obj) 3 8
dtype:int64
OUTPUT
Eg:2 0 2.5
import pandas as pd 1 3.0
2 3.5
obj=pd.Series([2.5,3.,3.5,4.]) 3 4.0
print(obj) Dtype:float 64
2. An ndarray
Eg: OUTPUT
0 2
import pandas as pd
1 4
import numpy as np 2 6
A=np.array([2,4,6,8]) 3 8
obj=pd.Series(A) dtype:int 64
print(obj)
3. Dictionary
Here the parameter inside a Series() function will be a dictionary. Syntax:
Series object= pandas.Series(any python dictionary)
Eg:
import pandas as pd OUTPUT
S = pd.Series({'ahil':12, 'abhay':9,'mohit':8,'anjali':10}) ahil 12
print(S) abhay 9
mohit 8
anjali 1
dtype: int64
Since you are creating a series object from a dictionary keys are
considered as indexes, values consider as element.
1. A Scalar value
Scalar value means the data will be in the form of a single value. The following
points may be noted while you create a series object from a scalar value:
If data is a scalar value then index need to be provided.
There can be more than one entry for index value
If index is more than one value then the scalar value will be repeated to
match it with the length of index.
Eg:1 OUTPUT
import pandas as pd 0 10
S=pd.Series(10) dtype: int64
print (S)
Eg:2 OUTPUT
import pandas as pd 1 10
S=pd.Series(10,index=[1,2]) 2 10
print(S) dtype: int64
Here both data and index have to be sequences. None is taken if you skip these parameters
Eg:1
import pandas as pd
S=pd.Series(data=[10,15,20,25],index=[1,2,3,4])
print(S)
OUTPUT
1 10
2 15
3 20
4 25
dtype: int64
Eg:2
import pandas as pd
l=[10,15,20,25]
i=[1,2,3,4]
S=pd.Series(data=l,index=i)
print(S)
Output will be same as above.
Attributes of Pandas Series
>>>import pandas
>>>L=[10,20,30,40,50]
>>>index=[‘a’,’b’,’c’,’d’,’e’]
>>>S=pandas.Series(L,index)
>>>print(S)
o/p: 20
ii) Using multiple labels: The multiple labels must be passed as a list i.e.
the multiple labels must be separated by commas and enclosed in double
square brackets. We should be avoided as it gives NaN value, it will be
considered as an error by Python.
import pandas
L=[10,20,30,40,50]
index=[‘a’,’b’,’c’,’d’,’e’]
S=pandas.Series(L,index)
print(S[[‘b’,’d’,’e’]])
o/p:
b 20
d 40
e 50
dtype: int64
0 11 2 33 3 66 0 11
1 25 3 66 4 85 2 33
2 33 4 85 5 75 4 85
3 66 5 75 6 95 6 95
4 85 6 95 dtype: int 64 8 17
5 75 7 45 dtype: int 64
6 95 8 17 >>> SO [ -5: ]
7 45 9 16 5 75 >>> SO [ -3:-1]
8 17 dtype: int 64 6 95
9 16 7 45 7 45
dtype: int 64 8 17 8 17
9 16 dtype: int 64
dtype: int 64
Operations on series object :
a) modifying elements of a series object :
The data value of a series object can be easily modified by the following syntax:
Seriesobject[index]=new data value
Eg:
Considering the above Series Object obj7 if we write obj7[11]=23
Output will be:
9 18
10 20
11 23
12 24
b) modify the data values within a given slice with the syntax:
Seriesobject[start:stop:step]=new data value
Eg:
#modifying series object
import pandas as pd
obj7=pd.Series(data=[18,20,22,24],index=[9,10,11,12])
print(obj7)
OUTPUT will be:
9 18
10 20
11 22
12 24
dtype: int64
obj7[0:2]=18
print(obj7)
9 18
10 18
11 22
12 24
dtype: int64
Head and Tail functions
LET US CONSIDER THE FOLLOWING EXAMPLE.
>>> seriesTenTwenty=pd.Series(np.arange( 10, 20, 1 ))
>>> print(seriesTenTwenty)
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int32
9 20
10 22
11 24
12 26
dtype: int64
print(obj1**2)
9 324
10 400
11 484
12 576
dtype: int64
be performed with two series objects provided both the series object should match with
their index.
0 3.75
1 15.50
2 37.20
3 15.10
4 25.50
5 NaN
6 NaN
dtype:
float64
Some additional operations on series objects:
Re-indexing:
Sometimes you want to create a similar object but with a different
order of same indexes. You can use the syntax:
Seriesobject=object.reindex(sequence with new order of indexes)
With this the same data values and their indexes will be stored in the new
object as per the defined order of index.
Eg:
import pandas as pd
obj1=pd.Series(data=[2,8,11,6,20],index=[0,1,2,3,4])
obj2=obj1.reindex([2,3,1,4,0])
print(obj2)
OUTPUT:
2 11
3 6
1 8
4 20
0 2
Filtering series:
It display from a set of data using a set of criteria. It check each element and return True / False.
Syntax: <series object>[bool expression on the series object]
Sorting series:
Empty DataFrame
import pandas as pd
df=pd.DataFrame()
print(df)
Creating a DataFrame Object from a List of Dictionaries :
Eg:
import pandas as pd
model1={'make':'maruti','mileage':20,"price":'5L'}
model2={'make':'hyundai','mileage':18,"price":'10L'}
model3={'make':'tata','mileage':21,"price":'12L'}
cars=[model1,model2,model3]
d=pd.DataFrame(cars)
print(d) make mileage price
0 maruti 20 5L
1 hyundai 18 10L
2 tata 21 12L
df3=pd.DataFrame(dict)
print(df3) or
smarks=pd.Series([80,90,70],index=['Neha','Maya','Reena'])
sage=pd.Series([25,30,29],index=['Neha','Maya','Reena'])
dict={'Marks':smarks,'Age':sage}
df3=pd.DataFrame(dict)
print(df3)
Creation of DataFrame from NumPy ndarrays
Consider the following three NumPy ndarrays. Let us create a simple DataFrame without any column
labels, using a single ndarray:
>>> import numpy as np
>>> array1 = np.array([10,20,30])
>>> array2 = np.array([100,200,300])
>>> array3 = np.array([-10,-20,-30, -40])
>>> dFrame4 = pd.DataFrame(array1)
>>> dFrame4
0
0 10
1 20
2 30
We can create a DataFrame using more than one ndarrays, as shown in the following example:
>>> dFrame5 = pd.DataFrame([array1, array3, array2], columns=[ 'A', 'B', 'C', 'D'])
>>> dFrame5
A B C D
0 10 20 30 NaN
1 -10 -20 -30 -40.0
2 100 200 300 NaN
DataFrame attributes
<DataFrane object> . <attribute name>
Attribute Description
index Returns the index (row labels) of the DataFrame
columns Returns the column labels of the DataFrame
axes Returns a list representing both the axes of the Data
Frame (axis=0 i.e. index and axis=1 i.e. columns)
values Returns a Numpy representation of the DataFrame
dtypes Returns the dtypes of data in the DataFrame
shape Returns tuple form of the DataFrames
ndim Returns number of dimensions of the dataframe
size Returns the number of elements in the dataframe
empty Returns True if the DataFrame object is empty, otherwise False
In the dot notation make sure not to put any quotation marks around the column name.
print(df5.BS) or
print(df5['BS'])
print(df5[['BS','IP']])
To access a row:
<dataframe object>.loc[<row label>, : ]
Make sure not to miss the colon after comma.
print(df5.loc['Ammu', :])
Python will return all rows falling between start row and end row; along with start row and end row.
print(df5.loc['Ammu':'Manu', : ])
print(df5.loc[:,'ACC':'IP'])
print(df5.loc['Manu':'Abu','ACC':'ECO'])
print(df5.iloc[1:3,1:3])
Selecting / Accessing individual value
(i) Either give name of row or numeric index in square bracket of column name:
print(df5.ACC[1])
Use Description
dfobject.at[row label,column label] Access a single value for a row/column
label pair.
dfobject.iat[row index no,col index no] Access a single value for a row/column pair
by integer position.
print(df5.at['Achu','ACC']) 67
or
print(df5.iat[1,1])
If the given column name does not exist in dataframe then a new column with the name is added.
df5['ENG']=60
print(df5)
If you want to add a column that has different values for all its rows, then we can assign the data values for
each row of the column in the form of a list. df5[‘ENG’]=[50,60,40,30,70]
There are some other ways for adding a column to a database.
Or
<dataframe object>.loc[ : ,<column name>]=value
df5.at[ : ,'ENG']=60
print(df5)
or
df5.loc[ : ,'ENG']=60
print(df5)
df5.at['Sabu', : ]=50
print(df5)
or
df5.BS['Ammu']=100
print(df5)
or
df5.BS[0]=100
print(df5)
df5=df5.drop(columns=['ECO','IP'])
We can use pop() to delete a column. The deleted column will be returned as Series object.
bstud=df5.pop(‘BS’)
print(bstud)
df5=df5.drop(['Ammu','Achu'])
or
df5=df5.drop(index=['Ammu','Achu'])
Renaming rows/columns
To change the name of any row/column individually use the rename() function of
dataframe as per the syntax:
dfobject.rename(index={namesdictionary},columns={namesdictionary},inplace=False)
where:
1. index argument is for index names(row labels).(use this if you want to rename rows only)
2. The columns argument is for the column names.(use this if you want to rename
columns only)
3. For both index and columns arguments, specify the names-change dictionary
containing original names and the new names in a form like [old name:new name]
4. specify inplace argument as True if you want to rename the rows/columns in the
same dataframe. If you skip this then a new dataframe is created with new
indexes/columns names and original remains unchanged.
Eg:
Consider the dataframe df as below:
rollno Name marks
sec a 115 Pavni 97.5
sec b 236 Rishi 98.0
sec c 307 Preet 98.5
Boolean Indexing:
Boolean indexing means having Boolean values(True or False) or (1 or 0) as
indexes of a dataframe. The Boolean indexes divide the dataframe in two groups.
True rows and False rows.
Creating Data frames with Boolean indexs:
whenever you create dataframe with Boolean indexes never enclose True
and False in single or double quotes.
Eg:
import pandas as pd
Days=['Mon','Tue','Wed','Thur','Fri']
Classes=[3,0,4,0,5]
dc={'Days':Days,'No:of Classes':Classes}
df=pd.DataFrame(dc,index=[True,False,True,False,True])
print(df)
In place of True and False 0’s and 1’s also can be given. as:
df=pd.DataFrame(dc,index=[1,0,1,0,1])
These indexing are very useful for filtering records ie extracting the True and False rows separately.
eg:
import pandas as pd
Days=['Mon','Tue','Wed','Thur','Fri']
Classes=[3,0,4,0,5]
dc={'Days':Days,'No:of Classes':Classes}
df=pd.DataFrame(dc,index=[True,False,True,False,True])
print(df)
OUTPUT:
Days No:of Classes
True Mon 3
False Tue 0
True Wed 4
False Thur 0
True Fri 5
<Dataframe>.loc[<Boolean condition>]
print(df.loc[df[‘no.of classes’]>0 ]