0% found this document useful (0 votes)
11 views75 pages

Dataframe Ip

The document provides an overview of data structures in Pandas DataFrames, highlighting their features, creation methods, and operations on rows and columns. It includes examples of creating DataFrames from various data types, importing/exporting CSV files, and performing operations like selection, addition, renaming, and deletion of rows and columns. Additionally, it explains the use of iteration functions and showcases how to manipulate DataFrames effectively.

Uploaded by

murali.media01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views75 pages

Dataframe Ip

The document provides an overview of data structures in Pandas DataFrames, highlighting their features, creation methods, and operations on rows and columns. It includes examples of creating DataFrames from various data types, importing/exporting CSV files, and performing operations like selection, addition, renaming, and deletion of rows and columns. Additionally, it explains the use of iteration functions and showcases how to manipulate DataFrames effectively.

Uploaded by

murali.media01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

DATA STRUCTURES

IN
PANDAS DATAFRAMES
Pandas DataFrame
A Data frame is a two-dimensional data structure,
i.e., data is aligned in a tabular fashion in rows
and columns.
Features of Data Frame
• Columns can be of different types
• Size of data frame is mutable i.e., the number of
rows and columns can be increased or
decreased
• Its data/values are also mutable and can be
changed any time
• Indexes may be of numbers, strings or letters
syntax :
pandas.DataFrame( data, index, columns, dtype,
copy)
data - takes various forms like ndarray, series, map,
lists, dict, constants and also another DataFrame.
index - used for row labels. Default np.arrange(n) if
no index is passed.
columns - used for column labels. This is only true if
no index is passed.
dtype - Data type of each column.
copy - used for copying of data, if the default is False.
Library
Python library is a collection of functions, methods
and huge number of modules that can be imported.

• import pandas as pd – Python library pandas is


added for working out with data structures in
pandas series.

• import numpy as np – Python library numpy is


added for creating numpy array and generating
random numbers.
Creation of DataFrame
A dataframe can be created using various inputs
like

• Series
• List
• Dictionary
• Numpy ndarrays
Create an Empty DataFrame
import pandas as pd
df = pd.DataFrame()
print df

output
Empty DataFrame
Columns: []
Index: []
Create Dataframe from dict of series
• Dictionary of Series can be passed to form a
DataFrame.

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df
Output
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Create Dataframe from list of dicts
• List of dictionaries
import pandas as pd
l = [{'Name' : 'Raj', 'lastname': 'Kumar'},
{'Name' : 'Vinodh', 'lastname' : 'Sharma'}]
df = pd.DataFrame(l)
print df

Output
Name lastname
0 Raj Kumar
1 Vinodh Sharma
• List of dictionaries and the row indices.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df

Output
a b c
first 1 2 NaN
second 5 10 20.0
• List of dictionaries, row indices and column
indices.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df1 = pd.DataFrame(data, index=['first', 'second'],
columns=['a', 'b'])
df2 = pd.DataFrame(data, index=['first', 'second'],
columns=['a', 'b1'])
print df1
print df2
Output
#df1 output
a b
first 1 2
second 5 10

#df2 output
a b1
first 1 NaN
second 5 NaN
Create CSV file
A CSV is a comma separated value file, which allows
data to be saved in a tabular format.
• Open Ms-Excel, type the details in rows and
columns
• Save the file with the extension csv
ex: ('C:\Users\emp.csv')
Importing CSV in a DataFrame
• pd.read_csv() – is used to read a csv file

import pandas as pd
df = pd.read_csv(r'C:\Users\emp.csv')
print(df)
Display subset of columns from CSV
import pandas as pd
data = pd.read_csv (r'C:\Users\emp.csv')
df = pd.DataFrame(data,
columns=[‘empid',‘ename'])
print (df)
Exporting data from dataframe to csv
import pandas as pd
l = [{'Name' : 'Sachin', 'Sirname': 'Bhardwaj'},
{'Name' : 'Vinod', 'Sirname' : 'Verma'},
{'Name' : 'Rajesh', 'Sirname' : 'Mishra'}]
df = pd.DataFrame(l)
df.to_csv('C:\Users\det.csv‘)
Iteration on rows and columns
• Iteration is used to access record or data
from a data frame row wise or column wise.

• Pandas provide 2 functions to perform


iterations
– iterrows()
– iteritems()
• iterrows() – used to access data row wise

import pandas as pd
l = [{'Name' : 'Raj', 'lastname': 'Kumar'},
{'Name' : 'Vinodh', 'lastname' : 'Sharma'}]
df = pd.DataFrame(l)
print df
for(row_index,row_value) in df.iterrows():
print('Row index is : ', row_index)
print('Row value is : ')
print(row_value)
Output
Name lastname
0 Raj Kumar
1 Vinodh Sharma
Row index is : 0
Row value is :
Name Raj
lastname Kumar
Name: 0, dtype: object
Row index is : 1
Row value is :
Name Vinodh
lastname Sharma
Name: 1, dtype: object
• iteritems() – used to access data column wise

import pandas as pd
l = [{'Name' : 'Raj', 'lastname': 'Kumar'},
{'Name' : 'Vinodh', 'lastname' : 'Sharma'}]
df = pd.DataFrame(l)
print df
for(col_name,col_value) in df.iteritems():
print('Column Name is : ', col_name)
print('Column Values are : ')
print(col_value)
Output
Name lastname
0 Raj Kumar
1 Vinodh Sharma
Column Name is : Name
Column Values are :
0 Raj
1 Vinodh
Name: Name, dtype: object
Column Name is : lastname
Column Values are :
0 Kumar
1 Sharma
Name: lastname, dtype: object
Operations on Rows

• Various operations are carried out in rows.


– Add
– Select
– Delete
Row Addition
• Add new rows to a DataFrame using the append
function.
• This function will append the rows at the end.

import pandas as pd
df1 = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df1.append(df2)
print df
Output
a b
0 10 20
1 30 40
0 50 60
1 70 80
Row Selection
Row selection can be performed by

• Label – it is used to access a group of rows and


columns

• Integer - it is used to access a group of rows and


columns based on integer index

• multiple rows – it is used to slice the rows


Selection by label
• Rows can be selected by passing row label to a loc
function.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df.loc['b', :]
print df.loc['a':'c', :]
Output
one 2.0
two 2.0
Name: b, dtype: float64
one two
a 1.0 1
b 2.0 2
c 3.0 3
Selection by integer location

• Rows can be selected by passing integer location


to an iloc function.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print("Dataframe :")
print df
print("Integer location index 2 value : ")
print df.iloc[2]
print("First 2 rows of second column : ")
print df.iloc[0:2,1:2]
print("All rows of first column : ")
print df.iloc[: , 0:1]
Output
Dataframe :
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Integer location index 2 value :
one 3.0
two 3.0
Name: c, dtype: float64
First 2 rows of second column :
two
a 1
b 2
All rows of first column :
one
a 1.0
b 2.0
c 3.0
d NaN
Slice rows
• Multiple rows can be selected using ‘ : ’ operator.

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df[2:4]
Output
one two
c 3.0 3
d NaN 4
Row Deletion

• Use index label to delete or drop rows from a


DataFrame.

• If label is duplicated, then multiple rows will


be dropped.
import pandas as pd
df1 = pd.DataFrame([[1, 2], [3, 4]], columns =
['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns =
['a','b'])
df = df1.append(df2)
df = df.drop(0)
print df
df=df.drop(1)
print df
Output
# drop(0)
a b
1 3 4
1 7 8
# drop(1)
a b
0 1 2
0 5 6
Operations on Columns
• Various operations are carried out in rows.
– Select
– Add
– Rename
– Delete
Column Selection
import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
print df['Col1']
print df['Col2']
Output
0 10
1 15
2 20
3 25
Name: Col1, dtype: int64
0 30
1 30
2 30
3 30
Name: Col2, dtype: int64
Column Addition
1) Adding a new column to an existing data frame

2) Adding a new column by passing a series

3) Adding a new column using insert function


Adding a column using existing columns
import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
df['Col3']=df['Col1']+df['Col2']
print("Adding a new column using the existing
columns in DataFrame : ")
print(df)
Output
Adding a new column using the existing columns
in DataFrame :
Col1 Col2 Col3
0 10 30 40
1 15 30 45
2 20 30 50
3 25 30 55
Adding a column by passing a series
import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
print ("Adding a new column by passing a Series:")
df['Col3']=pd.Series([5,5,5,5])
print(df)
Output
Adding a new column by passing a Series:
Col1 Col2 Col3
0 10 30 5
1 15 30 5
2 20 30 5
3 25 30 5
Adding a new column using insert
• Add a column in the last place(3rd place) named
‘Col3' value=[50,100,150,200].
import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
print ("Adding a new column using insert
function:")
df.insert(loc=2,column='Col3',value=[50,100,150,20
0])
print df
Output
Adding a new column using insert function:
Col1 Col2 Col3
0 10 30 50
1 15 30 100
2 20 30 150
3 25 30 200
Rename Column
• rename column m1 as marks1in both the dataframes
df1 and df2.

import pandas as pd
df1=pd.DataFrame({'m1':[56,77,82], 'm2':[89,68,45]})
df2=pd.DataFrame({'m1':[10,20,30], 'm2':[15,25,35]})
print("Dataframe 1")
df1.rename(columns={'m1':'marks1'},inplace=True)
print df1
print("Dataframe 2")
df2.rename(columns={'m1':'marks1'}, inplace=True)
print df2
Output
Dataframe 1
marks1 m2
0 56 89
1 77 68
2 82 45
Dataframe 2
marks1 m2
0 10 15
1 20 25
2 30 35
Rename index label
• change index label from 0 to zero, from 1 to
one and 2 to two for the given dataframe.
import pandas as pd
df=pd.DataFrame({'m1':[56,77,82],'m2':[89,68,4
5]},index=[0,1,2])
df.rename(index = {0: 'zero', 1:'one', 2:'two'},
inplace = True)
print df
Output
m1 m2
zero 56 89
one 77 68
two 82 45
Column Deletion
• Column from a data frame can be deleted by any
of the following:
– del()
– pop()
– drop()

• Note - output will be displayed based on column name


ascending
del () – deletes a column by passing column name in
subscript with df
import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
df['Col3']=df['Col1']+df['Col2']
print("Dataframe : ")
print df
print ("Deleting column using del():")
del df['Col1']
print df
Output
Dataframe:
Col1 Col2 Col3
0 10 30 40
1 15 30 45
2 20 30 50
3 25 30 55
Deleting column using del():
Col2 Col3
0 30 40
1 30 45
2 30 50
3 30 55
pop() – deletes a column by passing column name
in pop method.

import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
df['Col3']=df['Col1']+df['Col2']
print("Dataframe : ")
print df
print ("Deleting column using pop():")
df.pop('Col2')
print df
Output
Dataframe :
Col1 Col2 Col3
0 10 30 40
1 15 30 45
2 20 30 50
3 25 30 55
Deleting column using pop():
Col1 Col3
0 10 40
1 15 45
2 20 50
3 25 55
drop() – deletes a column by passing column name
and specify axis =1.

import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
print("Dataframe : ")
print df
print ("Deleting column using drop():")
df1=df.drop('Col2',axis=1)
print df1
Output
Dataframe :
Col1 Col2
0 10 30
1 15 30
2 20 30
3 25 30
Deleting column using drop():
Col1
0 10
1 15
2 20
3 25
Head function
• Head -returns the first n rows. The default
number of elements to display is five.
• Display first two rows of given dataframe

import pandas as pd
df=pd.DataFrame({'Name':['Tom', 'Jack', 'Steve',
'Ricky'],'Age':[28,34,29,42]})
print df.head(2)
Output
Age Name
0 28 Tom
1 34 Jack
Tail function
• Tail - returns the last n rows. The default
number of elements to display is five.
• Display last three rows of given dataframe

import pandas as pd
df=pd.DataFrame({'Name':['Tom', 'Jack', 'Steve',
'Ricky'],'Age':[28,34,29,42]})
print df.tail(3)
Output
Age Name
1 34 Jack
2 29 Steve
3 42 Ricky
Boolean indexing
• Boolean indexing helps to select the data from the data frames
using a boolean vector.
import pandas as pd
dic = {'Name' : ['Sachin', 'Vinod', 'Rajesh'],
'Age' : [32, 35, 40]}
df = pd.DataFrame(dic, index = [True, False, True])
print("Dataframe : ")
print df
print
print("Boolean True Value : ")
print df.loc[True]
print
print("Boolean False Value : ")
print df.loc[False]
Output
Dataframe :
Age Name
True 32 Sachin
False 35 Vinod
True 40 Rajesh
Boolean True Value :
Age Name
True 32 Sachin
True 40 Rajesh
Boolean False Value :
Age 35
Name Vinod
dtype: object
Concatenation

• concat() - performs concatenation operations


along an axis

• ignore_index – Boolean. Default False. If True,


do not use the index values on the
concatenation axis.
import pandas as pd
dic1 = { 'id' : [1,2,3], ' v1' : [56,78,98],
'v2' :[34,48,85]}
dic2 = { 'id' : [2,3,4], 'v1' : [22,48,65],
'v2' :[12,89,77]}
df1=pd.DataFrame(dic1)
df2=pd.DataFrame(dic2)
df3=pd.concat([df1,df2])
print df3
Output
id v1 v2
0 1 56 34
1 2 78 48
2 3 98 85
0 2 22 12
1 3 48 89
2 4 65 77
import pandas as pd
dic1 = { 'id' : [1,2,3], 'v1' : [56,78,98],
'v2' :[34,48,85]}
dic2 = { 'id' : [2,3,4], 'v1' : [22,48,65],
'v2' :[12,89,77]}
df1=pd.DataFrame(dic1)
df2=pd.DataFrame(dic2)
df3=pd.concat([df1,df2], ignore_index=True)
print df3
Output
id v1 v2
0 1 56 34
1 2 78 48
2 3 98 85
3 2 22 12
4 3 48 89
5 4 65 77
Merging

• merge() – used to join two data frames with


holding different values and linked by common
column

• on – columns to join on.


import pandas as pd
dic1 = { 'id' : [1,2,3], 'v1' : [56,78,98], 'v2'
:[34,44,85]}
dic2 = { 'id' : [2,3,4], 'v3':['a','b','c']}
df1=pd.DataFrame(dic1)
df2=pd.DataFrame(dic2)
df3=pd.merge(df1,df2,on='id')
print df3
Output
id v1 v2 v3
0 2 78 44 a
1 3 98 85 b
Joining
• The full outer join combines the results of both
left and right outer joins.

• The joined data frame will contain all records


from both the data frames and fill NaN for
missing data on either side.

• how − ('left', 'right', 'outer', 'inner‘)


import pandas as pd
dic1 = { 'id' : [1,2,3], 'v1' : [56,78,98],
'v2' :[34,48,85]}
dic2 = { 'id' : [2,3,4], 'v1' : [22,48,65],
'v2‘ :[12,89,77]}
df1=pd.DataFrame(dic1)
df2=pd.DataFrame(dic2)
df3=pd.merge(df1,df2,on='id',how='outer')
print df3
Output
id v1_x v2_x v1_y v2_y
0 1 56.0 34.0 NaN NaN
1 2 78.0 48.0 22.0 12.0
2 3 98.0 85.0 48.0 89.0
3 4 NaN NaN 65.0 77.0

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy