0% found this document useful (0 votes)
34 views43 pages

UNIT 3 (Chapter 2) Pandas

The document provides an overview of the Pandas library in Python, focusing on data manipulation techniques such as indexing, selection, and handling missing data. It introduces key concepts including Series and DataFrame objects, and demonstrates how to create and operate on these structures. Additionally, it covers installation, importing, and basic operations using Pandas.

Uploaded by

kavya sree bandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views43 pages

UNIT 3 (Chapter 2) Pandas

The document provides an overview of the Pandas library in Python, focusing on data manipulation techniques such as indexing, selection, and handling missing data. It introduces key concepts including Series and DataFrame objects, and demonstrates how to create and operate on these structures. Additionally, it covers installation, importing, and basic operations using Pandas.

Uploaded by

kavya sree bandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

UNIT-3 Python for Data Handling (Chapter-2 Pandas)

Data Manipulation with Pandas – Data indexing and selection ,Operating on data,
Missing data, Hierarchical indexing, Combining Datasets, Aggregation and
Grouping, Pivot Tables.

Introduction to Pandas
What is Pandas?
 Pandas is a Python library used for working with data sets.
 It has functions for analyzing, cleaning, exploring, and manipulating data.
 The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and
was created by Wes McKinney in 2008.
Why Use Pandas?
 Pandas allows us to analyze big data and make conclusions based on statistical theories.
 Pandas can clean messy data sets, and make them readable and relevant.
 Relevant data is very important in data science.

Installation of Pandas
If you have Python and PIP already installed on a system, then installation of Pandas is very easy.
Install it using this command:
C:\Users\Your Name>pip install pandas
If this command fails, then use a python distribution that already has Pandas installed like,
Anaconda, Spyder etc.
Import Pandas
Once Pandas is installed, import it in your applications by adding the import keyword:
import pandas
Example
import pandas
mydataset = {'cars': ["BMW", "Volvo", "Ford"],'passings':
[3, 7, 2]}
myvar = pandas.DataFrame(mydataset)
print(myvar)
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

Pandas as pd
Pandas is usually imported under the pd alias.
alias: In Python alias are an alternate name for referring to the same thing.
Create an alias with the as keyword while importing:
import pandas as pd
Example
import pandas as pd
mydataset = { 'cars':
["BMW", "Volvo", "Ford"], 'passings': [3, 7, 2]}
myvar = pd.DataFrame(mydataset)
print(myvar)
Checking Pandas Version
The version string is stored under __version__ attribute.
Example
import pandas as pd
print(pd.__version__)

Pandas Objects
Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the
rows and columns are identified with labels rather than simple integer indices.
Three fundamental Pandas data structures:
1. Series
2. DataFrame
3. Index.

The Pandas Series Object


A Pandas Series is a one-dimensional array of indexed data.
Example:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

Output:
0 0.25
1 0.50
2 0.75
3 1.00

The Series wraps both a sequence of values and a sequence of indices, which we can access with
the values and index attributes.

print(data.values)
print(data.index)

Output:
[0.25 0.5 0.75 1. ]
RangeIndex(start=0, stop=4, step=1)

 The essential difference between NumPy one-dimensional array and pandas Series is the
presence of the index: while the NumPy array has an implicitly defined integer index used to
access the values, the Pandas Series has an explicitly defined index associated with the values.
 This explicit index definition gives the Series object additional capabilities. For example, the
index need not be an integer, but can consist of values of any desired type.

Example:
data = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])
print(data)
Output:
a 0.25
b 0.50
c 0.75
d 1.00
We can even use non-contiguous or non-sequential indices:
Example:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
print(data)
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

Output:
2 0.25
5 0.50
3 0.75
7 1.00

Constructing Series Object:


The general syntax to create pandas Series object is
pd.Series(data, index=index)
where index is an optional argument, and data can be one of many entities.
 data can be a list or NumPy array, in which case index defaults to an integer sequence.
 data can be a scalar, which is repeated to fill the specified index.
 data can be a dictionary, in which index defaults to the sorted dictionary keys.
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

Example:
import pandas as pd
import numpy as np
arr=np.arange(10,60,10)
li=[10,20,30,40,50]
s=10
dic={'Ten':10,'Twenty':20,'Thirty':30,'Forty':40,'Fifty':50}
ser1 = pd.Series(arr)#A one-dimensional ndarray
ser2 = pd.Series(li)# A Python list
ser3 = pd.Series(s)#A scalar value
ser4 =pd.Series(s,index=['a','b','c','d','e'])
ser5 = pd.Series(dic) #A Python dictionary
print('Numpy 1-D array is converted into Pandas Series:')
print(ser1)
print('--------------------------------------------------')
print('Python list is converted into Pandas Series:')
print(ser2)
print('--------------------------------------------------')
print('Scalar Value is converted into Pandas Series:')
print(ser3)
print('--------------------------------------------------')
print('Scalar Value is converted into Pandas Series with explicit indexing:')
print(ser4)
print('--------------------------------------------------')
print('Python dictionary is converted into Pandas Series with explicit indexing:')
print(ser5)

Output:
Numpy 1-D array is converted into Pandas Series:
0 10
1 20
2 30
3 40
4 50
dtype: int32
--------------------------------------------------
Python list is converted into Pandas Series:
0 10
1 20
2 30
3 40
4 50
dtype: int64
--------------------------------------------------
Scalar Value is converted into Pandas Series:
0 10
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

dtype: int64
--------------------------------------------------
Scalar Value is converted into Pandas Series with explicit indexing:
a 10
b 10
c 10
d 10
e 10
dtype: int64
--------------------------------------------------
Python dictionary is converted into Pandas Series with explicit indexing:
Ten 10
Twenty 20
Thirty 30
Forty 40
Fifty 50
dtype: int64

The Pandas DataFrame Object


 The DataFrame can be thought of either as a generalization of a NumPy array, or as a
specialization of a Python dictionary.
 A DataFrame is an analog of a two-dimensional array with both flexible row indices
and flexible column names.
 We can think of a DataFrame as a sequence of aligned (they share the same index)
Series objects.
 The DataFrame can be thought of as a generalization of a two- dimensional NumPy
array, where both the rows and columns have a generalized index for accessing the data.

Example:
#Pandas DataFrame
import pandas as pd
print('Data Frame:')
d=pd.DataFrame([[10,20],[30,40],[50,60]])
print(d)
d=pd.DataFrame([[10,20],[30,40],[50,60]],index=['row1','row2','row3'])
print('==========================================================')
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

print('Data Frame with explicit indexing for row:')


print(d)
d=pd.DataFrame([[10,20],[30,40],[50,60]],index=['row1','row2','row3'],columns=['col1','c
ol2'])
print('=========================================================')
print('Data Frame with explicit indexing for rows and columns:')
print(d)

Output:
Data Frame:
0 1
0 10 20
1 30 40
2 50 60
=============================================================
Data Frame with explicit indexing for row:
0 1
row1 10 20
row2 30 40
row3 50 60
=============================================================
Data Frame with explicit indexing for rows and columns:
col1 col2
row1 10 20
row2 30 40
row3 50 60

Constructing DataFrame Object:


A Pandas DataFrame can be constructed in a variety of ways.
 From a single Series object
 From List of Dicts
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

 From a dictionary of Series objects


 From a two-dimensional NumPy array
 From a NumPy structured array

#Constructing DataFrame from a Single Series Object


import pandas as pd
markslist = {'Kumar':89,'Rao':78,'Ali':67,'Singh':96}
marks = pd.Series(markslist)
df= pd.DataFrame(marks,columns=['Marks'])
print(df)

Output:
Marks
Kumar 89
Rao 78
Ali 67
Singh 96

#Construct a DataFrame from List of Dictionaries


import pandas as pd
d1={"A":10,"B":20," C ":30}
d2={"A":40,"B":50," C ":60}
d3={"A":70,"B":80,"C":90}
l=[d1,d2,d3]
data=pd.DataFrame(l)
print()
print('List of dictionaries as a DataFrame:')
print(data)

Output:
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

List of dictionaries as a DataFrame:


A B C
0 10 20 30
1 40 50 60
2 70 80 90

#Constructing DataFrame from Dictionary of Series Objects


import pandas as pd
branch={'Sajid':'CSE','Wahid':'EEE','Hafeez':'MECH'}
address={'Sajid':'SAP','Wahid':'NRT','Hafeez':'GNT'}
B=pd.Series(branch)
A=pd.Series(address)
data=pd.DataFrame({'Branch':B,'Address':A})
print(data)

Output:
Branch Address
Sajid CSE SAP
Wahid EEE NRT
Hafeez MECH GNT

#Construct a DataFrame from NumPy 2-D array


import pandas as pd
import numpy as np
data=pd.DataFrame(np.arange(10,16,1).reshape(2,3),index=['row1','row2'])
print(data)

Output:
0 1 2
row1 10 11 12
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

row2 13 14 15

#Constructing DataFrame from a NumPy Structured Array


import pandas as pd
import numpy as np
SA=np.zeros(3,dtype=[('A','i8'),('B','f8')])
data=pd.DataFrame(SA,index=['row1','row2','row3'])
print(data)

Output:
A B
row1 0 0.0
row2 0 0.0
row3 0 0.0

Data Indexing and Selection

Pandas Index Object:


 Both the Series and DataFrame objects contain an explicit index using which we
reference and modify data.
 This Index object is an interesting structure in itself, and it can be thought of either as an
immutable array or as an ordered set.

Example:
import pandas as pd
rind = pd.Index(['row1','row2','row3','row4'])
cind =pd.Index(['col1'])
ser = pd.Series([100,200,300,400],index=rind)
df = pd.DataFrame(ser,columns=cind)
print(df)
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

Output:

col1
row1 100
row2 200
row3 300
row4 400

Example:
#Index Object
import pandas as pd
rind=pd.Index(['row1','row2','row3'])
cind=['col1','col2']
data1=pd.DataFrame([[10,20],[30,40],[50,60]],rind,cind)
data2=pd.DataFrame([[1,2],[3,4],[5,6]],rind,cind)
data3=pd.DataFrame([[100,200],[300,400],[500,600]],rind,cind)
print(data1)
print("--------------------------")
print(data2)
print("--------------------------")
print(data3)

Output:
col1 col2
row1 10 20
row2 30 40
row3 50 60
--------------------------
col1 col2
row1 1 2
row2 3 4
row3 5 6
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

--------------------------
col1 col2
row1 100 200
row2 300 400
row3 500 600

Operating on Data in Pandas


 Pandas inherit much of this functionality from NumPy, and the ufuncs. So Pandas having the
ability to perform quick element-wise operations, both with basic arithmetic (addition,
subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric
functions, exponential and logarithmic functions, etc.).
 For unary operations like negation and trigonometric functions, these ufuncs will preserve
index and column labels in the output.
 For binary operations such as addition and multiplication, Pandas will automatically align
indices when passing the objects to the ufunc.
 The universal functions are working in series and DataFrames by
 Index preservation
 Index alignment

Index Preservation:
#Operating on Data in pandas
#index preservation in series and dataframe
import numpy as np
import pandas as pd
s=pd.Series([10,20,30,40])
print('Series:')
print(s)
df=pd.DataFrame(np.arange(1,13,1).reshape(3,4))
print('DataFrame:')
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

print(df)
print("===================================================")
print("Adding 5 to individual row of an array in series")
print(np.add(s,5))
print("===================================================")
print("Adding 10 to individual element of an array in dataframe")
print(np.add(df,10))
print('================================================')
print('Trignometric Function sin applied on series:')
print(np.sin(s))
print('Logarithemic function applied on dataframe:')
print(np.log(df[0][0]))

Output:
Series:
0 10
1 20
2 30
3 40
dtype: int64
DataFrame:
0 1 2 3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
===================================================
Adding 5 to individual row of an array in series
0 15
1 25
2 35
3 45
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

dtype: int64
===================================================
Adding 10 to individual element of an array in dataframe
0 1 2 3
0 11 12 13 14
1 15 16 17 18
2 19 20 21 22
================================================
Trignometric Function sin applied on series:
0 -0.544021
1 0.912945
2 -0.988032
3 0.745113
dtype: float64
Logarithemic function applied on dataframe:
0.0

Index Alignment in Series


 Pandas will align indices in the process of performing the operation. This is very convenient
when we are working with incomplete data, as we’ll.
 Suppose we are combining two different data sources, then the index will aligned accordingly.

Example:
#Index Alignment in Series
import numpy as np
import pandas as pd
A=pd.Series([2,4,6],index=[0,1,2])
B=pd.Series([1,3,5],index=[1,2,3])
print(A.add(B))
print("===========================================================")
print("Fill value for any elements in A or B that might be missing")
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

print(A.add(B,fill_value=0))# fill value for any elements in A or B that might be missing

Output:
0 NaN
1 5.0
2 9.0
3 NaN
dtype: float64
===========================================================
Fill value for any elements in A or B that might be missing
0 2.0
1 5.0
2 9.0
3 5.0
dtype: float64

Index Alignment in DataFrame


A similar type of alignment takes place for both columns and indices when we are performing
operations on DataFrames.

Example:
#Index Alignment in DataFrame
import numpy as np
import pandas as pd
A=pd.DataFrame(np.arange(1,5,1).reshape(2,2), columns=list('AB'))
B=pd.DataFrame(np.arange(1,10,1).reshape(3,3), columns=list('BAC'))
print("DataFrame A:")
print("-------------------")
print(A)
print("DataFrame B:")
print("-------------------")
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

print(B)
print("Addition of DataFrame A and B:")
print("-----------------------------")
print(A.add(B))

Output:
DataFrame A:
-------------------
A B
0 1 2
1 3 4
DataFrame B:
-------------------
B A C
0 1 2 3
1 4 5 6
2 7 8 9
Addition of DataFrame A and B:
-----------------------------
A B C
0 3.0 3.0 NaN
1 8.0 8.0 NaN
2 NaN NaN NaN

Operations between DataFrame and Series


 When we are performing operations between a DataFrame and a Series, the index and column
alignment is similarly maintained.
 Operations between a DataFrame and a Series are similar to operations between a two-
dimensional and one-dimensional NumPy array.
#Operation between DataFrame and Series
import numpy as np
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

import pandas as pd
s=pd.Series([10,20])
df=pd.DataFrame([[100,200],[300,400]])
print("Series:")
print('----------')
print(s)
print("\nDataFrame:")
print('-----------')
print(df)
print("\nSubtraction of DataFrame with Series:")
print("-------------------------------------")
print(df.subtract(s))
print("\nSubtraction of DataFrame with Series at Axis=0: ")
print("-------------------------------------")
print(df.subtract(s, axis=0))

Output:
Series:
----------
0 10
1 20
dtype: int64

DataFrame:
-----------
0 1
0 100 200
1 300 400

Subtraction of DataFrame with Series:


-------------------------------------
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

0 1
0 90 180
1 290 380

Subtraction of DataFrame with Series at Axis=0:


-------------------------------------
0 1
0 90 190
1 280 380

Handling Missing Data


 A number of schemes have been developed to indicate the presence of missing data in a table
or DataFrame.
 Generally, they revolve around one of two strategies: using a mask that globally indicates
missing values, or choosing a sentinel value that indicates a missing entry.
 In the masking approach, the mask might be an entirely separate Boolean array, or it may
involve appropriation of one bit in the data representation to locally indicate the null status
of a value.
 In the sentinel approach, the sentinel value could be some data-specific convention, such
as indicating a missing integer value with –9999 or some rare bit pattern, or it could be a
more global convention, such as indicating a missing floating-point value with NaN (Not
a Number), a special value which is part of the IEEE floating-point specification.

Example: Missing Values in Numpy


import numpy as np
import pandas as pd
x=np.array([1,2,np.nan,4])
print('x=',x,'\n')
print('Sum of elements in numpy array x:')
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

print('sum(x)=',np.nansum(x))
y=np.array([10,20,30,np.nan])
print('----------------------------------------')
print('y=',y,'\n')
print('Sum of elements in numpy array y:')
print('sum(y)=',np.nansum(y))
print('----------------------------------------')
print('Addition of Numpy Array x and y:')
print(x+y)
Output:
x= [ 1. 2. nan 4.]

Sum of elements in numpy array x:


sum(x)= 7.0
----------------------------------------
y= [10. 20. 30. nan]

Sum of elements in numpy array y:


sum(y)= 60.0
----------------------------------------
Addition of Numpy Array x and y:
[11. 22. nan nan]

Missing Data in Pandas


 The way in which Pandas handles missing values is constrained by its NumPy package,
which does not have a built-in notion of NA values for non floating- point data types.
 NumPy supports fourteen basic integer types once we account for available precisions,
signedness, and endianness of the encoding.
 Reserving a specific bit pattern in all available NumPy types would lead to an unwieldy
amount of overhead in special-casing various operations for various types, likely even
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

requiring a new fork of the NumPy package.


 Pandas chose to use sentinels for missing data, and further chose to use two already-
existing Python null values: the special floating point NaN value, and the Python None
object.
 This choice has some side effects, as we will see, but in practice ends up being a good
compromise in most cases of interest.

None: Pythonic missing data


 The first sentinel value used by Pandas is None, a Python singleton object that is often
used for missing data in Python code. Because None is a Python object, it cannot be used
in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object' (i.e., arrays
of Python objects)
 This dtype=object means that the best common type representation NumPy could infer
for the contents of the array is that they are Python objects.

NaN: Missing numerical data


 NaN is a special floating-point value recognized by all systems that use the standard IEEE
floating-point representation.

NaN and None in Pandas


NaN and None both have their place, and Pandas is built to handle the two of them nearly
interchangeably.

NaN(not a number) is considered a missing value:


In Python, you can create nan with float('nan'), math.nan, or np.nan. nan is considered a missing
value in pandas.
Example:
import numpy as np
import pandas as pd
import math
s_nan = pd.Series([float('nan'), math.nan, np.nan])
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

print(s_nan)
Output:
0 NaN
1 NaN
2 NaN
dtype: float64

None is also considered a missing value:


In pandas, None is also treated as a missing value. None is a built-in constant in Python.
print(None)
# None
print(type(None))
# <class 'NoneType'>

For numeric columns, None is converted to nan when a DataFrame or Series containing None is
created, or None is assigned to an element.

Example:
import pandas as pd
s_none_float = pd.Series([None, 10, 20])
print(s_none_float)

Output:
0 NaN
1 10.0
2 20.0
dtype: float64

None in the object column remains as None:


Example:
import pandas as pd
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

s_none_object = pd.Series([None, 'abc', 'xyz'])


print(s_none_object)

Output:
0 None
1 abc
2 xyz
dtype: object

Operating on Null Values


There are several useful methods for detecting, removing, and replacing null values in Pandas
data structures.
They are:
 isnull() - Generate a Boolean mask indicating missing values
 notnull() - Opposite of isnull()
 dropna() - Return a filtered version of the data
 fillna() - Return a copy of the data with missing values filled or imputed
Detecting null values
Pandas data structures have two useful methods for detecting null data: isnull() and notnull().

Example:
import pandas as pd
s_none_float = pd.Series([None, 10, 20])
print(s_none_float)
print('--------------------------------')
print(s_none_float.isnull())
print('------------------------------------------')
print(s_none_float.notnull())
Output:
0 NaN
1 10.0
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

2 20.0
dtype: float64
--------------------------------
0 True
1 False
2 False
dtype: bool
------------------------------------------
0 False
1 True
2 True
dtype: bool

Dropping Null Values:


import pandas as pd
s_none_float = pd.Series([None, 10, 20])
print(s_none_float)
print('------------------------------------------')
print('Null Values dropped from the series:')
print(s_none_float.dropna())

Output:
0 NaN
1 10.0
2 20.0
dtype: float64
------------------------------------------
Null Values dropped from the series:
1 10.0
2 20.0
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

dtype: float64
Filling null values:
The fillna() method replaces the NULL values with a specified value.
Example:
import pandas as pd
ser = pd.Series([np.nan, 10, 20,30])
print(ser)
print('-------------------------------')
print('Series Null Values are filled with 0:')
print(ser.fillna(0))
Output:
0 NaN
1 10.0
2 20.0
3 30.0
dtype: float64
-------------------------------
Series Null Values are filled with 0:
0 0.0
1 10.0
2 20.0
3 30.0
dtype: float64
ffill():
The ‘ffill’ method fills the missing value with the last valid value before that missing value in
the data sequence.
Example:
import numpy as np
import pandas as pd
ex1 = pd.Series([1,3,np.nan,4])
print(ex1.ffill())
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

or
print(ex1.fillna(method='ffill') )

Output:
0 1.0
1 3.0
2 3.0
3 4.0
dtype: float64

‘bfill’ (Backward Fill):


Instead, the ‘bfill’ method fills the missing value with the first valid value that comes after the
missing value in the data sequence.
Example:
import numpy as np
import pandas as pd
ex1 = pd.Series([1,3,np.nan,4])
print(ex1.fillna(method='bfill'))

Output:
0 1.0
1 3.0
2 4.0
3 4.0
dtype: float64

Hierarchical Indexing
 Hierarchical indexing (also known as multi-indexing) is used to incorporate multiple index
levels within a single index.
 In this way, higher-dimensional data can be compactly represented within the familiar one-
dimensional Series and two-dimensional DataFrame objects.
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

 A Multiply Indexed Series: Here we represent two-dimensional data within a one-


dimensional Series.
Example:
#Hierarchical Indexing
import numpy as np
import pandas as pd
ser = pd.Series([10,20,30,40,50,60],index = [[1,1,1,2,2,2,],['a','b','c','a','b','c']])
print(ser)
print('----------------------------------------')
ser.index.names = ['index1','index2']
print(ser)

Output:
1 a 10
b 20
c 30
2 a 40
b 50
c 60
dtype: int64
----------------------------------------
index1 index2
1 a 10
b 20
c 30
2 a 40
b 50
c 60
dtype: int64
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

Example 2:
#Multiply Indexed DataFrame
import numpy as np
import pandas as pd
data = [[25,24],[28,26],[29,28],[27,26],[30,29],[28,27]]
ind = [['1201','1201','1264','1264','12C7','12C7'],['mid1','mid2','mid1','mid2','mid1','mid2']]
col = ['DS','DP']
df = pd.DataFrame(data,index=ind,columns=col)
print(df)
print('--------------------------------------------------')
df.index.names =['RollNo ','Mid Result']
print(df)

Output:
DS DP
1201 mid1 25 24
mid2 28 26
1264 mid1 29 28
mid2 27 26
12C7 mid1 30 29
mid2 28 27
--------------------------------------------------
DS DP
RollNo Mid Result
1201 mid1 25 24
mid2 28 26
1264 mid1 29 28
mid2 27 26
12C7 mid1 30 29
mid2 28 27
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

Combining Datasets
Some of the most interesting studies of data come from combining different data sources.
 These operations can involve anything from very straightforward concatenation of two
different datasets, to more complicated database- style joins and merges that correctly handle
any overlaps between the dataset.
 These operations can be:
 simple concatenation of Series and DataFrames with the pd.concat function
 in-memory merges and joins implemented in Pandas.

Simple Concatenation with pd.concat


 Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains
a number of other options.
 pd.concat() can be used for a simple concatenation of Series or DataFrame objects, just as
np.concatenate() can be used for simple concatenations of arrays.
Example 1:
import pandas as pd
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[1, 2, 3])
print('Concatenation of Series 1 and Series 2:')
print(pd.concat([ser1, ser2]))

Output:
Concatenation of Series 1 and Series 2:
1 A
2 B
3 C
1 D
2 E
3 F
dtype: object
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

Example-2:
#Combining Datasets
#Concatenation in DataFrame
import pandas as pd
df1 =pd.DataFrame([[10,20],[30,40]],index=[1,2],columns=['A','B'])
df2 =pd.DataFrame([[50,60],[70,80]],index=[1,2],columns=['A','B'])
print(df1)
print('---------------------------')
print(df2)
print('---------------------------')
print(pd.concat([df1, df2]))

Output:
A B
1 10 20
2 30 40
---------------------------
A B
1 50 60
2 70 80
---------------------------
A B
1 10 20
2 30 40
1 50 60
2 70 80

By default, the concatenation takes place row-wise within the DataFrame (i.e., axis=0). Like
np.concatenate, pd.concat allows specification of an axis along which concatenation will take
place.
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

Example-3:
#Axis wise Concatenation in DataFrame
import pandas as pd
df1 =pd.DataFrame([[10,20],[30,40]],index=[1,2],columns=['A','B'])
df2 =pd.DataFrame([[50,60],[70,80]],index=[1,2],columns=['A','B'])
print(df1)
print('-------------------------------------')
print(df2)
print('-------------------------------------')
print(pd.concat([df1, df2],axis=1))

Output:
A B
1 10 20
2 30 40
-------------------------------------
A B
1 50 60
2 70 80
-------------------------------------
A B A B
1 10 20 50 60
2 30 40 70 80

Example-4:
#Axis wise Concatenation in DataFrame
import pandas as pd
df1 =pd.DataFrame([[10,20],[30,40]],index=[1,2],columns=['A','B'])
df2 =pd.DataFrame([[50,60],[70,80]],index=[3,4],columns=['C','D'])
print(df1)
print('-------------------------------------')
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

print(df2)
print('-------------------------------------')
print(pd.concat([df1, df2],axis=1))

Output:
A B
1 10 20
2 30 40
-------------------------------------
C D
3 50 60
4 70 80
-------------------------------------
A B C D
1 10.0 20.0 NaN NaN
2 30.0 40.0 NaN NaN
3 NaN NaN 50.0 60.0
4 NaN NaN 70.0 80.0

append()
Series and DataFrame objects have an append method that can accomplish the concatenation in
fewer keystrokes.
For example, rather than calling pd.concat([df1, df2]), we can simply call df1.append(df2):
print(df1);
print(df2);
print(df1.append(df2))

Merge and Join


One essential feature offered by Pandas is its high-performance, in-memory join and merge
operations.
Categories of Joins
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

 One-to-one joins
 Many-to-one joins
 Many-to-many joins
One – to – one joins
The simplest type of merge expression is the one-to-one join, which is in many ways very similar
to the column-wise concatenation.
Example:
#Merging Data Frames
#one to one join
import pandas as pd
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'], 'group': ['Accounting',
'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'], 'hire_date': [2004, 2008, 2012,
2014]})
print(df1)
print('-------------------------------')
print(df2)
print('-------------------------------')
df3=pd.merge(df1,df2)
print(df3)
Output:
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
-------------------------------
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

3 Sue 2014
-------------------------------
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

Many-to-one joins
Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For
the many-to-one case, the resulting DataFrame will preserve those duplicate entries as
appropriate.
Example:
#Many to one join
import pandas as pd
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'], 'group': ['Accounting',
'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'], 'hire_date': [2004, 2008, 2012,
2014]})
df3=pd.merge(df1,df2)
print(df3)
print('-------------------------------')
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'], 'supervisor': ['Carly', 'Guido',
'Steve']})
print(pd.merge(df3,df4))

Output:
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

3 Sue HR 2014
-------------------------------
employee group hire_date supervisor
0 Bob Accounting 2008 Carly
1 Jake Engineering 2012 Guido
2 Lisa Engineering 2004 Guido
3 Sue HR 2014 Steve

The resulting DataFrame has an additional column with the “supervisor” information, where the
information is repeated in one or more locations as required by the inputs.
Many-to-many joins
Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the
key column in both the left and right array contains duplicates, then the result is a many-to-many
merge.

Example:
import pandas as pd
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'], 'group': ['Accounting',
'Engineering', 'Engineering', 'HR']})
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR',
'HR'], 'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']})
df6=pd.merge(df1,df5)
print(df6)

Output:
employee group skills
0 Bob Accounting math
1 Bob Accounting spreadsheets
2 Jake Engineering coding
3 Jake Engineering linux
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

4 Lisa Engineering coding


5 Lisa Engineering linux
6 Sue HR spreadsheets
7 Sue HR organization

Aggregation and Grouping


An essential piece of analysis of large data is efficient summarization: computing aggregations
like sum(), mean(), median(), min(), and max(), in which a single number gives insight into the
nature of a potentially large dataset.
Aggregation in pandas can be performed by:
 Simple Aggregation
 Operations based on the concept of a groupby.
Simple Aggregation in Pandas
As with a one dimensional NumPy array, for a Pandas Series the aggregates return a single value:

Example:
#Aggreagation in pandas
import pandas as pd
ser1=pd.Series([1,2,3,4,5])
print('Mean Value of Series:')
print(ser1.mean())
print('-----------------------')
print('Minimum Value of the Series:')
print(ser1.min())
print('-----------------------')
print('Maximum Value of the Series:')
print(ser1.max())
df=pd.DataFrame([[1,2,3],[4,5,6]])
print('-----------------------')
print('Maximum Value of the DataFrame:')
print(df.max())

Output:
Mean Value of Series:
3.0
-----------------------
Minimum Value of the Series:
1
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

-----------------------
Maximum Value of the Series:
5
-----------------------
Maximum Value of the DataFrame:
0 4
1 5
2 6
dtype: int64

Pandas Series and DataFrames include all of the common aggregates .In addition, there is a
convenience method describe() that computes several common aggregates for each column and
returns the result.
Example:
#Describe function
import pandas as pd
df=pd.DataFrame([[1,2,3],[4,5,6]])
print(df.describe())

Output:
0 1 2
count 2.00000 2.00000 2.00000
mean 2.50000 3.50000 4.50000
std 2.12132 2.12132 2.12132
min 1.00000 2.00000 3.00000
25% 1.75000 2.75000 3.75000
50% 2.50000 3.50000 4.50000
75% 3.25000 4.25000 5.25000
max 4.00000 5.00000 6.00000
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

Some of other built-in Pandas aggregations are:

GroupBy: Split, Apply, Combine


 The groupby operation llows to quickly and efficiently compute aggregates on subsets of data.
 The groupby operation is used to aggregate conditionally on some label or index.
 The name “group by” comes from a command in the SQL database language, but it is perhaps
more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame:
split, apply, combine.
 The split step involves breaking up and grouping a DataFrame depending on the
value of the specified key.
 The apply step involves computing some function, usually an aggregate,
transformation, or filtering, within the individual groups.
 The combine step merges the results of these operations into an output array.

Example:
#group by function
import pandas as pd
import numpy as np
df = pd.DataFrame({'key':['A','B','C','A','B','C'],'data':np.arange(1,7)},columns=['key','data'])
print(df)
print('----------------------------------------')
print('Applying group by function on data frame:')
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

print(df.groupby('key').sum())

Output:
key data
0 A 1
1 B 2
2 C 3
3 A 4
4 B 5
5 C 6
----------------------------------------
Applying group by function on data frame:
data
key
A 5
B 7
C 9
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

Pivot Tables
 A pivot table is a similar to GroupBy operation that is commonly seen in spreadsheets and other
programs that operate on tabular data.
 The pivot table takes simple column wise data as input, and groups the entries into a two-
dimensional table that provides a multidimensional summarization of the data.
 We can think of pivot tables as essentially a multidimensional version of GroupBy aggregation.
i.e., we can split-apply- combine, but both the split and the combine happen across not a one
dimensional index, but across a two-dimensional grid.
Pivot Table Syntax: The full call signature of the pivot_table method of DataFrames is as
follows:
DataFrame.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
fill_value=None, margins=False, dropna=True, margins_name='All')

where
data : pandas dataframe
index : feature that allows to group data
values : feature to aggregates on
columns: displays the values horizontally on top of the resultant table fill_value and
dropna, have to do with missing data

The aggfunc keyword controls what type of aggregation is applied, which is a mean by
default.
margins_name: compute totals along each grouping.

Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Kumar','Rao','Ali','Singh'],
'Job':['FullTimeEmployee','Intern','PartTime Employee','FullTimeEmployee'],
'Dept':['Admin','Tech','Admin','management'],
'YOJ':[2018,2019,2018,2010],'Sal':[20000,50000,10000,20000]})
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

print(df.to_string())
output = pd.pivot_table(data=df,index=['Job'],columns = ['Dept'],values ='Sal',aggfunc ='mean')
print('\n-------------------------------------------------------\n')
print(output.to_string())

Output:
Name Job Dept YOJ Sal
0 Kumar FullTimeEmployee Admin 2018 20000
1 Rao Intern Tech 2019 50000
2 Ali PartTime Employee Admin 2018 10000
3 Singh FullTimeEmployee management 2010 20000

-------------------------------------------------------

Dept Admin Tech management


Job
FullTimeEmployee 20000.0 NaN 20000.0
Intern NaN 50000.0 NaN
PartTime Employee 10000.0 NaN NaN
UNIT-3 Python for Data Handling (Chapter-2 Pandas)
UNIT-3 Python for Data Handling (Chapter-2 Pandas)
UNIT-3 Python for Data Handling (Chapter-2 Pandas)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy