0% found this document useful (0 votes)
51 views65 pages

Eda Unit 2

The document discusses various data manipulation techniques using Pandas library in Python like data indexing and selection, handling missing data, hierarchical indexing, combining datasets, aggregation and grouping. It covers Pandas objects like Series, DataFrame, introducing Pandas indexing techniques like [], loc[], iloc[] and ix[] along with examples.

Uploaded by

60 Vibha Shree.S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views65 pages

Eda Unit 2

The document discusses various data manipulation techniques using Pandas library in Python like data indexing and selection, handling missing data, hierarchical indexing, combining datasets, aggregation and grouping. It covers Pandas objects like Series, DataFrame, introducing Pandas indexing techniques like [], loc[], iloc[] and ix[] along with examples.

Uploaded by

60 Vibha Shree.S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

UNIT II

EDA USING PYTHON


UNIT II EDA USING PYTHON

Data Manipulation using Pandas – Pandas Objects


– Data Indexing and Selection – Operating on
Data – Handling Missing Data – Hierarchical
Indexing – Combining datasets – Concat, Append,
Merge and Join – Aggregation and grouping –
Pivot Tables – Vectorized String Operations
Installing and Using Pandas
 Once Pandas is installed, you can import it and check the
version:
In[1]: import pandas
pandas.__version__
Out[1]: '0.18.1'
 Just as we generally import NumPy under the alias np, we will
import Pandas under the alias pd:
In[2]: import pandas as p
 For example, to display all the contents of the pandas
namespace, you can type this:
In [3]: pd.<TAB>
 And to display the built-in Pandas documentation, you can use
this:
In [4]: pd?
Introducing Pandas Objects
 Pandas objects can be thought of as enhanced versions of NumPy
structured arrays in which the rows and columns are identified
with labels rather than simple integer indices.
 Pandas provides a host of useful tools, methods, and functionality
on top of the basic data structures, but nearly everything that
follows will require an understanding of what these structures
are.
 Thus, before we go any further, let’s introduce these three
fundamental Pandas data structures: the Series, DataFrame,
and Index.
 We will start our code sessions with the standard NumPy and
Pandas imports:
 In[1]: import numpy as np
import pandas as pd
Introducing Pandas Objects
Series as generalized NumPy array
The essential difference is the presence of the index: while the NumPy array has
an implicitly defined integer index used to access the values, the Pandas Series
has an explicitly defined index associated with the values.
Series as specialized dictionary

A dictionary is a structure that maps arbitrary keys to a set of arbitrary values,


and a Series is a structure that maps typed keys to a set of typed values.
Constructing Series objects
The Pandas DataFrame Object
The Pandas DataFrame Object
DataFrame as specialized dictionary
Indexing and Selecting Data with Pandas
Indexing in Pandas :
Indexing in pandas means simply selecting particular
rows and columns of data from a DataFrame. Indexing
could mean selecting all the rows and some of the
columns, some of the rows and all of the columns, or
some of each of the rows and columns. Indexing can
also be known as Subset Selection.
Indexing and Selecting Data with Pandas
Indexing and Selecting Data with Pandas
Indexing and Selecting Data with Pandas
Indexing and Selecting Data with Pandas
Indexing and Selecting Data with Pandas
 Pandas Indexing using [ ], .loc[], .iloc[ ], .ix[ ]
 There are a lot of ways to pull the elements, rows, and columns
from a DataFrame. There are some indexing method in Pandas
which help in getting an element from a DataFrame. These
indexing methods appear very similar but behave very differently.
Pandas support four types of Multi-axes indexing they are:
 Dataframe.[ ] ; This function also known as indexing operator
 Dataframe.loc[ ] : This function is used for labels.
 Dataframe.iloc[ ] : This function is used for positions or integer
based
 Dataframe.ix[] : This function is used for both label and integer
based
 Collectively, they are called the indexers. These are by far the most
common ways to index data. These are four function which help in
getting the elements, rows, and columns from a DataFrame.
Indexing and Selecting Data with Pandas
Selecting a single columns
In order to select a single column, we
simply put the name of the column in-
between the brackets
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving columns by indexing operator
first = data["Age"]
print(first)
Indexing and Selecting Data with Pandas
Selecting multiple columns
In order to select multiple columns, we
have to pass a list of columns in an
indexing operator.
 # importing pandas package
 import pandas as pd
 # making data frame from csv file
 data = pd.read_csv("nba.csv", index_col="Name")
 # retrieving multiple columns by indexing
operator
 first = data[["Age", "College", "Salary"]]
 first
Indexing and Selecting Data with Pandas
 Indexing a DataFrame using .loc[ ] :
This function selects data by the label of the rows and
columns. The df.loc indexer selects data in a different way
than just the indexing operator. It can select subsets of
rows or columns. It can also simultaneously select
subsets of rows and columns.
 Selecting a single row
 In order to select a single row using .loc[], we put a single
row label in a .loc function.
 # importing pandas package
 import pandas as pd
 # making data frame from csv file
 data = pd.read_csv("nba.csv", index_col ="Name")
 # retrieving row by loc method
 first = data.loc["Avery Bradley"]
 second = data.loc["R.J. Hunter"]
 print(first, "\n\n\n", second)
Indexing and Selecting Data with Pandas
Selecting multiple rows
In order to select multiple rows, we put
all the row labels in a list and pass
that to .loc function.
 import pandas as pd
 # making data frame from csv file
 data = pd.read_csv("nba.csv", index_col
="Name")
 # retrieving multiple rows by loc method
 first = data.loc[["Avery Bradley", "R.J. Hunter"]]
 print( first)
Indexing and Selecting Data with Pandas
Selecting two rows and three columns
In order to select two rows and three columns, we select a two
rows which we want to select and three columns and put it in
a separate list like this:
 Dataframe.loc[["row1", "row2"], ["column1", "column2", "column3"]]
 import pandas as pd
 # making data frame from csv file
 data = pd.read_csv("nba.csv", index_col ="Name")
 # retrieving two rows and three columns by loc method
 first = data.loc[["Avery Bradley", "R.J. Hunter"],
 ["Team", "Number", "Position"]]
 print(first)
Indexing and Selecting Data with Pandas
Selecting all of the rows and some columns
 In order to select all of the rows and some
columns, we use single colon [:] to select all of
rows and list of some columns which we want
to select like this:
 Dataframe.loc[:, ["column1", "column2", "column3"]]
 import pandas as pd
 # making data frame from csv file
 data = pd.read_csv("nba.csv", index_col ="Name")
 # retrieving all rows and some columns by loc
method
 first = data.loc[:, ["Team", "Number", "Position"]]
 print( first)
Indexing and Selecting Data with Pandas
 Indexing a DataFrame using .iloc[ ] :
This function allows us to retrieve rows and columns by
position. In order to do that, we’ll need to specify the
positions of the rows that we want, and the positions of
the columns that we want as well. The df.iloc indexer is
very similar to df.loc but only uses integer locations to
make its selections.
 Selecting a single row
 In order to select a single row using .iloc[], we can pass a
single integer to .iloc[] function.
 import pandas as pd
 # making data frame from csv file
 data = pd.read_csv("nba.csv", index_col ="Name")
 # retrieving rows by iloc method
 row2 = data.iloc[3]
 print(row2)
Indexing and Selecting Data with Pandas
 Indexing a using Dataframe.ix[ ] :

Early in the development of pandas, there existed another indexer, ix. This
indexer was capable of selecting both by label and by integer location. While it
was versatile, it caused lots of confusion because it’s not explicit. Sometimes
integers can also be labels for rows or columns. Thus there were instances
where it was ambiguous. Generally, ix is label based and acts just as
the .loc indexer. However, .ix also supports integer type selections (as in .iloc)
where passed an integer. This only works where the index of the DataFrame is
not integer based .ix will accept any of the inputs of .loc and .iloc.
Hierarchical Indexing
 The index is like an address, that’s how any data point across the data
frame or series can be accessed. Rows and columns both have indexes,
rows indices are called index and for columns, it’s general column
names.
 Hierarchical Indexes
 Hierarchical Indexes are also known as multi-indexing is setting more
than one column name as the index. In this article, we are going to use
homelessness.csv file.
Hierarchical Indexing
 # importing pandas library as alias pd
 import pandas as pd
 # calling the pandas read_csv() function.
 # and storing the result in DataFrame df
 df = pd.read_csv('homelessness.csv')
 print(df.head())
Hierarchical Indexing
Columns in the Dataframe:
# using the pandas columns attribute.
col = df.columns
print(col)
Output:
Index([‘Unnamed: 0’, ‘region’, ‘state’, ‘individuals’,
‘family_members’,
‘state_pop’],
dtype=’object’)
Hierarchical Indexing
 To make the column an index, we use the Set_index() function of pandas. If
we want to make one column an index, we can simply pass the name of the
column as a string in set_index(). If we want to do multi-indexing or
Hierarchical Indexing, we pass the list of column names in the set_index().
 Below Code demonstrates Hierarchical Indexing in pandas:
 # using the pandas set_index() function.
 df_ind3 = df.set_index(['region', 'state', 'individuals'])
 # we can sort the data by using sort_index()
 df_ind3.sort_index()
 print(df_ind3.head(10))
Hierarchical Indexing
 Now the dataframe is using Hierarchical Indexing or multi-indexing.

 Note that here we have made 3 columns as an index (‘region’, ‘state’,

‘individuals’ ). The first index ‘region’ is called level(0) index, which is on


top of the Hierarchy of indexes, next index ‘state’ is level(1) index which
is below the main or level(0) index, and so on. So, the Hierarchy of
indexes is formed that’s why this is called Hierarchical indexing.
 We may sometimes need to make a column as an index, or we want to

convert an index column into the normal column, so there is a pandas


reset_index(inplace = True) function, which makes the index column the
normal column.
Hierarchical Indexing
Selecting Data in a Hierarchical Index or using the Hierarchical
Indexing:For selecting the data from the dataframe using the .loc()
method we have to pass the name of the indexes in a list.
 # selecting the 'Pacific' and 'Mountain'
 # region from the dataframe.
 # selecting data using level(0) index or main index.
 df_ind3_region = df_ind3.loc[['Pacific', 'Mountain']]
 print(df_ind3_region.head(10))
Hierarchical Indexing
 We cannot use only level(1) index for getting data from the dataframe,
if we do so it will give an error. We can only use level (1) index or the
inner indexes with the level(0) or main index with the help list of
tuples.
 # using the inner index 'state' for getting data.
 df_ind3_state = df_ind3.loc[['Alaska', 'California', 'Idaho']]
 print(df_ind3_state.head(10))
Hierarchical Indexing
 Using inner levels indexes with the help of a list of tuples:
 Syntax:
 df.loc[[ ( level( 0 ) , level( 1 ) , level( 2 ) ) ]]Python3
 # selecting data by passing all levels index.
 df_ind3_region_state = df_ind3.loc[[("Pacific", "Alaska", 1434),
 ("Pacific", "Hawaii", 4131),
 ("Mountain", "Arizona", 7259),
 ("Mountain", "Idaho", 1297)]]
 df_ind3_region_state
Combine datasets
 In Pandas forusing Pandas merge(),
a horizontal join(), concat()
combination we haveand append()and join(), whereas for
merge()
vertical combination we can use concat() and append(). Merge and join perform
similar tasks but internally they have some differences, similar to concat and
append.
1.merge() is used for combining data on common columns
or indices.
import pandas as pd
d1 = {‘Id’: [‘A1’, ‘A2’, ‘A3’, ‘A4’,’A5'], ‘Name’:[‘Vivek’, ‘Rahul’,
‘Gaurav’, ‘Ankit’,’Vishakha’], ‘Age’:[27, 24, 22, 32, 28],}
d2 = {‘Id’: [‘A1’, ‘A2’, ‘A3’, ‘A4’], ‘Address’:[‘Delhi’, ‘Gurgaon’,
‘Noida’, ‘Pune’], ‘Qualification’:[‘Btech’, ‘B.A’, ‘Bcom’, ‘B.hons’]}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
Case 1. merging data on common columns ‘Id’
#Inner Join
pd.merge(df1,df2)
pd.merge(df1,df2, how='inner)
Left Join pd.merge(df1,df2,how=’left’)
 #matching and non matching records from left DF which is df1 is present in
result data frame

Right Join pd.merge(df1,df2,how=’right’)


#matching and non matching records from right DF, df2 will come in result df
#outer join pd.merge(df1,df2,how=’outer’)
#all the matching and non matching records are
available in resultant dataset from both data frames
2. join() is used for combining data on a key column
or an index.
import pandas as pd
df1 = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K5’, ‘K3’, ‘K4’,
‘K2’], ‘A’: [‘A0’, ‘A1’, ‘A5’, ‘A3’, ‘A4’, ‘A2’]})
df2 = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K2’], ‘B’: [‘B0’, ‘B1’,
‘B2’]})
Case 1. join on indexes
By default, pandas join operation is performed on
indexes both data frames have default indexes values,
so no need to specify any join key, join will implicitly
be performed on indexes.
Case 1.nature
 #default joinofon indexes
pandas join is left outer join
df1.join(df2, lsuffix=’_l’, rsuffix=’_r’)

Index values in both data frames are different, in the case


of inner/equi join resultant data set will be empty but data
is present from left DF (df1).
Create two data frames with different index values
df1 = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K5’, ‘K3’, ‘K4’, ‘K2’], ‘A’:
[‘A0’, ‘A1’, ‘A5’, ‘A3’, ‘A4’, ‘A2’]}, index=[0,1,2,3,4,5])
df2 = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K2’], ‘B’: [‘B0’, ‘B1’,
‘B2’]},index=[6,7,8])
df1.join(df2,lsuffix=’_l’,rsuffix=’_r’)
#df1 is left DF and df2 is right DF
#inner join
df1.join(df2,lsuffix=’_l’,rsuffix=’_r’,
how=’inner’)

#outer join
df1.join(df2,lsuffix=’_l’,rsuffix=’_r’,
how=’outer’)
Case 2. join on columns
Data frames can be joined on columns as well, but as joins work on
indexes, we need to convert the join key into the index and then
perform join, rest every thin is similar.

df1.set_index(‘key1’).join(df2.set_index(‘key2’))
3. concat() is used for combining Data Frames across
rows or columns.
Case 1. concat data frames on axis=0, default
operation
import pandas as pd
m1 = pd.DataFrame({ ‘Name’: [‘Alex’, ‘Amy’, ‘Allen’, ‘Alice’,
‘Ayoung’], ‘subject_id’ : [ ‘ sub1 ’,’ sub2 ',’ sub4 ',’ sub6',’sub5'],
‘Marks_scored’:[98,90,87,69,78]}, index=[1,2,3,4,5])
m2 = pd.DataFrame({ ‘Name’: [‘Billy’, ‘Brian’, ‘Bran’, ‘Bryce’,
‘Betty’], ‘subject_id’:[‘sub2’,’sub4',’sub3',’sub6',’sub5'],
‘Marks_scored’:[89,80,79,97,88]}, index=[4,5,6,7,8])
pd.concat([m1,m2])
Case 1. concat data frames on axis=0, default operation
pd.concat([m1,m2],ignore_index=True)
Case 2. concat operation on axis=1, horizontal
operation
pd.concat([m1,m2],axis=1)
4. append() combine data frames vertically
fashion
Case 1. appending data frames, duplicate
index issue
m1 = pd.DataFrame({ ‘Name’: [‘Vivek’, ‘Vishakha’, ‘Ash’,
‘Natalie’, ‘Ayoung’], ‘subject_id’ : [ ‘sub1’ ,’ sub2 ',’ sub4 ',’ sub6
',’sub5'], ‘Marks_scored’:[98,90,87,69,78], ‘ Rank ’ :
[1,3,6,20,13]}, index=[1,2,3,4,5])
m2 = pd.DataFrame({ ‘Name’: [‘Barak’, ‘Wayne’, ‘ Saurav ’ ,
‘Yuvraj’, ‘Suresh’], ‘ subject_id ’ : [ ‘ sub2 ’,’ sub4 ',’
sub3',’sub6',’sub5'], ‘Marks_scored’:[89,80,79,97,88],},
index=[1,2,3,4,5])
m1.append(m2)
Case 1. appending data frames, duplicate index issue
m1.append(m2)
Aggregation and grouping
 Grouping and aggregating will help to achieve data analysis easily using
various functions. These methods will help us to the group and
summarize our data and make complex analysis comparatively easy.
Aggregation and grouping

Aggregation and grouping
 Aggregation in Pandas
Aggregation in pandas provides various functions that perform a mathematical or logical
operation on our dataset and returns a summary of that function. Aggregation can be used to get
a summary of columns in our dataset like getting sum, minimum, maximum, etc. from a
particular column of our dataset. The function used for aggregation is agg(), the parameter is the
function we want to perform.
 Some functions used in the aggregation are:
 Function Description:
sum() :Compute sum of column values
min() :Compute min of column values
max() :Compute max of column values
mean() :Compute mean of column
size() :Compute column sizes
describe() :Generates descriptive statistics
first() :Compute first of group values
last() :Compute last of group values
count() :Compute count of column values
std() :Standard deviation of column
var() :Compute variance of column
sem() :Standard error of the mean of column

df.sum()

df.agg(['sum', 'min', 'max'])


Grouping in Pandas
Grouping is used to group data using some criteria from our
dataset. It is used as split-apply-combine strategy.
Splitting the data into groups based on some criteria.
Applying a function to each group independently.
Combining the results into a data structure.
Applying groupby() function to group the data on
“Maths” value. To view result of formed groups use
first() function.
a = df.groupby('Maths')
a.first()
b = df.groupby(['Maths', 'Science'])
b.first()
Vectorized String Operations
Introducing Pandas String Operations
 We saw in previous sections how tools like NumPy and Pandas
generalize arithmetic operations so that we can easily and quickly
perform the same operation on many array elements. For
example:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
Output:
array([ 4, 6, 10, 14, 22, 26])
 This vectorization of operations simplifies the syntax of operating
on arrays of data: we no longer have to worry about the size or
shape of the array, but just about what operation we want done.
Eg1:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
Output:
['Peter', 'Paul', 'Mary', 'Guido']
Eg2:
import pandas as pd
names = pd.Series(data) names
Output:
0 peter
1 Paul
2 None
3 MARY
4 gUIDO
dtype: object
Tables of Pandas String Methods
If you have a good understanding of string manipulation in
Python, most of Pandas string syntax is intuitive enough
that it's probably sufficient to just list a table of available
methods; we will start with that here, before diving deeper
into a few of the subtleties. The examples in this section
use the following series of names:
monte = pd.Series(['Graham Chapman', 'John Cleese',
'Terry Gilliam', 'Eric Idle', 'Terry Jones', 'Michael Palin'])

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy