Eda Unit 2
Eda Unit 2
Early in the development of pandas, there existed another indexer, ix. This
indexer was capable of selecting both by label and by integer location. While it
was versatile, it caused lots of confusion because it’s not explicit. Sometimes
integers can also be labels for rows or columns. Thus there were instances
where it was ambiguous. Generally, ix is label based and acts just as
the .loc indexer. However, .ix also supports integer type selections (as in .iloc)
where passed an integer. This only works where the index of the DataFrame is
not integer based .ix will accept any of the inputs of .loc and .iloc.
Hierarchical Indexing
The index is like an address, that’s how any data point across the data
frame or series can be accessed. Rows and columns both have indexes,
rows indices are called index and for columns, it’s general column
names.
Hierarchical Indexes
Hierarchical Indexes are also known as multi-indexing is setting more
than one column name as the index. In this article, we are going to use
homelessness.csv file.
Hierarchical Indexing
# importing pandas library as alias pd
import pandas as pd
# calling the pandas read_csv() function.
# and storing the result in DataFrame df
df = pd.read_csv('homelessness.csv')
print(df.head())
Hierarchical Indexing
Columns in the Dataframe:
# using the pandas columns attribute.
col = df.columns
print(col)
Output:
Index([‘Unnamed: 0’, ‘region’, ‘state’, ‘individuals’,
‘family_members’,
‘state_pop’],
dtype=’object’)
Hierarchical Indexing
To make the column an index, we use the Set_index() function of pandas. If
we want to make one column an index, we can simply pass the name of the
column as a string in set_index(). If we want to do multi-indexing or
Hierarchical Indexing, we pass the list of column names in the set_index().
Below Code demonstrates Hierarchical Indexing in pandas:
# using the pandas set_index() function.
df_ind3 = df.set_index(['region', 'state', 'individuals'])
# we can sort the data by using sort_index()
df_ind3.sort_index()
print(df_ind3.head(10))
Hierarchical Indexing
Now the dataframe is using Hierarchical Indexing or multi-indexing.
#outer join
df1.join(df2,lsuffix=’_l’,rsuffix=’_r’,
how=’outer’)
Case 2. join on columns
Data frames can be joined on columns as well, but as joins work on
indexes, we need to convert the join key into the index and then
perform join, rest every thin is similar.
df1.set_index(‘key1’).join(df2.set_index(‘key2’))
3. concat() is used for combining Data Frames across
rows or columns.
Case 1. concat data frames on axis=0, default
operation
import pandas as pd
m1 = pd.DataFrame({ ‘Name’: [‘Alex’, ‘Amy’, ‘Allen’, ‘Alice’,
‘Ayoung’], ‘subject_id’ : [ ‘ sub1 ’,’ sub2 ',’ sub4 ',’ sub6',’sub5'],
‘Marks_scored’:[98,90,87,69,78]}, index=[1,2,3,4,5])
m2 = pd.DataFrame({ ‘Name’: [‘Billy’, ‘Brian’, ‘Bran’, ‘Bryce’,
‘Betty’], ‘subject_id’:[‘sub2’,’sub4',’sub3',’sub6',’sub5'],
‘Marks_scored’:[89,80,79,97,88]}, index=[4,5,6,7,8])
pd.concat([m1,m2])
Case 1. concat data frames on axis=0, default operation
pd.concat([m1,m2],ignore_index=True)
Case 2. concat operation on axis=1, horizontal
operation
pd.concat([m1,m2],axis=1)
4. append() combine data frames vertically
fashion
Case 1. appending data frames, duplicate
index issue
m1 = pd.DataFrame({ ‘Name’: [‘Vivek’, ‘Vishakha’, ‘Ash’,
‘Natalie’, ‘Ayoung’], ‘subject_id’ : [ ‘sub1’ ,’ sub2 ',’ sub4 ',’ sub6
',’sub5'], ‘Marks_scored’:[98,90,87,69,78], ‘ Rank ’ :
[1,3,6,20,13]}, index=[1,2,3,4,5])
m2 = pd.DataFrame({ ‘Name’: [‘Barak’, ‘Wayne’, ‘ Saurav ’ ,
‘Yuvraj’, ‘Suresh’], ‘ subject_id ’ : [ ‘ sub2 ’,’ sub4 ',’
sub3',’sub6',’sub5'], ‘Marks_scored’:[89,80,79,97,88],},
index=[1,2,3,4,5])
m1.append(m2)
Case 1. appending data frames, duplicate index issue
m1.append(m2)
Aggregation and grouping
Grouping and aggregating will help to achieve data analysis easily using
various functions. These methods will help us to the group and
summarize our data and make complex analysis comparatively easy.
Aggregation and grouping
Aggregation and grouping
Aggregation in Pandas
Aggregation in pandas provides various functions that perform a mathematical or logical
operation on our dataset and returns a summary of that function. Aggregation can be used to get
a summary of columns in our dataset like getting sum, minimum, maximum, etc. from a
particular column of our dataset. The function used for aggregation is agg(), the parameter is the
function we want to perform.
Some functions used in the aggregation are:
Function Description:
sum() :Compute sum of column values
min() :Compute min of column values
max() :Compute max of column values
mean() :Compute mean of column
size() :Compute column sizes
describe() :Generates descriptive statistics
first() :Compute first of group values
last() :Compute last of group values
count() :Compute count of column values
std() :Standard deviation of column
var() :Compute variance of column
sem() :Standard error of the mean of column
df.sum()