Pandas: Import
Pandas: Import
The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.
Pandas can clean messy data sets, and make them readable and relevant.
A simple way to store big data sets is to use CSV files (comma separated
files).
import pandas as pd
df = pd.read_csv('data.csv')
Cleaning Data
1. Clean Data
2. Clean Empty Cells
3. Clean Wrong Format
4. Clean Wrong Data
5. Remove Duplicates
Empty cells
Data in wrong format
Wrong data
Duplicates
Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.
1. Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
df.dropna(inplace = True)
print(df.to_string())
This way you do not have to delete entire rows just because of some empty
cells.
To only replace empty values for one column, specify the column name for
the DataFrame:
x = df["Calories"].mean()
x = df["Calories"].median()
x = df["Calories"].mode()[0]
To fix it, you have two options: remove the rows, or convert all cells in the
columns into the same format.
Removing Rows
df.dropna(subset=['Date'], inplace = True)
3. wrong data
"Wrong data" does not have to be "empty cells" or "wrong format", it can
just be wrong, like duration "450" instead of "60".
Replacing Values
One way to fix wrong values is to replace them with something else.
df.loc[7, 'Duration'] = 120
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
Removing Rows
Another way of handling wrong data is to remove the rows that contains
wrong data.
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
4. Removing Duplicates
df.duplicated()
Removing Duplicates
df.drop_duplicates(inplace = True)
import matplotlib.pyplot as plt
import pandas as pd
df.plot(y='Tmax', x='Month')
df.plot(y=['Tmax','Tmin'], x='Month')
Bar Charts
df.plot.bar(y='Tmax', x='Month')
df.plot.barh(y='Tmax', x='Month')
df.plot.bar(y=['Tmax','Tmin'], x='Month')
color=['blue', 'red']
edgecolor='blue'
df.plot.bar(xlabel='Class')
df.plot.bar(ylabel='Amounts')
df.plot.bar(title='I am title')
figsize=(8, 6)
Rotate Label
rot=70
Multiple charts
df.plot.kde()
Pie
df= pd.DataFrame({'cost': [79, 40 , 60]},index=['Oranges', 'Bananas', '
Apples'])
df.plot.pie(y='cost', figsize=(8, 6))
plt.show()
Python Pandas - Series
Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index.
pandas.Series( data, index, dtype, copy)
Array
Dict
Scalar value or constant
import pandas as pd
s = pd.Series()
s
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
s
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
s
Data Frame:
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in
rows and columns.
Features of DataFrame
Data:
Data takes various forms like ndarray, series, map, lists, dict, constants and also another
DataFrame.
Index:
For the row labels, the Index to be used for the resulting frame is Optional Default
np.arange(n) if no index is passed.
columns:
For column labels, the optional default syntax is - np.arange(n). This is only true if no index
is passed.
dtype:
Data type of each column.
copy
This command (or whatever it is) is used for copying of data, if the default is False.
Create DataFrame
import pandas as pd
df = pd.DataFrame()
print(df)
import pandas as pd
data = [['SY',10],['TY',12],['FY',13]]
df = pd.DataFrame(data,columns=['CLASS','RN'])
print(df)