ANL252 SU4 Jul2022
ANL252 SU4 Jul2022
Data Management
Import Data
pandas Package
• “pandas” is the most common package for data management in Python.
• First, installing pandas using pip, then import pandas in our program:
import pandas as pd
3/57
Import Data
• We need Python compatible datasets to work with pandas.
• Load a dataset in Python and open it in the format of pandas.
• Convert .csv data files to pandas DataFrame by the read_csv() function.
DataFrame_name = pd.read_csv("csv_file_name.csv")
• read_csv() is a reader to convert specific format of data files into
pandas DataFrame.
• pandas also provides readers to import data files from other sources such
as Excel, SPSS, Stata, etc.
4/57
pandas Readers
5/57
Display pandas DataFrames
• We can use the print() function to display the whole DataFrame.
print(DataFrame_name)
• Alternatively, display() function achieves the same.
display(DataFrame_name)
• Another possibility is to print the DataFrames without any function.
DataFrame_name
• Use the .head() method to display the first five rows of a DataFrame.
DataFrame_name.head()
6/57
Data Selection
Select Columns by Variables
• Create a list of variable names to select specific columns of a DataFrame.
• The variable names must be put within a pair of quotation marks.
DataFrame_name[["var_name1", "var_name2", …]]
• To access one column, put the variable name as string inside the index
operator directly.
8/57
Example of column selection
• Suppose we have a dataset on fruits, prices, and country of origin
➢ imported as Pandas dataframe and named as Imports
9/57
Select Rows by Positions
• We cannot refer to natural “observation names” to selecting rows.
• pandas provides a row index to every row.
• It starts with 0 and ends with the number of rows minus one.
• Rows can be queried by the numeric index position using the DataFrame
attribute iloc.
DataFrame_name.iloc[start:end]
• The indices must be integers, but they do not need to be consecutive.
• If we select multiple rows, the indices must be put in a list first.
• If we select one row, the index can be put in the index operator directly.
10/57
Example of row selection
• First row and third row of imports, keep all columns
➢ [0, 2] indicates the first and third rows of Imports
➢ The ‘ : ’ after the comma indicates all columns
11/57
Select Rows by Indices
• Another way to select rows is to use row indices.
• Create row index labels by the method .set_index().
DataFrame_name.set_index(key, inplace = True)
• This method converts the values of a variable to row index labels.
• The rows can be queried by the row index labels using the .loc attribute.
DataFrame_name.loc[["row_label1", "row_label2", …]]
• Put row labels as strings in a list for selecting multiple rows.
• To select rows of a single label, put the label in .loc directly.
12/57
Example of using row indices
• Use the variable names as a key for Imports
13/57
Select Cells by Positions and Indices
• Specify columns and rows by the .loc and/or .iloc attributes to select
cells from a DataFrame.
• Use one of the following syntaxes to select cells from a DataFrame:
DF_name.iloc[row_start:row_end, col_start:col_end]
DF_name.loc[["row_labels", "col_labels"]]
DF_name[["col_labels"]].iloc[row_start:row_end]
DF_name.loc[["row_labels"]].iloc[0:, col_start:col_end]
14/57
Select Cells by Boolean Masking
• The elements of a Boolean mask array are either True or False.
• Boolean mask array is overlaid on top of the queried DataFrame.
• Elements aligned with True are selected.
DataFrame_name[Condition]
• More complex queries with several conditions are connected by bitwise
logical operators.
DataFrame_name[(Condition1) &/| (Condition2) &/| …]
• If there are two conditions, two Boolean masks will be compared
elementwise by the bitwise operator.
• Each condition needs to be encased in parentheses.
15/57
First Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Read in the data file “car_model.csv” into Python as a pandas DataFrame.
The data file contains 4 variables: Year, Make, Model, and Category.
• Print the DataFrame and check on its dimension.
16/57
Discussion
• What are the main differences between a dataset (or pandas DataFrames)
and an array (or NumPy array)?
• How different are the outputs of a DataFrame resulting from the different
printing functions in Python? Which one is most/least preferrable when
using JupyterLab as our programming environment?
17/57
Second Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Select the column (variable) “Make” from the DataFrame.
• Select all entries with the value “Audi” in the variable “Make” from the
DataFrame.
• Select the last 10 observations and only the variable “Year” from the
DataFrame.
19/57
Discussion
• Name situations where we would need to select data from a DataFrame.
• How different are the syntaxes between row and column selections.
20/57
Merge DataFrames
(This section on data
frames is optional)
Append DataFrames by Rows
• Concatenate two DataFrames with identical variables by rows:
23/57
Merge DataFrames by Columns
Merge two DataFrames with identical observations by columns:
24/57
Outer Join DataFrames with Some
Common Variables
Outer join two DataFrames with some common variables:
25/57
Inner Join DataFrames with Some
Common Variables
Inner join two DataFrames with some common variables:
26/57
Outer Join DataFrames with Some
Common Observations
Outer join two DataFrames with some common observations:
27/57
Inner Join DataFrames with Some
Common Observations
Inner join two DataFrames with some common observations:
28/57
Outer Join DataFrames with Different
Shapes
Outer join two DataFrames with different shapes:
29/57
Inner Join DataFrames with Different
Shapes
Inner join two DataFrames with different shapes:
30/57
Concatenate DataFrames
• Use concat() to merge multiple DataFrames with different shapes.
finalDF_name = pd.concat(objs, axis, join)
• The names of the DataFrames to be concatenated must be put in a list.
• If axis = 0, the DataFrames will be concatenated below one another,
and the concatenation will take place beside one another if axis = 1.
• The join parameter controls the type of concatenation. The possible
values here are "outer" and "inner", written as string.
31/57
Third Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Read in the usual data file “car_model.csv” into Python.
• Then read in car_price_model.csv” into Python. The data file contains 4
variables: Year, Make, Model, and Price (in USD).
• Inner join the “car_model” and “car_price_model” data by their “year”,
“make” and “model”.
• Print the merged DataFrame
32/57
Missing Data
and Outliers
Missing Data
• In empirical studies, an observed value of a variable could be missing.
• Reasons for missing data: defective measurement tools, withdrawal from
the study, refusal of responses to sensitive questions, etc.
• In Python, pandas indicates missing data with a special floating-point
value, while NumPy uses NaN (“Not a Number”).
• Missing data cannot be included in constructing models, forecasting, etc.
• In pandas, missing values are ignored by the statistical functions.
• Since the underlying sample sizes for each variable could vary in the
computation due to missing data, the statistical estimation can be biased.
34/57
Identify Missing Data
• pandas’ readers such as read_csv() have two parameters, na_filter
and na_values, to convert certain strings to missing values directly.
DataFrame_name = pd.read_csv("csv_file_name.csv",
na_values = "na_string", na_filer = True/False)
• If na_filter is True, pandas will convert all white spaces "" to NaN.
• With na_values, we can declare certain strings from our DataFrame to
be recognised as missing values.
• The following strings are treated as missing values by default and do not
require explicit declaration:
"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
"-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
"NULL", "NaN", "n/a", "nan", "null"
35/57
Locate Missing Data (I)
• It is often not easy to locate the missing data in a dataset.
• Count the NaNs in each row and each column to find the missing data.
• Missing values exist in a variable if the number of NaNs is larger than zero.
• Same approach is applicable to rows.
DataFrame_name.isnull().sum(axis = 0/1)
• The method .isnull() checks every cell of a DataFrame and returns
True if the cell contains a missing value and False otherwise.
• The method .sum() adds up all the “True” (equal to 1) in each row or
each column. It is equivalent to counting the missing values.
• If axis = 0, the values in a column will be added up together. If axis =
1, it returns the sum of a row.
36/57
Example of Locating Missing Data (I)
• Suppose our Imports dataset is corrupted, name it as C_Imports
➢ Observe that the price of apple is missing, NaN
37/57
Locate Missing Data (II)
• We can use .any() just to check whether missing data exist.
• It returns True if any element in the attached array is True.
DataFrame_name.isnull().any(axis = 0/1)
• Retrieve the row/column indices with missing data by .index.
object_name = DataFrame_name.isnull().any()
object_name[object_name == True].index
38/57
Example of Locating Missing Data (II)
• Check for missing entries in C_Imports. Recall that price is missing
➢ We see a ‘True’ for ‘Prices’
39/57
Delete Missing Data
• The usual ways to treat missing data are:
➢ delete the entire observations
➢ replace them
➢ ignore them
• With .drop(), we specify the row indices with missing values that should
be removed.
DF_name.drop(axis = 0, index = [index1, index2, …])
• The .dropna() method combines the localisation and removal of rows
or columns with missing data in a single function.
DF_name.dropna(axis = 0, how = "any"/"all")
• All missing values are treated and deleted by .dropna() equally.
40/57
Example of Deleting Missing Data
• Recall that Prices for Apple is missing in C_Imports.
• We set how = “any” to drop the row for Apple with NaN for Prices.
41/57
Replace Missing Data
• We can also replace missing values by a pre-defined value.
• The most common values are 0 or the variable mean.
• pandas facilitates replacement of missing values by .fillna().
DataFrame_name.fillna(value = repl_value)
DataFrame_name["column_label"].fillna(value = repl_value)
• The .fillna() method replaces all missing values with the value
specified in the parameter.
• We can also specify a column in the DataFrame where the missing values
should be replaced by the .fillna() method.
42/57
Example of Replacement
• Recall that Prices for Apple is missing in C_Imports (i.e., NaN)
43/57
Outliers
• Outliers can cause biasedness in the statistical estimation.
• Draw histogram/boxplot or use interquartile range (IQR) to detect outliers
in a variable.
• Use .quantile() to determine the 1st and 3rd quartiles of a variable.
DataFrame_name["column_label"].quantile(q = quantile)
• An observation y is considered as outliers if:
y < q1 – 1.5 * iqr or y > q3 + 1.5 * iqr
where q1 is the 1st quartile, q3 the third quartile and iqr = q3 – q1
is the interquartile range.
44/57
Detect and Remove Outliers
• Usually, outliers should be removed from the dataset.
• In Python, it suffices to select observations with no outliers in the target
variable.
• The following syntax generates a subset of rows that do not fulfil the
outlier condition.
DF[~((DF["Col"]<q1–1.5*iqr) | (DF["Col"]>q3+1.5*iqr))]
45/57
Example of Outliers (I)
• Suppose our fruit imports dataset, ‘O_Imports’, now has 1 outlier
➢ Durian with Prices of 1,000, from Ukraine; looks suspicious
46/57
Example of Outliers (II)
• Define interquartile range, iqr, as q3 – q1
47/57
Fourth Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Remove all the rows with missing data in “price” from the merged
DataFrame created in the previous chapter.
• Detect car models that can be considered as outliers in terms of their selling
price.
• Discuss whether the models with extraordinarily high/low selling prices
should be excluded from the DataFrame.
48/57
Data Modification
Sort Data
• Sometimes, we may want to sort the data according to the values of some
variables for better understanding.
• The method .sort_values() rearranges the order of the rows in a
DataFrame.
DataFrame_name.sort_values(
by = [List_of_var_names], ascending)
50/57
Discretisation
• Through discretisation, variables are easier to understand or becomes
compatible to some specific analytics models such as decision trees.
• Use the cut() function to discretise continuous variables.
51/57
Grouping Data
• Analysing aggregated statistics of some variables are often of interest in
data analytics.
• Aggregated statistics are calculated based on grouped data.
• A DataFrame can be grouped by one or more variables using the
.groupby() method.
DataFrame_name.groupby(by = [List_of_Labels]).anymethod()
• Attached to the .groupby() method can be any method for the
calculation of the aggregated statistics.
52/57
Transformation
• Sometimes, we may need to transform the values of a variable.
• For instance, log-transformation can help to stabilise the variance of a
variable.
• In Python, there are various functions to transform variables.
• For example, the log-transformation of a numeric variable can be carried
out by the log() function from the NumPy package.
DF_name["new_var"] = np.log(DF_name["var_name"])
53/57
Standardisation
• Sometimes, variables included clustering could be measured at different
scales and do not contribute equally to the analysis.
• Hence, we need to standardise or normalise them.
• The standardisation function can be found in the “scikit-learn” package.
• To standardise a variable in the traditional way, we need to find the mean
and the standard deviation of the variable first.
var_mean = np.mean(DF["var_name"])
var_std = np.std(DF["var_name"])
DF["std_var"] = (DF["var_name"] – var_mean) / var_std
54/57
Normalisation
• Normalisation is another transformation method to scale down a variable.
• The values of a normalised variable can only be in the interval [0, 1].
• The normalisation function can also be found in the “scikit-learn” package.
• To normalise a variable in the traditional, we need to find the minimum
and maximum of the variable first.
var_min = np.min(DF["var_name"])
var_max = np.max(DF["var_name"])
DF["norm_var"] = (DF["var_name"] – var_min)
/ (var_max – var_min)
55/57
Fifth Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Read in the data file “car_price_model.csv” into Python as a pandas
DataFrame. The data file contains 4 variables: Year, Make, Model, and Price.
• Print the DataFrame and check on its dimension.
• Drop the rows where at least one element is missing.
• Since the prices stored in the “Price” variable are in US dollars, add a new
column named “SGD Price” that is 1.5 times of the “Price”
• Calculate the mean price (in SGD) of each car Make
56/57
Discussion
• What happens to the bin edges in the discretisation process using the cut()
function?
• When is log-transformation/normalisation/standardisation necessary? What
is the main difference between normalisation and standardization?
57/57