0% found this document useful (0 votes)
82 views55 pages

ANL252 SU4 Jul2022

The document discusses importing and managing data using the pandas package in Python. It covers: 1) Importing pandas and loading CSV data files into pandas DataFrames using the read_csv() function. 2) Methods for selecting, displaying, and querying data from DataFrames including selecting columns, rows, cells by position, index, and boolean masking. 3) Merging and concatenating DataFrames using various join types like inner, outer joins for DataFrames with common variables or observations and with different shapes.

Uploaded by

Ebad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views55 pages

ANL252 SU4 Jul2022

The document discusses importing and managing data using the pandas package in Python. It covers: 1) Importing pandas and loading CSV data files into pandas DataFrames using the read_csv() function. 2) Methods for selecting, displaying, and querying data from DataFrames including selecting columns, rows, cells by position, index, and boolean masking. 3) Merging and concatenating DataFrames using various join types like inner, outer joins for DataFrames with common variables or observations and with different shapes.

Uploaded by

Ebad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Study Unit 4

Data Management
Import Data
pandas Package
• “pandas” is the most common package for data management in Python.
• First, installing pandas using pip, then import pandas in our program:

import pandas as pd

3/57
Import Data
• We need Python compatible datasets to work with pandas.
• Load a dataset in Python and open it in the format of pandas.
• Convert .csv data files to pandas DataFrame by the read_csv() function.

DataFrame_name = pd.read_csv("csv_file_name.csv")
• read_csv() is a reader to convert specific format of data files into
pandas DataFrame.
• pandas also provides readers to import data files from other sources such
as Excel, SPSS, Stata, etc.

4/57
pandas Readers

Reader Format Type Data Description


read_csv() text CSV
read_html() text HTML
read_clipboard() text Local clipboard
read_excel() binary MS Excel
read_stata() binary Stata
read_sas() binary SAS
read_spss() binary SPSS
read_pickle() binary Python Pickle Format
read_sql() SQL SQL
read_gbq() SQL Google BigQuery

5/57
Display pandas DataFrames
• We can use the print() function to display the whole DataFrame.
print(DataFrame_name)
• Alternatively, display() function achieves the same.
display(DataFrame_name)
• Another possibility is to print the DataFrames without any function.
DataFrame_name
• Use the .head() method to display the first five rows of a DataFrame.
DataFrame_name.head()

6/57
Data Selection
Select Columns by Variables
• Create a list of variable names to select specific columns of a DataFrame.
• The variable names must be put within a pair of quotation marks.
DataFrame_name[["var_name1", "var_name2", …]]
• To access one column, put the variable name as string inside the index
operator directly.

8/57
Example of column selection
• Suppose we have a dataset on fruits, prices, and country of origin
➢ imported as Pandas dataframe and named as Imports

• To get the fruits and their prices, we use Imports[[‘Fruits’, ‘Prices’]]

9/57
Select Rows by Positions
• We cannot refer to natural “observation names” to selecting rows.
• pandas provides a row index to every row.
• It starts with 0 and ends with the number of rows minus one.
• Rows can be queried by the numeric index position using the DataFrame
attribute iloc.
DataFrame_name.iloc[start:end]
• The indices must be integers, but they do not need to be consecutive.
• If we select multiple rows, the indices must be put in a list first.
• If we select one row, the index can be put in the index operator directly.

10/57
Example of row selection
• First row and third row of imports, keep all columns
➢ [0, 2] indicates the first and third rows of Imports
➢ The ‘ : ’ after the comma indicates all columns

11/57
Select Rows by Indices
• Another way to select rows is to use row indices.
• Create row index labels by the method .set_index().
DataFrame_name.set_index(key, inplace = True)
• This method converts the values of a variable to row index labels.
• The rows can be queried by the row index labels using the .loc attribute.
DataFrame_name.loc[["row_label1", "row_label2", …]]
• Put row labels as strings in a list for selecting multiple rows.
• To select rows of a single label, put the label in .loc directly.

12/57
Example of using row indices
• Use the variable names as a key for Imports

• Then, we locate the ‘Apple’ and ‘Orange’ rows

13/57
Select Cells by Positions and Indices
• Specify columns and rows by the .loc and/or .iloc attributes to select
cells from a DataFrame.
• Use one of the following syntaxes to select cells from a DataFrame:

DF_name.iloc[row_start:row_end, col_start:col_end]

DF_name.loc[["row_labels", "col_labels"]]

DF_name[["col_labels"]].iloc[row_start:row_end]

DF_name.loc[["row_labels"]].iloc[0:, col_start:col_end]

14/57
Select Cells by Boolean Masking
• The elements of a Boolean mask array are either True or False.
• Boolean mask array is overlaid on top of the queried DataFrame.
• Elements aligned with True are selected.
DataFrame_name[Condition]
• More complex queries with several conditions are connected by bitwise
logical operators.
DataFrame_name[(Condition1) &/| (Condition2) &/| …]
• If there are two conditions, two Boolean masks will be compared
elementwise by the bitwise operator.
• Each condition needs to be encased in parentheses.

15/57
First Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Read in the data file “car_model.csv” into Python as a pandas DataFrame.
The data file contains 4 variables: Year, Make, Model, and Category.
• Print the DataFrame and check on its dimension.

16/57
Discussion
• What are the main differences between a dataset (or pandas DataFrames)
and an array (or NumPy array)?
• How different are the outputs of a DataFrame resulting from the different
printing functions in Python? Which one is most/least preferrable when
using JupyterLab as our programming environment?

17/57
Second Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Select the column (variable) “Make” from the DataFrame.
• Select all entries with the value “Audi” in the variable “Make” from the
DataFrame.
• Select the last 10 observations and only the variable “Year” from the
DataFrame.

19/57
Discussion
• Name situations where we would need to select data from a DataFrame.
• How different are the syntaxes between row and column selections.

20/57
Merge DataFrames
(This section on data
frames is optional)
Append DataFrames by Rows
• Concatenate two DataFrames with identical variables by rows:

• Use the .append() method for such concatenation:


DataFrame_name.append(other = [OtherDataFrames])

23/57
Merge DataFrames by Columns
Merge two DataFrames with identical observations by columns:

24/57
Outer Join DataFrames with Some
Common Variables
Outer join two DataFrames with some common variables:

25/57
Inner Join DataFrames with Some
Common Variables
Inner join two DataFrames with some common variables:

26/57
Outer Join DataFrames with Some
Common Observations
Outer join two DataFrames with some common observations:

27/57
Inner Join DataFrames with Some
Common Observations
Inner join two DataFrames with some common observations:

28/57
Outer Join DataFrames with Different
Shapes
Outer join two DataFrames with different shapes:

29/57
Inner Join DataFrames with Different
Shapes
Inner join two DataFrames with different shapes:

30/57
Concatenate DataFrames
• Use concat() to merge multiple DataFrames with different shapes.
finalDF_name = pd.concat(objs, axis, join)
• The names of the DataFrames to be concatenated must be put in a list.
• If axis = 0, the DataFrames will be concatenated below one another,
and the concatenation will take place beside one another if axis = 1.
• The join parameter controls the type of concatenation. The possible
values here are "outer" and "inner", written as string.

31/57
Third Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Read in the usual data file “car_model.csv” into Python.
• Then read in car_price_model.csv” into Python. The data file contains 4
variables: Year, Make, Model, and Price (in USD).
• Inner join the “car_model” and “car_price_model” data by their “year”,
“make” and “model”.
• Print the merged DataFrame

32/57
Missing Data
and Outliers
Missing Data
• In empirical studies, an observed value of a variable could be missing.
• Reasons for missing data: defective measurement tools, withdrawal from
the study, refusal of responses to sensitive questions, etc.
• In Python, pandas indicates missing data with a special floating-point
value, while NumPy uses NaN (“Not a Number”).
• Missing data cannot be included in constructing models, forecasting, etc.
• In pandas, missing values are ignored by the statistical functions.
• Since the underlying sample sizes for each variable could vary in the
computation due to missing data, the statistical estimation can be biased.

34/57
Identify Missing Data
• pandas’ readers such as read_csv() have two parameters, na_filter
and na_values, to convert certain strings to missing values directly.

DataFrame_name = pd.read_csv("csv_file_name.csv",
na_values = "na_string", na_filer = True/False)

• If na_filter is True, pandas will convert all white spaces "" to NaN.
• With na_values, we can declare certain strings from our DataFrame to
be recognised as missing values.
• The following strings are treated as missing values by default and do not
require explicit declaration:
"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
"-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
"NULL", "NaN", "n/a", "nan", "null"

35/57
Locate Missing Data (I)
• It is often not easy to locate the missing data in a dataset.
• Count the NaNs in each row and each column to find the missing data.
• Missing values exist in a variable if the number of NaNs is larger than zero.
• Same approach is applicable to rows.
DataFrame_name.isnull().sum(axis = 0/1)
• The method .isnull() checks every cell of a DataFrame and returns
True if the cell contains a missing value and False otherwise.
• The method .sum() adds up all the “True” (equal to 1) in each row or
each column. It is equivalent to counting the missing values.
• If axis = 0, the values in a column will be added up together. If axis =
1, it returns the sum of a row.

36/57
Example of Locating Missing Data (I)
• Suppose our Imports dataset is corrupted, name it as C_Imports
➢ Observe that the price of apple is missing, NaN

• Now locate the missing data; there is 1 missing value in Prices

37/57
Locate Missing Data (II)
• We can use .any() just to check whether missing data exist.
• It returns True if any element in the attached array is True.
DataFrame_name.isnull().any(axis = 0/1)
• Retrieve the row/column indices with missing data by .index.
object_name = DataFrame_name.isnull().any()
object_name[object_name == True].index

• Counting the NaNs in columns means to check on the existence of missing


values in each variable.
• Counting the NaNs in rows means to identify observations with missing
values in at least one of the variables.
• The treatments of missing data are different in these two cases.

38/57
Example of Locating Missing Data (II)
• Check for missing entries in C_Imports. Recall that price is missing
➢ We see a ‘True’ for ‘Prices’

• Now check the index which is missing


➢ We see a ‘True’ for ‘Prices’

39/57
Delete Missing Data
• The usual ways to treat missing data are:
➢ delete the entire observations
➢ replace them
➢ ignore them
• With .drop(), we specify the row indices with missing values that should
be removed.
DF_name.drop(axis = 0, index = [index1, index2, …])
• The .dropna() method combines the localisation and removal of rows
or columns with missing data in a single function.
DF_name.dropna(axis = 0, how = "any"/"all")
• All missing values are treated and deleted by .dropna() equally.

40/57
Example of Deleting Missing Data
• Recall that Prices for Apple is missing in C_Imports.

• We set how = “any” to drop the row for Apple with NaN for Prices.

41/57
Replace Missing Data
• We can also replace missing values by a pre-defined value.
• The most common values are 0 or the variable mean.
• pandas facilitates replacement of missing values by .fillna().

DataFrame_name.fillna(value = repl_value)
DataFrame_name["column_label"].fillna(value = repl_value)

• The .fillna() method replaces all missing values with the value
specified in the parameter.
• We can also specify a column in the DataFrame where the missing values
should be replaced by the .fillna() method.

42/57
Example of Replacement
• Recall that Prices for Apple is missing in C_Imports (i.e., NaN)

• Replace NaN with the mean price of orange and banana


➢ Mean is (.7 + .9) / 2 = .8

43/57
Outliers
• Outliers can cause biasedness in the statistical estimation.
• Draw histogram/boxplot or use interquartile range (IQR) to detect outliers
in a variable.
• Use .quantile() to determine the 1st and 3rd quartiles of a variable.
DataFrame_name["column_label"].quantile(q = quantile)
• An observation y is considered as outliers if:
y < q1 – 1.5 * iqr or y > q3 + 1.5 * iqr
where q1 is the 1st quartile, q3 the third quartile and iqr = q3 – q1
is the interquartile range.

44/57
Detect and Remove Outliers
• Usually, outliers should be removed from the dataset.
• In Python, it suffices to select observations with no outliers in the target
variable.
• The following syntax generates a subset of rows that do not fulfil the
outlier condition.

DF[~((DF["Col"]<q1–1.5*iqr) | (DF["Col"]>q3+1.5*iqr))]

45/57
Example of Outliers (I)
• Suppose our fruit imports dataset, ‘O_Imports’, now has 1 outlier
➢ Durian with Prices of 1,000, from Ukraine; looks suspicious

• Obtain the 1st quantile (q = .25) and 3rd quantile (q = .75)

46/57
Example of Outliers (II)
• Define interquartile range, iqr, as q3 – q1

• Find the observations which are not outliers


➢ Hence, Durians from Iceland with a Price of 1,000 is an outlier

47/57
Fourth Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Remove all the rows with missing data in “price” from the merged
DataFrame created in the previous chapter.
• Detect car models that can be considered as outliers in terms of their selling
price.
• Discuss whether the models with extraordinarily high/low selling prices
should be excluded from the DataFrame.

48/57
Data Modification
Sort Data
• Sometimes, we may want to sort the data according to the values of some
variables for better understanding.
• The method .sort_values() rearranges the order of the rows in a
DataFrame.

DataFrame_name.sort_values(
by = [List_of_var_names], ascending)

• A list of variable names based on which the DataFrame is sorted must be


provided.
• The sorting hierarchy among these variables drops with the increasing
index in the list.
• Data can be sorted in the ascending or descending order.

50/57
Discretisation
• Through discretisation, variables are easier to understand or becomes
compatible to some specific analytics models such as decision trees.
• Use the cut() function to discretise continuous variables.

DataFrame_name["column"] = pd.cut(x = array,


bins, right, labels, include_lowest, ordered)
• Note that the data to be discretised should be converted to a one-
dimensional NumPy array in the first place.

51/57
Grouping Data
• Analysing aggregated statistics of some variables are often of interest in
data analytics.
• Aggregated statistics are calculated based on grouped data.
• A DataFrame can be grouped by one or more variables using the
.groupby() method.
DataFrame_name.groupby(by = [List_of_Labels]).anymethod()
• Attached to the .groupby() method can be any method for the
calculation of the aggregated statistics.

52/57
Transformation
• Sometimes, we may need to transform the values of a variable.
• For instance, log-transformation can help to stabilise the variance of a
variable.
• In Python, there are various functions to transform variables.
• For example, the log-transformation of a numeric variable can be carried
out by the log() function from the NumPy package.

DF_name["new_var"] = np.log(DF_name["var_name"])

53/57
Standardisation
• Sometimes, variables included clustering could be measured at different
scales and do not contribute equally to the analysis.
• Hence, we need to standardise or normalise them.
• The standardisation function can be found in the “scikit-learn” package.
• To standardise a variable in the traditional way, we need to find the mean
and the standard deviation of the variable first.

var_mean = np.mean(DF["var_name"])
var_std = np.std(DF["var_name"])
DF["std_var"] = (DF["var_name"] – var_mean) / var_std

54/57
Normalisation
• Normalisation is another transformation method to scale down a variable.
• The values of a normalised variable can only be in the interval [0, 1].
• The normalisation function can also be found in the “scikit-learn” package.
• To normalise a variable in the traditional, we need to find the minimum
and maximum of the variable first.
var_min = np.min(DF["var_name"])
var_max = np.max(DF["var_name"])
DF["norm_var"] = (DF["var_name"] – var_min)
/ (var_max – var_min)

55/57
Fifth Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Read in the data file “car_price_model.csv” into Python as a pandas
DataFrame. The data file contains 4 variables: Year, Make, Model, and Price.
• Print the DataFrame and check on its dimension.
• Drop the rows where at least one element is missing.
• Since the prices stored in the “Price” variable are in US dollars, add a new
column named “SGD Price” that is 1.5 times of the “Price”
• Calculate the mean price (in SGD) of each car Make

56/57
Discussion
• What happens to the bin edges in the discretisation process using the cut()
function?
• When is log-transformation/normalisation/standardisation necessary? What
is the main difference between normalisation and standardization?

57/57

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy