0% found this document useful (0 votes)

82 views55 pages

ANL252 SU4 Jul2022

The document discusses importing and managing data using the pandas package in Python. It covers: 1) Importing pandas and loading CSV data files into pandas DataFrames using the read_csv() function. 2) Methods for selecting, displaying, and querying data from DataFrames including selecting columns, rows, cells by position, index, and boolean masking. 3) Merging and concatenating DataFrames using various join types like inner, outer joins for DataFrames with common variables or observations and with different shapes.

Uploaded by

Ebad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views55 pages

ANL252 SU4 Jul2022

Uploaded by

Ebad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Study Unit 4

Data Management
Import Data
pandas Package
• “pandas” is the most common package for data management in Python.
• First, installing pandas using pip, then import pandas in our program:

import pandas as pd

3/57
Import Data
• We need Python compatible datasets to work with pandas.
• Load a dataset in Python and open it in the format of pandas.
• Convert .csv data files to pandas DataFrame by the read_csv() function.

DataFrame_name = pd.read_csv("csv_file_name.csv")
• read_csv() is a reader to convert specific format of data files into
pandas DataFrame.
• pandas also provides readers to import data files from other sources such
as Excel, SPSS, Stata, etc.

4/57
pandas Readers

Reader Format Type Data Description

read_csv() text CSV
read_html() text HTML
read_clipboard() text Local clipboard
read_excel() binary MS Excel
read_stata() binary Stata
read_sas() binary SAS
read_spss() binary SPSS
read_pickle() binary Python Pickle Format
read_sql() SQL SQL
read_gbq() SQL Google BigQuery

5/57
Display pandas DataFrames
• We can use the print() function to display the whole DataFrame.
print(DataFrame_name)
• Alternatively, display() function achieves the same.
display(DataFrame_name)
• Another possibility is to print the DataFrames without any function.
DataFrame_name
• Use the .head() method to display the first five rows of a DataFrame.
DataFrame_name.head()

6/57
Data Selection
Select Columns by Variables
• Create a list of variable names to select specific columns of a DataFrame.
• The variable names must be put within a pair of quotation marks.
DataFrame_name[["var_name1", "var_name2", …]]
• To access one column, put the variable name as string inside the index
operator directly.

8/57
Example of column selection
• Suppose we have a dataset on fruits, prices, and country of origin
➢ imported as Pandas dataframe and named as Imports

• To get the fruits and their prices, we use Imports[[‘Fruits’, ‘Prices’]]

9/57
Select Rows by Positions
• We cannot refer to natural “observation names” to selecting rows.
• pandas provides a row index to every row.
• It starts with 0 and ends with the number of rows minus one.
• Rows can be queried by the numeric index position using the DataFrame
attribute iloc.
DataFrame_name.iloc[start:end]
• The indices must be integers, but they do not need to be consecutive.
• If we select multiple rows, the indices must be put in a list first.
• If we select one row, the index can be put in the index operator directly.

10/57
Example of row selection
• First row and third row of imports, keep all columns
➢ [0, 2] indicates the first and third rows of Imports
➢ The ‘ : ’ after the comma indicates all columns

11/57
Select Rows by Indices
• Another way to select rows is to use row indices.
• Create row index labels by the method .set_index().
DataFrame_name.set_index(key, inplace = True)
• This method converts the values of a variable to row index labels.
• The rows can be queried by the row index labels using the .loc attribute.
DataFrame_name.loc[["row_label1", "row_label2", …]]
• Put row labels as strings in a list for selecting multiple rows.
• To select rows of a single label, put the label in .loc directly.

12/57
Example of using row indices
• Use the variable names as a key for Imports

• Then, we locate the ‘Apple’ and ‘Orange’ rows

13/57
Select Cells by Positions and Indices
• Specify columns and rows by the .loc and/or .iloc attributes to select
cells from a DataFrame.
• Use one of the following syntaxes to select cells from a DataFrame:

DF_name.iloc[row_start:row_end, col_start:col_end]

DF_name.loc[["row_labels", "col_labels"]]

DF_name[["col_labels"]].iloc[row_start:row_end]

DF_name.loc[["row_labels"]].iloc[0:, col_start:col_end]

14/57
Select Cells by Boolean Masking
• The elements of a Boolean mask array are either True or False.
• Boolean mask array is overlaid on top of the queried DataFrame.
• Elements aligned with True are selected.
DataFrame_name[Condition]
• More complex queries with several conditions are connected by bitwise
logical operators.
DataFrame_name[(Condition1) &/| (Condition2) &/| …]
• If there are two conditions, two Boolean masks will be compared
elementwise by the bitwise operator.
• Each condition needs to be encased in parentheses.

15/57
First Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Read in the data file “car_model.csv” into Python as a pandas DataFrame.
The data file contains 4 variables: Year, Make, Model, and Category.
• Print the DataFrame and check on its dimension.

16/57
Discussion
• What are the main differences between a dataset (or pandas DataFrames)
and an array (or NumPy array)?
• How different are the outputs of a DataFrame resulting from the different
printing functions in Python? Which one is most/least preferrable when
using JupyterLab as our programming environment?

17/57
Second Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Select the column (variable) “Make” from the DataFrame.
• Select all entries with the value “Audi” in the variable “Make” from the
DataFrame.
• Select the last 10 observations and only the variable “Year” from the
DataFrame.

19/57
Discussion
• Name situations where we would need to select data from a DataFrame.
• How different are the syntaxes between row and column selections.

20/57
Merge DataFrames
(This section on data
frames is optional)
Append DataFrames by Rows
• Concatenate two DataFrames with identical variables by rows:

• Use the .append() method for such concatenation:

DataFrame_name.append(other = [OtherDataFrames])

23/57
Merge DataFrames by Columns
Merge two DataFrames with identical observations by columns:

24/57
Outer Join DataFrames with Some
Common Variables
Outer join two DataFrames with some common variables:

25/57
Inner Join DataFrames with Some
Common Variables
Inner join two DataFrames with some common variables:

26/57
Outer Join DataFrames with Some
Common Observations
Outer join two DataFrames with some common observations:

27/57
Inner Join DataFrames with Some
Common Observations
Inner join two DataFrames with some common observations:

28/57
Outer Join DataFrames with Different
Shapes
Outer join two DataFrames with different shapes:

29/57
Inner Join DataFrames with Different
Shapes
Inner join two DataFrames with different shapes:

30/57
Concatenate DataFrames
• Use concat() to merge multiple DataFrames with different shapes.
finalDF_name = pd.concat(objs, axis, join)
• The names of the DataFrames to be concatenated must be put in a list.
• If axis = 0, the DataFrames will be concatenated below one another,
and the concatenation will take place beside one another if axis = 1.
• The join parameter controls the type of concatenation. The possible
values here are "outer" and "inner", written as string.

31/57
Third Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Read in the usual data file “car_model.csv” into Python.
• Then read in car_price_model.csv” into Python. The data file contains 4
variables: Year, Make, Model, and Price (in USD).
• Inner join the “car_model” and “car_price_model” data by their “year”,
“make” and “model”.
• Print the merged DataFrame

32/57
Missing Data
and Outliers
Missing Data
• In empirical studies, an observed value of a variable could be missing.
• Reasons for missing data: defective measurement tools, withdrawal from
the study, refusal of responses to sensitive questions, etc.
• In Python, pandas indicates missing data with a special floating-point
value, while NumPy uses NaN (“Not a Number”).
• Missing data cannot be included in constructing models, forecasting, etc.
• In pandas, missing values are ignored by the statistical functions.
• Since the underlying sample sizes for each variable could vary in the
computation due to missing data, the statistical estimation can be biased.

34/57
Identify Missing Data
• pandas’ readers such as read_csv() have two parameters, na_filter
and na_values, to convert certain strings to missing values directly.

DataFrame_name = pd.read_csv("csv_file_name.csv",
na_values = "na_string", na_filer = True/False)

• If na_filter is True, pandas will convert all white spaces "" to NaN.
• With na_values, we can declare certain strings from our DataFrame to
be recognised as missing values.
• The following strings are treated as missing values by default and do not
require explicit declaration:
"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
"-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
"NULL", "NaN", "n/a", "nan", "null"

35/57
Locate Missing Data (I)
• It is often not easy to locate the missing data in a dataset.
• Count the NaNs in each row and each column to find the missing data.
• Missing values exist in a variable if the number of NaNs is larger than zero.
• Same approach is applicable to rows.
DataFrame_name.isnull().sum(axis = 0/1)
• The method .isnull() checks every cell of a DataFrame and returns
True if the cell contains a missing value and False otherwise.
• The method .sum() adds up all the “True” (equal to 1) in each row or
each column. It is equivalent to counting the missing values.
• If axis = 0, the values in a column will be added up together. If axis =
1, it returns the sum of a row.

36/57
Example of Locating Missing Data (I)
• Suppose our Imports dataset is corrupted, name it as C_Imports
➢ Observe that the price of apple is missing, NaN

• Now locate the missing data; there is 1 missing value in Prices

37/57
Locate Missing Data (II)
• We can use .any() just to check whether missing data exist.
• It returns True if any element in the attached array is True.
DataFrame_name.isnull().any(axis = 0/1)
• Retrieve the row/column indices with missing data by .index.
object_name = DataFrame_name.isnull().any()
object_name[object_name == True].index

• Counting the NaNs in columns means to check on the existence of missing

values in each variable.
• Counting the NaNs in rows means to identify observations with missing
values in at least one of the variables.
• The treatments of missing data are different in these two cases.

38/57
Example of Locating Missing Data (II)
• Check for missing entries in C_Imports. Recall that price is missing
➢ We see a ‘True’ for ‘Prices’

• Now check the index which is missing

➢ We see a ‘True’ for ‘Prices’

39/57
Delete Missing Data
• The usual ways to treat missing data are:
➢ delete the entire observations
➢ replace them
➢ ignore them
• With .drop(), we specify the row indices with missing values that should
be removed.
DF_name.drop(axis = 0, index = [index1, index2, …])
• The .dropna() method combines the localisation and removal of rows
or columns with missing data in a single function.
DF_name.dropna(axis = 0, how = "any"/"all")
• All missing values are treated and deleted by .dropna() equally.

40/57
Example of Deleting Missing Data
• Recall that Prices for Apple is missing in C_Imports.

• We set how = “any” to drop the row for Apple with NaN for Prices.

41/57
Replace Missing Data
• We can also replace missing values by a pre-defined value.
• The most common values are 0 or the variable mean.
• pandas facilitates replacement of missing values by .fillna().

DataFrame_name.fillna(value = repl_value)
DataFrame_name["column_label"].fillna(value = repl_value)

• The .fillna() method replaces all missing values with the value
specified in the parameter.
• We can also specify a column in the DataFrame where the missing values
should be replaced by the .fillna() method.

42/57
Example of Replacement
• Recall that Prices for Apple is missing in C_Imports (i.e., NaN)

• Replace NaN with the mean price of orange and banana

➢ Mean is (.7 + .9) / 2 = .8

43/57
Outliers
• Outliers can cause biasedness in the statistical estimation.
• Draw histogram/boxplot or use interquartile range (IQR) to detect outliers
in a variable.
• Use .quantile() to determine the 1st and 3rd quartiles of a variable.
DataFrame_name["column_label"].quantile(q = quantile)
• An observation y is considered as outliers if:
y < q1 – 1.5 * iqr or y > q3 + 1.5 * iqr
where q1 is the 1st quartile, q3 the third quartile and iqr = q3 – q1
is the interquartile range.

44/57
Detect and Remove Outliers
• Usually, outliers should be removed from the dataset.
• In Python, it suffices to select observations with no outliers in the target
variable.
• The following syntax generates a subset of rows that do not fulfil the
outlier condition.

DF[~((DF["Col"]<q1–1.5*iqr) | (DF["Col"]>q3+1.5*iqr))]

45/57
Example of Outliers (I)
• Suppose our fruit imports dataset, ‘O_Imports’, now has 1 outlier
➢ Durian with Prices of 1,000, from Ukraine; looks suspicious

• Obtain the 1st quantile (q = .25) and 3rd quantile (q = .75)

46/57
Example of Outliers (II)
• Define interquartile range, iqr, as q3 – q1

• Find the observations which are not outliers

➢ Hence, Durians from Iceland with a Price of 1,000 is an outlier

47/57
Fourth Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Remove all the rows with missing data in “price” from the merged
DataFrame created in the previous chapter.
• Detect car models that can be considered as outliers in terms of their selling
price.
• Discuss whether the models with extraordinarily high/low selling prices
should be excluded from the DataFrame.

48/57
Data Modification
Sort Data
• Sometimes, we may want to sort the data according to the values of some
variables for better understanding.
• The method .sort_values() rearranges the order of the rows in a
DataFrame.

DataFrame_name.sort_values(
by = [List_of_var_names], ascending)

• A list of variable names based on which the DataFrame is sorted must be

provided.
• The sorting hierarchy among these variables drops with the increasing
index in the list.
• Data can be sorted in the ascending or descending order.

50/57
Discretisation
• Through discretisation, variables are easier to understand or becomes
compatible to some specific analytics models such as decision trees.
• Use the cut() function to discretise continuous variables.

DataFrame_name["column"] = pd.cut(x = array,

bins, right, labels, include_lowest, ordered)
• Note that the data to be discretised should be converted to a one-
dimensional NumPy array in the first place.

51/57
Grouping Data
• Analysing aggregated statistics of some variables are often of interest in
data analytics.
• Aggregated statistics are calculated based on grouped data.
• A DataFrame can be grouped by one or more variables using the
.groupby() method.
DataFrame_name.groupby(by = [List_of_Labels]).anymethod()
• Attached to the .groupby() method can be any method for the
calculation of the aggregated statistics.

52/57
Transformation
• Sometimes, we may need to transform the values of a variable.
• For instance, log-transformation can help to stabilise the variance of a
variable.
• In Python, there are various functions to transform variables.
• For example, the log-transformation of a numeric variable can be carried
out by the log() function from the NumPy package.

DF_name["new_var"] = np.log(DF_name["var_name"])

53/57
Standardisation
• Sometimes, variables included clustering could be measured at different
scales and do not contribute equally to the analysis.
• Hence, we need to standardise or normalise them.
• The standardisation function can be found in the “scikit-learn” package.
• To standardise a variable in the traditional way, we need to find the mean
and the standard deviation of the variable first.

var_mean = np.mean(DF["var_name"])
var_std = np.std(DF["var_name"])
DF["std_var"] = (DF["var_name"] – var_mean) / var_std

54/57
Normalisation
• Normalisation is another transformation method to scale down a variable.
• The values of a normalised variable can only be in the interval [0, 1].
• The normalisation function can also be found in the “scikit-learn” package.
• To normalise a variable in the traditional, we need to find the minimum
and maximum of the variable first.
var_min = np.min(DF["var_name"])
var_max = np.max(DF["var_name"])
DF["norm_var"] = (DF["var_name"] – var_min)
/ (var_max – var_min)

55/57
Fifth Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Read in the data file “car_price_model.csv” into Python as a pandas
DataFrame. The data file contains 4 variables: Year, Make, Model, and Price.
• Print the DataFrame and check on its dimension.
• Drop the rows where at least one element is missing.
• Since the prices stored in the “Price” variable are in US dollars, add a new
column named “SGD Price” that is 1.5 times of the “Price”
• Calculate the mean price (in SGD) of each car Make

56/57
Discussion
• What happens to the bin edges in the discretisation process using the cut()
function?
• When is log-transformation/normalisation/standardisation necessary? What
is the main difference between normalisation and standardization?

57/57

10 Minutes A Day Maths Ages 7-9 (Carol Vorderman) (
100% (6)
10 Minutes A Day Maths Ages 7-9 (Carol Vorderman) (
84 pages
Astm D1729-16
88% (17)
Astm D1729-16
4 pages
Automotive E&E Arch
No ratings yet
Automotive E&E Arch
12 pages
Year 6 Grammar Revision
No ratings yet
Year 6 Grammar Revision
4 pages
5.QFP - Probability-1
No ratings yet
5.QFP - Probability-1
2 pages
Modern Synthetic Methods - 4
No ratings yet
Modern Synthetic Methods - 4
49 pages
Q Bank
No ratings yet
Q Bank
3 pages
Single customer view Second Edition
From Everand
Single customer view Second Edition
Gerardus Blokdyk
No ratings yet
How I Turned AI Into A $10kmonth YouTube Channel FULL COURSE
No ratings yet
How I Turned AI Into A $10kmonth YouTube Channel FULL COURSE
6 pages
BCG GAMMA DS Case Interview Prep
No ratings yet
BCG GAMMA DS Case Interview Prep
24 pages
IBM 9406 270 Repair Analysis
No ratings yet
IBM 9406 270 Repair Analysis
773 pages
Vistas Prose 3 - Journey To The End of The Earth
No ratings yet
Vistas Prose 3 - Journey To The End of The Earth
4 pages
SP4-6 Test3 Czytanie
No ratings yet
SP4-6 Test3 Czytanie
3 pages
The Power of Prediction in Health Care: A Step-by-step Guide to Data Science in Health Care
From Everand
The Power of Prediction in Health Care: A Step-by-step Guide to Data Science in Health Care
Rafiq Muhammad
No ratings yet
Reviewer About Major Biomes in Environmental Science
No ratings yet
Reviewer About Major Biomes in Environmental Science
1 page
Cytogenetics Lab Report
No ratings yet
Cytogenetics Lab Report
8 pages
Post Task M6-Pharmacology
No ratings yet
Post Task M6-Pharmacology
2 pages
Gu01 2009 Standard Reference
No ratings yet
Gu01 2009 Standard Reference
7 pages
Settings Provider
No ratings yet
Settings Provider
43 pages
Guardians' Chronicles Adventure 2
No ratings yet
Guardians' Chronicles Adventure 2
2 pages
Bibliographic Record
No ratings yet
Bibliographic Record
3 pages
KY-040 Arduino Rotary Encoder User Manual
No ratings yet
KY-040 Arduino Rotary Encoder User Manual
5 pages
Formula 1048 Eye Face Balm With Q10 Liposomes
No ratings yet
Formula 1048 Eye Face Balm With Q10 Liposomes
2 pages
Case Interview Frameworks
No ratings yet
Case Interview Frameworks
46 pages
Fundamental Analysis Via Machine Learning
No ratings yet
Fundamental Analysis Via Machine Learning
26 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
ESADE 2014 Case Book
No ratings yet
ESADE 2014 Case Book
104 pages
MYY LBU: Salimun / Imran
No ratings yet
MYY LBU: Salimun / Imran
3 pages
Justenoughpython Pandas 220915 175329
No ratings yet
Justenoughpython Pandas 220915 175329
64 pages
Kyocera KM1650 / 2050 Parts List / Manual
No ratings yet
Kyocera KM1650 / 2050 Parts List / Manual
48 pages
Tools and Equipment Technologies For BAMBOO
100% (2)
Tools and Equipment Technologies For BAMBOO
34 pages
Exam Prep for:: Business Analysis and Valuation Using Financial Statements, Text and Cases
From Everand
Exam Prep for:: Business Analysis and Valuation Using Financial Statements, Text and Cases
Mzn Lnx
No ratings yet
Legal Presentation
100% (1)
Legal Presentation
31 pages
Troubleshooting
No ratings yet
Troubleshooting
36 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Mit PHD Thesis Proposal
100% (1)
Mit PHD Thesis Proposal
7 pages
Update Plan
100% (1)
Update Plan
79 pages
ANL252 SU2 Jul2022
No ratings yet
ANL252 SU2 Jul2022
52 pages
ANL252 SU6 Jul2022
No ratings yet
ANL252 SU6 Jul2022
51 pages
ANL252 SU3 Jul2022
No ratings yet
ANL252 SU3 Jul2022
23 pages
FM GWP 1 Report
No ratings yet
FM GWP 1 Report
7 pages
The Five Steps in Problem Analysis
No ratings yet
The Five Steps in Problem Analysis
5 pages
Interview Preparations - NielsenIQ
No ratings yet
Interview Preparations - NielsenIQ
1 page
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
No ratings yet
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
6 pages
Urinary Elimination
100% (14)
Urinary Elimination
7 pages
Wharton 2008
100% (1)
Wharton 2008
133 pages
ANL252 SU5 Jul2022
No ratings yet
ANL252 SU5 Jul2022
58 pages
2023 Mba Employment Report
No ratings yet
2023 Mba Employment Report
10 pages
Algorithms For Big Data (CS 229r)
No ratings yet
Algorithms For Big Data (CS 229r)
3 pages
Casebank Bank Merger - Management Consulted - Round 1
No ratings yet
Casebank Bank Merger - Management Consulted - Round 1
2 pages
Pandas Basics
No ratings yet
Pandas Basics
84 pages
Virginia Darden - 2020
100% (1)
Virginia Darden - 2020
183 pages
BOSS Supastor Stainless Steel Unvented Cylinders
No ratings yet
BOSS Supastor Stainless Steel Unvented Cylinders
10 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Armor 1972 Jan-Jun
100% (2)
Armor 1972 Jan-Jun
230 pages
Bain Analytical Test 3 MCP
No ratings yet
Bain Analytical Test 3 MCP
21 pages
Case Interview Prep
No ratings yet
Case Interview Prep
35 pages
Analytics Prepbook Laterals 2019-2020
100% (1)
Analytics Prepbook Laterals 2019-2020
40 pages
2600 Corporate Telecom Cabling Standard Rev 1A - (66778120)
No ratings yet
2600 Corporate Telecom Cabling Standard Rev 1A - (66778120)
182 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
313 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
Python Intro
No ratings yet
Python Intro
13 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Solutions To Pandas Basic Questions
No ratings yet
Solutions To Pandas Basic Questions
1 page
Hackerrank Learning
No ratings yet
Hackerrank Learning
1 page
Derivatives and Risk Management
0% (1)
Derivatives and Risk Management
82 pages
Lecture 7 p1
No ratings yet
Lecture 7 p1
38 pages
Python For You and Me: Release 0.3.alpha1
100% (1)
Python For You and Me: Release 0.3.alpha1
143 pages
Data Science With Python Explained PDF
No ratings yet
Data Science With Python Explained PDF
1 page
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Data Warehousing and Data Mining - Handbook
0% (2)
Data Warehousing and Data Mining - Handbook
27 pages
Python Basic
No ratings yet
Python Basic
34 pages
Analytics - PrepBook 2018 Laterals
No ratings yet
Analytics - PrepBook 2018 Laterals
34 pages
Python Project Documentation: Release 1.0
No ratings yet
Python Project Documentation: Release 1.0
15 pages
Consulting Business Situation Cases - Street of Walls
No ratings yet
Consulting Business Situation Cases - Street of Walls
20 pages
Linear Regression Interview Questions
No ratings yet
Linear Regression Interview Questions
4 pages
Risk Management:: A Helicopter View
No ratings yet
Risk Management:: A Helicopter View
2 pages
Factset Placements
No ratings yet
Factset Placements
2 pages
U Gro Capital
No ratings yet
U Gro Capital
43 pages
Simonkucher Case Interview Prep 2015
100% (1)
Simonkucher Case Interview Prep 2015
23 pages
Numpy Ref
No ratings yet
Numpy Ref
1,128 pages
R Cheat Sheet 3 PDF
No ratings yet
R Cheat Sheet 3 PDF
2 pages
Insead PHD Brochure
No ratings yet
Insead PHD Brochure
48 pages
Learn Data Modelling by Example PT 1 Beginner Level
No ratings yet
Learn Data Modelling by Example PT 1 Beginner Level
99 pages
BCG Workshop
No ratings yet
BCG Workshop
7 pages
Data Quality and Cleaning
No ratings yet
Data Quality and Cleaning
9 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Financial Modeling Case Study (Enercon)
No ratings yet
Financial Modeling Case Study (Enercon)
2 pages
CQF Brochure
No ratings yet
CQF Brochure
23 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ANL252 SU4 Jul2022

Uploaded by

ANL252 SU4 Jul2022

Uploaded by

Study Unit 4

Reader Format Type Data Description

• To get the fruits and their prices, we use Imports[[‘Fruits’, ‘Prices’]]

• Then, we locate the ‘Apple’ and ‘Orange’ rows

• Use the .append() method for such concatenation:

• Now locate the missing data; there is 1 missing value in Prices

• Counting the NaNs in columns means to check on the existence of missing

• Now check the index which is missing

• Replace NaN with the mean price of orange and banana

• Obtain the 1st quantile (q = .25) and 3rd quantile (q = .75)

• Find the observations which are not outliers

• A list of variable names based on which the DataFrame is sorted must be

DataFrame_name["column"] = pd.cut(x = array,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.