0% found this document useful (0 votes)
52 views51 pages

BI Lecture05A DataWrangling

Uploaded by

yasir11.work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views51 pages

BI Lecture05A DataWrangling

Uploaded by

yasir11.work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Data Wrangling

CS 459 Business Intelligence


Data Wrangling
February 24 CS459 - Business Intelligence - Abeera Tariq 2
• Data Wrangling is the process of gathering, collecting, and
transforming Raw data into another format for better
understanding, decision-making, accessing, and analysis
Data Wrangling in less time.
also called Data Munging • All the activity that you do on the raw data to make it “clean”
enough to input to your analytical algorithm is called data
wrangling or data munging. — Shubham Simar Tomar 2016

February 24 CS459 - Business Intelligence - Abeera Tariq 3


Summarizing
6-steps of
Data Wrangling

February 24 CS459 - Business Intelligence - Abeera Tariq 4


Importance of
Data Wrangling
• In data science and data
analysis, the amount of
work that goes into data
wrangling is embodied by
the 80/20 rule – data
scientists typically spend
80% of their time
‘wrangling’ or preparing
data and 20% of their time
actually analyzing the data.

February 24 CS459 - Business Intelligence - Abeera Tariq 5


Exploratory Data Analysis (EDA)

In data science, exploratory data


analysis involves examining the
distribution of various variables in the
dataset, identifying outliers, finding
trends and patterns, looking for
relationships between variables by
using heat maps or correlation metrics.

February 24 CS459 - Business Intelligence - Abeera Tariq 6


EDA

February 24 CS459 - Business Intelligence - Abeera Tariq 7


Data Wrangling

February
CS459
24 - Business Intelligence - Abeera Tariq 8
Data Cleaning

February 24 CS459 - Business Intelligence - Abeera Tariq 9


Types of dirty data

February 24 CS459 - Business Intelligence - Abeera Tariq 10


Missing Values

February 24 CS459 - Business Intelligence - Abeera Tariq 11


Missing Values

• Every value in every column has a certain probability of being


missing (Rubin, 1976)
• Generally, there is a probability distribution of any column in any data,
i.e., which defines the shape of the probabilities of occurrence of that
column (e.g., bell curve, exponential, logarithmic etc.)
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)

February 24 CS459 - Business Intelligence - Abeera Tariq 12


Missing Values - MCAR

• Missing Completely at Random (MCAR):


• Every column value has the same probability of being missing
• Causes of the missing data are unrelated to the data
• A product weighing scale generates missing data - batteries have
died down
• Sales data for an outlet is missing - outlet closed for maintenance
• ATM data missing over some time period - ATM was being filled
with cash or a technical glitch causing ripples at multiple locations.

February 24 CS459 - Business Intelligence - Abeera Tariq 13


Missing Values

• Missing at Random (MAR):


• Different column values (e.g., different groups) can have different
probabilities of being missing – most common case
• Causes of the missing data are related to the data
• A weighing scale produces more missing values for heavier products
• Sales data missing for teenage customers - no promotion for teenagers
• ATM data is missing for a time period – missing due to weekend or
holidays or due to lower transaction volumes. The missingness is
related to the observed variable (day of the week) but not directly to
the missing values.
February 24 CS459 - Business Intelligence - Abeera Tariq 14
Missing Values

• Missing Not at Random (MNAR):


• When the case cannot be categorized as MCAR or MAR -
probability of being missing is varying for unknown reasons
• Weighing scale gives missing values over time - wearing out -
cannot detect
• Sales data - more and more missing over time – customers
relocating – cannot detect
• ATM data – people coming lesser and lesser – fear of theft

February 24 CS459 - Business Intelligence - Abeera Tariq 15


February 24 CS459 - Business Intelligence - Abeera Tariq 16
Data Cleaning

February 24 CS459 - Business Intelligence - Abeera Tariq 17


Problems with the Data

February 24 CS459 - Business Intelligence - Abeera Tariq 18


Interpreting
Histograms and Box
plots
What is a
Histogram?
A histogram is a graphical representation
of the frequency distribution of continuous
series using rectangles.
The x-axis of the graph represents the class
interval, and the y-axis shows the various
frequencies corresponding to different
class intervals

February 24 CS459 - Business Intelligence - Abeera Tariq 20


Analyzing Histograms:
Shape, Skew and Kurtosis

February 24 CS459 - Business Intelligence - Abeera Tariq 21


Mean, Median, Mode

• Mean: The "average" number; found by adding all data points and
dividing by the number of data points.
(impacted by outlier)
• Median: The middle number; found by ordering all data points and
picking out the one in the middle (or if there are two middle numbers,
taking the mean of those two numbers).
(Not impacted by outlier)
• Mode: The most frequent number—that is, the number that occurs the
highest number of times.

February 24 CS459 - Business Intelligence - Abeera Tariq 22


Skew

• Skewness is a statistical measure that assesses the


asymmetry of a probability distribution. It quantifies the
extent to which the data is skewed or shifted to one side.
Positive (long tail on right) and Negative (long tail on left)

February 24 CS459 - Business Intelligence - Abeera Tariq 23


Kurtosis

• Kurtosis is a statistical measure that quantifies the shape of a


probability distribution. It provides information about the
tails and peakedness of the distribution compared to a
normal distribution.
• Positive kurtosis indicates heavier tails and a more peaked
distribution, while negative kurtosis suggests lighter tails
and a flatter distribution.
February 24 CS459 - Business Intelligence - Abeera Tariq 24
Interpreting Box Plots

February 24 CS459 - Business Intelligence - Abeera Tariq 25


Histograms and Box Plots

February 24 CS459 - Business Intelligence - Abeera Tariq 26


Wrangling
Techniques
Standardization vs
Normalization

• Standardization typically
means rescales data to have a
mean of 0 and a standard
deviation of 1 (unit variance).
• Normalization typically means
rescales the values into a
range of [0,1] or [-1,1].

February 24 CS459 - Business Intelligence - Abeera Tariq 28


Discretization

• Discretization is the
process through which
we can transform
continuous variables,
models or functions
into a discrete form.
• For categorical
variables to reduce the
number of possible
groups.

February 24 CS459 - Business Intelligence - Abeera Tariq 29


Example – Price of commonly sold products

February 24 CS459 - Business Intelligence - Abeera Tariq 30


Outlier Analysis
Outliers Vs Anomalies

Outlier is usually a single Anomalies are observations


observation, which is (usually more than one)
extreme from “Median” where they don’t confirm to
and can fall on either side pattern exhibited by certain
of it. variable.
Outlier Anomaly

February 24 CS459 - Business Intelligence - Abeera Tariq 33


Outliers- Example

February 24 CS459 - Business Intelligence - Abeera Tariq 34


Outliers- Example

February 24 CS459 - Business Intelligence - Abeera Tariq 35


Outliers with Box Plots

February 24 CS459 - Business Intelligence - Abeera Tariq 36


Outliers

Outliers in data may contain valuable information.

Or be meaningless aberrations caused by measurement and recording/data entry errors.


For example, not converting weight, making a typo in sales value with an additional zero.

Investigate why are they occurring? Where—and what—might the meaning be?

The answer could differ from business to business, but it’s important to have the
conversation rather than ignoring the data.

February 24 CS459 - Business Intelligence - Abeera Tariq 37


Outliers Testing and Visualization

• Visualization : Boxplot and the scatterplot

• The Tietjen-Moore test is useful for determining multiple outliers in a data


set with the null hypothesis for this test is — there are no outliers in the data.

February 24 CS459 - Business Intelligence - Abeera Tariq 38


What should I do with outliers?

• Much dependent on the business needs.


• A good BI dashboard should be able to detect outliers for the
right decision making at the right time.
• Outlier Detection is important, treatment is dependent on the
requirements of analysis.
• Removal/Imputation may become important when it is essential
to have a normal distribution for some statistical testing or
machine learning algorithms.

February 24 CS459 - Business Intelligence - Abeera Tariq 39


Missing Value
Analysis (MVA)
Missing Values

• Missing values are


usually represented in
the form of NaN or Null
or None in the dataset.
Dropping Rows and Columns

• Data not in use →Not useful for your analysis


• Contains the same value (with missing values or not)
• Very few rows with missing values in comparison to the full size of the
dataset and information in multiple columns is missing.
• Use this method in extreme cases when there are too many null values
in the column or row.
• Tradeoff: Loss of information.

February 24 CS459 - Business Intelligence - Abeera Tariq 42


Imputation

NUMERICAL
1.Filling the missing data with the mean
2.Filling the missing data with the median.
CATEGORICAL
1.Filling the missing data with mode
2.Filling with a new type for the missing values.
Last observation carried forward (LOCF)

February 24 CS459 - Business Intelligence - Abeera Tariq 43


Interpolation – Linear

• It’s the method of


approximating a missing
value by joining dots in
increasing order along a
straight line.
• In a nutshell, it calculates the
unknown value in the same
ascending order as the
values that came before it

February 24 CS459 - Business Intelligence - Abeera Tariq 44


Forward Interpolation

February 24 CS459 - Business Intelligence - Abeera Tariq 45


Imputation by KNN

• A fundamental classification approach is the k-nearest-


neighbors (kNN) algorithm.
• Class membership is the outcome of k-NN categorization
• If k = 1, the item is simply assigned to the class of the item’s
closest neighbor.
• Finding the k’s closest neighbours to the observation with
missing data and then imputing them based on the non-
missing values in the neighborhood might help generate
predictions about the missing values.

February 24 CS459 - Business Intelligence - Abeera Tariq 46


MICE - Multiple Imputation by Chained
Equation

• Multiple Imputation by Chained Equation assumes that data is MAR,


i.e. missing at random.
• Sometimes data missing in a dataset and is related to the other
features and can be predicted using other feature values.
• It cannot be imputed with general ways of using mean, mode, or
median.

February 24 CS459 - Business Intelligence - Abeera Tariq 47


IterativeImputer class

• Models each feature with missing values as a function of other


features and uses that estimate for imputation.
• It does so in an iterated round-robin fashion: at each step, a feature
column is designated as output y and the other feature columns are
treated as inputs X.
• A regressor is fit on (X, y) for known y. Then, the regressor is used to
predict the missing values of y. This is done for each feature in an
iterative fashion, and then is repeated for max_iter imputation rounds.
The results of the final imputation round are returned.
February 24 CS459 - Business Intelligence - Abeera Tariq 48
Python Notebook
Required Imports

#importing the basic libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
import missingno as mano
%matplotlib inline

February 24 CS459 - Business Intelligence - Abeera Tariq 50


df.datatypes

February 24 CS459 - Business Intelligence - Abeera Tariq 51

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy