0% found this document useful (0 votes)
17 views47 pages

unit 3

The document provides an overview of Machine Learning (ML), detailing its significance in analyzing vast amounts of data generated in the digital age and its applications across various industries. It outlines the history of ML, distinguishing it from Artificial Intelligence (AI), and describes the ML life cycle, which includes data gathering, preparation, analysis, model training, testing, and deployment. Additionally, it introduces data mining as a process for discovering patterns in large datasets, highlighting its steps and importance in improving data quality and insights.

Uploaded by

Ritika Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views47 pages

unit 3

The document provides an overview of Machine Learning (ML), detailing its significance in analyzing vast amounts of data generated in the digital age and its applications across various industries. It outlines the history of ML, distinguishing it from Artificial Intelligence (AI), and describes the ML life cycle, which includes data gathering, preparation, analysis, model training, testing, and deployment. Additionally, it introduces data mining as a process for discovering patterns in large datasets, highlighting its steps and importance in improving data quality and insights.

Uploaded by

Ritika Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Unit III

Introduction To ML

Introduction
The advancements in computing and communicating technologies have resulted in an
explosion of connected and interacting digital devices. In the current digital age, the amount
of data we produce every day is truly mind-boggling. A single click on any website, social
media usage, online transactions, smartphone use, streaming audio and video, and many more
such activities generate enormous amount of data. Data has evolved dramatically in the recent
years, increasing in volume, velocity, and variety. It is a huge challenge to analyze this data
and derive meaningful insights to make sense of this data. So in this era of computing and
information science, Machine Learning has become a rapidly growing and challenging area
for research and development activities.
Machine Learning is used anywhere from automating mundane tasks to offering intelligent
insights, industries in every sector try to benefit from it. You may already be using a device
that utilizes it. For example, a wearable fitness trackers like Fitbit, or an intelligent home
assistant like Google Home.
Machine learning is the field of science where computer algorithms are used to explicitly
learn from past data and information. In machine learning, computers don't have to be
explicitly programmed but they can change and improve their algorithms through learning by
themselves.
Hence, machine learning techniques have become key components of many intelligent
systems.
Machine learning uses various algorithms for building mathematical models and making
predictions using historical data or information. Many development organizations are now
looking for innovative ways to control their data assets to gain competitive edge and to help
the business to gain a new level of understanding. Here, machine learning plays a very
important role.

3.1 History and Evaluation of ML


 In 1952, Arthur Samuel, who was the pioneer of machine learning, created a program
that helped an IBM computer to play a checkers game. This program observed and
understood the positions at the game, and then learnt and developed a model that
finds the better move for the player. Arthur Samuel first came up with the phrase
"Machine Learning" in 1952.
 In 1957, Frank Rosenblatt collectively used Donald Hebb's model of brain cell
interaction along with Arthur Samuel's Machine Learning efforts and created the
perceptron at the Cornell Aeronautical Laboratory. Rosenblatt proposed the first
simple neural network unit called perceptron, which simulates the thought processes
of the human brain.
 In 1960, multi-layer perceptron was invented which shows that by adding more
layers, significantly adds more processing power than single layer perceptron.
 In 1967, nearest neighbor algorithm was invented for pattern recognition, which maps
the problem.routes and was used for finding the most efficient route for traveling
salesperson's
 Later in 1970, multilayer perceptron with back propagation was invented, that allows
a network to adjust its hidden layers of neurons to adjust to new situations. This back-
propagation is now being used to train deep learning neural networks.
 In early 1980, artificial intelligence approach shifted from algorithms to using logical,
knowledge-based approaches.
 In 1997, the IBM's deep blue intelligent computer won the chess game against the
chess expert Garry Kasparov, and it became the first computer which had beaten a
human chess expert.
 In 2006, the Face Recognition Grand Challenge - a National Institute of Standards
and Technology program - evaluated the popular face recognition algorithms of the
time. 3D face scans, iris images, and high-resolution face images were tested.
Geoffrey Hinton rebooted neural net research as deep learning. Today, the internet
users use his techniques to improve tools like voice recognition and image tagging.
 In 2012, Google created a neural network to recognize humans and cats in YouTube
videos without ever being told how to characterize either.
 In 2014, Facebook developed DeepFace, an algorithm capable of recognizing or
verifying individuals in photographs with the same accuracy as humans.
 Modern machine learning algorithms now moves out from labs, into our daily lives
with applications across industry, home, games, education and research. Today this
new technology is wearing amazing forms like self-driving cars, airplanes on
autopilot, self-learning drones etc. This technology spans everything from improving
in-store retail experiences with IOT to boosting security with biometric data to
predicting and diagnosing diseases.

AI vs ML

Machine Learning and Artificial Intelligence are two closely related but distinct
fields within the broader field of computer science. Machine learning is a part of AI
that helps machines learn from data and get better over time without being told
exactly what to do. So, all machine learning is AI, but not all AI is machine
learning. AI can include things like robots or voice assistants, while machine
learning focuses more on learning from patterns in data to make predictions or
decisions.

Machine learning is the brain behind AI teaching machines to learn from data and
make smarter decisions.
Imagine a smart chef named Alex who can prepare any dish you ask for. Alex doesn’t
need instructions; they know every recipe by heart and can even come up with new
dishes. Alex represents Artificial Intelligence (AI) , a system that mimics human
intelligence to make decisions and solve problems on its own.
Now, meet Jamie, Alex’s learning assistant. Jamie is great at chopping vegetables
and following recipes but doesn’t know how to cook creatively. Jamie learns over
time by observing Alex and practicing recipes repeatedly. For instance, if Jamie
makes a mistake in seasoning one day, they adjust it the next time until they perfect it.
This story highlights that while ML is a subset of AI, they each have unique roles and
serve different purposes.

Artificial Intelligence Machine Learning

AI is a broader field focused on


ML is a subset of AI that focuses on
creating systems that mimic human
teaching machines to learn patterns
intelligence, including reasoning,
from data and improve over time
decision-making, and problem-
without explicit programming
solving.

ML focuses on finding patterns in


The main goal of AI is to develop
data and using them to make
machines that can perform
predictions or decisions. It aims to
complex tasks intelligently, similar
help systems improve automatically
to how humans think and act.
with experience.

AI systems aim to simulate human ML focuses on training systems for


intelligence and can perform tasks specific tasks, such as prediction or
across multiple domains. classification.
Artificial Intelligence Machine Learning

AI aims to create systems that can ML aims to create systems that


think, learn, and make decisions learn from data and improve their
autonomously. performance for a particular task.

AI has a wider application range, ML applications are typically


including problem-solving, narrower, focused on tasks like
decision-making, and autonomous pattern recognition and predictive
systems modeling.

AI can operate with minimal ML requires human involvement


human intervention, depending on for data preparation, model
its complexity and design. training, and optimization

AI produces intelligent behavior, ML generates predictions or


such as driving safely, responding classifications based on data, such
to customer queries, or diagnosing as predicting house prices,
diseases, and can adapt to changing identifying objects in images, or
scenarios. categorizing emails.

AI involves broader goals, ML focuses specifically on building


including natural language models that identify patterns and
processing, vision, and reasonin relationships in data

Examples: Recommender systems,


Examples: Robotics, virtual
fraud detection, stock price
assistants like Siri, autonomous
forecasting, and social media friend
vehicles, and intelligent chatbots.
suggestions.
3.2 Machine Learning Life cycle

Machine learning has given the computer systems the abilities to automatically learn without
being explicitly programmed. But how does a machine leaming system work? So, it can be
described using the life cycle of machine learning. Machine learning life cycle is a cyclic
process to build an efficient machine learning project. The main purpose of the life cycle is to
find a solution to the problem or project.
Machine learning life cycle involves seven major steps, which are given below:
 Gathering Data
 Data preparation
 Data Wrangling
 Analyse Data
 Train the model
 Test the model
 Deployment

The most important thing in the complete process is to understand the problem and to
know the purpose of the problem. Therefore, before starting the life cycle, we need to
understand the problem because the good result depends on the better understanding of the
problem.
In the complete life cycle process, to solve a problem, we create a machine learning
system called "model", and this model is created by providing "training". But to train a
model, we need data, hence, life cycle starts by collecting data.

1.Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step
is to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected
from various sources such as files, database, internet, or mobile devices. It is one of the most
important steps of the life cycle. The quantity and quality of the collected data will determine
the efficiency of the output. The more will be the data, the more accurate will be the
prediction.
This step includes the below tasks:
* Identify various data sources
* Collect data
* Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It
will be used in further steps.

2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a
step where we put our data into a suitable place and prepare it to use in our machine learning
training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
 Data exploration:It is used to understand the nature of data that we have to work
with. We need to understand the characteristics, format, and quality of data.A better
understanding of data leads to an effective outcome. In this, we find Correlations,
general trends, and outliers.
 Data pre-processing:
Now the next step is pre-processing of data for its analysis.

3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and transforming
the data in a proper format to make it more suitable for analysis in the next step. It is one of
the most important steps of the complete process. Cleaning of data is required to address the
quality issues.It is not necessary that data we have collected is always of our use as some of
the data may not be useful. In real-world applications, collected data may have various issues,
including:
* Missing Values
* Duplicate data
* Invalid data
* Noise
So, we use various filtering techniques to clean the data.It is mandatory to detect and remove
the above issues because it can negatively affect the quality of the outcome.

4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step
involves:
* Selection of analytical techniques
* Building models
* Review the result
The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the determination of the
type of the problems,we select the machine learning techniques as Classification, Regression,
Cluster analysis, Association, etc. then build the model using prepared data, and evaluate the
model.Hence, in this step, we take the data and use machine learning algorithms to build the
model.

5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.
Model training is an iterative process where the algorithm adjusts its parameters to minimize
errors and enhance predictive accuracy. During this phase the model fine- tunes itself for
better understanding of data and optimizing its ability to make predictions. Rigorous training
process ensure that the trained model works well with new unseen data for reliable
predictions in real-world scenarios.
Here are the basic features of Model Training:
 Training Data: Expose the model to historical data to learn patterns, relationships
and dependencies.
 Iterative Process: Train the model iteratively, adjusting parameters to minimize
errors and enhance accuracy.
 Optimization: Fine-tune model to optimize its predictive capabilities.
 Validation: Rigorously train model to ensure accuracy to new unseen data.

6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the
model.In this step, we check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.

7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the
model in the real-world system.
If the above-prepared model is producing an accurate result as per our requirement
with acceptable speed, then we deploy the model in the real system. But before deploying the
project, we will check whether it is improving its performance using available data or not.
The deployment phase is similar to making the final report for a project.
Upon successful evaluation machine learning model is ready for deployment for real- world
application. Model deployment involves integrating the predictive model with existing
systems allowing business to use this for informed decision-making.
Here are the basic features of Model Deployment:
 Integration: Integrate the trained model into existing systems or processes for real-
world application.
 Decision Making: Use the model's predictions for informed decision.
 Practical Solutions: Deploy the model to transform theoretical insights into practical
use that address business needs.
 Continuous Improvement: Monitor model performance and make adjustments as
necessary to maintain effectiveness over time.

3.3 Different form of Data

1. Data Mining
Definition:
Data mining is the computational process of discovering patterns, trends, correlations, or
anomalies from large datasets using methods from statistics, machine learning, and database
systems.
Steps in Data Mining Process

1. Data Collection (Data Understanding)


Goal: Acquire relevant data from multiple sources.
 Sources include: databases, data warehouses, IoT sensors, social media, web logs, etc.
 Raw data might be unstructured or semi-structured.
Example: Collecting transaction data from retail stores and customer feedback from surveys.

2. Data Cleaning
Goal: Improve data quality by removing inconsistencies and errors.
 Remove duplicate records
 Fill or eliminate missing values
 Smooth noisy data
 Detect and fix outliers
Example: Removing customers who have incomplete transaction records or incorrect email
addresses.

3. Data Integration
Goal: Combine data from multiple heterogeneous sources into a unified view.
 Resolve data format conflicts
 Match schemas (e.g., merging “customer_ID” from two different databases)
Example: Combining in-store purchases and online purchases into one dataset per customer.

4. Data Selection
Goal: Select relevant data for mining tasks.
 Focus only on attributes/features useful for your objective
 Reduces dimensionality and speeds up processing
Example: From a dataset of 100 variables, select 20 features that influence customer churn.

5. Data Transformation
Goal: Convert data into suitable format for mining.
 Normalize numerical data (e.g., scale to 0–1)
 Aggregate or summarize data
 Encode categorical variables (e.g., one-hot encoding)
Example: Converting purchase amounts to a normalized scale so that clustering isn't biased.

6. Data Mining (Core Step)


Goal: Use algorithms to discover patterns or models.
 Techniques:
o Classification (Decision Trees, SVM)
o Clustering (K-means, DBSCAN)
o Association (Apriori, FP-Growth)
o Anomaly Detection
o Regression (Linear, Logistic)
Example: Discovering that customers aged 30–40 buying fitness gear are likely to buy
protein supplements.

7. Pattern Evaluation
Goal: Identify patterns that are truly interesting and useful.
 Remove redundant or irrelevant patterns
 Use metrics like support, confidence, lift (in association rules)
Example: From 500 rules found, keep only those with high confidence and lift.

8. Knowledge Presentation
Goal: Visualize and present the mined knowledge to stakeholders.
 Charts, graphs, dashboards
 Interactive visualizations using tools like Tableau, Power BI
 Model interpretations for non-technical users
Example: Creating a dashboard showing customer clusters based on buying habits.

Example Use Case: Retail Customer Segmentation


1. Collect sales and demographic data.
2. Clean missing or incorrect entries.
3. Merge online and in-store data.
4. Select variables like age, income, purchase frequency.
5. Normalize spending patterns.
6. Use K-means clustering to group customers.
7. Evaluate cluster quality using silhouette score.
8. Present insights to marketing: “Group A = Loyal Bargain Shoppers; Group B = High-
Spenders.”
Features:
 Extracts hidden patterns from large datasets
 Often unsupervised or semi-supervised
 Uses clustering, classification, regression, association rules
Example:
A retail chain uses data mining to identify purchasing patterns of customers, such as:
 “Customers who buy diapers often buy beer on weekends.”
Data Mining Techniques

Technique Description Example

Classification Assign items to predefined classes Email → Spam or Not Spam

Group similar items without predefined


Clustering Market segmentation
labels

"People who buy X also buy


Association Rules Find relationships between items
Y"

Anomaly
Identify rare or unusual patterns Fraud detection in banking
Detection

Regression Predict continuous values Predicting house prices

 Association Rule Learning (e.g., Apriori algorithm)


 Clustering (e.g., K-means)
 Classification (e.g., Decision Trees, SVM)

2. Data Analytics
Definition:
Data analytics refers to the process of examining datasets to draw conclusions about the
information they contain. It is often used in decision-making and business intelligence.
Types:
 Descriptive Analytics: What happened?
Example: Sales increased by 15% last quarter.
 Diagnostic Analytics: Why did it happen?
Example: Sales increased due to a holiday promotion.
 Predictive Analytics: What will happen?
Example: Forecasting next quarter's sales using historical trends.
 Prescriptive Analytics: What should be done?
Example: Recommending discount strategies to boost future sales.

Type Purpose Example

Descriptive What happened? “Sales increased by 10% in Q.”

“Sales rose due to the new


Diagnostic Why did it happen?
marketing campaign.”

What is likely to “Sales are expected to grow 15%


Predictive
happen? next quarter.”

What should be “Launch a loyalty program to retain


Prescriptive
done? top customers.”
Data Analytics Process

1. Define the Objective


Goal: Clearly identify the problem or question to be answered.
 Understand what the business or project is trying to achieve.
 Frame specific, measurable, and actionable questions.
Example:
“Why are customer returns increasing in the past quarter?”

2. Data Collection
Goal: Gather relevant data needed to answer your question.
 Sources: Databases, APIs, CRM systems, web logs, surveys, IoT sensors, social
media.
 Ensure data is timely, accurate, and relevant.
Example:
Collect data on customer purchases, return reasons, product reviews, and support
tickets.

3. Data Cleaning & Preparation


Goal: Ensure data quality and consistency.
 Remove duplicates
 Handle missing values (e.g., using averages, medians, or deletion)
 Correct formatting issues (e.g., date formats, currencies)
 Detect and remove outliers if necessary
Example:
A dataset with 5% missing "return reasons" may require filling based on product type
or customer segment.

4. Data Exploration (EDA – Exploratory Data Analysis)


Goal: Understand the data through visualization and summary statistics.
 Use graphs (bar charts, histograms, box plots)
 Calculate mean, median, standard deviation
 Look for trends, anomalies, and relationships
Example:
EDA might show that most returns happen on electronics purchased during
promotions.

5. Data Modeling (Optional/Advanced)


Goal: Apply statistical or machine learning models to predict or classify data.
 Techniques:
o Regression (predict numeric values)
o Classification (categorize outcomes)
o Clustering (group similar observations)
o Time Series Forecasting
Example:
Use logistic regression to predict which customers are likely to return a product.

6. Data Interpretation & Visualization


Goal: Present insights in a clear, actionable manner.
 Use dashboards, reports, infographics
 Storytelling with data (what’s happening, why, and what to do)
 Tailor visualizations to audience (executives vs analysts)
Example:
Present a dashboard showing return rates by product, season, and location.

7. Decision-Making and Action


Goal: Use insights to guide business strategies.
 Communicate findings to stakeholders
 Implement recommendations
 Monitor outcomes post-implementation
Example:
Decide to reduce promotional discounts on electronics or revise product descriptions
to reduce returns.
Tools Commonly Used
 Basic: Excel, Google Sheets
 BI Tools: Tableau, Power BI, Qlik
 Programming: Python (Pandas, NumPy, Matplotlib), R
 Databases: SQL, NoSQL
 Advanced: Apache Spark, SAS, Snowflake
Example 1: Online Retailer Analytics
Problem: A retailer wants to understand why customer churn is increasing.
 Descriptive: Review past sales and customer feedback.
 Diagnostic: Find that churn is highest in customers not using discount codes.
 Predictive: Train a model to predict who is likely to churn next month.
 Prescriptive: Recommend sending targeted discounts to at-risk customers.
Example 2:
An e-commerce company analyzes customer behavior to improve user experience and boost
sales conversion rates.

3. Statistical Data
Definition:
Statistical data consists of data collected for analysis using statistical methods to summarize,
interpret, and draw conclusions.
Types:
Statistical data can be broadly categorized based on measurement levels and nature of the
values:
 Qualitative (Categorical): Gender, Colors, Brands
 Quantitative (Numerical): Age, Height, Sales
1. Quantitative Data (Numerical Data)
This data type consists of numbers that represent measurable quantities. It can be discrete or
continuous.
a) Discrete Data
 Definition: Countable and finite numbers.
 Examples:
o Number of students in a class (20, 30)
o Number of clicks on a website
o Number of cars in a parking lot
Features:
o Gaps between values
o No decimals or fractions
o Often used in classification tasks
b) Continuous Data
 Definition: Measurable values that can take any value within a range.
 Examples:
o Height, weight, temperature
o Time, speed, distance
 Features:
o Infinite possibilities within a range
o Can include decimals
o Important for regression tasks

2. Qualitative Data (Categorical Data)


This data type represents categories or labels rather than numbers.
a) Nominal Data
 Definition: Categories with no inherent order or ranking.
 Examples:
o Gender (Male, Female, Other)
o Eye color (Blue, Green, Brown)
o Country (India, USA, UK)
 Features:
o Purely labels
o Cannot be used in mathematical operations
o Often encoded using one-hot encoding for ML models
b) Ordinal Data
 Definition: Categories with a meaningful order, but intervals are not equal.
 Examples:
o Movie ratings (Poor, Average, Good, Excellent)
o Education level (High School < Bachelor < Master < PhD)
o Customer satisfaction (Low, Medium, High)
 Features:
o Order matters
o Useful in ranking problems
o Distance between categories is not measurable

3. Binary Data (Special case of Nominal)


 Definition: Only two categories or states.
 Examples:
o Yes/No, 1/0, True/False
 Application:
o Frequently used in classification tasks (e.g., spam vs. not spam)
How These Types Affect Machine Learning:

Data Type Typical Use Preprocessing Methods

Discrete Classification Label encoding, scaling

Continuous Regression, clustering Normalization, standardization

Nominal Classification One-hot encoding

Ordinal Classification, ranking Label encoding (preserving order)

Binary Binary classification No transformation or binary labels

Descriptive Statistics
Descriptive statistics are used to summarize and describe the main features of a dataset.
They provide a simple overview of the distribution, central tendency, and variability of the
data.
___________________________________________________________________________
1. Measures of Central Tendency
These describe the center of a data distribution.
a) Mean (Average)
 Formula:

Data: 4, 5, 6, 7, 8
Mean = (4+5+6+7+8) / 5 = 6
b) Median
 The middle value when data is sorted.
 Example:
Data: 3, 6, 7, 9, 12
Median = 7
If even number of values:
Data: 2, 4, 6, 8 → Median = (4+6)/2 = 5
c) Mode
 The value that appears most frequently.
 Example:
Data: 2, 4, 4, 5, 6 → Mode = 4
___________________________________________________________________________
2. Measures of Dispersion (Spread or Variability)
a) Range
 Difference between the maximum and minimum values.
 Example:
Data: 10, 15, 20, 25
Range = 25 - 10 = 15
b) Variance
 Average of the squared deviations from the mean.
 Example:
Data: 4, 6, 8
Mean = 6
Variance = [(4-6)² + (6-6)² + (8-6)²] / 3 = (4 + 0 + 4)/3 = 2.67
c) Standard Deviation (SD)
 The square root of variance; shows how much data deviates from the mean.
 SD = √Variance
 From the above example:
SD = √2.67 ≈ 1.63

3. Shape of Distribution
Descriptive stats also help identify:
a) Skewness
 Indicates asymmetry:
o Right-skewed (positive): long tail on the right
o Left-skewed (negative): long tail on the left
b) Kurtosis
 Describes the "peakedness" of a distribution:
o High kurtosis: sharp peak
o Low kurtosis: flat curve
Use Cases in ML and Data Science

Measure Why It's Useful in ML

Mean, Median, Mode Understand central values for features

Range, SD, Variance Detect outliers or spread of data

Skewness, Kurtosis Identify need for transformation

___________________________________________________________________________

Statistical Concepts in Machine Learning


1. Probability Distributions
Definition: A probability distribution describes how the values of a random variable are
distributed.
Common Distributions:

Distribution Use Case in ML Example Use

Normal (Gaussian) Many ML models assume normality Linear Regression, Naive Bayes

Bernoulli Binary outcomes (1 or 0) Logistic Regression

Binomial Count of success/failure in trials Click prediction

Poisson Number of events in fixed time Web traffic prediction

Uniform Equal probability for all outcomes Random selection or sampling

Why it matters: Many ML algorithms (e.g., Naive Bayes, Gaussian Mixture Models) rely on
understanding how data is distributed.

2. Hypothesis Testing
Definition: A statistical method to test assumptions (hypotheses) about a population.
Concepts:
 Null Hypothesis (H₀): No effect or no difference.
 Alternative Hypothesis (H₁): There is an effect or difference.
 p-value: Probability of obtaining results at least as extreme as the observed, assuming
H₀ is true.
 Significance Level (α): Threshold to reject H₀ (e.g., 0.05 or 5%).
Example:
 H₀: “Feature X has no impact on target Y”
 If p-value < α, reject H₀ → Feature X is statistically significant.
ML Use: Feature selection, model comparison, A/B testing.

3. Correlation & Covariance


a) Correlation
 Measures the strength and direction of a linear relationship between two variables.
 Values range from -1 to +1:
o +1: Perfect positive
o -1: Perfect negative
o 0: No correlation
Example: Height and weight might have high positive correlation.
b) Covariance
 Measures how two variables change together.
 Positive: variables increase together.
 Negative: one increases, the other decreases.
ML Use: Feature selection, dimensionality reduction (PCA).

4. Bayes’ Theorem
Formula:

Interpretation: The probability of A given B.


Use in ML:
 Fundamental in Naive Bayes classifier.
 Used when updating beliefs based on new evidence.
Example: Spam filtering — update the probability of an email being spam given the presence
of certain words.

5. Statistical Inference
Definition: Drawing conclusions about a population based on sample data.
Includes:
 Confidence Intervals: Estimate a population parameter range with confidence (e.g.,
95% CI).
 Z-test / t-test: Comparing means when population variance is known/unknown.
 Chi-square test: For categorical variables.
ML Use: Model diagnostics, evaluating model assumptions, interpreting model parameters.

6. Descriptive vs Inferential Statistics

Category Description ML Relevance

Descriptive Summarizes data Exploratory Data Analysis (EDA)

Inferential Makes predictions/inferences Model estimation & testing

7. Statistical Assumptions in ML Models


Some ML models (especially classical ones) rely on:
 Linearity of data (e.g., Linear Regression)
 Independence of observations
 Homoscedasticity (equal variance)
 Normality of errors
If violated, these assumptions may lead to poor performance or invalid inferences.

Summary Table: Statistical Concepts and Their Roles

Concept Function in ML

Probability Modeling uncertainty, data generation processes

Hypothesis Testing Feature selection, A/B testing

Correlation Identify relationships, multicollinearity

Bayes’ Theorem Probabilistic models, spam filters

Inference Generalize from samples, interpret models

Distributions Used in assumptions of many ML algorithms


Role in Machine Learning

Aspect Statistical Role

Data Analysis Summarize and explore features

Feature Selection Use correlation, hypothesis testing

Model Building Regression, probabilistic models

Model Evaluation Use metrics like accuracy, precision, recall, ROC

Inference Draw conclusions from data

Example Use Cases


 Linear Regression: Predict continuous outcomes using least squares.
 Logistic Regression: Classification using sigmoid function.
 PCA (Principal Component Analysis): Uses covariance to reduce dimensions.
Example:
A university analyzes the average GPA of students to understand academic performance
across departments.
Common Pitfalls
 Misinterpreting correlation as causation
 Ignoring data skewness or outliers
 Overfitting due to poor statistical assumptions

________________________________________________________________________

4. Statistics vs. Data Mining

Aspect Statistics Data Mining

Understanding data & making Discovering unknown patterns in


Purpose
inferences data

Approach Hypothesis-driven Data-driven (often exploratory)


Aspect Statistics Data Mining

Data Size Small to medium Large-scale, big data

Focus on sampling, distributions,


Methodology Machine learning, pattern recognition
hypothesis

Weka, RapidMiner, Python (Scikit-


Tools SPSS, R, SAS
learn)

Example:
 Statistics: Testing if a new drug is effective through a clinical trial.
 Data Mining: Uncovering hidden correlations in electronic health records without
predefined hypotheses.

5. Data Analytics vs. Data Science

Aspect Data Analytics Data Science

Analyzing existing data to End-to-end process: data collection,


Focus
generate insights analysis, modeling

Tools Excel, SQL, Tableau, Power BI Python, R, Hadoop, TensorFlow, Spark

Narrower (mostly business Broader (includes AI/ML, product


Scope
decisions) development)

Methods
BI tools, statistics Statistics, machine learning, deep learning
Used

Predictive models, automation, strategic


Outcome Business insights and reporting
solutions

Example:
 Data Analytics: Creating a dashboard to visualize monthly sales data.
 Data Science: Building a recommendation engine for Netflix using collaborative
filtering.
3.4 Dataset For ML
What Is a Dataset in Machine Learning?
A machine learning dataset is, quite simply, a collection of data pieces that can be
treated by a computer as a single unit for analytic and prediction purposes. This means
that the data collected should be made uniform and understandable for a machine that doesn’t
see data the same way as humans do.
ML models depend on machine learning datasets to learn and improve. These datasets
act as collections of data used to train, test, and validate a machine learning algorithm.
Without the right dataset, your model won’t perform well or provide accurate results.
Why Are Datasets Important?
The success of any machine learning model depends on the quality and relevance of the data
it learns from. Here's why:
 Training: AI training datasets provide examples that models use to identify patterns.
 Validation: Separate datasets help tune the model's performance.
 Testing: Unseen data evaluates the model's ability to generalize.
Without the right data, even the best-designed algorithm won’t work effectively.
Data Types in Machine Learning
There are two main categories:
Structured Data
Unstructured Data
Organized into tables or rows, such as numerical values or categories. Example: sales data.
Includes text, images, video, or audio. Example: social media posts or medical datasets for
machine learning containing X-ray images.
Choosing the right type depends on your project’s goals. For instance, image recognition
models require unstructured data, while forecasting tasks often rely on structured datasets.

Types of Machine Learning Datasets


Splitting of a dataset into training, testing, and validation datasets
Machine Learning models rely on three main types of datasets during development. Each
serves a distinct purpose, contributing to a model’s accuracy and reliability.

1)Training Model
What is Training data?
Testing data is used to determine the performance of the trained model, whereas training data
is used to train the machine learning model. Training data is the power that supplies the
model in machine learning, it is larger than testing data. Because more data helps to more
effective predictive models. When a machine learning algorithm receives data from our
records, it recognizes patterns and creates a decision-making model.
Algorithms allow a company's past experience to be used to make decisions. It analyzes all
previous cases and their results and, using this data creates models to score and predict the
outcome of current cases. The more data ML models have access to, the more reliable their
predictions get over time.
Purpose: To teach the model.
The AI training dataset is the largest subset and forms the foundation of model development.
The model uses this data to identify patterns, relationships, and trends.
Characteristics:
 Large size for better learning opportunities.
 Diversity to avoid bias and improve generalization.
 Well-labeled for supervised learning tasks.
Example: AI training datasets with annotated images train models to recognize objects like
cars and animals.
Validation Datasets
Purpose: To fine-tune the model.
Validation datasets evaluate the model during training, helping you adjust parameters like
learning rates or weights to prevent overfitting.
Key Characteristics:
 Separate from the training data.
 Small but representative of the problem space.
 Used iteratively to improve performance.
Tip: Validation data ensures your model isn’t simply memorizing the training data but can
generalize well.

2) Testing Datasets
What is Testing Data?
You will need unknown information to test your machine learning model after it was created
(using your training data). This data is known as testing data, and it may be used to assess the
progress and efficiency of your algorithms' training as well as to modify or optimize them for
better results.
 Showing the original set of data.
 Be large enough to produce reliable projections
This dataset needs to be "unseen" and recent. This is because the training data was already
"learned" by your model. You can decide if it is operating successfully or when it need more
training data to fulfill your standards by observing how it performs on fresh test data. Test
data provides as a last, real check if an unknown dataset was correctly trained by the machine
learning algorithm.
Purpose: To evaluate the model.
The testing dataset provides an unbiased assessment of the model’s performance on unseen
data.
Characteristics:
 Exclusively used after training and validation.
 Mimics real-world scenarios for robust evaluation.
 Should remain untouched during the training process.
Example: Testing a sentiment analysis model with user reviews it hasn’t seen before.

Why do we need Training data and Testing data


Training data teaches a machine learning model how to behave, whereas testing data assesses
how well the model has learned.
 Training Data: The machine learning model is taught how to generate predictions or
perform a specific task using training data. Since it is usually identified, every data
point's output from the model is known. In order to provide predictions, the model
must first learn to recognize patterns in the data. Training data can be compared to a
student's textbook when learning a new subject. The learner learns by reading the text
and completing the tasks, and the book offers all the knowledge they require.
 Testing Data: The performance of the machine learning model is measured using
testing data. Usually, it is labeled and distinct from the training set. This indicates that
for every data point, the model's result is unknown. On the testing data, the model's
accuracy in predicting outcomes is assessed. Testing data is comparable to the exam a
student takes to determine how well-versed in a subject they are. The test asks
questions that the student must respond to, and the test results are used to gauge the
student's comprehension.
Why is it important to use separate training and testing data?
To avoid overfitting, it essential to use separate training and testing data. When a machine
learning model learns the training data too well, it becomes hard to generalize to new data.
This may happen if the training data is insufficient or not representative of the real-world data
on which the model will be used.
We can confirm that the model is learning the fundamental patterns and relationships in the
data and not simply memorizing the training data by using separate training and testing sets.
This will assist the model in making more accurate predictions based on new data.
How Training and Testing Data Work?
Algorithms which examine your training dataset, classify the inputs and outputs, and then
analyze it again are used to build machine learning models.
When an algorithm is sufficiently trained, it will effectively memorize all of the inputs and
outputs in a training dataset; however, this presents an issue when it is required to evaluate
data from other sources, such as real-world consumers.
The training data collection procedure consists of three steps:
 Feed - Providing data to a model.
 Define - The model converts training data into text vectors (numbers corresponding to
data features).
 Test - Lastly, you put your model to the test by feeding it test data (unseen data).
When training is complete, then you’re good to use the 20% of data you saved from your
actual dataset (without labeled outcomes, if leveraging supervised learning) to test the model.
This is where the model is fine-tuned to make sure it works the way we want it to.
The entire process (training and testing) is conducted in a matter of seconds, so you don’t
have to worry about fine-tuning. However, we always say that it’s always good to know
what’s happening behind the scenes so it’s not a black box.
How Training and Testing Data Used in Automation Tools?
It makes sense that test automation technologies include data from both training and testing.
This will raise the tests' correctness and dependability. The test automation tool is trained on
the particular application or system under test using training data. This aids in the tool's
learning of the application's intended behavior and helps it detect any potential flaws. Test
automation tool performance is assessed using testing data. This makes it more likely that the
tool will detect errors and won't overfit the training set.
The following are brief examples of how test automation technologies use training and
testing data:
 The test automation tool learns how to communicate with the application or system it
is testing using training data. It should be both large enough to enable the tool to
recognize patterns in the behavior of the application and representative of the real
world.
 Test automation tool performance is assessed using testing data. It ought to be
unlabeled and distinct from the training set. This guarantees that the instrument can
detect errors in fresh data and is balanced with the training set.
 You may create more accurate and dependable test automation tools by using training
and testing data.

How to Split Datasets Effectively


Splitting your machine learning datasets into these three subsets is crucial for accurate model
evaluation:
 70% Training
 15% Validation
 15% Testing
Keep these subsets diverse and balanced to ensure your model learns and performs well
across different scenarios.

How to Choose the Right Dataset Source

Define Your Requirements


Identify the type of data and the level of annotation required for your model.
Match the Source to Your Needs
For smaller projects, start with open-source repositories. For specialized tasks, consider paid
or synthetic data.
Budget Considerations
Free datasets are cost-effective but may require significant cleaning and preprocessing, while
paid sources offer higher quality but come at a price.

Features to look for when building or selecting machine


learning datasets:

1)Relevance to the Task


Your dataset must align with the problem your model is designed to solve.
 Why it matters: Irrelevant data leads to noisy models and poor performance.
 Example: For autonomous vehicles, datasets with street, vehicle, and pedestrian
images are essential; wildlife photos are irrelevant.
2)Balanced Data Distribution
A balanced dataset ensures all target classes or features are well-represented.
 Why it matters: Imbalanced data can bias the model toward the dominant class,
reducing its ability to generalize.
 Example: A sentiment analysis model trained only on positive reviews will struggle
to classify negative sentiments accurately.
3)How to address imbalance:
 Oversample underrepresented classes.
 Use techniques like SMOTE (Synthetic Minority Oversampling Technique).
4)Data Diversity
Diverse machine learning datasets improve a model's ability to generalize across real-world
scenarios.
 Why it matters: Limited diversity can lead to overfitting, where the model performs
well on AI training datasets but poorly on unseen data.
 Example: A facial recognition model trained only on specific demographics will
likely fail on broader populations.
Tips for ensuring diversity:
 Source data from multiple locations or demographics.
 Avoid over-representing any single feature.
5)Clean and Accurate Data
Errors in your dataset can introduce noise, misleading the model.
 Why it matters: Duplicates, missing values, and incorrect labels affect the reliability
of your model.
 Example: A mislabeled image in a dataset can teach the model to associate the wrong
label with similar images.
Steps to clean data:
 Remove duplicates.
 Impute or drop missing values.
 Validate annotations for accuracy.
6)Sufficient Quantity
Large datasets for machine learning are typically better for training robust models, especially
for complex tasks.
 Why it matters: Insufficient data can prevent the model from learning effectively,
while larger datasets improve performance and generalization.
 Example: Training deep learning models like GPT-4 requires vast datasets spanning
billions of data points.
7)Proper Annotation
Expert data annotation guides the model’s learning process.
 Why it matters: Inaccurate or incomplete annotations lead to misaligned predictions.
 Example: For object detection tasks, bounding boxes around objects must be precise
to help the model identify features accurately.

How to Build Machine Learning Datasets


Three steps of data processing in machine learning
Creating a high-quality dataset is a critical process that involves multiple stages. Each step
contributes to ensuring your machine learning model learns effectively and delivers accurate
predictions.
Step 1: Define Your Objective
Before collecting any data, identify the specific problem your model is solving. This
determines the type, format, and features of your dataset.
Key Questions:
 What is the model’s goal? (e.g., image classification, sentiment analysis)
 What type of data is required? (structured, unstructured, numerical, text, images, etc.)
 Are there specific target outputs (labels) needed?
Step 2: Collect Relevant Data
Choose the most suitable sources for data collection based on your project’s requirements.
Key Questions:
 Open-source datasets: Great for experimentation and academic projects.
 Proprietary data: Ideal for niche applications requiring specific data.
 Synthetic data: Useful for replicating controlled scenarios or addressing privacy
concerns.
Tips:
 Use tools like web scrapers or APIs to gather raw data.
 Diversify data sources to improve model generalization.
Step 3: Preprocess and Clean the Data
Raw data is often messy and unsuitable for direct use in machine learning. Preprocessing
ensures the dataset is consistent and usable.
Common Steps:
 Remove duplicates and outliers.
 Handle missing values by imputing or discarding them.
 Normalize or standardize numerical data.
 Tokenize and clean text data for NLP tasks.
Why it matters: Clean data improves learning efficiency and reduces noise in predictions.
Step 4: Annotate Data
Annotations provide the labels your model learns from. This step is crucial for supervised
learning tasks.
Methods:
 Manual Annotation: Involves domain experts or outsourced teams for precise data
annotation services.
 Automated Annotation: Use AI-powered tools for large-scale projects.
Examples:
 Bounding boxes for object detection in images.
 Sentiment tags for text reviews.
 Transcriptions for audio data.
Step 5: Split the Dataset
Dividing your dataset into subsets ensures the model is trained and evaluated properly.
Standard Splits:
 Training Set (70-80%): Used to teach the model.
 Validation Set (10-15%): Fine-tunes hyperparameters and prevents overfitting.
 Testing Set (10-15%): Evaluates performance on unseen data.
Step 6: Analyze the Dataset
Before feeding the data into your model, analyze it to uncover potential biases or
inconsistencies.
Techniques:
 Use histograms, scatter plots, or box plots to visualize data distribution.
 Check for imbalances in class labels or missing features.
Step 7: Document the Dataset
Comprehensive documentation ensures transparency and reproducibility.
What to include:
 Data sources and collection methods.
 Preprocessing and annotation steps.
 Descriptions of features and labels.
Step 8: Store and Manage Data
Secure and organized storage ensures scalability and ease of access.
Best Practices:
 Use cloud storage solutions like AWS S3 or Google Cloud Storage.
 Implement version control to track dataset changes over time.
This step-by-step guide should streamline your dataset creation process, whether you’re
working on a small project or building datasets for large-scale AI applications.

Use Cases for Machine Learning Datasets

Machine learning datasets serve as the foundation for various applications across industries.
The type of dataset often dictates its use case, ranging from enhancing user experiences to
solving complex scientific challenges.
Here are some of the most impactful use cases:
1) Computer Vision
Computer vision tasks rely heavily on labeled video and image datasets for machine learning
for accurate predictions.
Applications:
 Object detection and recognition (e.g., autonomous vehicles, security systems).
 Image segmentation for medical diagnostics (e.g., tumor detection in X-rays).
 Scene understanding in robotics and virtual reality, and geospatial annotation.
Dataset Examples: COCO, ImageNet, VisualData.
2) Natural Language Processing (NLP)
NLP tasks require diverse and well-annotated text datasets to train models that understand
and generate human language.
Applications:
 Sentiment analysis for customer feedback.
 Machine translation for multilingual content.
 Chatbots and conversational AI systems.
Dataset Examples: IMDb Reviews, SQuAD, Common Crawl.
3) Time Series Analysis
Time-series datasets are crucial for forecasting and trend analysis.
Applications:
 Predicting stock prices or market trends in finance.
 Monitoring sensor data in IoT devices.
 Analyzing patient health data for predictive healthcare.
Dataset Examples: UCI Gas Sensor, Yahoo Finance Datasets.
4) Speech and Audio Recognition
Datasets containing speech and audio signals power models for recognizing and processing
sound in automatic speech recognition.
Applications:
 Voice assistants like Alexa and Siri.
 Speaker diarization in meeting transcription tools.
 Acoustic scene analysis for smart environments.
Dataset Examples: LibriSpeech, VoxCeleb.
Recommendation Systems
Recommendation systems rely on user behavior data to personalize suggestions.
Applications:
 E-commerce platforms suggesting products.
 Streaming services recommending content (e.g., Netflix, Spotify).
 Personalized learning systems in education.
Dataset Examples: MovieLens, Amazon Product Data.
Difference between Training data and Testing data

Features Training Data Testing Data

The machine-learning
model is trained using
Testing data is used to
training data. The more
evaluate the model's
training data a model has,
performance.
the more accurate
Purpose predictions it can make.

Until evaluation, the testing


By using the training data,
data is not exposed to the
the model can gain
model. This guarantees that
knowledge and become
the model cannot learn the
more accurate in its
testing data by heart and
predictions.
Exposure produce flawless forecasts.

This training data


The distribution of the
distribution should be
testing data and the data
similar to the distribution of
from the real world differs
actual data that the model
greatly.
Distribution will use.

By making predictions on
the testing data and
To stop overfitting, training comparing them to the
data is utilized. actual labels, the
performance of the model is
Use assessed.

Size Typically larger Typically smaller


3.5 Data Cleaning

Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate
records from a dataset. It is an essential step in the data preprocessing phase of any data
science or machine learning workflow.
“Garbage in, garbage out” – the quality of your data directly affects the quality of your
results.
 Data collected from the various resources are dirty and this will affect the accuracy of
prediction result.
•Data quality is a main issue in quality information management.
•Data quality problems occur anywhere in information systems and these problems are
solved by data cleaning.
•Data cleaning is a process used to determine inaccurate, incomplete or unreasonable
data and then improve the quality through correcting of detected errors and omissions.
•Generally, data cleaning reduces errors and improves the data quality.

importance of Data Cleaning:


•Data cleaning, data cleansing, or data scrubbing is the act of first identifying any issues or
bad data, then systematically correcting these issues.
•Data cleaning is required data before start analyzing it, especially if you will berunning it
through machine learning model .
Here are a few data quality issues you will likely come across when working on any real-
world data science or machine learning project -
•Missing Data/ Null Values
•Duplicate Data
•Outliers present in Data
•Erroneous Data
•Presence of Irrelevant Data

1. Handling Missing Data or Null values


 • Missing Data or Missing values occur when we have
NO data points stored for aparticular column or feature.
 • There might be multiple data sources for creating a
data set.
 • Different data sources may indicate missing values in
various ways to make analysiseven more complicated and can significantly impact the
conclusions drawn from data.

 Missing Data or Missing values occur when we have NO data points stored for a
particular column or feature.
 There might be multiple data sources for creating a data set.
 Different data sources may indicate missing values in various ways to make analysis
even more complicated and can significantly impact the conclusions drawn from data

Remember, not all NULL data is corrupt; sometimes, we will need to accept
missing values; this scenario is entirely dependent on the data set and the type of
business problem
Causes of Missing Data:
 Human error during data entry
 Data corruption or transmission failure
 Inapplicable responses (e.g., income question for unemployed)
 Software bugs or system issues

Types of Missing Data:

Type Description
MCAR (Missing Completely The missing values are independent of any data (e.g., a
at Random) sensor failed randomly).
The missing values depend on observed data (e.g., men less
MAR (Missing at Random)
likely to report weight).
MNAR (Missing Not at Missing values depend on unobserved data (e.g., income not
Random) reported by high earners).

Techniques to Handle Missing Data:


a) Deletion Methods:
 Listwise Deletion: Remove entire row if any value is missing.
o Simple, but can cause data loss.
 Pairwise Deletion: Use available data for each analysis without removing entire
rows.
o Preserves more data but harder to implement.
b) Imputation Methods:
 Mean/Median/Mode Imputation:
o Replace missing values with column mean (for numerical), median (for
skewed), or mode (for categorical).
 Forward/Backward Fill (Time series):
o Use previous or next value to fill gaps.
 KNN Imputation:
o Use k-nearest neighbors to estimate missing value based on similar rows.
 Multivariate Imputation (MICE):
o Models each feature with missing values as a function of other features.
 Using ML Models:
o Train a model to predict the missing value from other variables.

Operating on Null Values


Pandas provides various functions for Detecting, Removing, and Replacing Null values.
They are:
•isnull() - Generates a Boolean Mask(True/False) indicating Missing value.
•notnull() - Completely opposite of isnull(), generates a Boolean Mask.
•dropna() - Returns a filtered version of the data present.
•fillna() - Returns a copy of the dataframe with missing values replaced or imputed.

Detecting Null Values


Pandas' data structure has two useful ways for detecting null data. isnull() and notnull().
Either of them will return a Boolean value(True/False)
Note: True - Null entry, False - No Null data.

2) Dealing with Outliers


An outlier is a data point that differs significantly from other observations. It can be due to
variability in the data, measurement error, or incorrect entry.
• Outliers are data entries whose values deviate significantly from the rest of the data.
• Outliers can be found in every possible real-world dataset you come across, and dealing
with Outliers is one of the many data cleaning techniques.
• Outliers also have a significance on data accuracy and business analytics.
•Most machine learning algorithms, predominantly linear regression models, need to be
dealt with Outliers, or else the Variance of the model could turn out very high, which
further leads to false conclusions by the model.
Causes of Outliers:
 Data entry errors (e.g., typing 1000 instead of 100)
 Measurement errors (e.g., faulty sensors)
 Legitimate rare events (e.g., very high income)

Detecting Outliers
a) Statistical Techniques:
 Z-score Method:

If |Z| > 3, it’s considered an outlier.


 IQR (Interquartile Range) Method:
o IQR = Q3 - Q1
o Outlier if:

b) Visualization Tools:
 Boxplot: Shows outliers as points beyond whiskers.
 Scatter plot: Helps visualize isolated values.
 Histogram: Detects skewed or long-tailed distributions.

The two most efficient business practices for detecting


outliers are:
1.Normal Distribution.
2.Box-plots.
1) Normal Distribution: Also known as the Bell curve, Normal distribution helps us
visualize a particular feature's distribution. The following shows what a Normal distribution
looks like

a) Scatter Plot
Purpose:
To identify relationships between two variables and detect anomalous points.
Characteristics:
 Each point represents one data observation.
 X-axis and Y-axis represent two continuous variables.
 Outliers appear as isolated points, far from the main cluster.
Example:
Imagine a dataset tracking:
 X-axis: Number of website visits
 Y-axis: Sales made
Most data points cluster between (100–300 visits, 10–50 sales).
But one point (600 visits, 5 sales) is an anomaly — low conversion despite high traffic.
Use Case:
 Detect outliers and trends
 Understand correlations
 Visualize regression patterns

b) Histogram
Purpose:
To visualize the frequency distribution of a single continuous variable.
Characteristics:
 Divides data into bins (intervals).
 Y-axis shows count/frequency.
 Shows:
o Shape (normal, skewed)
o Peaks (modes)
o Gaps
o Tails (long tails → potential outliers)
Example:
Imagine tracking daily temperatures:
 Data mostly falls between 20–30°C.
 A few days are >40°C.
In the histogram:
 Most bins are concentrated around 25°C.
 A long right tail shows outliers in the hot range (skewed distribution).
Use Case:
 Check distribution normality
 Identify skewness or multiple peaks
 Detect unusual frequency gaps
2. Boxplot (Box-and-Whisker Plot): Box-plot is a visualization technique used in Python to
represent data distribution.A five-number summary represents the Box plot:

1.Minimum Value: The lowest data point of a feature.


2.Maximum Value: The Maximum data point of a feature.
3.Median: The middle data point of a feature.
4.Lower Quartile: This range of data contains 25% of
data.
5.Upper Quartile: This range includes 75% of the data
A box plot usually includes two parts, a Box and a set of students (having height more
than 6 ft.) as shown in the above image.
6. IQR (Interquartile range) is the distance between the upper and the lower quartiles

Purpose:
To visualize the distribution of a dataset and highlight outliers.
Components:
 Box: Represents the Interquartile Range (IQR = Q3 - Q1).
 Line inside box: Median (Q2).
 Whiskers: Extend to the minimum and maximum non-outlier values.
 Outliers: Shown as individual dots beyond the whiskers.
Example:
Data: [10, 12, 15, 18, 19, 20, 23, 26, 28, 80]
 Q1 = 14, Q3 = 25 → IQR = 11
 Whiskers:
o Lower limit = Q1 − 1.5×IQR = 14 − 16.5 = -2.5
o Upper limit = Q3 + 1.5×IQR = 25 + 16.5 = 41.5
 Outlier = 80, since it exceeds 41.5
Use Case:
 Outlier detection
 Distribution comparison across groups

Handling Techniques:

Method Use Case

Capping/Clipping Set limits (e.g., max salary = 95th percentile).

Transformation Use log/sqrt to reduce impact of outliers.

Winsorization Replace extreme values with nearest non-outlier value.

Removing Outliers Only when justified (e.g., clear error).

Some ML models (like Decision Trees) are naturally resistant to


Robust Models
outliers.

Advantages and benefits of data cleaning


Having clean data will ultimately increase overall productivity and allow for the highest
quality information in your decision-making. Benefits include:
 Removal of errors when multiple sources of data are at play.
 Fewer errors make for happier clients and less-frustrated employees.
 Ability to map the different functions and what your data is intended to do.
 Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or corrupt data for future applications.
 Using tools for data cleaning will make for more efficient business practices and
quicker decision-making.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy