unit 3
unit 3
Introduction To ML
Introduction
The advancements in computing and communicating technologies have resulted in an
explosion of connected and interacting digital devices. In the current digital age, the amount
of data we produce every day is truly mind-boggling. A single click on any website, social
media usage, online transactions, smartphone use, streaming audio and video, and many more
such activities generate enormous amount of data. Data has evolved dramatically in the recent
years, increasing in volume, velocity, and variety. It is a huge challenge to analyze this data
and derive meaningful insights to make sense of this data. So in this era of computing and
information science, Machine Learning has become a rapidly growing and challenging area
for research and development activities.
Machine Learning is used anywhere from automating mundane tasks to offering intelligent
insights, industries in every sector try to benefit from it. You may already be using a device
that utilizes it. For example, a wearable fitness trackers like Fitbit, or an intelligent home
assistant like Google Home.
Machine learning is the field of science where computer algorithms are used to explicitly
learn from past data and information. In machine learning, computers don't have to be
explicitly programmed but they can change and improve their algorithms through learning by
themselves.
Hence, machine learning techniques have become key components of many intelligent
systems.
Machine learning uses various algorithms for building mathematical models and making
predictions using historical data or information. Many development organizations are now
looking for innovative ways to control their data assets to gain competitive edge and to help
the business to gain a new level of understanding. Here, machine learning plays a very
important role.
AI vs ML
Machine Learning and Artificial Intelligence are two closely related but distinct
fields within the broader field of computer science. Machine learning is a part of AI
that helps machines learn from data and get better over time without being told
exactly what to do. So, all machine learning is AI, but not all AI is machine
learning. AI can include things like robots or voice assistants, while machine
learning focuses more on learning from patterns in data to make predictions or
decisions.
Machine learning is the brain behind AI teaching machines to learn from data and
make smarter decisions.
Imagine a smart chef named Alex who can prepare any dish you ask for. Alex doesn’t
need instructions; they know every recipe by heart and can even come up with new
dishes. Alex represents Artificial Intelligence (AI) , a system that mimics human
intelligence to make decisions and solve problems on its own.
Now, meet Jamie, Alex’s learning assistant. Jamie is great at chopping vegetables
and following recipes but doesn’t know how to cook creatively. Jamie learns over
time by observing Alex and practicing recipes repeatedly. For instance, if Jamie
makes a mistake in seasoning one day, they adjust it the next time until they perfect it.
This story highlights that while ML is a subset of AI, they each have unique roles and
serve different purposes.
Machine learning has given the computer systems the abilities to automatically learn without
being explicitly programmed. But how does a machine leaming system work? So, it can be
described using the life cycle of machine learning. Machine learning life cycle is a cyclic
process to build an efficient machine learning project. The main purpose of the life cycle is to
find a solution to the problem or project.
Machine learning life cycle involves seven major steps, which are given below:
Gathering Data
Data preparation
Data Wrangling
Analyse Data
Train the model
Test the model
Deployment
The most important thing in the complete process is to understand the problem and to
know the purpose of the problem. Therefore, before starting the life cycle, we need to
understand the problem because the good result depends on the better understanding of the
problem.
In the complete life cycle process, to solve a problem, we create a machine learning
system called "model", and this model is created by providing "training". But to train a
model, we need data, hence, life cycle starts by collecting data.
1.Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step
is to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected
from various sources such as files, database, internet, or mobile devices. It is one of the most
important steps of the life cycle. The quantity and quality of the collected data will determine
the efficiency of the output. The more will be the data, the more accurate will be the
prediction.
This step includes the below tasks:
* Identify various data sources
* Collect data
* Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It
will be used in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a
step where we put our data into a suitable place and prepare it to use in our machine learning
training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
Data exploration:It is used to understand the nature of data that we have to work
with. We need to understand the characteristics, format, and quality of data.A better
understanding of data leads to an effective outcome. In this, we find Correlations,
general trends, and outliers.
Data pre-processing:
Now the next step is pre-processing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and transforming
the data in a proper format to make it more suitable for analysis in the next step. It is one of
the most important steps of the complete process. Cleaning of data is required to address the
quality issues.It is not necessary that data we have collected is always of our use as some of
the data may not be useful. In real-world applications, collected data may have various issues,
including:
* Missing Values
* Duplicate data
* Invalid data
* Noise
So, we use various filtering techniques to clean the data.It is mandatory to detect and remove
the above issues because it can negatively affect the quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step
involves:
* Selection of analytical techniques
* Building models
* Review the result
The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the determination of the
type of the problems,we select the machine learning techniques as Classification, Regression,
Cluster analysis, Association, etc. then build the model using prepared data, and evaluate the
model.Hence, in this step, we take the data and use machine learning algorithms to build the
model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.
Model training is an iterative process where the algorithm adjusts its parameters to minimize
errors and enhance predictive accuracy. During this phase the model fine- tunes itself for
better understanding of data and optimizing its ability to make predictions. Rigorous training
process ensure that the trained model works well with new unseen data for reliable
predictions in real-world scenarios.
Here are the basic features of Model Training:
Training Data: Expose the model to historical data to learn patterns, relationships
and dependencies.
Iterative Process: Train the model iteratively, adjusting parameters to minimize
errors and enhance accuracy.
Optimization: Fine-tune model to optimize its predictive capabilities.
Validation: Rigorously train model to ensure accuracy to new unseen data.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the
model.In this step, we check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the
model in the real-world system.
If the above-prepared model is producing an accurate result as per our requirement
with acceptable speed, then we deploy the model in the real system. But before deploying the
project, we will check whether it is improving its performance using available data or not.
The deployment phase is similar to making the final report for a project.
Upon successful evaluation machine learning model is ready for deployment for real- world
application. Model deployment involves integrating the predictive model with existing
systems allowing business to use this for informed decision-making.
Here are the basic features of Model Deployment:
Integration: Integrate the trained model into existing systems or processes for real-
world application.
Decision Making: Use the model's predictions for informed decision.
Practical Solutions: Deploy the model to transform theoretical insights into practical
use that address business needs.
Continuous Improvement: Monitor model performance and make adjustments as
necessary to maintain effectiveness over time.
1. Data Mining
Definition:
Data mining is the computational process of discovering patterns, trends, correlations, or
anomalies from large datasets using methods from statistics, machine learning, and database
systems.
Steps in Data Mining Process
2. Data Cleaning
Goal: Improve data quality by removing inconsistencies and errors.
Remove duplicate records
Fill or eliminate missing values
Smooth noisy data
Detect and fix outliers
Example: Removing customers who have incomplete transaction records or incorrect email
addresses.
3. Data Integration
Goal: Combine data from multiple heterogeneous sources into a unified view.
Resolve data format conflicts
Match schemas (e.g., merging “customer_ID” from two different databases)
Example: Combining in-store purchases and online purchases into one dataset per customer.
4. Data Selection
Goal: Select relevant data for mining tasks.
Focus only on attributes/features useful for your objective
Reduces dimensionality and speeds up processing
Example: From a dataset of 100 variables, select 20 features that influence customer churn.
5. Data Transformation
Goal: Convert data into suitable format for mining.
Normalize numerical data (e.g., scale to 0–1)
Aggregate or summarize data
Encode categorical variables (e.g., one-hot encoding)
Example: Converting purchase amounts to a normalized scale so that clustering isn't biased.
7. Pattern Evaluation
Goal: Identify patterns that are truly interesting and useful.
Remove redundant or irrelevant patterns
Use metrics like support, confidence, lift (in association rules)
Example: From 500 rules found, keep only those with high confidence and lift.
8. Knowledge Presentation
Goal: Visualize and present the mined knowledge to stakeholders.
Charts, graphs, dashboards
Interactive visualizations using tools like Tableau, Power BI
Model interpretations for non-technical users
Example: Creating a dashboard showing customer clusters based on buying habits.
Anomaly
Identify rare or unusual patterns Fraud detection in banking
Detection
2. Data Analytics
Definition:
Data analytics refers to the process of examining datasets to draw conclusions about the
information they contain. It is often used in decision-making and business intelligence.
Types:
Descriptive Analytics: What happened?
Example: Sales increased by 15% last quarter.
Diagnostic Analytics: Why did it happen?
Example: Sales increased due to a holiday promotion.
Predictive Analytics: What will happen?
Example: Forecasting next quarter's sales using historical trends.
Prescriptive Analytics: What should be done?
Example: Recommending discount strategies to boost future sales.
2. Data Collection
Goal: Gather relevant data needed to answer your question.
Sources: Databases, APIs, CRM systems, web logs, surveys, IoT sensors, social
media.
Ensure data is timely, accurate, and relevant.
Example:
Collect data on customer purchases, return reasons, product reviews, and support
tickets.
3. Statistical Data
Definition:
Statistical data consists of data collected for analysis using statistical methods to summarize,
interpret, and draw conclusions.
Types:
Statistical data can be broadly categorized based on measurement levels and nature of the
values:
Qualitative (Categorical): Gender, Colors, Brands
Quantitative (Numerical): Age, Height, Sales
1. Quantitative Data (Numerical Data)
This data type consists of numbers that represent measurable quantities. It can be discrete or
continuous.
a) Discrete Data
Definition: Countable and finite numbers.
Examples:
o Number of students in a class (20, 30)
o Number of clicks on a website
o Number of cars in a parking lot
Features:
o Gaps between values
o No decimals or fractions
o Often used in classification tasks
b) Continuous Data
Definition: Measurable values that can take any value within a range.
Examples:
o Height, weight, temperature
o Time, speed, distance
Features:
o Infinite possibilities within a range
o Can include decimals
o Important for regression tasks
Descriptive Statistics
Descriptive statistics are used to summarize and describe the main features of a dataset.
They provide a simple overview of the distribution, central tendency, and variability of the
data.
___________________________________________________________________________
1. Measures of Central Tendency
These describe the center of a data distribution.
a) Mean (Average)
Formula:
Data: 4, 5, 6, 7, 8
Mean = (4+5+6+7+8) / 5 = 6
b) Median
The middle value when data is sorted.
Example:
Data: 3, 6, 7, 9, 12
Median = 7
If even number of values:
Data: 2, 4, 6, 8 → Median = (4+6)/2 = 5
c) Mode
The value that appears most frequently.
Example:
Data: 2, 4, 4, 5, 6 → Mode = 4
___________________________________________________________________________
2. Measures of Dispersion (Spread or Variability)
a) Range
Difference between the maximum and minimum values.
Example:
Data: 10, 15, 20, 25
Range = 25 - 10 = 15
b) Variance
Average of the squared deviations from the mean.
Example:
Data: 4, 6, 8
Mean = 6
Variance = [(4-6)² + (6-6)² + (8-6)²] / 3 = (4 + 0 + 4)/3 = 2.67
c) Standard Deviation (SD)
The square root of variance; shows how much data deviates from the mean.
SD = √Variance
From the above example:
SD = √2.67 ≈ 1.63
3. Shape of Distribution
Descriptive stats also help identify:
a) Skewness
Indicates asymmetry:
o Right-skewed (positive): long tail on the right
o Left-skewed (negative): long tail on the left
b) Kurtosis
Describes the "peakedness" of a distribution:
o High kurtosis: sharp peak
o Low kurtosis: flat curve
Use Cases in ML and Data Science
___________________________________________________________________________
Normal (Gaussian) Many ML models assume normality Linear Regression, Naive Bayes
Why it matters: Many ML algorithms (e.g., Naive Bayes, Gaussian Mixture Models) rely on
understanding how data is distributed.
2. Hypothesis Testing
Definition: A statistical method to test assumptions (hypotheses) about a population.
Concepts:
Null Hypothesis (H₀): No effect or no difference.
Alternative Hypothesis (H₁): There is an effect or difference.
p-value: Probability of obtaining results at least as extreme as the observed, assuming
H₀ is true.
Significance Level (α): Threshold to reject H₀ (e.g., 0.05 or 5%).
Example:
H₀: “Feature X has no impact on target Y”
If p-value < α, reject H₀ → Feature X is statistically significant.
ML Use: Feature selection, model comparison, A/B testing.
4. Bayes’ Theorem
Formula:
5. Statistical Inference
Definition: Drawing conclusions about a population based on sample data.
Includes:
Confidence Intervals: Estimate a population parameter range with confidence (e.g.,
95% CI).
Z-test / t-test: Comparing means when population variance is known/unknown.
Chi-square test: For categorical variables.
ML Use: Model diagnostics, evaluating model assumptions, interpreting model parameters.
Concept Function in ML
________________________________________________________________________
Example:
Statistics: Testing if a new drug is effective through a clinical trial.
Data Mining: Uncovering hidden correlations in electronic health records without
predefined hypotheses.
Methods
BI tools, statistics Statistics, machine learning, deep learning
Used
Example:
Data Analytics: Creating a dashboard to visualize monthly sales data.
Data Science: Building a recommendation engine for Netflix using collaborative
filtering.
3.4 Dataset For ML
What Is a Dataset in Machine Learning?
A machine learning dataset is, quite simply, a collection of data pieces that can be
treated by a computer as a single unit for analytic and prediction purposes. This means
that the data collected should be made uniform and understandable for a machine that doesn’t
see data the same way as humans do.
ML models depend on machine learning datasets to learn and improve. These datasets
act as collections of data used to train, test, and validate a machine learning algorithm.
Without the right dataset, your model won’t perform well or provide accurate results.
Why Are Datasets Important?
The success of any machine learning model depends on the quality and relevance of the data
it learns from. Here's why:
Training: AI training datasets provide examples that models use to identify patterns.
Validation: Separate datasets help tune the model's performance.
Testing: Unseen data evaluates the model's ability to generalize.
Without the right data, even the best-designed algorithm won’t work effectively.
Data Types in Machine Learning
There are two main categories:
Structured Data
Unstructured Data
Organized into tables or rows, such as numerical values or categories. Example: sales data.
Includes text, images, video, or audio. Example: social media posts or medical datasets for
machine learning containing X-ray images.
Choosing the right type depends on your project’s goals. For instance, image recognition
models require unstructured data, while forecasting tasks often rely on structured datasets.
1)Training Model
What is Training data?
Testing data is used to determine the performance of the trained model, whereas training data
is used to train the machine learning model. Training data is the power that supplies the
model in machine learning, it is larger than testing data. Because more data helps to more
effective predictive models. When a machine learning algorithm receives data from our
records, it recognizes patterns and creates a decision-making model.
Algorithms allow a company's past experience to be used to make decisions. It analyzes all
previous cases and their results and, using this data creates models to score and predict the
outcome of current cases. The more data ML models have access to, the more reliable their
predictions get over time.
Purpose: To teach the model.
The AI training dataset is the largest subset and forms the foundation of model development.
The model uses this data to identify patterns, relationships, and trends.
Characteristics:
Large size for better learning opportunities.
Diversity to avoid bias and improve generalization.
Well-labeled for supervised learning tasks.
Example: AI training datasets with annotated images train models to recognize objects like
cars and animals.
Validation Datasets
Purpose: To fine-tune the model.
Validation datasets evaluate the model during training, helping you adjust parameters like
learning rates or weights to prevent overfitting.
Key Characteristics:
Separate from the training data.
Small but representative of the problem space.
Used iteratively to improve performance.
Tip: Validation data ensures your model isn’t simply memorizing the training data but can
generalize well.
2) Testing Datasets
What is Testing Data?
You will need unknown information to test your machine learning model after it was created
(using your training data). This data is known as testing data, and it may be used to assess the
progress and efficiency of your algorithms' training as well as to modify or optimize them for
better results.
Showing the original set of data.
Be large enough to produce reliable projections
This dataset needs to be "unseen" and recent. This is because the training data was already
"learned" by your model. You can decide if it is operating successfully or when it need more
training data to fulfill your standards by observing how it performs on fresh test data. Test
data provides as a last, real check if an unknown dataset was correctly trained by the machine
learning algorithm.
Purpose: To evaluate the model.
The testing dataset provides an unbiased assessment of the model’s performance on unseen
data.
Characteristics:
Exclusively used after training and validation.
Mimics real-world scenarios for robust evaluation.
Should remain untouched during the training process.
Example: Testing a sentiment analysis model with user reviews it hasn’t seen before.
Machine learning datasets serve as the foundation for various applications across industries.
The type of dataset often dictates its use case, ranging from enhancing user experiences to
solving complex scientific challenges.
Here are some of the most impactful use cases:
1) Computer Vision
Computer vision tasks rely heavily on labeled video and image datasets for machine learning
for accurate predictions.
Applications:
Object detection and recognition (e.g., autonomous vehicles, security systems).
Image segmentation for medical diagnostics (e.g., tumor detection in X-rays).
Scene understanding in robotics and virtual reality, and geospatial annotation.
Dataset Examples: COCO, ImageNet, VisualData.
2) Natural Language Processing (NLP)
NLP tasks require diverse and well-annotated text datasets to train models that understand
and generate human language.
Applications:
Sentiment analysis for customer feedback.
Machine translation for multilingual content.
Chatbots and conversational AI systems.
Dataset Examples: IMDb Reviews, SQuAD, Common Crawl.
3) Time Series Analysis
Time-series datasets are crucial for forecasting and trend analysis.
Applications:
Predicting stock prices or market trends in finance.
Monitoring sensor data in IoT devices.
Analyzing patient health data for predictive healthcare.
Dataset Examples: UCI Gas Sensor, Yahoo Finance Datasets.
4) Speech and Audio Recognition
Datasets containing speech and audio signals power models for recognizing and processing
sound in automatic speech recognition.
Applications:
Voice assistants like Alexa and Siri.
Speaker diarization in meeting transcription tools.
Acoustic scene analysis for smart environments.
Dataset Examples: LibriSpeech, VoxCeleb.
Recommendation Systems
Recommendation systems rely on user behavior data to personalize suggestions.
Applications:
E-commerce platforms suggesting products.
Streaming services recommending content (e.g., Netflix, Spotify).
Personalized learning systems in education.
Dataset Examples: MovieLens, Amazon Product Data.
Difference between Training data and Testing data
The machine-learning
model is trained using
Testing data is used to
training data. The more
evaluate the model's
training data a model has,
performance.
the more accurate
Purpose predictions it can make.
By making predictions on
the testing data and
To stop overfitting, training comparing them to the
data is utilized. actual labels, the
performance of the model is
Use assessed.
Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate
records from a dataset. It is an essential step in the data preprocessing phase of any data
science or machine learning workflow.
“Garbage in, garbage out” – the quality of your data directly affects the quality of your
results.
Data collected from the various resources are dirty and this will affect the accuracy of
prediction result.
•Data quality is a main issue in quality information management.
•Data quality problems occur anywhere in information systems and these problems are
solved by data cleaning.
•Data cleaning is a process used to determine inaccurate, incomplete or unreasonable
data and then improve the quality through correcting of detected errors and omissions.
•Generally, data cleaning reduces errors and improves the data quality.
Missing Data or Missing values occur when we have NO data points stored for a
particular column or feature.
There might be multiple data sources for creating a data set.
Different data sources may indicate missing values in various ways to make analysis
even more complicated and can significantly impact the conclusions drawn from data
Remember, not all NULL data is corrupt; sometimes, we will need to accept
missing values; this scenario is entirely dependent on the data set and the type of
business problem
Causes of Missing Data:
Human error during data entry
Data corruption or transmission failure
Inapplicable responses (e.g., income question for unemployed)
Software bugs or system issues
Type Description
MCAR (Missing Completely The missing values are independent of any data (e.g., a
at Random) sensor failed randomly).
The missing values depend on observed data (e.g., men less
MAR (Missing at Random)
likely to report weight).
MNAR (Missing Not at Missing values depend on unobserved data (e.g., income not
Random) reported by high earners).
Detecting Outliers
a) Statistical Techniques:
Z-score Method:
b) Visualization Tools:
Boxplot: Shows outliers as points beyond whiskers.
Scatter plot: Helps visualize isolated values.
Histogram: Detects skewed or long-tailed distributions.
a) Scatter Plot
Purpose:
To identify relationships between two variables and detect anomalous points.
Characteristics:
Each point represents one data observation.
X-axis and Y-axis represent two continuous variables.
Outliers appear as isolated points, far from the main cluster.
Example:
Imagine a dataset tracking:
X-axis: Number of website visits
Y-axis: Sales made
Most data points cluster between (100–300 visits, 10–50 sales).
But one point (600 visits, 5 sales) is an anomaly — low conversion despite high traffic.
Use Case:
Detect outliers and trends
Understand correlations
Visualize regression patterns
b) Histogram
Purpose:
To visualize the frequency distribution of a single continuous variable.
Characteristics:
Divides data into bins (intervals).
Y-axis shows count/frequency.
Shows:
o Shape (normal, skewed)
o Peaks (modes)
o Gaps
o Tails (long tails → potential outliers)
Example:
Imagine tracking daily temperatures:
Data mostly falls between 20–30°C.
A few days are >40°C.
In the histogram:
Most bins are concentrated around 25°C.
A long right tail shows outliers in the hot range (skewed distribution).
Use Case:
Check distribution normality
Identify skewness or multiple peaks
Detect unusual frequency gaps
2. Boxplot (Box-and-Whisker Plot): Box-plot is a visualization technique used in Python to
represent data distribution.A five-number summary represents the Box plot:
Purpose:
To visualize the distribution of a dataset and highlight outliers.
Components:
Box: Represents the Interquartile Range (IQR = Q3 - Q1).
Line inside box: Median (Q2).
Whiskers: Extend to the minimum and maximum non-outlier values.
Outliers: Shown as individual dots beyond the whiskers.
Example:
Data: [10, 12, 15, 18, 19, 20, 23, 26, 28, 80]
Q1 = 14, Q3 = 25 → IQR = 11
Whiskers:
o Lower limit = Q1 − 1.5×IQR = 14 − 16.5 = -2.5
o Upper limit = Q3 + 1.5×IQR = 25 + 16.5 = 41.5
Outlier = 80, since it exceeds 41.5
Use Case:
Outlier detection
Distribution comparison across groups
Handling Techniques: