ML Notes
ML Notes
• How it works
Machine learning uses algorithms to analyze data, identify patterns, and make
predictions. The more data a machine learning system is exposed to, the better it
performs.
• Applications
Machine learning is used in many areas, including healthcare, entertainment,
shopping carts, and homes. For example, a financial institution can use machine
learning to classify transactions as fraudulent or genuine.
FLOW CHART:
Features of Machine Learning:
• Machine learning uses data to detect various patterns in a given dataset.
• It can learn from past data and improve automatically.
• It is a data-driven technology
• Machine learning is similar to data mining as it also
• deals with a huge amount of data.
Key Points:
• Rapid increment in the production of data
• solving complex problems, which are difficult for a human
• Decision-making in various sectors including finance
• Finding hidden patterns and extracting useful information from data.
Applications of Machine Learning:
1. Self-Driving Cars
2. Fraud Detection
3. Face Recognition
4. Stock Prediction
• The goal of supervised learning is to map input data with the Output data.
• The goal of unsupervised learning is to restructure the input data Into new
features or a group of objects with similar patterns.
• The goal of an agent is to get the most reward points,and hence, it improves its
performance.
ex: Robotic Dog
Thank you
Speech
Image Recognition Traffic prediction
Recognition
Automatic
Medical Diagnosis Language
Translation
IMAGE
RECOGNITION:
• Image recognition in machine
learning involves using
algorithms to identify and
classify objects, patterns, or
features within an image. It is a
subset of computer vision,
which aims to enable computers
to interpret and make sense of
the visual world.
• Ex:
Social Media,
Face Locks etc
SPEECH
RECOGNITION:
• Speech recognition is the process
of converting spoken language into
text. This technology enables
machines to understand and
respond to human speech, and it
has numerous applications such as
voice-activated assistants,
transcription services, and
accessibility tools. Below is an
overview of the key concepts,
steps, popular algorithms,
applications, and challenges
associated with speech
recognition.
• Ex: Siri, Alexa
TRAFFIC PREDICTION:
• Traffic prediction is a process of estimating traffic
conditions, such as congestion levels, travel times, or
traffic volumes, for a specific road network. It is
widely used in smart transportation systems,
navigation applications, and urban planning.
• Algorithms we use:
• 1.Multilayer Perceptron
• 2.Decision Tree
• 3.Naïve bayes classifier
VIRTUAL
PERSONAL
ASSISTANT:
• Virtual Personal Assistants
(VPAs) are AI-powered
applications designed to
assist users in managing
tasks, accessing information,
and controlling smart
devices. They utilize natural
language processing (NLP),
machine learning, and
automation to provide
human-like interaction and
support.
ONLINE FRAUD
DETECTION:
• Online fraud detection uses
advanced data analysis,
machine learning, and pattern
recognition to identify and
prevent fraudulent activities in
real time. Fraudulent activities
can include identity theft,
financial fraud, phishing, and
payment fraud in e-commerce,
banking, and other digital
platforms.
• Algorithms used:
Feed-forward neural network
STOCK MARKET
TRADING:
• Stock market trading involves
buying and selling financial
securities, such as stocks,
bonds, and derivatives.
Machine learning is widely
used in stock market analysis
to predict prices, optimize
trading strategies, and
manage risks.
MEDICAL
DIAGNOSIS:
• Machine learning (ML) plays
a transformative role in
medical diagnosis by
analyzing complex medical
data to assist in detecting,
diagnosing, and managing
diseases. These models can
process large datasets,
identify patterns, and provide
insights that support clinical
decision-making.
AUTOMATIC
LANGUAGE
TRANSLATION:
• Automatic Language
Translation is the process of
translating text or speech
from one language to
another using computational
models. Machine learning
(ML) has significantly
improved translation quality,
making it more accurate and
scalable.
Gathering Data Data
Data preparation Wrangling
LIFE CYCLE
Deployment
1. Gathering Data
This is the first and most crucial step in the machine learning lifecycle. Data is the
backbone of any machine learning model, and gathering relevant, accurate, and high-
quality data is vital to the success of the project.
Key Points:
• Sources of Data: Data can come from multiple sources such as databases, APIs, files
(CSV, Excel), or online platforms (social media, websites).
• Types of Data: Depending on the problem, the data can be structured (tables,
spreadsheets), unstructured (text, images, videos), or semi-structured (JSON, XML).
• Data Relevance: The data gathered should be relevant to the problem you're trying to
solve. It needs to represent the patterns or information you want the model to learn.
• Data Quantity: More data is generally better, but quality is more important than
quantity. The data should be sufficiently representative of the problem domain.
Example: If you are building a recommendation system, you might collect data on user
behavior, product interactions, and preferences.
2. Data Preparation
Once the data is gathered, it must be prepared for use in training the machine learning
model. This involves cleaning and transforming the data into a form that the model can use
effectively.
Key Points:
• Data Cleaning: This step involves handling missing values, dealing with duplicates,
correcting errors, and managing inconsistencies in the data. For example, if some rows
have missing values, you might fill them with the mean value or remove those rows
entirely.
• Data Transformation: This involves converting raw data into a form that can be easily
processed by machine learning algorithms. For example, you might normalize or scale
features to ensure that variables have comparable ranges, or you might encode
categorical data into numeric values.
• Data Splitting: Typically, data is split into three sets:
• Training Set: Used to train the model.
• Validation Set: Used to tune the model’s hyperparameters.
• Test Set: Used to evaluate the model’s performance after training.
Example: If you have a dataset with different scales (e.g., income in dollars and age in
years), you might scale the features using standardization techniques.
3. Data Wrangling
Data wrangling (also known as data munging) is a part of data preparation but deserves
special attention. It refers to the process of cleaning, transforming, and mapping raw data
into a more suitable format for analysis.
Key Points:
• Handling Missing Data: You might impute missing values, drop rows or columns with
excessive missing data, or use predictive models to estimate missing values.
• Outlier Detection and Removal: Outliers are extreme values that deviate significantly
from the rest of the data. Identifying and handling them is crucial, as they can skew the
results.
• Data Merging: Often, data comes from different sources, and combining these datasets
(merging or joining) is needed to create a comprehensive dataset for analysis.
• Normalizing/Scaling: Many algorithms (like distance-based algorithms) require
normalized or standardized data, so the features can be compared on the same scale.
Example: In a financial dataset, you might remove rows where transaction values are
excessively high (outliers) that could distort trends.
4. Analyze Data
After the data has been prepared, it’s time to explore and analyze it. This step involves
understanding the relationships between different features and identifying patterns that
might be useful for building a model.
Key Points:
• Exploratory Data Analysis (EDA): EDA is used to summarize the main characteristics of
the dataset, often with visual methods. This step helps identify trends, patterns,
correlations, and anomalies.
• Use histograms, boxplots, and scatter plots to understand the distribution of variables
and the relationships between them.
• Correlation matrices can help identify how different features are related to each other.
• Identifying Key Features: Through EDA, you may identify which features are most
influential in predicting the target variable. For example, in a house price prediction
problem, square footage and number of bedrooms might be more influential than the color
of the house.
Example: If you predict customer churn, analyzing which demographic or behavior-related
features are most associated with churn can guide the selection process.
5. Train the Model
Training a model is the step where the actual machine-learning algorithm is applied to the
data. The goal is to enable the model to learn patterns and relationships in the data so it can
make accurate predictions.
Key Points:
• Model Selection: Choose an appropriate machine learning algorithm based on the nature
of the problem (e.g., classification, regression, clustering).
• Examples: Decision Trees, Logistic Regression, Random Forest, Support Vector
Machines, and Neural Networks.
• Hyperparameter Tuning: Hyperparameters are settings or configurations that control the
model's training process (e.g., learning rate, number of layers in a neural network). Fine-
tuning these hyperparameters is crucial for getting the best performance.
• Training Process: The model is trained by feeding the training data and letting it adjust
internal parameters to minimize errors or loss. The training process may take some time
depending on the size of the dataset and the complexity of the model.
Example: In a classification problem, you may train a support vector machine (SVM) to
predict whether an email is spam or not based on features like word frequency and sender
details.
6. Test the Model
After training the model, it's time to evaluate its performance using the test dataset. This step
helps assess how well the model generalizes to unseen data.
Key Points:
• Model Evaluation Metrics: The choice of evaluation metrics depends on the type of
model and the problem you're solving.
• Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC.
• Regression: Mean Squared Error (MSE), Mean Absolute Error (MAE), R-squared.
• Overfitting/Underfitting: A model that performs well on the training data but poorly on the
test data is overfitting. A model that performs poorly on both the training and test data is
underfitting. Fine-tuning the model and selecting the right features can help avoid these
issues.
Example: You might test a classification model to see how accurately it classifies emails as
spam or non-spam, calculating precision and recall to ensure it's not falsely labeling non-
spam as spam.
7. Deployment
Once the model has been trained and tested, the final step is deployment—putting the
model into production so it can be used to make predictions on new data.
Key Points:
• Integration: The model is integrated into an existing system or application. For example,
an e-commerce site might integrate a recommendation model into its website to suggest
products to users.
• Monitoring: Once deployed, the model’s performance should be monitored regularly to
ensure that it continues to perform well. This may involve setting up dashboards or logs to
track model outputs.
• Model Updates: In many cases, models degrade over time as new data becomes
available. It’s important to retrain the model periodically with fresh data and potentially
update it to improve accuracy.
Example: After training a model to predict customer churn, you deploy it as an API in a
customer relationship management (CRM) tool to predict which customers are likely to
churn, helping businesses take proactive measures.
ThankYou
RETRIEVING OF THE
DATASETS
What is a dataset?
1. Kaggle Datasets
The link for the Kaggle dataset is
https://www.kaggle.com/datasets.
2. UCI Machine Learning Repository
The link for the UCI machine learning repository is
https://archive.ics.uci.edu/ml/index.php.
3. Datasets via AWS
The link for the resource is https://registry.opendata.aws/.
4. Google's Dataset Search Engine
https://toolbox.google.com/datasetsearch.
5. Microsoft Datasets The link to download or use the dataset
from this resource is https://msropendata.com/.