0% found this document useful (0 votes)
6 views16 pages

Datascience Notes

The document outlines the modeling process in machine learning, which includes hypothesis development, experiment design, execution, and evaluation. It details various types of machine learning, including supervised, unsupervised, semi-supervised, and reinforcement learning, along with their advantages, disadvantages, and applications. Additionally, it touches on outlier detection and introduces data visualization using Matplotlib for creating various types of plots.

Uploaded by

samuelraj2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views16 pages

Datascience Notes

The document outlines the modeling process in machine learning, which includes hypothesis development, experiment design, execution, and evaluation. It details various types of machine learning, including supervised, unsupervised, semi-supervised, and reinforcement learning, along with their advantages, disadvantages, and applications. Additionally, it touches on outlier detection and introduces data visualization using Matplotlib for creating various types of plots.

Uploaded by

samuelraj2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

UNITIII MACHINE LEARNING

The modeling process


The Modeling Process is essentially a scientific experiment which includes:
Development of a Hypothesis .
e.g., data collected about a specific previous consumer behavior can be used to predict
future behavior.
Modeling is a multi-stage methodology for creating trained and tested Machine Learning
and AI models. The Modeling Process is essentially a scientific experiment which
includes:

 Development of a Hypothesis - e.g., data collected about a specific previous consumer


behavior can be used to predict future behavior
 Design of the Experiment - e.g., model/algorithm selection
 Execution of the Experiment - e.g., model training and testing
 Evaluation and Explanation of Results - e.g., is the hypothesis true or false, what is
the accuracy
Types of machine learning
There are several types of machine learning, each with special characteristics and
applications. Some of the main types of machine learning algorithms are as
follows:
 Supervised Machine Learning
 Unsupervised Machine Learning
 Semi-Supervised Machine Learning
 Reinforcement Learning
1. Supervised Machine Learning
Supervised learning is defined as when a model gets trained on a “Labelled
Dataset”. Labelled datasets have both input and output parameters.
In Supervised Learning algorithms learn to map points between inputs and
correct outputs.

Example: Consider a scenario where you have to build an image classifier to


differentiate between cats and dogs. If you feed the datasets of dogs and cats
labelled images to the algorithm, the machine will learn to classify between a
dog or a cat from these labeled images. When we input new dog or cat images
that it has never seen before, it will use the learned algorithms and predict
whether it is a dog or a cat. This is how supervised learning works, and this is
particularly an image classification.
There are two main categories of supervised learning that are mentioned
below:
 Classification
 Regression

Classification
Classification deals with predicting categorical target variables, which
represent discrete classes or labels. For instance, classifying emails as spam or
not spam, or predicting whether a patient has a high risk of heart disease.
Classification algorithms learn to map the input features to one of the predefined
classes.
Here are some classification algorithms:

 Logistic Regression
 Support Vector Machine
 Random Forest
 Decision Tree
 K-Nearest Neighbors (KNN)
 Naive Bayes

Regression
Regression, on the other hand, deals with predicting continuous target
variables, which represent numerical values. For example, predicting the price of
a house based on its size, location, and amenities, or forecasting the sales of a
product. Regression algorithms learn to map the input features to a continuous
numerical value.

Here are some regression algorithms:


 Linear Regression
 Polynomial Regression
 Ridge Regression
 Lasso Regression
 Decision tree
 Random Forest
Advantages of Supervised Machine Learning
 Supervised Learning models can have high accuracy as they are trained
on labelled data.
 The process of decision-making in supervised learning models is often
interpretable.
 It can often be used in pre-trained models which saves time and resources
when developing new models from scratch.
Disadvantages of Supervised Machine Learning
 It has limitations in knowing patterns and may struggle with unseen or
unexpected patterns that are not present in the training data.
 It can be time-consuming and costly as it relies on labeled data only.
 It may lead to poor generalizations based on new data.

Applications of Supervised Learning


Supervised learning is used in a wide variety of applications, including:
 Image classification: Identify objects, faces, and other features in images.
 Natural language processing: Extract information from text, such as
sentiment, entities, and relationships.
 Speech recognition: Convert spoken language into text.
 Recommendation systems: Make personalized recommendations to users.
 Predictive analytics: Predict outcomes, such as sales, customer churn, and
stock prices.

 Medical diagnosis: Detect diseases and other medical conditions.


 Fraud detection: Identify fraudulent transactions.
 Autonomous vehicles: Recognize and respond to objects in the
environment.
 Email spam detection: Classify emails as spam or not spam.
 Quality control in manufacturing: Inspect products for defects.
 Credit scoring: Assess the risk of a borrower defaulting on a loan.
 Gaming: Recognize characters, analyze player behavior, and create
NPCs.
 Customer support: Automate customer support tasks.
 Weather forecasting: Make predictions for temperature, precipitation,
and other meteorological parameters.

 Sports analytics: Analyze player performance, make game predictions,


and optimize strategies.
2. Unsupervised Machine Learning
Unsupervised Learning Unsupervised learning is a type of machine learning
technique in which an algorithm discovers patterns and relationships
using unlabeled data. Unlike supervised learning, unsupervised learning
doesn’t involve providing the algorithm with labeled target outputs. The
primary goal of Unsupervised learning is often to discover hidden patterns,
similarities, or clusters within the data, which can then be used for various
purposes, such as data exploration, visualization, dimensionality red
Example: Consider that you have a dataset that contains information about the
purchases you made from the shop. Through clustering, the algorithm can group
the same purchasing behavior among you and other customers, which reveals
potential customers without predefined labels. This type of information can help
businesses get target customers as well as identify outliers.
There are two main categories of unsupervised learning that are mentioned
below:
 Clustering
 Association

Clustering
Clustering is the process of grouping data points into clusters based on their
similarity. This technique is useful for identifying patterns and relationships in
data without the need for labeled examples.
Here are some clustering algorithms:
 K-Means Clustering algorithm
 Mean-shift algorithm
 DBSCAN Algorithm
 Principal Component Analysis
 Independent Component Analysis

Association
Association rule learning is a technique for discovering relationships between
items in a dataset. It identifies rules that indicate the presence of one item
implies the presence of another item with a specific probability.
Here are some association rule learning algorithms:
 Apriori Algorithm
 Eclat
 FP-growth Algorithm
Advantages of Unsupervised Machine Learning
 It helps to discover hidden patterns and various relationships between the
data.
 Used for tasks such as customer segmentation, anomaly
detection, and data exploration.
 It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
 Without using labels, it may be difficult to predict the quality of the model’s
output.
 Cluster Interpretability may not be clear and may not have meaningful
interpretations.
 It has techniques such as autoencoders and dimensionality reduction that can
be used to extract meaningful features from raw data.

Applications of Unsupervised Learning


Here are some common applications of unsupervised learning:
 Clustering: Group similar data points into clusters.
 Anomaly detection: Identify outliers or anomalies in data.
 Dimensionality reduction: Reduce the dimensionality of data while
preserving its essential information.
 Recommendation systems: Suggest products, movies, or content to users
based on their historical behavior or preferences.
 Topic modeling: Discover latent topics within a collection of documents.
 Density estimation: Estimate the probability density function of data.
 Image and video compression: Reduce the amount of storage required for
multimedia content.
 Data preprocessing: Help with data preprocessing tasks such as data
cleaning, imputation of missing values, and data scaling.
 Market basket analysis: Discover associations between products.
 Genomic data analysis: Identify patterns or group genes with similar
expression profiles. .
 Community detection in social networks: Identify communities or groups
of individuals with similar interests or connections.
 Customer behavior analysis: Uncover patterns and insights for better
marketing and product recommendations.
 Content recommendation: Classify and tag content to make it easier to
recommend similar items to users

 Exploratory data analysis (EDA): Explore data and gain insights before
defining specific tasks.
3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between
the supervised and unsupervised learning so it uses both labelled and
unlabelled data. It’s particularly useful when obtaining labeled data is costly,
time-consuming, or resource-intensive. This approach is useful when the dataset
is expensive and time-consuming. Semi-supervised learning is chosen when
labeled data requires skills and relevant resources in order to train or learn from
it.
We use these techniques when we are dealing with data that is a little bit labeled
and the rest large portion of it is unlabeled. We can use the unsupervised
techniques to predict labels and then feed these labels to supervised techniques.
This technique is mostly applicable in the case of image data sets where usually
all images are not labeled.

Example: Consider that we are building a language translation model, having


labeled translations for every sentence pair can be resources intensive. It allows
the models to learn from labeled and unlabeled sentence pairs, making them
more accurate. This technique has led to significant improvements in the quality
of machine translation services.

Types of Semi-Supervised Learning Methods


There are a number of different semi-supervised learning methods each with its
own characteristics. Some of the most common ones include:
 Graph-based semi-supervised learning: This approach uses a graph to
represent the relationships between the data points. The graph is then used to
propagate labels from the labeled data points to the unlabeled data points.
 Label propagation: This approach iteratively propagates labels from the
labeled data points to the unlabeled data points, based on the similarities
between the data points.
 Co-training: This approach trains two different machine learning models on
different subsets of the unlabeled data. The two models are then used to label
each other’s predictions.
 Self-training: This approach trains a machine learning model on the labeled
data and then uses the model to predict labels for the unlabeled data. The
model is then retrained on the labeled data and the predicted labels for the
unlabeled data.
 Generative adversarial networks (GANs) : GANs are a type of deep
learning algorithm that can be used to generate synthetic data. GANs can be
used to generate unlabeled data for semi-supervised learning by training two
neural networks, a generator and a discriminator.

Advantages of Semi- Supervised Machine Learning


 It leads to better generalization as compared to supervised learning, as it
takes both labeled and unlabeled data.
 Can be applied to a wide range of data.

Disadvantages of Semi- Supervised Machine Learning


 Semi-supervised methods can be more complex to implement compared to
other approaches.
 It still requires some labeled data that might not always be available or easy
to obtain.
 The unlabeled data can impact the model performance accordingly.

Applications of Semi-Supervised Learning


Here are some common applications of semi-supervised learning:
 Image Classification and Object Recognition: Improve the accuracy of
models by combining a small set of labeled images with a larger set of
unlabeled images.
 Natural Language Processing (NLP): Enhance the performance of language
models and classifiers by combining a small set of labeled text data with a
vast amount of unlabeled text.
 Speech Recognition: Improve the accuracy of speech recognition by
leveraging a limited amount of transcribed speech data and a more extensive
set of unlabeled audio.
 Recommendation Systems: Improve the accuracy of personalized
recommendations by supplementing a sparse set of user-item interactions
(labeled data) with a wealth of unlabeled user behavior data.
 Healthcare and Medical Imaging: Enhance medical image analysis by
utilizing a small set of labeled medical images alongside a larger set of
unlabeled images.

4. Reinforcement Machine Learning


Reinforcement machine learning algorithm is a learning method that interacts
with the environment by producing actions and discovering errors. Trial, error,
and delay are the most relevant characteristics of reinforcement learning. In this
technique, the model keeps on increasing its performance using Reward
Feedback to learn the behavior or pattern. These algorithms are specific to a
particular problem e.g. Google Self Driving car, AlphaGo where a bot competes
with humans and even itself to get better and better performers in Go Game.
Each time we feed in data, they learn and add the data to their knowledge which
is training data. So, the more it learns the better it gets trained and hence
experienced.

Here are some of most common reinforcement learning algorithms:


 Q-learning: Q-learning is a model-free RL algorithm that learns a Q-
function, which maps states to actions. The Q-function estimates the expected
reward of taking a particular action in a given state.
 SARSA (State-Action-Reward-State-Action): SARSA is another model-
free RL algorithm that learns a Q-function. However, unlike Q-learning,
SARSA updates the Q-function for the action that was actually taken, rather
than the optimal action.
 Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep
learning. Deep Q-learning uses a neural network to represent the Q-function,
which allows it to learn complex relationships between states and actions.

Example: Consider that you are training an AI agent to play a game like chess.
The agent explores different moves and receives positive or negative feedback
based on the outcome. Reinforcement Learning also finds applications in which
they learn to perform tasks by interacting with their surroundings.

Types of Reinforcement Machine Learning


There are two main types of reinforcement learning:
Positive reinforcement
 Rewards the agent for taking a desired action.
 Encourages the agent to repeat the behavior.
 Examples: Giving a treat to a dog for sitting, providing a point in a game for
a correct answer.
Negative reinforcement
 Removes an undesirable stimulus to encourage a desired behavior.
 Discourages the agent from repeating the behavior.
 Examples: Turning off a loud buzzer when a lever is pressed, avoiding a
penalty by completing a task.
Advantages of Reinforcement Machine Learning
 It has autonomous decision-making that is well-suited for tasks and that can
learn to make a sequence of decisions, like robotics and game-playing.
 This technique is preferred to achieve long-term results that are very difficult
to achieve.
 It is used to solve a complex problems that cannot be solved by conventional
techniques.
Disadvantages of Reinforcement Machine Learning
 Training Reinforcement Learning agents can be computationally expensive
and time-consuming.
 Reinforcement learning is not preferable to solving simple problems.
 It needs a lot of data and a lot of computation, which makes it impractical
and costly.
Applications of Reinforcement Machine Learning
Here are some applications of reinforcement learning:
 Game Playing: RL can teach agents to play games, even complex ones.
 Robotics: RL can teach robots to perform tasks autonomously.
 Autonomous Vehicles: RL can help self-driving cars navigate and make
decisions.
 Recommendation Systems: RL can enhance recommendation algorithms by
learning user preferences.
 Healthcare: RL can be used to optimize treatment plans and drug discovery.
 Natural Language Processing (NLP): RL can be used in dialogue systems
and chatbots.
 Finance and Trading: RL can be used for algorithmic trading.
 Supply Chain and Inventory Management: RL can be used to optimize
supply chain operations.
 Energy Management: RL can be used to optimize energy consumption.
 Game AI: RL can be used to create more intelligent and adaptive NPCs in
video games.
 Adaptive Personal Assistants: RL can be used to improve personal
assistants.
 Virtual Reality (VR) and Augmented Reality (AR): RL can be used to
create immersive and interactive experiences.
 Industrial Control: RL can be used to optimize industrial processes.
 Education: RL can be used to create adaptive learning systems.
 Agriculture: RL can be used to optimize agricultural operations.

Outliers and outlier analysis


• Let's understand what are outliers in machine learning and what is outlier
detection in machine learning.
• An outlier is a data point significantly different from other data points in a
dataset.
• Outliers can occur for various reasons, such as measurement errors, data
entry errors, or natural variations in the data.

UNITIV DATAVISUALIZATION 5
Importing Matplotlib
 importing matplotlib. pyplot and use the alias plt , which
is the alias used by convention for this submodule.
 Matplotlib is a library that contains several submodules such as pyplot
.
 After defining the two lists days and steps_walked , you use two of
the functions from matplotlib.
 Matplotlib is a popular data visualization library in Python. It's often
used for creating static, interactive, and animated visualizations in
Python. Matplotlib allows you to generate plots, histograms, bar
charts, scatter plots, etc.,
Simple line plots
A line chart, also referred to as a line graph or a line plot, connects a series of data
points using a line. This chart type presents sequential values to help you identify
trends. Most of the time, the x-axis (horizontal axis) represents a sequential
progression of values.
Perhaps the simplest of all plots is the visualization of a single function y=f(x)y=f(x).
Here we will take a first look at creating a simple plot of this type. As with all the
following sections, we'll start by setting up the notebook for plotting and importing
the packages we will use:
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
For all Matplotlib plots, we start by creating a figure and an axes. In their simplest
form, a figure and axes can be created as follows:

In [2]:
fig = plt.figure()
ax = plt.axes()

Simple scatter plots


A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two
different numeric variables. The position of each dot on the horizontal and vertical
axis indicates values for an individual data point. Scatter plots are used to observe
relationships between variables.
Visualizing errors
A graphical representation of errors or uncertainties in the measurement is called an
error bar. It's a line (a cap line, sometimes) drawn from the data point. The line's
length shows us how precise the measurement is. A short error bar means that the
values are concentrated, and the data is reliable.

Density and counter plots


A 2D histogram contour plot, also known as a density contour plot, is a 2-
dimensional generalization of a histogram which resembles a contour plot but is
computed by grouping a set of points specified by their x and y coordinates into
bins, and applying an aggregation function such as count or sum (if z is provided)
to ...
• We can also say in a more general way that a contour line of a function
with two variables is a curve which connects points with the same
values.

import numpy as np

xlist = np.linspace(-3.0, 3.0, 3)


ylist = np.linspace(-3.0, 3.0, 4)

X, Y = np.meshgrid(xlist, ylist

print(xlist)

print(ylist)

print(X)

print(Y)

OUTPUT
Histograms
o A histogram is a graph used to represent the frequency distribution of
a few data points of one variable.
 Histograms often classify data into various “bins” or “range groups” and
count how many data points belong to each of those bins.
 The histogram was invented by Karl Pearson, an English mathematician.

Legends
A legend is used to identify data in visualizations by its color, size, or other
distinguishing features. Simply connect one or more data visualizations to a legend
and they will automatically display a table of symbols and descriptions to help
users understand what is being displayed.

Colours
Color selection in data visualization is not merely an aesthetic choice, it is a
crucial tool to convey quantitative information. Properly selected colors convey the
underlying data accurately, in contrast to many color schemes commonly used in
visualization that distort relationships between data values.
Subplot
Subplots are like mini-plots positioned in a grid arrangement. They are
useful for displaying different plot types for the same data, or the same plot type
with subsets of your data.
For example, you might have 4 plots arranged in a 2 by 2 grid like this.
Text on donation
Text visualization is one of the most important tools for text mining due to its
readability to both human and machines. Text visualization is mainly achieved
through the use of graph, chart, word cloud, map, network, timeline,etc…
Customization
Custom visuals in Power BI refer to visualizations that are not built-in to the
standard set of charts and graphs provided by Power BI Desktop. These custom
visuals are created by third-party developers or by users themselves using the Power
BI custom visuals SDK.
Three dimensional plotting
3D scatter plots display data points in a three-dimensional space, utilizing
three axes (X, Y, and Z) to represent three different variables.
3D data visualization is the process of creating three-dimensional
representations of data sets. This can include anything from simple graphs and charts
to complex models and simulations.
Geographic data and basemap
Matplotlib's main tool for this type of visualization is the Basemap toolkit,
which is one of several Matplotlib toolkits which lives under the mpl_toolkits
namespace.

Visualizing with sea born


Sea born is a Python data visualization library based on matplotlib. It provides a
high-level interface for drawing attractive and informative statistical graphics. For
a brief introduction to the ideas behind the library, you can read the introductory
notes or the paper.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy