Aiml - Notes 2 & 3 Units
Aiml - Notes 2 & 3 Units
Heuristic functions play a critical role in artificial intelligence (AI), particularly in search
algorithms used for problem-solving. These functions estimate the cost to reach the goal from
a given state, helping to make informed decisions that optimize the search process.
In this article, we will explore what heuristic functions are, their role in search algorithms,
various types of heuristic search algorithms, and their applications in AI.
Table of Content
What are Heuristic Functions?
Search Algorithm
Heuristic Search Algorithm in AI
o A* Algorithm
o Greedy Best-First Search
o Hill-Climbing Algorithm
Role of Heuristic Functions in AI
Common Problem Types for Heuristic Functions
Path Finding with Heuristic Functions
o Step 1: Define the A* Algorithm
o Step 2: Define the Visualization Function
o Step 3: Define the Grid and Start/Goal Positions
o Step 4: Run the A* Algorithm and Visualize the Path
o Complete Code
Applications of Heuristic Functions in AI
Conclusion
What are Heuristic Functions?
Heuristic functions are strategies or methods that guide the search process in AI algorithms
by providing estimates of the most promising path to a solution. They are often used in
scenarios where finding an exact solution is computationally infeasible. Instead, heuristics
provide a practical approach by narrowing down the search space, leading to faster and more
efficient problem-solving.
Heuristic functions transform complex problems into more manageable subproblems by
providing estimates that guide the search process. This approach is particularly effective in
AI planning, where the goal is to sequence actions that lead to a desired outcome.
Search Algorithm
Search algorithms are fundamental to AI, enabling systems to navigate through problem
spaces to find solutions. These algorithms can be classified into uninformed (blind) and
informed (heuristic) searches. Uninformed search algorithms, such as breadth-first and depth-
first search, do not have additional information about the goal state beyond the problem
definition. In contrast, informed search algorithms use heuristic functions to estimate the cost
of reaching the goal, significantly improving search efficiency.
Heuristic Search Algorithm in AI
Heuristic search algorithms leverage heuristic functions to make more intelligent decisions
during the search process. Some common heuristic search algorithms include:
A* Algorithm
The A* algorithm is one of the most widely used heuristic search algorithms. It uses both the
actual cost from the start node to the current node (g(n)) and the estimated cost from the
current node to the goal (h(n)). The total estimated cost (f(n)) is the sum of these two values:
f(n) = g(n) +h(n)
Greedy Best-First Search
The Greedy Best-First Search algorithm selects the path that appears to be the most
promising based on the heuristic function alone. It prioritizes nodes with the lowest heuristic
cost (h(n)), but it does not necessarily guarantee the shortest path to the goal.
Hill-Climbing Algorithm
The Hill-Climbing algorithm is a local search algorithm that continuously moves towards the
neighbor with the lowest heuristic cost. It resembles climbing uphill towards the goal but can
get stuck in local optima.
Role of Heuristic Functions in AI
Heuristic functions are essential in AI for several reasons:
Efficiency: They reduce the search space, leading to faster solution times.
Guidance: They provide a sense of direction in large problem spaces, avoiding
unnecessary exploration.
Practicality: They offer practical solutions in situations where exact methods are
computationally prohibitive.
Common Problem Types for Heuristic Functions
Heuristic functions are particularly useful in various problem types, including:
1. Pathfinding Problems: Pathfinding problems, such as navigating a maze or finding the
shortest route on a map, benefit greatly from heuristic functions that estimate the distance
to the goal.
2. Constraint Satisfaction Problems: In constraint satisfaction problems, such as
scheduling and puzzle-solving, heuristics help in selecting the most promising variables
and values to explore.
3. Optimization Problems: Optimization problems, like the traveling salesman problem,
use heuristics to find near-optimal solutions within a reasonable time frame.
Path Finding with Heuristic Functions
Step 1: Define the A* Algorithm
This step involves defining the A* algorithm, which finds the shortest path from the start to
the goal using a heuristic function. The heuristic function used here is the Manhattan
distance. It returns the path from the start to the goal if one is found.
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work on
our instructions. But can a machine also learn from experiences or past data like a human does?
So here comes the role of Machine Learning.
A subset of artificial intelligence known as machine learning focuses primarily on the creation
of algorithms that enable a computer to independently learn from data and previous
experiences. Arthur Samuel first used the term "machine learning" in 1959. It could be
summarized as follows:
Without being explicitly programmed, machine learning enables a machine to automatically
learn from data, improve performance from experiences, and predict things.
Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample historical
data, or training data. For the purpose of developing predictive models, machine learning brings
together statistics and computer science. Algorithms that learn from historical data are either
constructed or utilized in machine learning. The performance will rise in proportion to the
quantity of information we provide.
A machine can learn if it can gain more data to improve its performance.
A machine learning system builds prediction models, learns from previous data, and predicts
the output of new data whenever it receives it. The amount of data helps to build a better model
that accurately predicts the output, which in turn affects the accuracy of the predicted output.
Let's say we have a complex problem in which we need to make predictions. Instead of writing
code, we just need to feed the data to generic algorithms, which build the logic based on the
data and predict the output. Our perspective on the issue has changed as a result of machine
learning. The Machine Learning algorithm's operation is depicted in the following block
diagram:
The demand for machine learning is steadily rising. Because it is able to perform tasks that are
too complex for a person to directly implement, machine learning is required. Humans are
constrained by our inability to manually access vast amounts of data; as a result, we require
computer systems, which is where machine learning comes in to simplify our lives.
By providing them with a large amount of data and allowing them to automatically explore the
data, build models, and predict the required output, we can train machine learning algorithms.
The cost function can be used to determine the amount of data and the machine learning
algorithm's performance. We can save both time and money by using machine learning.
The significance of AI can be handily perceived by its utilization's cases, Presently, AI is
utilized in self-driving vehicles, digital misrepresentation identification, face acknowledgment,
and companion idea by Facebook, and so on. Different top organizations, for example, Netflix
and Amazon have constructed AI models that are utilizing an immense measure of information
to examine the client interest and suggest item likewise.
Following are some key points which show the importance of Machine Learning:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
M 48 sick
M 67 sick
F 53 healthy
M 49 sick
F 32 healthy
M 34 healthy
M 21 healthy
2. Unsupervised learning:
Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses. In unsupervised learning
algorithms, classification or categorization is not included in the observations. Example:
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients.
Gender Age
M 48
M 67
F 53
M 49
F 34
M 21
As a kind of learning, it resembles the methods humans use to figure out that certain objects
or events are from the same class, such as by observing the degree of similarity between
objects. Some recommendation systems that you find on the web in the form of marketing
automation are based on this type of learning.
To know more about supervised and unsupervised learning refer to:
.
3. Reinforcement learning:
Reinforcement learning is the problem of getting an agent to act in the world so as to
maximize its rewards.
A learner is not told what actions to take as in most forms of machine learning but instead
must discover which actions yield the most reward by trying them. For example — Consider
teaching a dog a new trick: we cannot tell him what to do, what not to do, but we can
reward/punish it if it does the right/wrong thing.
When watching the video, notice how the program is initially clumsy and unskilled but
steadily improves with training until it becomes a champion.
To know more about Reinforcement learning refer to:
https://www.geeksforgeeks.org/what-is-reinforcement-learning/.
4. Semi-supervised learning:
Where an incomplete training signal is given: a training set with some (often many) of the
target outputs missing. There is a special case of this principle known as Transduction
where the entire set of problem instances is known at learning time, except that part of the
targets are missing. Semi-supervised learning is an approach to machine learning that
combines small labeled data with a large amount of unlabeled data during training. Semi-
supervised learning falls between unsupervised learning and supervised learning.
Categorizing based on Required Output
Another categorization of machine-learning tasks arises when one considers the desired
output of a machine-learned system:
1. Classification: When inputs are divided into two or more classes, the learner must
produce a model that assigns unseen inputs to one or more (multi-label classification) of
these classes. This is typically tackled in a supervised way. Spam filtering is an example
of classification, where the inputs are email (or other) messages and the classes are
“spam” and “not spam”.
2. Regression: This is also a supervised problem, A case when the outputs are continuous
rather than discrete.
3. Clustering: When a set of inputs is to be divided into groups. Unlike in classification,
the groups are not known beforehand, making this typically an unsupervised task.
Examples of Machine Learning in Action
Machine learning is woven into the fabric of our daily lives. Here are some examples to
illustrate its diverse applications
Supervised Learning
Filtering Your Inbox: Spam filters use machine learning to analyze emails and identify
spam based on past patterns. They learn from emails you mark as spam and not spam,
becoming more accurate over time.
Recommending Your Next Purchase: E-commerce platforms and streaming services
use machine learning to analyze your purchase history and viewing habits. This allows
them to recommend products and shows you’re more likely to enjoy.
Smart Reply in Emails: Machine learning powers features like “Smart Reply” in
Gmail, suggesting short responses based on the content of the email.
Unsupervised Learning
Grouping Customers: Machine learning can analyze customer data (purchase history,
demographics) to identify customer segments with similar characteristics. This helps
businesses tailor marketing campaigns and product offerings.
Anomaly Detection: Financial institutions use machine learning to detect unusual
spending patterns on your credit card, potentially indicating fraudulent activity.
Image Classification in Photos: Facial recognition in photos on social media platforms
is powered by machine learning algorithms trained on vast amounts of labeled data.
Beyond Categories
Self-Driving Cars: These rely on reinforcement learning, a type of machine learning
where algorithms learn through trial and error in a simulated environment.
Medical Diagnosis: Machine learning algorithms can analyze medical images (X-rays,
MRIs) to identify abnormalities and aid doctors in diagnosis.
Benefits and Challenges of Machine Learning
Machine learning (ML) has become a transformative technology across various industries.
While it offers numerous advantages, it’s crucial to acknowledge the challenges that come
with its increasing use.
Benefits of Machine Learning
Enhanced Efficiency and Automation: ML automates repetitive tasks, freeing up
human resources for more complex work. It also streamlines processes, leading to
increased efficiency and productivity.
Data-Driven Insights: ML can analyze vast amounts of data to identify patterns and
trends that humans might miss. This allows for better decision-making based on real-
world data.
Improved Personalization: ML personalizes user experiences across various
platforms. From recommendation systems to targeted advertising, ML tailors content
and services to individual preferences.
Advanced Automation and Robotics: ML empowers robots and machines to perform
complex tasks with greater accuracy and adaptability. This is revolutionizing fields like
manufacturing and logistics.
Challenges of Machine Learning
Data Bias and Fairness: ML algorithms are only as good as the data they are trained
on. Biased data can lead to discriminatory outcomes, requiring careful data selection
and monitoring of algorithms.
Security and Privacy Concerns: As ML relies heavily on data, security breaches can
expose sensitive information. Additionally, the use of personal data raises privacy
concerns that need to be addressed.
Interpretability and Explainability: Complex ML models can be difficult to
understand, making it challenging to explain their decision-making processes. This lack
of transparency can raise questions about accountability and trust.
Job Displacement and Automation: Automation through ML can lead to job
displacement in certain sectors. Addressing the need for retraining and reskilling the
workforce is crucial.
Conclusion
In conclusion, machine learning is a powerful technology that allows computers to learn
without explicit programming. By exploring different learning tasks and their applications,
we gain a deeper understanding of how machine learning is shaping our world. From
filtering your inbox to diagnosing diseases, machine learning is making a significant impact
on various aspects of our lives.
Are you passionate about data and looking to make one giant leap into your career?
Our Data Science Course will help you change your game and, most importantly, allow
students, professionals, and working adults to tide over into the data science immersion.
Master state-of-the-art methodologies, powerful tools, and industry best practices, hands-on
projects, and real-world applications. Become the executive head of industries related
to Data Analysis, Machine Learning, and Data Visualization with these growing skills.
Ready to Transform Your Future? Enroll Now to Be a Data Science Expert!
1) Supervised Learning
In supervised learning, sample labeled data are provided to the machine learning system for
training, and the system then predicts the output based on the training data.
The system uses labeled data to build a model that understands the datasets and learns about
each one. After the training and processing are done, we test the model with sample data to see
if it can accurately predict the output.
The mapping of the input data to the output data is the objective of supervised learning. The
managed learning depends on oversight, and it is equivalent to when an understudy learns
things in the management of the educator. Spam filtering is an example of supervised learning.
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any supervision.
The training is provided to the machine with the set of data that has not been labeled, classified,
or categorized, and the algorithm needs to act on that data without any supervision. The goal
of unsupervised learning is to restructure the input data into new features or a group of objects
with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find useful
insights from the huge amount of data. It can be further classifieds into two categories of
algorithms:
3) Reinforcement Learning.
Reinforcement learning is a feedback-based learning method, in which a learning agent gets a
reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the most
reward points, and hence, it improves its performance.
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
History of Machine LearningBefore some years (about 40-50 years), machine learning was
science fiction, but today it is the part of our daily life. Machine learning is making our day to
day life easy from self-driving cars to Amazon virtual assistant "Alexa". However, the idea
behind machine learning is so old and has a long history. Below some milestones are given
which have occurred in the history of machine learning:
o 1834: In 1834, Charles Babbage, the father of the computer, conceived a device that
could be programmed with punch cards. However, the machine was never built, but all
modern computers rely on its logical structure.
o 1936: In 1936, Alan Turing gave a theory that how a machine can determine and
execute a set of instructions.
o 1940: In 1940, the first manually operated computer, "ENIAC" was invented, which
was the first electronic general-purpose computer. After that stored program computer
such as EDSAC in 1949 and EDVAC in 1951 were invented.
o 1943: In 1943, a human neural network was modeled with an electrical circuit. In 1950,
the scientists started applying their idea to work and analyzed how human neurons
might work.
o 1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery and
Intelligence," on the topic of artificial intelligence. In his paper, he asked, "Can
machines think?"
o 1952: Arthur Samuel, who was the pioneer of machine learning, created a program that
helped an IBM computer to play a checkers game. It performed better more it played.
o 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.
o The duration of 1974 to 1980 was the tough time for AI and ML researchers, and this
duration was called as AI winter.
o In this duration, failure of machine translation occurred, and people had reduced their
interest from AI, which led to reduced funding by the government to the researches.
o 1959: In 1959, the first neural network was applied to a real-world problem to remove
echoes over phone lines using an adaptive filter.
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural
network NETtalk, which was able to teach itself how to correctly pronounce 20,000
words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess game against the chess
expert Garry Kasparov, and it became the first computer which had beaten a human
chess expert.
2006:
o Geoffrey Hinton and his group presented the idea of profound getting the hang of
utilizing profound conviction organizations.
o The Elastic Compute Cloud (EC2) was launched by Amazon to provide scalable
computing resources that made it easier to create and implement machine learning
models.
2007:
o Profound learning gained ground as analysts showed its viability in different errands,
including discourse acknowledgment and picture grouping.
o The expression "Large Information" acquired ubiquity, featuring the difficulties and
open doors related with taking care of huge datasets.
2010:
o The goal of explainable AI, which focuses on making machine learning models easier
to understand, received some attention.
o Google's DeepMind created AlphaGo Zero, which accomplished godlike Go abilities
to play without human information, utilizing just support learning.
2017:
The field of machine learning has made significant strides in recent years, and its applications
are numerous, including self-driving cars, Amazon Alexa, Catboats, and the recommender
system. It incorporates clustering, classification, decision tree, SVM algorithms, and
reinforcement learning, as well as unsupervised and supervised learning.
Present day AI models can be utilized for making different expectations, including climate
expectation, sickness forecast, financial exchange examination, and so on.
Prerequisites
Before learning machine learning, you must have the basic knowledge of followings so that
you can easily understand the concepts of machine learning:
Machine learning is a field of computer science that gives computers the ability to learn
without being explicitly programmed. Supervised learning and unsupervised learning are two
main types of machine learning.
In supervised learning, the machine is trained on a set of labeled data, which means that the
input data is paired with the desired output. The machine then learns to predict the output for
new input data. Supervised learning is often used for tasks such as classification, regression,
and object detection.
In unsupervised learning, the machine is trained on a set of unlabeled data, which means that
the input data is not paired with the desired output. The machine then learns to find patterns
and relationships in the data. Unsupervised learning is often used for tasks such as clustering,
dimensionality reduction, and anomaly detection.
What is Supervised learning?
Supervised learning is a type of machine learning algorithm that learns from labeled data.
Labeled data is data that has been tagged with a correct answer or classification.
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.
Supervised learning is when we teach or train the machine using data that is well-labelled.
Which means some data is already tagged with the correct answer. After that, the machine is
provided with a new set of examples(data) so that the supervised learning algorithm analyses
the training data(set of training examples) and produces a correct outcome from labeled data.
For example, a labeled dataset of images of Elephant, Camel and Cow would have each
image tagged with either “Elephant” , “Camel”or “Cow.”
Key Points:
Supervised learning involves training a machine from labeled data.
Labeled data consists of examples with the correct answer or classification.
The machine learns the relationship between inputs (fruit images) and outputs (fruit
labels).
The trained machine can then make predictions on new, unlabeled data.
Example:
Let’s say you have a fruit basket that you want to identify. The machine would first analyze
the image to extract features such as its shape, color, and texture. Then, it would compare
these features to the features of the fruits it has already learned about. If the new image’s
features are most similar to those of an apple, the machine would predict that the fruit is an
apple.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the
first step is to train the machine with all the different fruits one by one like this:
If the shape of the object is rounded and has a depression at the top, is red in color, then it
will be labeled as –Apple.
If the shape of the object is a long curving cylinder having Green-Yellow color, then it
will be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from
the basket, and asked to identify it.
Since the machine has already learned the things from previous data and this time has to use
it wisely. It will first classify the fruit with its shape and color and would confirm the fruit
name as BANANA and put it in the Banana category. Thus the machine learns the things
from training data(basket containing fruits) and then applies the knowledge to test data(new
fruit).
Types of Supervised Learning
Supervised learning is classified into two categories of algorithms:
Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” , “disease” or “no disease”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.
1- Regression
Regression is a type of supervised learning that is used to predict continuous values, such as
house prices, stock prices, or customer churn. Regression algorithms learn a function that
maps from the input features to the output value.
Some common regression algorithms include:
Linear Regression
Polynomial Regression
Support Vector Machine Regression
Decision Tree Regression
Random Forest Regression
2- Classification
Classification is a type of supervised learning that is used to predict categorical values, such
as whether a customer will churn or not, whether an email is spam or not, or whether a
medical image shows a tumor or not. Classification algorithms learn a function that maps
from the input features to a probability distribution over the output classes.
Some common classification algorithms include:
Logistic Regression
Support Vector Machines
Decision Trees
Random Forests
Naive Baye
Evaluating Supervised Learning Models
Evaluating supervised learning models is an important step in ensuring that the model is
accurate and generalizable. There are a number of different metrics that can be used to
evaluate supervised learning models, but some of the most common ones include:
For Regression
Mean Squared Error (MSE): MSE measures the average squared difference between
the predicted values and the actual values. Lower MSE values indicate better model
performance.
Root Mean Squared Error (RMSE): RMSE is the square root of MSE, representing the
standard deviation of the prediction errors. Similar to MSE, lower RMSE values indicate
better model performance.
Mean Absolute Error (MAE): MAE measures the average absolute difference between
the predicted values and the actual values. It is less sensitive to outliers compared to MSE
or RMSE.
R-squared (Coefficient of Determination): R-squared measures the proportion of the
variance in the target variable that is explained by the model. Higher R-squared values
indicate better model fit.
For Classification
Accuracy: Accuracy is the percentage of predictions that the model makes correctly. It is
calculated by dividing the number of correct predictions by the total number of
predictions.
Precision: Precision is the percentage of positive predictions that the model makes that
are actually correct. It is calculated by dividing the number of true positives by the total
number of positive predictions.
Recall: Recall is the percentage of all positive examples that the model correctly
identifies. It is calculated by dividing the number of true positives by the total number of
positive examples.
F1 score: The F1 score is a weighted average of precision and recall. It is calculated by
taking the harmonic mean of precision and recall.
Confusion matrix: A confusion matrix is a table that shows the number of predictions
for each class, along with the actual class labels. It can be used to visualize the
performance of the model and identify areas where the model is struggling.
Applications of Supervised learning
Supervised learning can be used to solve a wide variety of problems, including:
Spam filtering: Supervised learning algorithms can be trained to identify and classify
spam emails based on their content, helping users avoid unwanted messages.
Image classification: Supervised learning can automatically classify images into
different categories, such as animals, objects, or scenes, facilitating tasks like image
search, content moderation, and image-based product recommendations.
Medical diagnosis: Supervised learning can assist in medical diagnosis by analyzing
patient data, such as medical images, test results, and patient history, to identify patterns
that suggest specific diseases or conditions.
Fraud detection: Supervised learning models can analyze financial transactions and
identify patterns that indicate fraudulent activity, helping financial institutions prevent
fraud and protect their customers.
Natural language processing (NLP): Supervised learning plays a crucial role in NLP
tasks, including sentiment analysis, machine translation, and text summarization, enabling
machines to understand and process human language effectively.
Advantages of Supervised learning
Supervised learning allows collecting data and produces data output from previous
experiences.
Helps to optimize performance criteria with the help of experience.
Supervised machine learning helps to solve various types of real-world computation
problems.
It performs classification and regression tasks.
It allows estimating or mapping the result to a new sample.
We have complete control over choosing the number of classes we want in the training
data.
Disadvantages of Supervised learning
Classifying big data can be challenging.
Training for supervised learning needs a lot of computation time. So, it requires a lot of
time.
Supervised learning cannot handle all complex tasks in Machine Learning.
Computation time is vast for supervised learning.
It requires a labelled data set.
It requires a training process.
What is Unsupervised learning?
Unsupervised learning is a type of machine learning that learns from unlabeled data. This
means that the data does not have any pre-existing labels or categories. The goal of
unsupervised learning is to discover patterns and relationships in the data without any explicit
guidance.
Unsupervised learning is the training of a machine using information that is neither classified
nor labeled and allowing the algorithm to act on that information without guidance. Here the
task of the machine is to group unsorted information according to similarities, patterns, and
differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by
itself.
You can use unsupervised learning to examine the animal data that has been gathered and
distinguish between several groups according to the traits and actions of the animals. These
groupings might correspond to various animal species, providing you to categorize the
creatures without depending on labels that already exist.
Key Points
Unsupervised learning allows the model to discover patterns and relationships in
unlabeled data.
Clustering algorithms group similar data points together based on their inherent
characteristics.
Feature extraction captures essential information from the data, enabling the model to
make meaningful distinctions.
Label association assigns categories to the clusters based on the extracted patterns and
characteristics.
Example
Imagine you have a machine learning model trained on a large dataset of unlabeled images,
containing both dogs and cats. The model has never seen an image of a dog or cat before, and
it has no pre-existing labels or categories for these animals. Your task is to use unsupervised
learning to identify the dogs and cats in a new, unseen image.
For instance, suppose it is given an image having both dogs and cats which it has never
seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as
‘dogs and cats ‘. But it can categorize them according to their similarities, patterns, and
differences, i.e., we can easily categorize the above picture into two parts. The first may
contain all pics having dogs in them and the second part may contain all pics having cats in
them. Here you didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Types of Unsupervised Learning
Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent groupings in
the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
Clustering
Clustering is a type of unsupervised learning that is used to group similar data points
together. Clustering algorithms work by iteratively moving data points closer to their cluster
centers and further away from data points in other clusters.
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
6. Gaussian Mixture Models (GMMs)
7. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Association rule learning
Association rule learning is a type of unsupervised learning that is used to identify patterns in
a data. Association rule learning algorithms work by finding relationships between different
items in a dataset.
Some common association rule learning algorithms include:
Apriori Algorithm
Eclat Algorithm
FP-Growth Algorithm
Evaluating Non-Supervised Learning Models
Evaluating non-supervised learning models is an important step in ensuring that the model is
effective and useful. However, it can be more challenging than evaluating supervised learning
models, as there is no ground truth data to compare the model’s predictions to.
There are a number of different metrics that can be used to evaluate non-supervised learning
models, but some of the most common ones include:
Silhouette score: The silhouette score measures how well each data point is clustered
with its own cluster members and separated from other clusters. It ranges from -1 to
1, with higher scores indicating better clustering.
Calinski-Harabasz score: The Calinski-Harabasz score measures the ratio between the
variance between clusters and the variance within clusters. It ranges from 0 to
infinity, with higher scores indicating better clustering.
Adjusted Rand index: The adjusted Rand index measures the similarity between two
clusterings. It ranges from -1 to 1, with higher scores indicating more similar clusterings.
Davies-Bouldin index: The Davies-Bouldin index measures the average similarity
between clusters. It ranges from 0 to infinity, with lower scores indicating better
clustering.
F1 score: The F1 score is a weighted average of precision and recall, which are two
metrics that are commonly used in supervised learning to evaluate classification
models. However, the F1 score can also be used to evaluate non-supervised learning
models, such as clustering models.
Application of Unsupervised learning
Non-supervised learning can be used to solve a wide variety of problems, including:
Anomaly detection: Unsupervised learning can identify unusual patterns or deviations
from normal behavior in data, enabling the detection of fraud, intrusion, or system
failures.
Scientific discovery: Unsupervised learning can uncover hidden relationships and patterns
in scientific data, leading to new hypotheses and insights in various scientific fields.
Recommendation systems: Unsupervised learning can identify patterns and similarities in
user behavior and preferences to recommend products, movies, or music that align with
their interests.
Customer segmentation: Unsupervised learning can identify groups of customers with
similar characteristics, allowing businesses to target marketing campaigns and improve
customer service more effectively.
Image analysis: Unsupervised learning can group images based on their content,
facilitating tasks such as image classification, object detection, and image retrieval.
Advantages of Unsupervised learning
It does not require training data to be labeled.
Dimensionality reduction can be easily accomplished using unsupervised learning.
Capable of finding previously unknown patterns in data.
Unsupervised learning can help you gain insights from unlabeled data that you might not
have been able to get otherwise.
Unsupervised learning is good at finding patterns and relationships in data without being
told what to look for. This can help you learn new things about your data.
Disadvantages of Unsupervised learning
Difficult to measure accuracy or effectiveness due to lack of predefined answers during
training.
The results often have lesser accuracy.
The user needs to spend time interpreting and label the classes which follow that
classification.
Unsupervised learning can be sensitive to data quality, including missing values, outliers,
and noisy data.
Without labeled data, it can be difficult to evaluate the performance of unsupervised
learning models, making it challenging to assess their effectiveness.
Supervised vs. Unsupervised Machine Learning
Supervised machine Unsupervised machine
Parameters learning learning
Computational
Simpler method Computationally complex
Complexity
Model We can test our model. We can not test our model.
Conclusion
Supervised and unsupervised learning are two powerful tools that can be used to solve a wide
variety of problems. Supervised learning is well-suited for tasks where the desired output is
known, while unsupervised learning is well-suited for tasks where the desired output is
unknown.
Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data. It is stated that almost 90% of today's data has been generated in the
past 3 years.
o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs
from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction.
Gartner defines Big Data as “Big data is high-volume, high-velocity and/or high-variety
information that demands cost-effective, innovative forms of information processing that
enable enhanced insight, decision making, and process automation.”
Big Data is a collection of large amounts of data sets that traditional computing approaches
cannot compute and manage. It is a broad term that refers to the massive volume of complex
data sets that businesses and governments generate in today's digital world. It is often
measured in petabytes or terabytes and originates from three key sources: transactional data,
machine data, and social data.
Big Data encompasses data, frameworks, tools, and methodologies used to store, access,
analyse and visualise it. Technological advanced communication channe
ls like social networking and powerful gadgets have created different ways to create data,
data transformation and challenges to industry participants in the sense that they must find
new ways to handle data. The process of converting large amounts of unstructured raw data,
retrieved from different sources to a data product useful for organizations forms the core of
Big Data Analytics.
Big Data Analytics is a powerful tool which helps to find the potential of large and complex
datasets. To get a better understanding, let's break it down into key steps −
Data Collection
This is the initial step, in which data is collected from different sources like social media,
sensors, online channels, commercial transactions, website logs etc. Collected data might be
structured (predefined organisation, such as databases), semi-structured (like log files) or
unstructured (text documents, photos, and videos).
The next step is to process collected data by removing errors and making it suitable and
proper for analysis. Collected raw data generally contains errors, missing values,
inconsistencies, and noisy data. Data cleaning entails identifying and correcting errors to
ensure that the data is accurate and consistent. Pre-processing operations may also involve
data transformation, normalisation, and feature extraction to prepare the data for further
analysis.
Overall, data cleaning and pre-processing entail the replacement of missing data, the
correction of inaccuracies, and the removal of duplicates. It is like sifting through a treasure
trove, separating the rocks and debris and leaving only the valuable gems behind.
Data Analysis
This is a key phase of big data analytics. Different techniques and algorithms are used to
analyse data and derive useful insights. This can include descriptive analytics (summarising
data to better understand its characteristics), diagnostic analytics (identifying patterns and
relationships), predictive analytics (predicting future trends or outcomes), and prescriptive
analytics (making recommendations or decisions based on the analysis).
Data Visualization
It’s a step to present data in a visual form using charts, graphs and interactive dashboards.
Hence, data visualisation techniques are used to visually portray the data using charts, graphs,
dashboards, and other graphical formats to make data analysis insights more clear and
actionable.
Once data analytics and visualisation are done and insights gained, stakeholders analyse the
findings to make informed decisions. This decision-making includes optimising corporate
operations, increasing consumer experiences, creating new products or services, and directing
strategic planning.
Once collected, the data must be stored in a way that enables easy retrieval and analysis.
Traditional databases may not be sufficient for handling large amounts of data, hence many
organisations use distributed storage systems such as Hadoop Distributed File System
(HDFS) or cloud-based storage solutions like Amazon S3.
Big data analytics is a continuous process of collecting, cleaning, and analyzing data to
uncover hidden insights. It helps businesses make better decisions and gain a competitive
edge.
Types of Big-Data
Big Data is generally categorized into three different varieties. They are as shown below −
Structured Data
Semi-Structured Data
Unstructured Data
Structured Data
Structured data has a dedicated data model, a well-defined structure, and a consistent order,
and is designed in such a way that it can be easily accessed and used by humans or
computers. Structured data is usually stored in well-defined tabular form means in the form
of rows and columns. Example: MS Excel, Database Management Systems (DBMS)
Semi-Structured Data
Semi-structured data can be described as another type of structured data. It inherits some
qualities from Structured Data; however, the majority of this type of data lacks a specific
structure and does not follow the formal structure of data models such as an RDBMS.
Example: Comma Separated Values (CSV) File.
Unstructured Data
Unstructured data is a type of data that doesn’t follow any structure. It lacks a uniform format
and is constantly changing. However, it may occasionally include data and time-related
information. Example: Audio Files, Images etc.
Descriptive Analytics
Descriptive analytics gives a result like “What is happening in my business?" if the dataset
is business-related. Overall, this summarises prior facts and aids in the creation of reports
such as a company's income, profit, and sales figures. It also aids the tabulation of social
media metrics. It can do comprehensive, accurate, live data and effective visualisation.
Diagnostic Analytics
Diagnostic analytics determines root causes from data. It answers like “Why is it
happening?” Some common examples are drill-down, data mining, and data recovery.
Organisations use diagnostic analytics because they provide an in-depth insight into a
particular problem. Overall, it can drill down the root causes and ability to isolate all
confounding information.
For example − A report from an online store says that sales have decreased, even though
people are still adding items to their shopping carts. Several things could have caused this,
such as the form not loading properly, the shipping cost being too high, or not enough
payment choices being offered. You can use diagnostic data to figure out why this is
happening.
Predictive Analytics
This kind of analytics looks at data from the past and the present to guess what will happen in
the future. Hence, it answers like “What will be happening in future? “Data mining, AI,
and machine learning are all used in predictive analytics to look at current data and guess
what will happen in the future. It can figure out things like market trends, customer trends,
and so on.
For example − The rules that Bajaj Finance has to follow to keep their customers safe from
fake transactions are set by PayPal. The business uses predictive analytics to look at all of its
past payment and user behaviour data and come up with a program that can spot fraud.
Prescriptive Analytics
Perspective analytics gives the ability to frame a strategic decision, the analytical results
answer “What do I need to do?” Perspective analytics works with both descriptive and
predictive analytics. Most of the time, it relies on AI and machine learning.
For example − Prescriptive analytics can help a company to maximise its business and
profit. For example in the airline industry, Perspective analytics applies some set of
algorithms that will change flight prices automatically based on demand from customers, and
reduce ticket prices due to bad weather conditions, location, holiday seasons etc.
Hadoop
A tool to store and analyze large amounts of data. Hadoop makes it possible to deal with big
data, It's a tool which made big data analytics possible.
MongoDB
A tool for managing unstructured data. It's a database which specially designed to store,
access and process large quantities of unstructured data.
Talend
A tool to use for data integration and management. Talend's solution package includes
complete capabilities for data integration, data quality, master data management, and data
governance. Talend integrates with big data management tools like Hadoop, Spark, and
NoSQL databases allowing organisations to process and analyse enormous amounts of data
efficiently. It includes connectors and components for interacting with big data technologies,
allowing users to create data pipelines for ingesting, processing, and analysing large amounts
of data.
Cassandra
Spark
Used for real-time processing and analyzing large amounts of data. Apache Spark is a robust
and versatile distributed computing framework that provides a single platform for big data
processing, analytics, and machine learning, making it popular in industries such as e-
commerce, finance, healthcare, and telecommunications.
Storm
Kafka
It is a distributed streaming platform that is used for fault-tolerant storage. Apache Kafka is a
versatile and powerful event streaming platform that allows organisations to create scalable,
fault-tolerant, and real-time data pipelines and streaming applications to efficiently meet their
data processing requirements.