0% found this document useful (0 votes)

5 views26 pages

Unit 1

The document outlines the syllabus and key concepts of a Machine Learning course, including types of machine learning, advantages, disadvantages, and limitations. It details the history of machine learning, various learning approaches such as supervised, unsupervised, semi-supervised, and reinforcement learning, as well as problems that cannot be effectively solved using machine learning. Additionally, it highlights applications of machine learning across different industries, emphasizing its transformative impact.

Uploaded by

achilles2006ad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views26 pages

Unit 1

Uploaded by

achilles2006ad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

MACHINE LEARNING COURSE CODE-A8703 MODULE-01

SYLLABUS: Introduction to Machine Learning: Types of Machine Learning, Problems not

to be solved using Machine Learning, Applications of Machine Learning, Tools in Machine
Learning, Issues in Machine Learning, Machine learning Activities, Basic Types of Data in
Machine Learning, Exploring Structure of data, Data Quality & Remediation, Data Pre-
Processing.

Introduction to Machine Learning:

What is Machine Learning?
 In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work
on our instructions. But can a machine also learn from experiences or past data like a human
does? So here comes the role of Machine Learning.

 Machine learning is a subset of artificial intelligence (AI) that focuses on developing

algorithms and techniques that enable computers to learn from data and make predictions or
decisions without being explicitly programmed. It involves training models on data to identify
patterns, relationships, and insights, which can then be used to perform various tasks and
make predictions on new, unseen data.

Dr M.Ramachandro 1|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
Advantages:
1. Automation: Machine learning enables automation of tasks that are repetitive or time-
consuming, leading to increased efficiency and productivity.
2. Prediction and Decision Making: ML algorithms can analyze large datasets to make
predictions and decisions with high accuracy, helping businesses and organizations make
informed choices.
3. Scalability: ML models can handle large volumes of data and scale efficiently to
accommodate growing datasets and complex problems.
4. Adaptability: ML models can adapt and learn from new data, allowing them to continuously
improve and stay relevant in dynamic environments.
5. Personalization: ML algorithms can personalize user experiences by analyzing user
behavior and preferences, leading to targeted recommendations and customized services.
6. Pattern Recognition: ML excels at identifying patterns, trends, and anomalies in data that
may not be obvious to humans, leading to valuable insights and discoveries.
Disadvantages:
1. Data Dependency: ML models heavily rely on the quality and quantity of data for training,
and biased or incomplete data can lead to biased or inaccurate predictions.
2. Overfitting: ML models may become too specialized to the training data and fail to generalize
well to unseen data, resulting in overfitting.
3. Interpretability: Some ML models, particularly complex ones like deep neural networks,
lack interpretability, making it challenging to understand how they arrive at their predictions
or decisions.
4. Computational Resources: Training complex ML models requires significant computational
resources, including high-performance hardware and large amounts of memory, which can
be costly and resource-intensive.
5. Ethical and Privacy Concerns: ML algorithms may inadvertently perpetuate biases present
in the data or infringe on privacy rights, raising ethical and social concerns.
6. Lack of Domain Knowledge: ML models may perform poorly in domains where domain-
specific knowledge is essential, as they may not understand the underlying context or
constraints of the problem.
Limitations:
1. Limited by Data Quality: ML models are limited by the quality, relevance, and
representativeness of the training data. Poor-quality or biased data can lead to inaccurate
predictions and unreliable performance.

Dr M.Ramachandro 2|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
2. Complexity: Developing and training ML models, especially deep learning models, can be
complex and time-consuming, requiring expertise in data science, mathematics, and
computer science.
3. Interpretability: Many ML models lack interpretability, making it difficult to understand and
trust their decisions, particularly in critical applications like healthcare or finance.
4. Generalization: ML models may struggle to generalize well to unseen data, especially in
scenarios with significant variations or changes in the data distribution.
5. Scalability: While ML models can handle large datasets, scaling them to extremely large
datasets or real-time applications can be challenging and may require distributed computing
or specialized infrastructure.
6. Human Expertise: ML models still rely on human expertise for tasks such as feature
engineering, model selection, and evaluation, and may not fully replace human decision-
making in complex or subjective domains.
History of Machine learning
The history of machine learning traces back to the mid-20th century, with roots in the fields of
mathematics, computer science, and artificial intelligence. Here's a brief overview:
1950s - 1960s: Early Foundations
1. Alan Turing (1950): Turing proposed the Turing Test as a measure of machine intelligence,
laying the groundwork for the concept of artificial intelligence (AI).
2. Arthur Samuel (1959): Samuel developed the first self-learning program, a checkers-
playing program that improved its performance through reinforcement learning.
1970s - 1980s: Symbolic AI Dominance
1. Expert Systems: Symbolic AI, based on rule-based expert systems, dominated the field.
These systems encoded human expertise in the form of rules to solve specific problems.
2. Neural Networks Research: Neural networks research continued, but interest waned due
to limited computational power and the dominance of symbolic AI.
1990s: Renaissance of Neural Networks
1. Backpropagation: The rediscovery of the backpropagation algorithm for training neural
networks led to renewed interest in neural networks and machine learning.
2. Support Vector Machines (SVMs): Vladimir Vapnik and others developed support vector
machines, a powerful machine learning algorithm for classification and regression tasks.
2000s - Present: The Big Data Era
1. Big Data: The explosion of data availability due to the internet, social media, and digital
technologies fueled the development of new machine learning algorithms and techniques.

Dr M.Ramachandro 3|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
2. Deep Learning: Breakthroughs in deep learning, fueled by advances in computational power
and data availability, led to significant improvements in areas like computer vision, natural
language processing, and speech recognition.
3. Reinforcement Learning: Reinforcement learning gained prominence, particularly in areas
like robotics, gaming (e.g., AlphaGo), and autonomous vehicles.
4. Machine Learning Applications: Machine learning became ubiquitous in various
applications, including recommendation systems, fraud detection, healthcare diagnostics,
autonomous vehicles, and more.
5. Ethical and Social Implications: Increased attention to the ethical and social implications of
machine learning, including concerns about bias, fairness, privacy, and job displacement.

Dr M.Ramachandro 4|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-01

Types of Machine Learning

Machine learning can be broadly categorized into four types based on the learning approach:

Supervised Learning:
Definition: In supervised learning, the algorithm is trained on a labeled dataset, where each input
is associated with a corresponding output.
Usefulness: Supervised learning is highly useful in real-world scenarios where there is a known
outcome or target variable. It is particularly valuable for tasks such as classification and regression.
Real-world Applications: Supervised learning is applied in various domains such as:
1. Predictive analytics: Forecasting customer churn, predicting sales trends, etc.
2. Healthcare: Diagnosing diseases based on patient data.
3. Finance: Credit scoring, fraud detection, risk assessment.
Advantages:
 Well-understood and widely studied.
 Can achieve high accuracy when trained on sufficient and representative data.
Disadvantages:
 Requires labeled data, which can be expensive and time-consuming to obtain.
 May suffer from overfitting if the model is too complex or the training dataset is small.
Unsupervised Learning:
Definition: In unsupervised learning, the algorithm is trained on an unlabeled dataset, where the
goal is to discover hidden patterns or structures within the data.
Usefulness: Unsupervised learning is valuable in real-time scenarios where the data is unstructured
or lacks labels. It can uncover hidden insights and group similar data points together.

Dr M.Ramachandro 5|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
Real-world Applications: Unsupervised learning is applied in various domains such as:
1. Market segmentation: Grouping customers based on similar traits or behaviors.
2. Anomaly detection: Identifying unusual patterns or outliers in data.
3. Recommender systems: Generating personalized recommendations based on user
preferences.
Advantages:
 Can reveal hidden patterns or structures in the data.
 Does not require labeled data, making it applicable to a wide range of datasets.
Disadvantages:
 Evaluation of results can be subjective and challenging.
 Interpretability of the model's output may be limited.
Semi Supervised Learning Algorithms
Semi-supervised learning is a machine learning paradigm that falls between supervised and
unsupervised learning. In semi-supervised learning, the dataset contains both labeled and unlabeled
data. The algorithm leverages the small amount of labeled data along with the larger pool of
unlabeled data to make predictions or learn patterns.
Here are a few semi-supervised learning algorithms:
Self-Training:
1. Self-training is a simple semi-supervised learning algorithm where the model starts with a
small amount of labeled data.
2. It trains initially on the labeled data and then uses the trained model to make predictions on
the unlabeled data.
3. The predictions with high confidence are added to the labeled dataset, and the process
iterates until convergence.
4. This approach assumes that the model's predictions on the unlabeled data are reliable.
Semi-Supervised Support Vector Machines (S3VM):
1. S3VM is an extension of traditional Support Vector Machines (SVM) to semi-supervised
settings.
2. It incorporates both labeled and unlabeled data into the SVM framework, aiming to find a
decision boundary that separates the data while minimizing classification errors.
3. S3VM optimizes a combination of the margin and the empirical error on the labeled data,
along with a term penalizing the model's complexity.
Label Propagation:
1. Label propagation is a graph-based semi-supervised learning algorithm.

Dr M.Ramachandro 6|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
2. It constructs a graph representation of the data, where nodes represent data points, and
edges represent similarities between points.
3. Initially, the labeled nodes are assigned their true labels, and then the labels propagate
through the graph based on similarities between nodes.
4. The final labels are determined based on the propagated labels, and the process iterates until
convergence.
Generative Adversarial Networks (GANs):
1. GANs consist of two neural networks, a generator and a discriminator, which are trained
simultaneously through a min-max game.
2. In semi-supervised learning, GANs can be used to generate realistic samples from the
unlabeled data distribution.
3. The generated samples are combined with the labeled data to train a classifier, effectively
leveraging the unlabeled data to improve classification performance.
Reinforcement Learning:
Definition: Reinforcement learning involves an agent learning to make decisions by interacting with
an environment and receiving feedback in the form of rewards or penalties.
Usefulness: Reinforcement learning is beneficial in real-time environments where decisions must
be made sequentially and actions have consequences. It is used in areas such as robotics, gaming,
and autonomous systems.
Real-world Applications: Reinforcement learning is applied in various domains such as:
1. Robotics: Training robots to perform complex tasks in dynamic environments.
2. Autonomous vehicles: Teaching vehicles to navigate safely and efficiently.
3. Resource management: Optimizing energy usage, inventory management, etc.
Advantages:
1. Can learn complex behaviors and strategies through trial and error.
2. Suitable for environments with sparse or delayed feedback.
Disadvantages:
1. Requires a well-defined reward structure, which may be challenging to specify.
2. Can be computationally expensive and time-consuming to train.

Dr M.Ramachandro 7|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-01

Problems Cannot to Be Solved Using Machine Learning

While machine learning (ML) is a powerful tool for solving a wide range of problems, there are
certain types of problems that it may not be well-suited to address effectively. Here are some
examples of problems that cannot be easily solved by machine learning alone:
1. Lack of Data: Machine learning models require sufficient and high-quality data for training.
If the data is scarce, incomplete, or biased, the performance of ML models can suffer.
2. Undefined Objectives: Machine learning relies on well-defined objectives and metrics for
optimization. If the problem itself is not well-defined, ML might not be effective in finding
solutions.
3. Causal Inference: While ML can identify correlations and patterns in data, it's not inherently
designed to establish causal relationships. Determining cause and effect requires more
rigorous experimental design and statistical methods.
4. Ethical and Moral Judgments: Decisions involving ethical considerations, moral judgments,
and values often require human reasoning, empathy, and contextual understanding that
machine learning lacks.
5. Unstructured Problem Solving: Machine learning is often used for structured tasks with
clearly defined inputs and outputs. Problems requiring creative thinking, intuition, and
subjective judgment may not be suitable for ML.
6. Domain Expertise: ML models require domain-specific knowledge for effective feature
engineering, interpretation of results, and ensuring meaningful outcomes. Lack of domain
expertise can lead to suboptimal solutions.
7. Conceptual Understanding: ML models can predict outcomes based on patterns in data, but
they may not provide a deep conceptual understanding of underlying phenomena or
processes.
8. Small Sample Sizes: Some machine learning algorithms, particularly deep learning models,
require large amounts of data to generalize well. Small sample sizes can lead to overfitting
and poor performance.
9. Incorporating Context: Contextual understanding and reasoning based on broader context,
cultural nuances, and real-world experiences are areas where machines may struggle.
10. Real-time Critical Decision-making: Situations that require real-time decision-making,
especially in high-stakes environments like healthcare or aviation, may not allow sufficient
time for the learning and adaptation process of ML models.
11. Extreme Context Shifts: Machine learning models might not perform well when deployed in
situations drastically different from their training environment. They lack adaptability to

Dr M.Ramachandro 8|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
extreme shifts in context.
12. New and Novel Situations: ML models typically operate based on patterns learned from past
data. When faced with entirely new and novel situations, they might not have sufficient
information to provide accurate predictions.
13. Interpersonal and Emotional Understanding: Recognizing and responding to human
emotions, nuances, and interpersonal interactions are challenging tasks that require human
emotional intelligence and social understanding.

Dr M.Ramachandro 9|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-01

Applications of Machine Learning

Machine learning (ML) has become a powerful tool across various industries, transforming how we
live and work. Here's an overview of its applications and limitations:

Applications of Machine Learning

Healthcare: Finance:
1. Disease diagnosis and prognosis 1. Fraud detection and prevention
2. Personalized treatment recommendation 2. Credit scoring and risk assessment
3. Drug discovery and development 3. Algorithmic trading and financial forecasting
4. Medical imaging analysis (e.g., MRI, CT scans)
4. Customer segmentation and targeted marketing
5. Electronic health record (EHR) analysis for patient 5. Portfolio optimization and wealth management
management

E-commerce: Marketing:
1. Product recommendation and personalized shopping 1. Customer segmentation and targeting
experiences 2. Sentiment analysis and brand sentiment monitoring
2. Customer segmentation and churn prediction 3. Social media analytics and influencer identification
3. Price optimization and dynamic pricing strategies 4. Customer lifetime value prediction
4. Fraud detection and prevention 5. Campaign optimization and marketing attribution
5. Supply chain optimization and demand forecasting modeling

Manufacturing: Transportation:
1. Predictive maintenance for machinery and equipment 1. Autonomous vehicles and self-driving cars
2. Quality control and defect detection 2. Route optimization and traffic prediction
3. Supply chain optimization and inventory management 3. Demand forecasting for ride-sharing and delivery
4. Demand forecasting and production planning services
5. Process optimization and efficiency improvement 4. Fleet management and vehicle routing
5. Predictive maintenance for transportation
infrastructure

Natural Language Processing (NLP): Computer Vision:

1. Sentiment analysis and opinion mining 1. Object detection and recognition
2. Text classification and document categorization 2. Image classification and segmentation
3. Language translation and multilingual communication 3. Facial recognition and biometric authentication
4. Chat bots and virtual assistants
4. Autonomous drones and aerial surveillance
5. Text summarization and content generation
5. Medical image analysis and diagnosis

Limitations:
1. Lack of Common Sense Reasoning: ML algorithms struggle with tasks requiring common
sense or understanding the context of a situation beyond the data they are trained on.
2. Creativity and Innovation: While applications like generating creative text formats are
emerging, current ML techniques lack the ability to truly innovate or come up with entirely
new ideas.
3. Ethical Decision-Making: ML algorithms cannot make ethical judgments or navigate
complex moral dilemmas requiring human values and understanding.

Dr M.Ramachandro 10 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01

Tools in Machine Learning

Machine learning, a branch of artificial intelligence, is rapidly evolving and requires a robust set of
tools to build, train, and deploy models effectively. Here are some of the most popular tools, along
with their advantages, disadvantages, and limitations:

1. SCIKIT-LEARN:
Advantages:
 Simple and easy-to-use API, making it great for beginners.
 Comprehensive documentation and community support.
 Implements a wide range of classical machine learning algorithms.
Disadvantages:
 Limited support for deep learning models.
 May not be suitable for very large datasets or complex model architectures.
Limitations:
 Lack of flexibility in customization compared to other frameworks like TensorFlow or
PyTorch.
2. TENSORFLOW:
Advantages:
 Highly flexible and scalable, suitable for both research and production.
 Supports deep learning models with customizable architectures.
 TensorFlow Serving allows easy deployment of models.

Dr M.Ramachandro 11 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
Disadvantages:
 Steeper learning curve compared to simpler libraries like scikit-learn.
 Requires more lines of code for simple tasks.
Limitations:
 May require significant computational resources for training complex models.
3. PYTORCH:
Advantages:
 Dynamic computational graph makes it easier to debug and experiment.
 Pythonic API is intuitive and easy to learn.
 Growing popularity in both research and industry.
Disadvantages:
 Less mature ecosystem compared to TensorFlow.
 Limited production deployment tools compared to TensorFlow Serving.
Limitations:
 Training large models can be slower compared to TensorFlow due to lack of
optimizations.
4. KERAS:
Advantages:
 High-level API, allowing for rapid prototyping and experimentation.
 Can run on top of TensorFlow, Theano, or Microsoft Cognitive Toolkit.
 Simplified syntax makes it easy to build neural networks.
Disadvantages:
 Less flexibility compared to TensorFlow or PyTorch.
 May not be suitable for implementing custom architectures.
Limitations:
 Limited support for complex research experiments compared to TensorFlow or PyTorch.
5. APACHE SPARK MLLIB:
Advantages:
 Distributed computing capabilities suitable for big data processing.
 Integration with Apache Spark ecosystem for data preprocessing and analysis.
Disadvantages:
 Limited algorithms compared to standalone libraries like scikit-learn.
 Slower compared to native implementations for smaller datasets.
Limitations:
 Not as actively developed or supported as other ML libraries.

Dr M.Ramachandro 12 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01

Issues in Machine Learning

Although machine learning is being used in every industry and helps organizations make more
informed and data-driven choices that are more effective than classical methodologies, it still has so
many problems that cannot be ignored. Here are some common issues in Machine Learning that
professionals face to inculcate ML skills and create an application from scratch.
Data Quality and Quantity:
Insufficient data: Inadequate amount of data can lead to poor model performance, especially for
complex models like deep learning.
Data imbalance: When the classes in a classification problem are not represented equally, the
model may become biased towards the majority class.
Noisy data: Data may contain errors, outliers, or irrelevant information, which can negatively
impact model performance.
Feature Engineering:
Identifying relevant features: Selecting the right features that contribute to predictive
performance is crucial. Missing important features or including irrelevant ones can degrade
model accuracy.
Handling categorical data: Encoding categorical variables effectively without introducing bias
or increasing dimensionality can be challenging.
Feature scaling: Ensuring that features are on similar scales can improve the performance of
certain algorithms, such as distance-based methods.
Model Selection and Evaluation:
Overfitting and under fitting: Overfitting occurs when a model learns to memorize the training
data instead of generalizing to unseen data, while under fitting happens when the model is too
simple to capture the underlying patterns.
Hyper parameter tuning: Selecting the optimal hyper parameters for a model can be time-
consuming and require extensive experimentation.
Model evaluation metrics: Choosing appropriate metrics to evaluate model performance based
on the problem domain is critical. Using inaccurate or misleading metrics can lead to erroneous
conclusions.
Interpretability and Explain ability:
Black-box models: Complex models such as deep neural networks may lack interpretability,
making it difficult to understand the reasoning behind their predictions.
Model transparency: Understanding how a model makes decisions is important for gaining trust

Dr M.Ramachandro 13 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01

and addressing concerns about fairness, bias, and ethics.

Deployment and Maintenance:
Deployment challenges: Integrating machine learning models into production systems while
ensuring scalability, reliability, and efficiency can be complex.
Monitoring and updating: Models may degrade over time due to changes in data distribution or
drift. Regular monitoring and updating are necessary to maintain performance.
Ethical and Legal Considerations:
Bias and fairness: Machine learning models can inherit biases present in the training data,
leading to unfair or discriminatory outcomes.
Privacy concerns: Handling sensitive data requires careful attention to privacy regulations and
ethical considerations, such as data anonymization and informed consent.

Dr M.Ramachandro 14 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01

Machine learning Activities

Data Collection and Preprocessing:
Example: Suppose you're building a spam email classifier. You collect a dataset containing emails
labeled as spam or not spam. Preprocessing involves tasks like removing HTML tags, converting text
to lowercase, removing stop words, and tokenization.
Exploratory Data Analysis (EDA):
Example: Before training a model, you analyze the distribution of features, correlations, and outliers
in your dataset. For instance, you might visualize the frequency of spam words in spam emails
compared to non-spam emails.
Feature Engineering:
Example: In a fraud detection system, you create new features like transaction frequency, average
transaction amount, and account age based on existing data. These features provide more
information to the model for better fraud detection.
Model Selection and Training:
Example: You experiment with different algorithms (e.g., logistic regression, random forest, neural
networks) and hyper parameters to find the best-performing model for your task. For instance, you
train multiple classifiers on your spam email dataset and compare their accuracy scores.
Model Evaluation:
Example: After training your spam email classifier, you evaluate its performance using metrics like
accuracy, precision, recall, and F1-score. You split your dataset into training and testing sets to assess
how well the model generalizes to unseen data.
Hyper parameter Tuning:
Example: You use techniques like grid search or random search to tune the hyper parameters of your
machine learning model. For instance, you adjust the learning rate, regularization strength, and
batch size of a neural network to optimize its performance on a validation set.
Cross-Validation:
Example: Instead of relying on a single train-test split, you perform k-fold cross-validation to
evaluate your model's performance more robustly. For example, you divide your data into 5 folds,
train the model on 4 folds, and validate it on the remaining fold, repeating this process five times.
Model Interpretation and Explain ability:
Example: In a medical diagnosis system, you use techniques like SHAP (SHapley Additive ex
Planations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret the predictions
of your model. This helps understand which features are most influential in making decisions.

Dr M.Ramachandro 15 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
Deployment and Monitoring:
Example: After building and evaluating your model, you deploy it into a production environment
where it can make real-time predictions. You set up monitoring systems to track the model's
performance over time and retrain or update it as necessary to maintain accuracy.
Transfer Learning:
Example: In image classification, you leverage a pre-trained convolutional neural network (CNN),
such as ResNet or VGG, which was trained on a large dataset like ImageNet. You fine-tune the CNN
on your specific task with a smaller dataset, achieving better performance than training from scratch.

Dr M.Ramachandro 16 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01

Basic Types of Data in Machine Learning

1. Numerical Data:
Examples: Age, temperature, height, salary, stock prices.
Advantages:
o Easy to work with in many machine learning algorithms.
o Can represent a wide range of values and magnitudes.
Disadvantages:
o Outliers can significantly affect analysis and model performance.
o May require scaling or normalization to ensure features are on a similar scale.
Categorical Data:
2. Examples: Gender (Male, Female), color (Red, Green, Blue), product categories (Electronics,
Clothing, Books).
Advantages:
o Useful for representing non-numeric attributes and classes.
o Can provide valuable information for classification tasks.
Disadvantages:
o Need to be encoded into numerical values for most machine learning algorithms.
o High cardinality (many unique categories) can lead to issues like the curse of
dimensionality.
3. Ordinal Data:
Examples: Education level (High School < Bachelor's < Master's < Ph.D.), Likert scale ratings
(Strongly Disagree < Disagree < Neutral < Agree < Strongly Agree).
Advantages:
o Preserves order or ranking among categories, providing additional information.
o Can be useful for certain types of regression or ranking tasks.
Disadvantages:
o Not all machine learning algorithms can handle ordinal data directly.
o May require careful encoding to maintain the ordinal relationship.
4. Text Data:
Examples: Tweets, emails, articles, customer reviews.
Advantages:
o Rich source of information for sentiment analysis, text classification, and natural
language processing tasks.
o Can capture nuanced information and context.

Dr M.Ramachandro 17 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
Disadvantages:
o High-dimensional and sparse representation can be computationally expensive.
o Preprocessing steps like tokenization and stemming are necessary, which can
introduce noise.
5. Image Data:
Examples: Photographs, medical images, satellite images.
Advantages:
o Rich visual information suitable for tasks like object detection, image classification,
and image segmentation.
o Deep learning models like CNNs can automatically extract hierarchical features.
Disadvantages:
o Large memory and computational requirements for processing high-resolution
images.
o Requires extensive preprocessing and data augmentation to handle variations in
lighting, orientation, and scale.
6. Time Series Data:
Examples: Stock prices over time, temperature readings, and sensor data.
Advantages:
o Captures temporal dependencies and trends over time.
o Suitable for forecasting, anomaly detection, and trend analysis.
Disadvantages:
o Need to handle missing values and irregular sampling intervals.
o Sensitive to seasonality, trends, and noise.

Dr M.Ramachandro 18 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01

Exploring the Structure of Data

Exploring the structure of data involves examining its organization, relationships, patterns, and
attributes to gain insights and understanding. Here are examples, advantages, and disadvantages of
exploring the structure of data:
1. Descriptive Statistics:
Examples: Mean, median, mode, standard deviation, and variance.
Advantages:
o Provides summary statistics that describe the central tendency, dispersion, and shape
of the data distribution.
o Helps identify outliers and anomalies.
Disadvantages:
o May not capture complex relationships between variables.
o Limited to numerical data.
Example: Calculating the mean and standard deviation of exam scores to understand the
average performance and variability among students.
2. Data Visualization:
Examples: Histograms, scatter plots, box plots, bar charts.
Advantages:
o Offers visual representation of data distribution, trends, and relationships.
o Facilitates easy interpretation and communication of findings.
Disadvantages:
o Interpretation may vary based on visualization techniques.
o Limited to visualizing a few variables at a time.
Example: Plotting a histogram of customer ages to understand the age distribution in a
market dataset.
3. Correlation Analysis:
Examples: Pearson correlation coefficient, Spearman rank correlation.
Advantages:
o Quantifies the strength and direction of relationships between pairs of variables.
o Helps identify potential predictors in regression analysis.
Disadvantages:
o Assumes linear relationships and may miss non-linear associations.
o Correlation does not imply causation.
Example: Computing the correlation between advertising spending and sales revenue to

Dr M.Ramachandro 19 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
understand their relationship in a marketing dataset.
4. Dimensionality Reduction:
Examples: Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor
Embedding (t-SNE).
Advantages:
o Reduces the dimensionality of data while preserving important information.
o Facilitates visualization and interpretation of high-dimensional data.
Disadvantages:
o Loss of interpretability in reduced dimensions.
o May discard some information.
Example: Applying PCA to gene expression data to identify principal components
representing gene expression patterns.
5. Clustering Analysis:
Examples: K-means clustering, hierarchical clustering.
Advantages:
o Identifies natural groupings or clusters within the data.
o Useful for segmentation and pattern recognition.
Disadvantages:
o Requires choosing the number of clusters, which can be subjective.
o Results may vary based on the choice of distance metric and clustering algorithm.
Example: Using K-means clustering to segment customers based on their purchasing
behavior.

Dr M.Ramachandro 20 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01

Data Quality & Remediation

Data quality refers to the reliability, accuracy, consistency, completeness, and relevancy of data. Data
remediation involves the process of identifying and correcting data quality issues to ensure that data
is accurate, reliable, and suitable for analysis or decision-making. Here are examples, advantages,
disadvantages, and limitations of data quality and remediation in real-time scenarios:
Examples of Data Quality Issues:
1. Inconsistent formats: In a customer database, phone numbers are stored in various formats
(e.g., +1 (555) 123-4567, 555-123-4567, 5551234567).
2. Missing values: In a sales dataset, some records have missing values for the "sales amount"
field.
3. Duplicate records: A product inventory system contains duplicate entries for the same item.
4. Incorrect data: In a healthcare database, a patient's birthdate is recorded as 01/13/1900,
which is not possible.
Advantages of Data Quality & Remediation:
1. Improved decision-making: High-quality data leads to more accurate insights and better-
informed decisions.
2. Enhanced efficiency: Clean and reliable data reduces the time spent on data cleaning and
troubleshooting.
3. Increased trust: Stakeholders have greater confidence in data-driven analyses and reports
when data quality is high.
4. Regulatory compliance: Ensuring data quality helps organizations comply with data
protection and privacy regulations.
Disadvantages of Data Quality & Remediation:
1. Time-consuming: Identifying and rectifying data quality issues can be a time-intensive
process, especially for large datasets.
2. Costly: Data remediation efforts may require investments in tools, resources, and personnel.
3. Complexity: Some data quality issues may be challenging to detect and correct, especially in
heterogeneous datasets.
4. Potential for errors: Human error during data cleaning and remediation can introduce new
inaccuracies or biases.
Limitations in Real-Time Examples:
1. Real-time data streams: Data quality issues may arise in streaming data sources where
there is limited time for manual intervention. Automated data quality checks and remediation
processes are essential in such cases.

Dr M.Ramachandro 21 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
2. Data integration: When integrating data from multiple sources, inconsistencies in formats,
naming conventions, and data definitions can complicate data quality efforts. Standardization
and data governance practices are crucial to address these challenges.
3. Unstructured data: Textual data from sources like social media or customer feedback may
contain noise, ambiguity, or sentiment that pose challenges for automated data quality
assessment and remediation.
4. Legacy systems: Older systems may have outdated data formats, redundant fields, or
missing documentation, making it difficult to ensure data quality without significant efforts
in data migration and modernization.

Dr M.Ramachandro 22 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01

Data Pre-Processing
Data preprocessing is a critical step in machine learning pipelines, involving transforming raw data
into a clean, structured format suitable for training machine learning models. It includes various
techniques to handle missing values, outliers, feature scaling, normalization, encoding categorical
variables, irrelevant features and more, as well as to standardize or scale the data. Here's an
overview along with suitable examples, advantages, and disadvantages.
1. Handling Missing Values:
Example: Suppose you have a dataset of customer information, and some entries have
missing values for the "income" attribute. You can handle this by imputing missing values
using techniques like mean, median, or mode imputation, or by using advanced imputation
methods like K-nearest neighbors (KNN) or predictive models.
Advantages:
o Prevents loss of valuable data.
o Improves the robustness and reliability of the dataset.
Disadvantages:
o Imputation methods may introduce bias if not handled carefully.
o Imputed values may not accurately represent the true underlying data distribution.
2. Outlier Detection and Removal:
Example: In a dataset of housing prices, you may find some entries with unrealistically high
or low prices. Outliers can be detected using statistical methods like Z-score or IQR
(Interquartile Range) and removed or adjusted accordingly.
Advantages:
o Improves model performance by reducing the impact of outliers.
o Prevents the model from being skewed by extreme values.
Disadvantages:
o Removal of outliers may lead to loss of valuable information.
o Subjective choice of outlier detection method and threshold.
3. Feature Scaling and Normalization:
Example: In a dataset containing features with different scales (e.g., age and income), scaling
techniques like Min-Max scaling or Standardization (Z-score normalization) can be applied to
bring all features to a similar scale.
Advantages:
o Ensures that features contribute equally to the model.
o Helps algorithms converge faster during training.

Dr M.Ramachandro 23 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
Disadvantages:
o Scaling may amplify the noise in the data.
o Loss of interpretability for some features after scaling.
4. Encoding Categorical Variables:
Example: Suppose you have a categorical feature like "gender" with values "Male" and
"Female." You can encode it into numerical values using techniques like one-hot encoding or
label encoding.
Advantages:
o Allows algorithms to work with categorical data.
o Preserves the ordinal relationship between categories if needed.
Disadvantages:
o Increases dimensionality, especially with one-hot encoding.
o May introduce sparsity and multicollinearity in the dataset.
5. Dimensionality Reduction:
Example: Applying techniques like Principal Component Analysis (PCA) or t-distributed
Stochastic Neighbor Embedding (t-SNE) to reduce the dimensionality of high-dimensional
datasets while preserving most of the relevant information.
Advantages:
o Reduces overfitting and computational complexity.
o Visualizes high-dimensional data in lower dimensions.
Disadvantages:
o Loss of interpretability in reduced dimensions.
o May discard some information, leading to loss of predictive power.

Dr M.Ramachandro 24 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01

Summary of the Topics

PART – A [2Marks]
1. What are the three main types of machine learning?
2. Provide an example of a problem that is not suitable for solving using machine learning
techniques.
3. Name two applications of machine learning in the healthcare industry.
4. Mention one popular tool used for deep learning and one for data preprocessing in machine
learning.
5. What are two common issues encountered in machine learning projects?
6. List two activities involved in a typical machine learning workflow.
7. Identify two basic types of data encountered in machine learning tasks.
8. Describe one technique used for exploring the structure of data in machine learning.
9. Explain the importance of data quality in machine learning and provide one example of data
quality issue.
10. Name one step involved in data preprocessing for machine learning tasks. (2 marks)

PART – B [5Marks]
1. Explain the three main types of machine learning (supervised, unsupervised, and reinforcement
learning) with relevant examples for each. Discuss the key differences between these types,
highlighting their strengths and limitations.
2. Discuss two real-world problems that are not well-suited for solving using machine learning and
explain why. What are some alternative approaches that could be used to address these
problems?
3. Describe three specific applications of machine learning in different domains (e.g., healthcare,
finance, transportation). Explain the specific tasks or challenges these applications address and
the benefits they provide.
4. Compare and contrast two popular machine learning tools (e.g., scikit-learn and TensorFlow)
based on factors like ease of use, flexibility, and suitability for different types of machine learning
problems.
5. Discuss the concept of overfitting and under fitting in machine learning models. Explain how
these issues can impact the performance of a model and what techniques can be used to mitigate
them.
6. Explain the concept of natural language processing (NLP) and describe two potential applications
of NLP technology in different industries. Discuss the advantages and disadvantages of using NLP
for each application.

Dr M.Ramachandro 25 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-01
7. Differentiate between structured and unstructured data with illustrative examples for each.
Describe two challenges that can arise when working with high-dimensional data and how they
can be addressed.
8. Explain the importance of exploring the structure of data in machine learning. Discuss how
examining data structure can help identify potential biases and improve the quality of data
analysis and model development.
9. Describe the data pre-processing pipeline, outlining the key steps involved and providing
examples of techniques used for each step. Explain the rationale behind data pre-processing and
its impact on machine learning model performance.
10. Discuss two ethical challenges associated with the development and deployment of machine
learning models. Explain the potential consequences of these challenges and propose strategies
to ensure responsible and ethical practices in machine learning.

Dr M.Ramachandro 26 | P a g e

Python For Beginners Mastering The Basics of Python - Part 1 (Alex Harrison) (Z-Library)
No ratings yet
Python For Beginners Mastering The Basics of Python - Part 1 (Alex Harrison) (Z-Library)
575 pages
Machine Learning PPT For Students
70% (10)
Machine Learning PPT For Students
18 pages
Machine Learning?
100% (2)
Machine Learning?
114 pages
Unit1 ML
No ratings yet
Unit1 ML
23 pages
Liquid Neural Networks A Novel Approach To Dynamic Information Processing
No ratings yet
Liquid Neural Networks A Novel Approach To Dynamic Information Processing
6 pages
Top 10 Uses of Python in The Real World With Examples
100% (1)
Top 10 Uses of Python in The Real World With Examples
10 pages
Python GTU Study Material Presentations Unit-2 24072020062038AM
No ratings yet
Python GTU Study Material Presentations Unit-2 24072020062038AM
18 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Group 35 Final Report
No ratings yet
Group 35 Final Report
51 pages
Rahul Maurya Resume
No ratings yet
Rahul Maurya Resume
2 pages
Om Sheewale Pune
No ratings yet
Om Sheewale Pune
2 pages
Age and Gender Classification Using Conv
No ratings yet
Age and Gender Classification Using Conv
5 pages
DL Practical
No ratings yet
DL Practical
14 pages
Sahil Pandey Resume PDF
No ratings yet
Sahil Pandey Resume PDF
1 page
Build Deep Learning NN Models
No ratings yet
Build Deep Learning NN Models
6 pages
Research Paper Final
No ratings yet
Research Paper Final
6 pages
Machine Learning - Module 1
No ratings yet
Machine Learning - Module 1
105 pages
Question Paper LH - MLT
No ratings yet
Question Paper LH - MLT
93 pages
Unit 1
No ratings yet
Unit 1
110 pages
1.2.1 ML Intro
No ratings yet
1.2.1 ML Intro
15 pages
Kishalay Das cv-1 PDF
No ratings yet
Kishalay Das cv-1 PDF
1 page
Cvgenerate KGP
No ratings yet
Cvgenerate KGP
1 page
Tensor Flow Q
No ratings yet
Tensor Flow Q
39 pages
Introduction To Machine Learning: Dr.S.Sankar Ganesh Vellore Institute of Technology
No ratings yet
Introduction To Machine Learning: Dr.S.Sankar Ganesh Vellore Institute of Technology
132 pages
Plant Disease Detection and It's Health Monitoring Using CNN and Arduino
No ratings yet
Plant Disease Detection and It's Health Monitoring Using CNN and Arduino
9 pages
Machine Learning: From: Atul Ranjan Jha
No ratings yet
Machine Learning: From: Atul Ranjan Jha
11 pages
Module - 1
No ratings yet
Module - 1
132 pages
Resume Rishi
No ratings yet
Resume Rishi
1 page
CLASS NOTES Unit 1 ML Material
No ratings yet
CLASS NOTES Unit 1 ML Material
42 pages
C2 W1 Lab03 CoffeeRoasting Numpy
No ratings yet
C2 W1 Lab03 CoffeeRoasting Numpy
5 pages
Prabhjot Singh Resume 2I9W1ZQX17
No ratings yet
Prabhjot Singh Resume 2I9W1ZQX17
1 page
ML - Module 1
No ratings yet
ML - Module 1
52 pages
Learning Unit 6
No ratings yet
Learning Unit 6
23 pages
Presentation 33360 Content Document 20250319044717PM
No ratings yet
Presentation 33360 Content Document 20250319044717PM
126 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
12 pages
ML-UNIT - I - Part A
No ratings yet
ML-UNIT - I - Part A
88 pages
Introduction To Keras
No ratings yet
Introduction To Keras
14 pages
Introducion To ML
No ratings yet
Introducion To ML
29 pages
Aryaman Resume
No ratings yet
Aryaman Resume
1 page
ChatGPT - MyLearning On Research Methodology in Computer Science
No ratings yet
ChatGPT - MyLearning On Research Methodology in Computer Science
22 pages
Mooc Progress Report
No ratings yet
Mooc Progress Report
8 pages
Machine Learning.
No ratings yet
Machine Learning.
50 pages
Machine Learning A Comprehensive Overview
No ratings yet
Machine Learning A Comprehensive Overview
10 pages
ML-Unit 1 Merged
No ratings yet
ML-Unit 1 Merged
151 pages
Exam Killer
100% (1)
Exam Killer
246 pages
Intro Template
No ratings yet
Intro Template
3 pages
LLM Models You've Worked With
No ratings yet
LLM Models You've Worked With
3 pages
ML Microsoft Course Overview: Machine Learning in Context
100% (1)
ML Microsoft Course Overview: Machine Learning in Context
53 pages
Stephen Gou Resume 1
No ratings yet
Stephen Gou Resume 1
1 page
UNIT-1 Machine Learning
No ratings yet
UNIT-1 Machine Learning
43 pages
A Beginner's Guide To Machine Learning Fundamentals (Compressed)
No ratings yet
A Beginner's Guide To Machine Learning Fundamentals (Compressed)
10 pages
ML Lec 1
No ratings yet
ML Lec 1
49 pages
6.1.unit-1 ML Handsout
No ratings yet
6.1.unit-1 ML Handsout
18 pages
1.2.1 ML Intro
No ratings yet
1.2.1 ML Intro
18 pages
UNIT-IV Notes
No ratings yet
UNIT-IV Notes
42 pages
ML-Unit 1
No ratings yet
ML-Unit 1
43 pages
Machine Learning - UNIT I
No ratings yet
Machine Learning - UNIT I
70 pages
Introduction To Machine Learning Basics
No ratings yet
Introduction To Machine Learning Basics
12 pages
Study On Machine Learning Research Paper
No ratings yet
Study On Machine Learning Research Paper
17 pages
CP Presentation Affan, Hammad, Arman, Shayan
No ratings yet
CP Presentation Affan, Hammad, Arman, Shayan
18 pages
Unit 9 - Machine Learning
No ratings yet
Unit 9 - Machine Learning
18 pages
Unit-1 Part-1 Material
No ratings yet
Unit-1 Part-1 Material
45 pages
ML Module 4
No ratings yet
ML Module 4
25 pages
Ai Faheem
No ratings yet
Ai Faheem
16 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Session One Machine Learning
No ratings yet
Session One Machine Learning
18 pages
Anantha Sai Ram Padala - 9+ Yrs - AIML Engineer - 1st
No ratings yet
Anantha Sai Ram Padala - 9+ Yrs - AIML Engineer - 1st
6 pages
SK Sahidur Rahaman Bba504a 2024
No ratings yet
SK Sahidur Rahaman Bba504a 2024
9 pages
Gpu Applications Catalog
No ratings yet
Gpu Applications Catalog
51 pages
Department of Emerging Technology (SB) III B.Tech - I Semester
No ratings yet
Department of Emerging Technology (SB) III B.Tech - I Semester
12 pages
UNIT I-Machine Learning
No ratings yet
UNIT I-Machine Learning
68 pages
Introduction To Machine Learning2 - 085047
No ratings yet
Introduction To Machine Learning2 - 085047
11 pages
Machine: Learning ATO Z - I
No ratings yet
Machine: Learning ATO Z - I
131 pages
Lec 1-2 Notes Introduction To Machine Learning
No ratings yet
Lec 1-2 Notes Introduction To Machine Learning
7 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
68 pages
Machine Learning
100% (2)
Machine Learning
81 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
5 pages
Machine Learning1
100% (1)
Machine Learning1
11 pages
ML Report
No ratings yet
ML Report
19 pages
ML Notes
No ratings yet
ML Notes
202 pages
ML Unit-1
No ratings yet
ML Unit-1
15 pages
Data Analyst Interview Questions
No ratings yet
Data Analyst Interview Questions
12 pages
Unit 1
No ratings yet
Unit 1
88 pages
Machine Learning Tutorial For Beginners
No ratings yet
Machine Learning Tutorial For Beginners
15 pages
Unit 1 Introduction of Machine Learning Notes
No ratings yet
Unit 1 Introduction of Machine Learning Notes
57 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
5 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Karthik
No ratings yet
Karthik
10 pages
Training Report On Machine Learning
No ratings yet
Training Report On Machine Learning
27 pages
Fundamentals of Machine Learning: a Simplified Approach
From Everand
Fundamentals of Machine Learning: a Simplified Approach
Er. Sudhir Goswami
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 1

Uploaded by

Unit 1

Uploaded by

MACHINE LEARNING COURSE CODE-A8703 MODULE-01

SYLLABUS: Introduction to Machine Learning: Types of Machine Learning, Problems not

Introduction to Machine Learning:

 Machine learning is a subset of artificial intelligence (AI) that focuses on developing

Types of Machine Learning

Problems Cannot to Be Solved Using Machine Learning

Applications of Machine Learning

Applications of Machine Learning

Natural Language Processing (NLP): Computer Vision:

Tools in Machine Learning

Issues in Machine Learning

and addressing concerns about fairness, bias, and ethics.

Machine learning Activities

Basic Types of Data in Machine Learning

Exploring the Structure of Data

Data Quality & Remediation

Summary of the Topics

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.