0% found this document useful (0 votes)
2 views42 pages

1Datamining Intro

Uploaded by

shukladinesh0206
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views42 pages

1Datamining Intro

Uploaded by

shukladinesh0206
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Introduction

S. Sumitra
Department of Mathematics
Indian Institute of Space Science and Technology

MA613 Data Mining


MA613 Data Mining: About the course

• 3 credits course; 1 credit lab(Data Mining Lab)

• Create a group mail id

• Syllabus will be uploaded in Moodle or sent to mail id

• Assignments will be uploaded in Moodle or sent to mail id


Patterns in the Data

• Data {(3, 9), (4, 12), (5, 15)}


• Test data: (6,y)
• Given
Data: (1, −1, 2)T , 4 , (3, 4, −1)T , 14 , (1, 7, −1)T , 26.5
  

• Test data: { (3, −3, 5)T , y }



Patterns in the Data

• Find two clusters


• {2, 121, 94, 11, 3, 1001}
Patterns in the Data

(0.17176, 1.3807)
(1.3042, 0.39963)
(0.29626, 1.6562)
(0.95062, 0.14257)
(1.6749, 0.76618)
(1.1749, 0.35071)
(1.3647 ,0.49265)
(0.9279, 0.014485)
(0.35133, 0.9043)
(1.1512, 1.5482 )
(1.4698 ,1.393)
(0.71853, 1.5127 )
(1.2985, 0.33472)
BIG Data

• Data Explosion
• Every 2 Days We Create As Much Information As We Did
Up To 2003 - Eric Schmidt, who served as Google CEO

• Big data
• Image data
• Video data
• Sensor data
• IoT data
What is Machine Learning?

• Machine Learning is concerned with the development of


algorithms and techniques that allow computers to learn
Applications of Machine Learning

• Medical diagnosis

• Fraud detection

• Spam filtering

• Weather prediction

• Natural language understanding


Face Detection

Source: Internet
AI Camera

Source: Internet
ChatGPT

Source: Internet
Google Self Driving car

Source: Internet
Automated Navigation

• Navigation on unknown terrains in distant space missions


• Autonav: An autonomous navigation and driving system in
Spirit & Curiosity rovers
• Deciding the suitable spectrum for communication

Source: Internet
Intelligent Robots
• Selection of new observational targets
• AEGIS: NASA software capable of autonomously
identifying interesting rocks and terrain features

Source: Internet
Brain Computer Interface
Machine Learning
Terminologies
Overfitting and Underfitting

Taken from Bishops book


Types of Learning

• Supervised learning

• Unsupervised learning

• Reinforcement learning

• Semi-supervised learning

• Active learning

• Transfer learning
Supervised Learning

• Let {(x1 , y1 ), (x2 , y2 ), . . . (xN , yN )} be the given data points,


xi ∈ X ⊆ Rn , yi ∈ Y ⊆ R
• Cardinality: N

• Attributes (features): n
Example: Model to Detect Heart Disease data

• Collected data from 100 subjects. From each subject:


• Cholesterol level
• fasting blood sugar
N = 100, n = 2
Notation: i th data

• i th data
 
xi1
xi2 
xi =  . 
 
 .. 
xin

xi T = xi1 xi2 . . .
 
xin
Objective

• To develop a model that has good generalization capacity


• Training Data
• Testing Data
Supervised Learning: Two Types

• Classification : outputs are discrete


• The number of values y takes is finite
• If y takes only two different values, the task is called binary
classification problem
• y ∈ {0, 1}: Binary classification problem
• If y takes more than two different values, the task is called
multi classification problem
• y ∈ {−1, 0, 1}: Three class problem

• Regression: outputs are continuous


• y ∈ [0, 1]
• y ∈R
Study of Arthritis Data: Classification Sample Data

Data Soleus Gactrocnemius yi : 1/0


x1T x11 x12 1
x2T x21 x22 0
x3T x31 x32 0
x4T x41 x42 1
x5T x51 x52 1
Binary classification Problem

• Two class classification problem: Positive class , negative


class
• Positive class: Arthritis patients
• Negative class: Normal subjects
Regression: Sample Data

Data angular position velocity yi : angular acceleration


x1T 1 3 .1
x2T 9 15 0.05
x3T 3 9 0.07
x4T 2 1 0.03
x5T 4 2 0.5
Challenger USA Space Shuttle
• Task: predict the number
of O-rings that experience
thermal distress on a flight
at 31 degrees F given data
on the previous 23 shuttle
flights
• Attributes
• Number of O-rings at
risk on a given flight
• Number experiencing
thermal distress
• Launch temperature
(degrees F)
• Leak-check pressure
(psi)
• Temporal order of flight
Relation

Function: f : X → Y
• Domain: X
• Codomain: Y
• Range :{y ∈ Y : f (x) = y )}, Range(X ) ⊆ Y
• If f (x) = y , y is called the image of x and x is called the
preimage of y
• f is said to be one to one if f (x) = f (y ), then x = y . In
other words, no two elements in the range have the same
preimage.
• f is said to be onto if Range(X ) = Y
Supervise Learning: Model

• A function (model) : f : X → Y
• Classification: f is called the decision boundary. f
separates the data into different classes
• Regression: f is called the approximating function
Hyperplane: Classification

• Two class problem

• f (x) ≥ threshold, x ∈ positve class


• f (x) < threshold, x ∈ negative class
Hyperplane: Classification

• Seperable Data
Nonlinear Function

• Non Seperable Data


Hyperplane: Regression
Hyperplane: Regression

Figure: Data Figure: Hyperplane


Nonlinear Function: Regression
Regression and Classification

• For classification, if the decision boundary is a line, how


many attributes data has?
• For regression, if the function that approximates the data is
a line, how many attributes data has?
Linear and Nonlinear Algorithms

• If f corresponds to affine function


• linear algorithms
• else
• nonlinear algorithms
Unsupervised Learning

• No label

• Clustering
• Divide the data into different groups
• Data in same group are similar and different groups are
dissimilar
Study of Telomeres
• A telomere is a region of
repetitive DNA at the end
of chromosomes
• Telomere help in protecting
the chromosomes from
fusing with each other
• Gradual loss of telomere
results in age related
diseases and cancer
• Development of techniques
that are able to measure
telomere length will help in
the early diagnosis and
prevention of age related
diseases and cancer
1000

900

800

700

600
SSC−Height

500

400

300

200

100

0
0 100 200 300 400 500 600 700 800 900 1000
FSC−Height
Nanoparticle Representation
Reinforcement Learning
A four-legged robot is build. The objective is to program it to
walk. How to proceed?

• Use RL Algorithm

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy