0% found this document useful (0 votes)
12 views44 pages

Lecture 1 - Introduction To Machine Learning-HO - Ch0

The document provides an overview of machine learning, emphasizing its application in predicting customer behavior in online retail and media platforms through supervised learning. It discusses the importance of data collection, feature selection, and the various types of machine learning, including supervised, unsupervised, and reinforcement learning. Additionally, it highlights the challenges of processing large datasets and the need for effective algorithms to analyze and extract meaningful insights from the data.

Uploaded by

backup91189
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views44 pages

Lecture 1 - Introduction To Machine Learning-HO - Ch0

The document provides an overview of machine learning, emphasizing its application in predicting customer behavior in online retail and media platforms through supervised learning. It discusses the importance of data collection, feature selection, and the various types of machine learning, including supervised, unsupervised, and reinforcement learning. Additionally, it highlights the challenges of processing large datasets and the need for effective algorithms to analyze and extract meaningful insights from the data.

Uploaded by

backup91189
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

CENG566: Machine

Learning
Lecture 1 in red extra notes
Handouts– Introduction to Machine Learning
Source: Machine Learning An Algorithmic Perspective Second edition
By Stephen Marsland
Prepared By: Dr. Zaher Merhi

1
Introduction to Machine Learning
• Consider an online retail store like amazon with data about
movement of its clients: Purchases and Preferences:
• Based on these data, we would like to populate things that you might be
interested in. Same as in movie suggestions in Netflix.
• The problem we have one of prediction:
• Given the data we have predict what the next person will buy.
• The reason that this will work if that people who are similar act similar.
• This is an example of supervised learning i.e. a Teacher
• On another hand storing all movements takes a lot of space.
• Data stored in large quantities is a well known problem.
• The challenge is to do something useful with it.
• The size and complexity of data means humans are unable to extract useful
information

2
Vid Notes 1
- Amazon focused 3l clients ele byshtro products m3ynen
- Bl website btn3rad mjmo3t products mmkn thmo lal client
- W t7fzo yshtrehn
- Bs nfoot 3a amazon site, based 3l previous visit lal website, amazon collects data 3n el customers, bshuf
history lal browser 3n tree2 el cookies, shu hne el items of interest la client m3yan , w b3lej hl data 3n tre2
ML program, la yktre7 3l vistor mjmo3et products bthmo
- Kl hal shi prediction process (predicts desire of customer, 3n tree2 supervised learning )

- Another example: Netflix website


- Awl mra btfoto 3a Netflix: displaying la different movies mn kl el anwe3
- M3 el w2t bs trkzo 3a category m3ayan aw t5taro mjmo3et movies m3yne
- M3 2nd w 3rd visit bser Netflix brkez 3l category ele frjyna ehtmem feha abel
- Netflix byjm3 data 3n el user w mn5lela by3ml prediction lal movies ele momken t3jbo

- Supervised learning:
- -based on previous experience
- Like Website 3ndo Database
- M3na el inputs w el proper answer bkel situation
3
-
Vid Notes 2
- ML vs classical learning
- Example: software m3yan x  programming  y , y = f(X)
- In ML ma mnbrmej relation ben x w y, mn3te el ML x1 ... Xn w y1... yn w ble2e
relation bynton
- Traditional : mn3te el formula
- ML: mn3te data la yowsal lal relation
- ML based 3l learning from data, by3tena 7al la storing of huge amount of data
bl servers or databases el mwz3een 7wlen el 3lam (use bs useful data,
compress data etc..)
- Tracking of data thru visiting sites and Social media w used for analysis
- Large amount of data bda large capacity of storage
- ML by3tmd 3al data, bs byst3ml bs el useful data w b2lela, bn3mla
compression btre2a mnst5rj bs el useful part

4
Introduction to Machine Learning
• Classifying data can be very easy with 2-D data and small amount of
data. We humans struggle processing large data so we try to visualize
it, so after 3 dimensions we can’t do that and that prompts ML.
• Higher dimensions (above 3) no visualization is possible.
• Projection might help (reducing a 10D dataset to 2D or 3D) but it will masks
(lose) some of the information.
• It will hide information for distinguishing classes that differs in the other dimension
• Imagine a 3D cloud of points where two classes are separable. If you project it onto a 2D
plane, some points from different classes might overlap.

5
Vid Notes 3
- Previous examples: regression/ prediction
- Regression is a type of supervised learning in Machine Learning used for predicting
continuous values. Unlike classification (which assigns labels to categories), regression
estimates a numerical value based on input features. (supervised technique)
- Now in the table example: x1 and x2 are coordinates
- A point is the combination of x1 and x2
- Each point belongs to a certain class, there are only 3 in this example
- Mfrood bl ML program/ software by2dr eza 3tyne data msh bl table i.e Point
with x1 and X2 new, it will know its class. This is a classification problem
- Table data: training data, available for software to learn from
- The software sees each point and its class
- Class 1: +
- Class2 : thunder
- Class 3: circle
- Software tries to determine between (x1) x,y (X2) and class, to determine a
decision boundary / threshold that distinguishes the classes 6
Vid Notes 4
- Nb of inputs : P1 till Pn , kl input 2 values so dimension lal input is 2
- Kl x1 w x2 hye feature lal data

7
Applications of Machine Learning
• Many applications exists for Machine learning:

• Spam Filters, email bysht8l ka ML classifier la y3ml classifying of true email versus spam email

• Voice recognition , recognizes voices based on their specific characteristics and categorizes each sound to

which person it belongs to

• Computer Games

• Automatic Number plate recognition

• Anti Skid Braking Systems a safety system in vehicles that prevents wheels from locking up during

braking, improving stability and control. This is especially useful on slippery roads or during emergency

stops.

• Vehicle Stability control


8
Learning
• What is Learning:
• The key concept is learning from data.
• What about learning from experience?
• Key parts in learning is remembering, adapting, and generalization.
• That is recognizing similarity between different situations so that things can be used in
one place can be used in another.

• There is also Intelligence such as reasoning and logical deduction.


• However we are only interested in the most fundamental part of intelligence which is
learning and adapting
• And how to model them in a computer.

9
Vid notes 4
• Program byt3lm lw7do based on data mn3te yeha
• Bdon data ma 3na ML
• Eza fe data, mn2dr n3ml software based on techniques 3na yehn
• Kif? La n2ol mn3te el software inputs w kl input lzem ykon 3ndo given response t
• Initially mn7ot input x1 w btle3 el program y1 bs mnrj3 mn5bro eno lzem el output t1.
based on el error ben t1 w y1 b8yr el program bl parameters ele 3ndo yehn la yowsal la
minimal error w yle2e output a2rb shi la t1 (target value)- hyda el FB (adaptation)
• Adaptation depends on different matrices, most prominent is the error ele hye t-y : target
value- output value (technically b3den mnshuf eno hye square root el difference actually)
• Learning is based on experience, based on data ele mn3te yehn, kl ma n3te by7sb ade el
output, w mnkon 3ate targeted output w 3ala asesa by3ml adaptation
• Awl mr7le byt3lm mn el data w adaptation lal parameters t3olo, la yowsal la minimal error
and best solution (less error, more accurate software)
• Remembering : mn3te x1 w mnb3d adaptation by3tena y1 w lzem y3te nfs e output b3den
based on same input
• Generalization: trained on n inputs, eza 3tyne input new mana bl training set, mfrood ML
program based on previous experience, la tlma x1 bt3te y1 lzem xm t3te ym m3yne 10
Machine Learning
• Machine learning is about making computer modify or adopt their actions
so that these actions get more accuracy.
• n3te mjmo3et available data, m3 certain label (solution) lal software, w software
by7sb output w bjreb m3 time y7sob output areb lal solution / desired output ele
3tyne ye
• Accuracy is measured by how well the chosen actions reflect the correct ones.
based on adaptation of software to new situations. real output – desired output
• Example: Playing chess with a computer.
• Initially you begin beating the machine.
• At some point the machine starts to learn your techniques, and start beating you.
• Next time some one else plays with the machine it will not start from scratch.
• Computational complexity is of interest.
• It can be broken to two parts:
• Complexity of training
• Does not often happens. Training happens once (or infrequently), so it’s less critical for real-
time systems. It need time
• Complexity of applying trained algorithm:
• Time critical decisions . Fast predictions are important , it should be fast since most ml
11
programs are used in real time applications
Types of Machine Learning
• Picking the variables that you want to use, also called features, is very important to find
a solution.
• Choosing how to process data can be important.
• Supervised Learning: A training set of examples with the correct responses (targets) is
provided.
• The algorithm generalizes to respond correctly to all possible inputs.
• This is also called exemplars.( model is guided by exemplars)
• training data set: inputs m3 solutions la elon, w based 3a hl couple, el system byosal la function
t3ml right mapping ben input w output. Supervising lal machine/ learning with teacher
• Unsupervised learning: Correct responses are not provided, but instead, the algorithm
tries to identify similarities between the inputs so that they are categorized. lzem el
machine t3ml statistical study la hol inputs w tle2e 3wemel mshtrki bynton la t2dr
tsnef el inputs, based on different categories (density estimation)
• The statistical approach is known as density estimation.
• Density estimation we model the probability distribution of data. It helps identify how data
points are distributed in a space. It determines where data points are concentrated and how they
spread. It answers: "Given some data, what is the likelihood of a new data point belonging to the
same distribution?“
• By estimating density, we can detect clusters, outliers, and anomalies in datasets. 12
Vid notes 5
• Data: hye 3ebara 3n mjmo3et input vectors , kl vector / input m2lfe
mn vector of values / coordinate lal vectors named features,
• 7sb el problem, ma drore n3te kl el data set ele 3na yeha, we extract
best data lal problem ele 3na yeha
• Mn characteristic lal given input ele 3na ye, mn5od some- not all
input characteristics of vectors l2n some components are more useful
than others within the desired situation/ problem
• Previous examples were mostly supervised learning
• Emtan supervised aw unsupervised? 7sb el data ele 3na yehn, labeled
vs unlabeled data

13
Types of Machine Learning
• Reinforcement learning: In between supervised and unsupervised
learning
• The algorithm gets told if the answer is wrong, but it cannot correct it. (this FB
comes in form of rewards or punishments)
• It has to explore and try to explore different possibilities to get it right.
(optimize with time, through trial and error)
• It is called learning with a critic.

• Evolutionary Learning: Based on biological evolution.


• It explores models that deals with fitness (score) which corresponds how
good the current solution is.
• Solutions with higher fitness are more likely to be chosen to "reproduce" and
make new solutions.
• Ma mhme bl course, holek el 3 bs types bl ML mnst3mlon
14
Data Collection and Preparation
• Sometimes Data is ready and available, most times it will be scarce, or have to be collected
from scratch.

• Having large amount of data is very important in Machine Learning. However, attaining
large amounts is challenging:

• Sensors collecting data are subject to noise, thus low error data is hard to attain.
• Sometimes it is difficult to collecting data.
• Enough data should be given without excessive computation is impossible to predict.
• Fena nst3ml data raw as it is. Or perform feature extraction, important ones, that
distinguish different inputs.
• Amrar ma 3na data, aw labeled data, mntr n3ml experiments to get data. Example:
network m3yn n7sl 3a input signal 3n tree2 different sensors but could be mixed
with noise which affects data quality.
• Mnjme3 data mn clinics w mshfyet w mmkn ykon fe a8lat l2n 3m njme3a mn
different persons
• El 7al lzem n3ml cleaning and extract noise, remove outlier values, check for
missing values, w nshuf eza el data m7tota bl m7al el 8lat. (preprocessing step) 15
The Machine Learning Process
• Feature Selection
• Identifying features useful for the problem under examination. (7sb el situation) n2e el best 7sb our problem’s scope
• Requires prior knowledge of the problem and data. La 23rf el useful ones
• Features should be not be collected expensively (high cost of collection) or corrupted with high noise(can cause
overfitting).
• Algorithm of Choice
• Given the data set what is the appropriate algorithm.
• algorithm: unsupervised learning, supervised, reinforced 7sb el data type ele m3na ye (labeled or la2)
• Kl category mn hol fe subcategories (different algorithms belonging to supervised learning and so on). We choose based
on characteristics of these algorithms, their performance, proper; if it works with solving such type of problem.
• Evaluation and model selection (mtl SVM btmro2 b3den)
• Parameters that needs to be set manually or needs experimentation to identify appropriate values. Adjust parameters of
algorithm : initialize them (learning rate, performance estimate, DOF of fn in regression)
• Training
• Use computational resources to build a model in order to predict the output. takes time (adaptation, optimization )
proper model
• Evaluation
• Before the system is deployed it needs to be tested.
• Like through metrics: accuracy, precision, confusion matrix etc..
• Evaluation : mn3te new set of data/ test data, w mn2lo jreb tle3 el output, w based on output fe different matrics to
evaluate the output 16
Vid notes 6 (upcoming 2 slides)
- Example 7a nshuf bl neural networks:
- Neuron
- adaptation la software m3yan based on neural network
- 3na input data (vector X, m2lf mn different features x1 ...xn)
- Neuron: huwe processing unit m2lfe mn 2 steps :
- combiner/summation which sums the different inputs (x1.. Xn) multiplied by their
corresponding weights (w1...wn) with same dimension bla bias
- Kl input byn3mlo scaling / weighting m3yn 3n tree2 phonetic (?) receivers ele 3na yehn
- Fe missing values, bias etc..(mn7ki fehn b3den)
- Combiner byjm3 el combination lal different features ele 3na yehn
- Combined value by3tena response v mn5da 3a activation fn g , ele bt3ml compression/
limitation lal result.
- Final output: y a fn of g , activation fn applied 3l input lal activation fn
- El output y mmkn 8er dimension, bs y w t 3ndon nfs dimension
- Hyda el output supervised learning mn2rno bl targeted values, using subtractor for
example to get eezarror e, e used to adapt (change different weights) to minimize error
- Activation fn: limits output ben 0 w 1 or -1 w 1 7sb el fn ele 3m nst3mla 17
Some Terminologies:
• Inputs:
• An input vector is the data given as one input to the algorithm. , where
where is the input dimension.
• Weights:
• are the weighted connections between nodes and .
• For a neural network (machine learning technique), weights are analogous to
the synapses of the brain.
• They are arranged in a weight matrix W.
• Outputs:
• An output vector , where j where is the output dimension.
• Targets:
• A target vector , where where is the output dimension.
• Target vectors are extra data that we need for supervised training.
• They provide correct answers for the algorithm to learn from.
18
Some Terminologies:
• Activation function:
• For a neural networks is a mathematical function that describes the
threshold when the neuron is activated or not.
• Error:
• a function that computes the inaccuracies of the network of outputs and
target .

19
Weight Space
• Plotting data is useful however the dimension should be equal or less than 3. for us to visualize
• Plotting weights is especially useful in neural networks.
• Weights are parameters of a neural network that connects the neurons to the input.
• These weights control how much influence each input feature has on the output of a neuron. The weights are the
learnable parameters that the model adjusts during training to minimize error.
• If we treat weights as a set of coordinates, then we have a weight space.
• With the weight space we can assess how close the neuron are with the input s together
• If the neuron is close to the input in
• this sense then it should fire, and if it is not close then it shouldn’t.
• Each point in this space represents a unique configuration of weights, which collectively define
the model’s behavior and performance.
• We can plot the inputs on the same space.
• A Bias cannot be used since they will add an extra dimension
• We can do this by the Euclidian distance

• This gives us a different way of learning.


• By changing the weight we are changing the location
Of the neuron in the weight space
• We can use the idea of neurons and inputs being close
Together to decide when the neuron should fire
- Nftred 3na two neurons, w kl neuron 3m n3te different inputs kl w7ad mnon matrix of sum of different features 3na yehn (linear combination of
inputs)
- Thus fena nrsom el neurons acc. To their position in weight space
- Y3ni mn wra their weights n7tlon certain location bl weight space 20
- El input el origin, based on it mn2ren el neurons based 3l euclindean distance mn origin
Machine Learning Issues
• The Curse of Dimensionality
• The essence of this curse apply to machine learning algorithms because as the number
of input dimension increases, we need more data to enable the algorithm to generalize
sufficiently well.
• as the number of features (or dimensions) in your data increases, the data becomes
sparser. This means the algorithm has fewer examples to work with in each region of
the feature space. To make good predictions, the model needs more data to properly
understand and generalize from the input.
• The algorithm will try to Separate out data in to classes based on the features.
• As the number of features increases so will be the needed Data points.
• Dimensionality curse: high dimension > high inputs , lzem el inputs nb akbr mn
dimension la each input. Kl ma dimensions/ features nb akbr, el training input nb lzem
ykbar. Mnshn hk a7sn n3ml feature extraction la ns8r el dimensions
• Testing Machine learning algorithms.
• We need a training set to train our algorithm based on targets (supervised)
• Another set us needed (test set) to test how well things are going
• The only problem is that is reduces the amount of data to be used for training. (b3d el adaptation )
• testing data is only used once the model has been fully trained and optimized. 21
Machine Learning Issues (Overfitting)
• We need to make sure that we did enough training that the algorithm genializes well.
• (Generalization means that the algorithm performs well on new, unseen data, not just on the training data.)
• Kl ma zedna nb of input vectors > more experience lal model > more accuracy > more training. Mainly shi mne7
• Bl graphs: more training more complicated function more accurate but it is over training, y3ni later on btser inaccurate. Bs lama n3te
testing data bkon low accuracy,
- Overtrained: high accuracy bl training set, low accuracy bl testing
- Well trained: accurate same in both training and testing
- Undertraining lzem avoided, we need to reach equilibrium
- Overtraining: knows software details and its noise
- Tlmeez: y7fz el examples m3 answers, bs lama 3tyne new questions, ma eder y7l
• There is a risk in overtraining as much as there is in undertraining.
• If we train for too long (large number of epochs/iterations) we will overfit the data which means we have learned about the noise
and inaccuracies as well as the actual function.
• Validation:
• We need to stop learning before the algorithm overfits.
• We need a new set of data to detect overfitting.
• We cannot use training data.
• It will not detect it
• We cannot use test data
• We are keeping it for testing only.
• We need a third set called Validation set.
• It is known as cross validation in statistics.
• Evaluate model performance during development phase.
• This helps in adjusting parameters, such as learning rates, to improve model accuracy and prevent overfitting.
22
• Think of cross-validation like testing a student’s knowledge by giving them multiple different practice tests before the final exam (test data). Instead
Machine Learning Issues (Training
Testing and Validation)
• Given some data sets a typical percentage of data usage is below:-
based on experience mnwz3on b tre2a random, mnshen ykon fe
different types of data
Type Large Data Small Data
Set Set
Testing 25% 20%
Validation 25% 20%

Training 50% 60%

23
Confusion Matrix
• The Confusion Matrix: It is a square matrix that contains all possible classes in both
horizontal and vertical directions.
• Left hand side: targets
• Top Side: Predicted Outputs

• Elements of the matrix: : How many patterns were placed into Class , in the targets, but placed in the
class by the algorithm (Prediction).
• The diagonal Column is the correct answer the rest are misclassifications.
• It is used to be able to assess the performance of a classification model
- In classification, evaluation is based on discrete categories using metrics like accuracy, precision, recall, and F1 score from the confusion matrix, while in
regression, where the output is continuous, we use error-based metrics like Mean Squared Error (MSE)

- Hon fe 3na 6 samples that actually belong to C1 bs el model predicted eno 5 la C1 w 1 la C2


- 3m n3ml summarize lal predicted results based on actual classes 3na yhn w predicted classes
- Kl elements that don’t belong 3l diagonal hne errors
- 3dtan el matrix bkon bl eleb tho
24
Accuracy Metrics
• We can analyze the results more :
• Adding Observation of predictions
• True Positive: Correct observation in Class1
• False Positive: Observation incorrectly placed in Class 1
• True Negative: Correctly placed in Class 2 (or C3)
• False Negative: Observation incorrectly placed in Class 2 or c3 (belong to C1
actually)

• Accuracy is defined as the sum of true positives and true negatives


divided by the total. Ade nsbet el sa7, bs mana reliable bl unbalanced
sets
25
Accuracy Metrics
• Sensitivity: True positive rate: Number of correct positive examples out of those
classified as positive. out of those actually positive

• False negatives are incorrectly identified as negative when in reality they are positive examples
• Specificity: True Negative rate. Of all actual negative cases, how many did we correctly
predict as negative. 3ks sensitivity

• Precision: percentage of correct true examples over actual positive examples predicted
positive examples. High precision means fewer false positives (FP), so the model is
confident when it predicts "positive."

• F1 Score: Summarizes model performance, combines precision w sensitivity, especially


useful when data is imbalanced

• F1 w accuracy mostly used


• In general, ahmyet el parameter hye 7sb el problem w ade 3nde tolerance, w shu no3 el26
- Kl class we need to compute these 4 matrices
- Ck hon 3m n3tbra +ve class
- 3a el column el +ve, 3l diagonal el TP w be2e FP
- 3l row lal Ck bkon klo FN ela el diagonal TP
- El be2e TN
- Hal 7aki eza hk el estimate ele fo2, bs msln bl example
t7t m2lobe
- Hone apple +ve , Ck

27
28
- Ck is orange:

29
The Receiver Operation Characteristic (ROC) Curve
• Plot of Percentage of true positive on the vs. false positives.
• It is used to evaluate a particular Classifier or to compare different classifiers.
• A perfect classifier would point to:
• (0,1) : 100 % true positives 0% false positive
• An Anti Classifier would point to:
• (1,0): 100 % false positive 0% true positive
• A classifier working by chance will be
on the diagonal line ma fe 7aje lal training, fene e5l2 random variable (not
acceptable)
• In order to compare classifiers (algorithms), or choice of parameters (to fine
tune parameters) for some classifier, we just compute the point that is
furthest from the chance line along the diagonal. B ab3d no2ta b ab3d curve
mnltzm bhol parameters
• Or we can compare different classifiers / algorithms
• We can also compute the Area under the curve (AUC) instead of just a point.( 0 to 1 where 30
Unbalanced Datasets
• Balanced data: data set ele feha mn kl el anwe3 nfs el proportion
• For accuracy we have assumed implicitly the same number of positive and negative
examples in the dataset.
• This is known as the balanced Dataset.
• For unbalanced: accuracy is not a good measure, if 95% of the dataset is negative and
5% is positive, a model that always predicts "negative" will have high accuracy, but it
will fail to identify any positives.
• Bl classes b confusion matrix, eza a8lb el samples belong la certain class, bser software
mdrb 3a no3 class aktr mn 8yro (unbalanced data set)
• However this is not true:
• Balanced accuracy: We can compute the balanced accuracy as the sum of the sensitivity and
specificity divided by 2
• However a more accurate measure is Matthew’s Correlation Coefficient given by:

• If any of the values in brackets is 0 then the denominator is set to 1 and this provides a balanced accuracy
• 1 indicates perfect classification (all correct predictions).0 indicates random classification (no better than
chance).-1 indicates a perfect inverse classification (the model always gets the opposite class wrong). 31
• It takes into account the 4 quadrants of the confusion matrix
Turning Data into probabilities
• Consider the plot below showing the measurements of some features of for the two classes and

• Histogram of features values against their


C1 • probability of two classes

C2

• Members of have large values of features of than


• There is an overlap between the two classes. This makes classification less certain in that region.
• At the edges its is fairly easy to distinguish
• In the middle it is unclear
• This shows how we can estimate the probability that a data point belongs to a class based on feature values.
• 3a ases hl probability measurements 3na different algorithms
• Example: Naïve Bayes algorithm is a probabilistic classifier based on Bayes’ Theorem and the naive assumption that all
features are independent of each other, given the class. It is efficient
• Example el graph
• input data x m2lfe mn n features , eza rsmna el histogram of different features, of vector x wrt different classes , 7a ytl3lna hal two
shapes,
• For a given feature, if its value is big, fe e7tmlye eno btntme la class c2
• Bs probability 0 bl C1
• For some values of X fe 3na overlapping bl two histograms hone btet2sm x eno e7tmlyet tntme la c 1 or c2
• Hol probalistic values bs3dona n7ded decision range, mn5tara b tre2a el error bkon minimum / risk. 32

Turning Data into probabilities
• Consider letters and

• They are very similar


• However an English text have much more then it has which is an example of an unbalanced dataset.
• c1 w c2: a w b letters
• Assume that the occurrence of a letter has to be letter . This implies :

- 3na 20 feature (pixels)
- P(c1) w p(c2) prior knowledge
- Posterior probability ele bthmna
- Mnst3ml bayes rule la nshuf eya probability 23la w mn5od decision 3a asesa (minimize risk)
- 7wel data la probalistic measurement w b3da b3ml evaluation 33
Turning Data into probabilities
• When making a classification the value of will definitely help in addition to
the value of . It is frequency of C1.

• Another helpful information is the conditional probability
• Conditional Probability: How likely it is given that
• In the figure we can notice that is very small when is small
• How to calculate since it cannot be read from the histogram?
• Quantize the measurement of .
• i.e. put it into one of discrete set of values of
• Like bins in a histogram ( this is the plot in the figure 
• These bins can then represent different categories of the feature x
• For example, if X is age, you might have bins like [0-20], [21-40], [41-60], etc.

34
Turning Data into probabilities
• If we have lots of examples of the two classes, and the histogram bins
they fall into, we can then compute the Joint probability.
• Joint Probability : How often a measurement of fell into histogram
bin .
• We can calculate it by looking into histogram bin and counting the number of
examples of Class and dividing it by the total number of examples ( for any
class) bl dataset
• There is also different type of conditional Probability:
• How often (in the training set) there is a measurement of given that the
example is a member of
• This can be done by counting the number of examples of class in histogram
and dividing by the number of examples of that class in any bin.
• We need a more proper way of computing these values.

35
Turning Data into probabilities
• Bayes’ Rule

• Bayes’ Rule is one of the most important rules in machine learning.


• It relates posterior probability with the prior probability and the class-
conditional probability .
• posterior probability looking What we actually want
• prior probability: How often each class appear in the training set
• class-conditional probability: Histogram of the values of features of the
training set
• Minimizing risk:
• In the medical field misclassification can be dangerous for an example
• Loss matrix: Specifies the risk involved from classifying as
• Similar to the confusion Matrix but the diagonal is 0 (no risk in correct classifications)
• A FP of 5 value has risk of 5
36
Basic Statistics
• A Random experiment is one whose outcome is not predictable with certainty in advance.
results in a sample space
• Sample space: The set of all possible outcomes
• The sample space is discrete is it consists of a finite set of outcomes, otherwise it is continuous.
• Sample space: kl ma nkeb dice, mn3ref outcomes mn 1 la 6 bs ma mn3ref exact outcome
• Eza Sample space discrete el RV discrete w el 3akes
• Random Variables:
• RV: method/ fn that maps b/w sample space and real space, map non numerical to real values, w hole
mnst3mlon to obtain different parameters (Discussed later)
• Assigns a number to each outcome in sample space of a random experiment
• The probability distribution of a random variable for any real number is:
• and we have
• Fx(a) (probability density/mass fn of a) jem3a kl probabilities of occurrence of x as8r mn a
• If is a discrete random variable then

• Where is the probability mass function defined as


• Probability mass function is a function that gives the probability of a discrete random variable exactly equal to some value.
• Like F(3) in rolling a dice is P(1)+P(2)+P(3)
• If x is continuous random variable

• Where is the probability density function


37
• Like for a normal distribution pdf
Basic Statistics
• Expectation: Expected value or mean of a random variable denoted by is
the average value of in a large number of experiments:

• It is the weighted average were each value is weighted by the probability that takes
a value.
• Can be Sum inputs/ nb of data
• Expected value: mean/average value lal RV. Sum of all values of x, weighted by
probability fn of x
• Variance of a set of numbers is a measure of how spread out the values are.
• Variance: spreading degree, ade b3ad3n el average, eza arebe ela el variance z8ere.
Mtwset el msefe. Xi: set of values of x
• It is the sum of the squared distance between each element in the set and the
expected value of the set ( mean )

/N
/N
• The Square root of the variance is known as the standard deviation. 38
Basic Statistics
• The variance looks at the variation in one variable to its mean.
• variance: generalized – for two varaibles – covariance –if yi =xi btser
cov hye var
• We can generalize this to look at 2 variables vary together which is
known as Covariance.
• Covariance is the measure of how dependent the two variables are in
statistical sense
• / N where is the mean of
• If the two variables are independent then the covariance is 0 (uncorrelated)
• If the they both increase and decrease at the same time then the covariance
is positive
• If one goes up then the other goes down then the covariance is negative.

39
Basic Statistics
• The covariance can be used to look at the correlation between all pairs of the variable within a data set
• We need to compute the covariance of each pair to get the covariance matrix.
• X1 m3 7ala , btrj3 m3 x2 , m3 x3 till xn bl row. Awal row tne column: cov x1 m3 x2
• Tne row awal colum: cov x2 m3 x1
• Then x2 m3 x1 , m3 7ala till xn

• Matrix is square
• Elements in leading diagonal are to the variances
• It is symmetric since
• Ele 3l diagonal hne variance
• Sra7a lzem no2sm by N kl w7de. Ele t7t ma b3rf wen asln bn2sm b N fa bleha
• Covariance can be written as follows

• Take the data 𝑋, Subtract the mean from each data point → 𝑋− 𝐸[ 𝑋], Multiply by its transpose, Take
• where is mean of vector

the expectation (average)


• Thus matrix is symmetrical, y3ni transpose elo huwe nfs el matrix
• Its says how the data varies along each data dimension. 40
Basic Statistics
• Example: Given two data sets below, we want to check if is part of
the data set

A B

• If we check the distances you might answer is part of A and is not


part of B even though on both graphs have the same distance to the
center of the data (This is just the mean)
• The reason for this answer is because you looked at the mean and also on the
location with respect to the spread of data.
• Based on variance 7syna x belong to A not b , variance part of our decision
making process 41
Basic Statistics
• From the previous example we can notice that of the data is tightly
controlled then the test point has to be closed to the mean. variance
aleel
• If the data is very spread out then the distance to the mean has no
importance.
• A measure of distance that takes into account this into account is
called the Mahalanobis distance:

• where
• is the data arranged a vector columns, RV vector
• is the column vector representing the mean
• id the inverse of the covariance matrix
• If then the is the Euclidian Distance.

42
The Bias Variance tradeoff
• Whenever we train a Machine learning algorithm we are making some
choices about the model to use and fitting the parameters of that model.
• The more degrees of freedom the more complicated it is.
• More complicated data suffer from overfitting and requires more training data.
• kl ma ken el model more complex, more accuracy bl training, bhl case el model 7a
ytba3 el variation lal data b tre2a ktir accurate > overfitting
• This is the bias-variance dilemma
• A model can be bad for two reasons
• Not accurate and does not match the data well this is called Bias. (high bias) underfitting,
model too simple
• Not very precise and lots of variation in the results this is called variance. (not consistent,
high variance)
• More Complex classifiers improves bias at the cost of variance (it
increases).
• Making the model more specific simpler by reducing the variance
increase the bias. 44
The Bias Variance tradeoff
• Example: Fitting a curve to points
• A line can go precisely through the data points
• No variance => High Bias (ma feyn ykono same relation, bias 3ale w variance 3ale)
• A spline function can fit the data to arbitrary accuracy, but variance will
increase.

Error high bl test data msh training


(0)

High bias, accuracy t2reban btdl nfsa High accuracy, high variance

45

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy