Lecture 1 - Introduction To Machine Learning-HO - Ch0
Lecture 1 - Introduction To Machine Learning-HO - Ch0
Learning
Lecture 1 in red extra notes
Handouts– Introduction to Machine Learning
Source: Machine Learning An Algorithmic Perspective Second edition
By Stephen Marsland
Prepared By: Dr. Zaher Merhi
1
Introduction to Machine Learning
• Consider an online retail store like amazon with data about
movement of its clients: Purchases and Preferences:
• Based on these data, we would like to populate things that you might be
interested in. Same as in movie suggestions in Netflix.
• The problem we have one of prediction:
• Given the data we have predict what the next person will buy.
• The reason that this will work if that people who are similar act similar.
• This is an example of supervised learning i.e. a Teacher
• On another hand storing all movements takes a lot of space.
• Data stored in large quantities is a well known problem.
• The challenge is to do something useful with it.
• The size and complexity of data means humans are unable to extract useful
information
2
Vid Notes 1
- Amazon focused 3l clients ele byshtro products m3ynen
- Bl website btn3rad mjmo3t products mmkn thmo lal client
- W t7fzo yshtrehn
- Bs nfoot 3a amazon site, based 3l previous visit lal website, amazon collects data 3n el customers, bshuf
history lal browser 3n tree2 el cookies, shu hne el items of interest la client m3yan , w b3lej hl data 3n tre2
ML program, la yktre7 3l vistor mjmo3et products bthmo
- Kl hal shi prediction process (predicts desire of customer, 3n tree2 supervised learning )
- Supervised learning:
- -based on previous experience
- Like Website 3ndo Database
- M3na el inputs w el proper answer bkel situation
3
-
Vid Notes 2
- ML vs classical learning
- Example: software m3yan x programming y , y = f(X)
- In ML ma mnbrmej relation ben x w y, mn3te el ML x1 ... Xn w y1... yn w ble2e
relation bynton
- Traditional : mn3te el formula
- ML: mn3te data la yowsal lal relation
- ML based 3l learning from data, by3tena 7al la storing of huge amount of data
bl servers or databases el mwz3een 7wlen el 3lam (use bs useful data,
compress data etc..)
- Tracking of data thru visiting sites and Social media w used for analysis
- Large amount of data bda large capacity of storage
- ML by3tmd 3al data, bs byst3ml bs el useful data w b2lela, bn3mla
compression btre2a mnst5rj bs el useful part
4
Introduction to Machine Learning
• Classifying data can be very easy with 2-D data and small amount of
data. We humans struggle processing large data so we try to visualize
it, so after 3 dimensions we can’t do that and that prompts ML.
• Higher dimensions (above 3) no visualization is possible.
• Projection might help (reducing a 10D dataset to 2D or 3D) but it will masks
(lose) some of the information.
• It will hide information for distinguishing classes that differs in the other dimension
• Imagine a 3D cloud of points where two classes are separable. If you project it onto a 2D
plane, some points from different classes might overlap.
5
Vid Notes 3
- Previous examples: regression/ prediction
- Regression is a type of supervised learning in Machine Learning used for predicting
continuous values. Unlike classification (which assigns labels to categories), regression
estimates a numerical value based on input features. (supervised technique)
- Now in the table example: x1 and x2 are coordinates
- A point is the combination of x1 and x2
- Each point belongs to a certain class, there are only 3 in this example
- Mfrood bl ML program/ software by2dr eza 3tyne data msh bl table i.e Point
with x1 and X2 new, it will know its class. This is a classification problem
- Table data: training data, available for software to learn from
- The software sees each point and its class
- Class 1: +
- Class2 : thunder
- Class 3: circle
- Software tries to determine between (x1) x,y (X2) and class, to determine a
decision boundary / threshold that distinguishes the classes 6
Vid Notes 4
- Nb of inputs : P1 till Pn , kl input 2 values so dimension lal input is 2
- Kl x1 w x2 hye feature lal data
7
Applications of Machine Learning
• Many applications exists for Machine learning:
• Spam Filters, email bysht8l ka ML classifier la y3ml classifying of true email versus spam email
• Voice recognition , recognizes voices based on their specific characteristics and categorizes each sound to
• Computer Games
• Anti Skid Braking Systems a safety system in vehicles that prevents wheels from locking up during
braking, improving stability and control. This is especially useful on slippery roads or during emergency
stops.
9
Vid notes 4
• Program byt3lm lw7do based on data mn3te yeha
• Bdon data ma 3na ML
• Eza fe data, mn2dr n3ml software based on techniques 3na yehn
• Kif? La n2ol mn3te el software inputs w kl input lzem ykon 3ndo given response t
• Initially mn7ot input x1 w btle3 el program y1 bs mnrj3 mn5bro eno lzem el output t1.
based on el error ben t1 w y1 b8yr el program bl parameters ele 3ndo yehn la yowsal la
minimal error w yle2e output a2rb shi la t1 (target value)- hyda el FB (adaptation)
• Adaptation depends on different matrices, most prominent is the error ele hye t-y : target
value- output value (technically b3den mnshuf eno hye square root el difference actually)
• Learning is based on experience, based on data ele mn3te yehn, kl ma n3te by7sb ade el
output, w mnkon 3ate targeted output w 3ala asesa by3ml adaptation
• Awl mr7le byt3lm mn el data w adaptation lal parameters t3olo, la yowsal la minimal error
and best solution (less error, more accurate software)
• Remembering : mn3te x1 w mnb3d adaptation by3tena y1 w lzem y3te nfs e output b3den
based on same input
• Generalization: trained on n inputs, eza 3tyne input new mana bl training set, mfrood ML
program based on previous experience, la tlma x1 bt3te y1 lzem xm t3te ym m3yne 10
Machine Learning
• Machine learning is about making computer modify or adopt their actions
so that these actions get more accuracy.
• n3te mjmo3et available data, m3 certain label (solution) lal software, w software
by7sb output w bjreb m3 time y7sob output areb lal solution / desired output ele
3tyne ye
• Accuracy is measured by how well the chosen actions reflect the correct ones.
based on adaptation of software to new situations. real output – desired output
• Example: Playing chess with a computer.
• Initially you begin beating the machine.
• At some point the machine starts to learn your techniques, and start beating you.
• Next time some one else plays with the machine it will not start from scratch.
• Computational complexity is of interest.
• It can be broken to two parts:
• Complexity of training
• Does not often happens. Training happens once (or infrequently), so it’s less critical for real-
time systems. It need time
• Complexity of applying trained algorithm:
• Time critical decisions . Fast predictions are important , it should be fast since most ml
11
programs are used in real time applications
Types of Machine Learning
• Picking the variables that you want to use, also called features, is very important to find
a solution.
• Choosing how to process data can be important.
• Supervised Learning: A training set of examples with the correct responses (targets) is
provided.
• The algorithm generalizes to respond correctly to all possible inputs.
• This is also called exemplars.( model is guided by exemplars)
• training data set: inputs m3 solutions la elon, w based 3a hl couple, el system byosal la function
t3ml right mapping ben input w output. Supervising lal machine/ learning with teacher
• Unsupervised learning: Correct responses are not provided, but instead, the algorithm
tries to identify similarities between the inputs so that they are categorized. lzem el
machine t3ml statistical study la hol inputs w tle2e 3wemel mshtrki bynton la t2dr
tsnef el inputs, based on different categories (density estimation)
• The statistical approach is known as density estimation.
• Density estimation we model the probability distribution of data. It helps identify how data
points are distributed in a space. It determines where data points are concentrated and how they
spread. It answers: "Given some data, what is the likelihood of a new data point belonging to the
same distribution?“
• By estimating density, we can detect clusters, outliers, and anomalies in datasets. 12
Vid notes 5
• Data: hye 3ebara 3n mjmo3et input vectors , kl vector / input m2lfe
mn vector of values / coordinate lal vectors named features,
• 7sb el problem, ma drore n3te kl el data set ele 3na yeha, we extract
best data lal problem ele 3na yeha
• Mn characteristic lal given input ele 3na ye, mn5od some- not all
input characteristics of vectors l2n some components are more useful
than others within the desired situation/ problem
• Previous examples were mostly supervised learning
• Emtan supervised aw unsupervised? 7sb el data ele 3na yehn, labeled
vs unlabeled data
13
Types of Machine Learning
• Reinforcement learning: In between supervised and unsupervised
learning
• The algorithm gets told if the answer is wrong, but it cannot correct it. (this FB
comes in form of rewards or punishments)
• It has to explore and try to explore different possibilities to get it right.
(optimize with time, through trial and error)
• It is called learning with a critic.
• Having large amount of data is very important in Machine Learning. However, attaining
large amounts is challenging:
• Sensors collecting data are subject to noise, thus low error data is hard to attain.
• Sometimes it is difficult to collecting data.
• Enough data should be given without excessive computation is impossible to predict.
• Fena nst3ml data raw as it is. Or perform feature extraction, important ones, that
distinguish different inputs.
• Amrar ma 3na data, aw labeled data, mntr n3ml experiments to get data. Example:
network m3yn n7sl 3a input signal 3n tree2 different sensors but could be mixed
with noise which affects data quality.
• Mnjme3 data mn clinics w mshfyet w mmkn ykon fe a8lat l2n 3m njme3a mn
different persons
• El 7al lzem n3ml cleaning and extract noise, remove outlier values, check for
missing values, w nshuf eza el data m7tota bl m7al el 8lat. (preprocessing step) 15
The Machine Learning Process
• Feature Selection
• Identifying features useful for the problem under examination. (7sb el situation) n2e el best 7sb our problem’s scope
• Requires prior knowledge of the problem and data. La 23rf el useful ones
• Features should be not be collected expensively (high cost of collection) or corrupted with high noise(can cause
overfitting).
• Algorithm of Choice
• Given the data set what is the appropriate algorithm.
• algorithm: unsupervised learning, supervised, reinforced 7sb el data type ele m3na ye (labeled or la2)
• Kl category mn hol fe subcategories (different algorithms belonging to supervised learning and so on). We choose based
on characteristics of these algorithms, their performance, proper; if it works with solving such type of problem.
• Evaluation and model selection (mtl SVM btmro2 b3den)
• Parameters that needs to be set manually or needs experimentation to identify appropriate values. Adjust parameters of
algorithm : initialize them (learning rate, performance estimate, DOF of fn in regression)
• Training
• Use computational resources to build a model in order to predict the output. takes time (adaptation, optimization )
proper model
• Evaluation
• Before the system is deployed it needs to be tested.
• Like through metrics: accuracy, precision, confusion matrix etc..
• Evaluation : mn3te new set of data/ test data, w mn2lo jreb tle3 el output, w based on output fe different matrics to
evaluate the output 16
Vid notes 6 (upcoming 2 slides)
- Example 7a nshuf bl neural networks:
- Neuron
- adaptation la software m3yan based on neural network
- 3na input data (vector X, m2lf mn different features x1 ...xn)
- Neuron: huwe processing unit m2lfe mn 2 steps :
- combiner/summation which sums the different inputs (x1.. Xn) multiplied by their
corresponding weights (w1...wn) with same dimension bla bias
- Kl input byn3mlo scaling / weighting m3yn 3n tree2 phonetic (?) receivers ele 3na yehn
- Fe missing values, bias etc..(mn7ki fehn b3den)
- Combiner byjm3 el combination lal different features ele 3na yehn
- Combined value by3tena response v mn5da 3a activation fn g , ele bt3ml compression/
limitation lal result.
- Final output: y a fn of g , activation fn applied 3l input lal activation fn
- El output y mmkn 8er dimension, bs y w t 3ndon nfs dimension
- Hyda el output supervised learning mn2rno bl targeted values, using subtractor for
example to get eezarror e, e used to adapt (change different weights) to minimize error
- Activation fn: limits output ben 0 w 1 or -1 w 1 7sb el fn ele 3m nst3mla 17
Some Terminologies:
• Inputs:
• An input vector is the data given as one input to the algorithm. , where
where is the input dimension.
• Weights:
• are the weighted connections between nodes and .
• For a neural network (machine learning technique), weights are analogous to
the synapses of the brain.
• They are arranged in a weight matrix W.
• Outputs:
• An output vector , where j where is the output dimension.
• Targets:
• A target vector , where where is the output dimension.
• Target vectors are extra data that we need for supervised training.
• They provide correct answers for the algorithm to learn from.
18
Some Terminologies:
• Activation function:
• For a neural networks is a mathematical function that describes the
threshold when the neuron is activated or not.
• Error:
• a function that computes the inaccuracies of the network of outputs and
target .
19
Weight Space
• Plotting data is useful however the dimension should be equal or less than 3. for us to visualize
• Plotting weights is especially useful in neural networks.
• Weights are parameters of a neural network that connects the neurons to the input.
• These weights control how much influence each input feature has on the output of a neuron. The weights are the
learnable parameters that the model adjusts during training to minimize error.
• If we treat weights as a set of coordinates, then we have a weight space.
• With the weight space we can assess how close the neuron are with the input s together
• If the neuron is close to the input in
• this sense then it should fire, and if it is not close then it shouldn’t.
• Each point in this space represents a unique configuration of weights, which collectively define
the model’s behavior and performance.
• We can plot the inputs on the same space.
• A Bias cannot be used since they will add an extra dimension
• We can do this by the Euclidian distance
23
Confusion Matrix
• The Confusion Matrix: It is a square matrix that contains all possible classes in both
horizontal and vertical directions.
• Left hand side: targets
• Top Side: Predicted Outputs
• Elements of the matrix: : How many patterns were placed into Class , in the targets, but placed in the
class by the algorithm (Prediction).
• The diagonal Column is the correct answer the rest are misclassifications.
• It is used to be able to assess the performance of a classification model
- In classification, evaluation is based on discrete categories using metrics like accuracy, precision, recall, and F1 score from the confusion matrix, while in
regression, where the output is continuous, we use error-based metrics like Mean Squared Error (MSE)
• False negatives are incorrectly identified as negative when in reality they are positive examples
• Specificity: True Negative rate. Of all actual negative cases, how many did we correctly
predict as negative. 3ks sensitivity
• Precision: percentage of correct true examples over actual positive examples predicted
positive examples. High precision means fewer false positives (FP), so the model is
confident when it predicts "positive."
27
28
- Ck is orange:
29
The Receiver Operation Characteristic (ROC) Curve
• Plot of Percentage of true positive on the vs. false positives.
• It is used to evaluate a particular Classifier or to compare different classifiers.
• A perfect classifier would point to:
• (0,1) : 100 % true positives 0% false positive
• An Anti Classifier would point to:
• (1,0): 100 % false positive 0% true positive
• A classifier working by chance will be
on the diagonal line ma fe 7aje lal training, fene e5l2 random variable (not
acceptable)
• In order to compare classifiers (algorithms), or choice of parameters (to fine
tune parameters) for some classifier, we just compute the point that is
furthest from the chance line along the diagonal. B ab3d no2ta b ab3d curve
mnltzm bhol parameters
• Or we can compare different classifiers / algorithms
• We can also compute the Area under the curve (AUC) instead of just a point.( 0 to 1 where 30
Unbalanced Datasets
• Balanced data: data set ele feha mn kl el anwe3 nfs el proportion
• For accuracy we have assumed implicitly the same number of positive and negative
examples in the dataset.
• This is known as the balanced Dataset.
• For unbalanced: accuracy is not a good measure, if 95% of the dataset is negative and
5% is positive, a model that always predicts "negative" will have high accuracy, but it
will fail to identify any positives.
• Bl classes b confusion matrix, eza a8lb el samples belong la certain class, bser software
mdrb 3a no3 class aktr mn 8yro (unbalanced data set)
• However this is not true:
• Balanced accuracy: We can compute the balanced accuracy as the sum of the sensitivity and
specificity divided by 2
• However a more accurate measure is Matthew’s Correlation Coefficient given by:
• If any of the values in brackets is 0 then the denominator is set to 1 and this provides a balanced accuracy
• 1 indicates perfect classification (all correct predictions).0 indicates random classification (no better than
chance).-1 indicates a perfect inverse classification (the model always gets the opposite class wrong). 31
• It takes into account the 4 quadrants of the confusion matrix
Turning Data into probabilities
• Consider the plot below showing the measurements of some features of for the two classes and
C2
34
Turning Data into probabilities
• If we have lots of examples of the two classes, and the histogram bins
they fall into, we can then compute the Joint probability.
• Joint Probability : How often a measurement of fell into histogram
bin .
• We can calculate it by looking into histogram bin and counting the number of
examples of Class and dividing it by the total number of examples ( for any
class) bl dataset
• There is also different type of conditional Probability:
• How often (in the training set) there is a measurement of given that the
example is a member of
• This can be done by counting the number of examples of class in histogram
and dividing by the number of examples of that class in any bin.
• We need a more proper way of computing these values.
35
Turning Data into probabilities
• Bayes’ Rule
/N
/N
• The Square root of the variance is known as the standard deviation. 38
Basic Statistics
• The variance looks at the variation in one variable to its mean.
• variance: generalized – for two varaibles – covariance –if yi =xi btser
cov hye var
• We can generalize this to look at 2 variables vary together which is
known as Covariance.
• Covariance is the measure of how dependent the two variables are in
statistical sense
• / N where is the mean of
• If the two variables are independent then the covariance is 0 (uncorrelated)
• If the they both increase and decrease at the same time then the covariance
is positive
• If one goes up then the other goes down then the covariance is negative.
39
Basic Statistics
• The covariance can be used to look at the correlation between all pairs of the variable within a data set
• We need to compute the covariance of each pair to get the covariance matrix.
• X1 m3 7ala , btrj3 m3 x2 , m3 x3 till xn bl row. Awal row tne column: cov x1 m3 x2
• Tne row awal colum: cov x2 m3 x1
• Then x2 m3 x1 , m3 7ala till xn
• Matrix is square
• Elements in leading diagonal are to the variances
• It is symmetric since
• Ele 3l diagonal hne variance
• Sra7a lzem no2sm by N kl w7de. Ele t7t ma b3rf wen asln bn2sm b N fa bleha
• Covariance can be written as follows
• Take the data 𝑋, Subtract the mean from each data point → 𝑋− 𝐸[ 𝑋], Multiply by its transpose, Take
• where is mean of vector
A B
• where
• is the data arranged a vector columns, RV vector
• is the column vector representing the mean
• id the inverse of the covariance matrix
• If then the is the Euclidian Distance.
42
The Bias Variance tradeoff
• Whenever we train a Machine learning algorithm we are making some
choices about the model to use and fitting the parameters of that model.
• The more degrees of freedom the more complicated it is.
• More complicated data suffer from overfitting and requires more training data.
• kl ma ken el model more complex, more accuracy bl training, bhl case el model 7a
ytba3 el variation lal data b tre2a ktir accurate > overfitting
• This is the bias-variance dilemma
• A model can be bad for two reasons
• Not accurate and does not match the data well this is called Bias. (high bias) underfitting,
model too simple
• Not very precise and lots of variation in the results this is called variance. (not consistent,
high variance)
• More Complex classifiers improves bias at the cost of variance (it
increases).
• Making the model more specific simpler by reducing the variance
increase the bias. 44
The Bias Variance tradeoff
• Example: Fitting a curve to points
• A line can go precisely through the data points
• No variance => High Bias (ma feyn ykono same relation, bias 3ale w variance 3ale)
• A spline function can fit the data to arbitrary accuracy, but variance will
increase.
High bias, accuracy t2reban btdl nfsa High accuracy, high variance
45