Tutorials
Tutorials
Question 1:
What is the difference between ML (Machine Learning) and AI (Artificial Intelligence)?
Suggested discussion: Artificial Intelligence is the broader concept of machines being able to carry out tasks
in a way that we would consider “smart”. And, Machine Learning is a current application of AI based around
the idea that we should really just be able to give machines access to data and let them learn for themselves.
https://www.forbes.com/sites/bernardmarr/2016/12/06/what-is-the-difference-between-artificial-intelligence-and-machine-
learning/#741adc6b2742
Ref: https://www.vsinghbisen.com/technology/ai/difference-between-artificial-intelligence-and-machine-learning/
Question 2:
Which of the following is the most reasonable definition of machine learning?
(a) Machine learning is the field of allowing robots to act intelligently.
(b) Machine learning is the science of programming computers.
(c) Machine learning only learn from unlabeled data.
(d) Machine learning is the field of study that gives computers the ability to learn without being explicitly
programmed.
Ans: (d)
Question 3:
A computer program is said to learn from experience E with respect to some task T and some performance measure
P, if its performance on T, as measured by P, improves with experience E. Suppose we feed a learning algorithm a lot
of historical weather data, and have it learn to predict weather. In this setting what is T?
(a) The historical weather data.
(b) The probability of it correctly predicting a future data’s weather.
(c) The weather prediction task.
(d) None of these.
Ans: (c)
Question 4:
Suppose you are working on weather prediction and use a learning algorithm to predict tomorrow’s temperature (in
degrees Centigrade/Fahrenheit).
(i) Would you treat this as a classification or a regression problem?
(a) Regression.
(b) Classification.
(c) Clustering.
(d) None of these.
(ii) What kind of data should you gather?
Ans: (i)(a)
(ii) Weather forecasts are made by collecting quantitative data (e.g., changes in barometric pressure, current weather
conditions, and sky condition or cloud cover) about the current state of the atmosphere at a given place and
using meteorology to project how the atmosphere will change.
Question 5:
You want to develop learning algorithms to address each of the following two problems.
P1: You’d like the software to examine your email accounts, and decide whether each email is a spam or not.
P2: You have a large quantity of green tea (e.g., 1000kg) with a record of previous sales. You want to predict how
much of it will sell over the next 6 months.
Should you treat these as classification or as regression problems?
(a) Treat both P1, P2 regression problems.
(b) Treat both P1, P2 classification problems.
(c) Treat P1 regression problem, P2 classification problem.
(d) Treat P1 classification problem, P2 regression problem.
Ans: (d)
Question 6:
Suppose you are working on stock market prediction. Typically tens of millions of shares of a company’s stock are
traded each day. You would like to predict the number of shares that will be traded tomorrow.
(i) Would you treat this as a classification or a regression problem?
(a) Regression.
(b) Classification.
(c) Clustering.
(d) None of these.
(ii) If the data you have collected involved millions of attributes, what would you do?
Ans: (i)(a), (ii)(extract relevant features)
Question 7:
Some of the problems below are best addressed using a supervised learning algorithm, and the others with an
unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that
apply) Assume some appropriate dataset is available for your algorithm to learn from.
(a) Determine whether there are vocals (i.e., a human voice singing) in each audio clip extracted from a piece of
music, or it is a clip of only musical instruments and no vocals.
(b) Given data on how 1000 medical patients respond to an experimental drug (such as effectiveness of the
treatment, side effects, etc.), discover whether there are different categories or “types” of patients in terms of
how they respond to the drug, and if so what these categories are.
(c) Given a large dataset of medical records of patients suffering from heart disease, try to learn whether there
might be different clusters of such patients for which we might tailor separate treatments.
(d) Given a set of data which contains the diet and the occurrence of diabetes from a population over a 10-year
period. Predict the odds of a person developing diabetes over the next 10 years.
Ans: (a), (d)
Question 8:
Suppose you are working on a machine learning algorithm to predict if a patient is COVID-19 infected according to
the patient’s symptomatic data, such as fever, dry cough, tiredness, aches and pains, sore throat, diarrhoea,
conjunctivitis, and headache etc. What are the Task, Performance, and Experience involved according to the
definition of machine learning?
Ans: (please refer to the definition of Task, Performance, and Experience in the lecture notes)
Task: patient classification into ‘infected’ or ‘uninfected’
Performance: accuracy of classification
Experience: patient’s symptomatic data with actual diagnosis
Question 9:
We use labelled data for supervised learning, where the labels are used as the desired target of prediction for
classifiers. Which of the next data are the useful labelled data?
(a) To build an image object classifier to discriminate between apple and orange, we have many fruit images
labelled with the country of origin.
(b) To build a system to predict the number of COVID cases for tomorrow given the past daily record, we have
a collection of daily data for a period of 12 months.
(c) To build a classifier to automatically evaluate student essays, we have collected a set of student essays that
have not been graded by teachers.
Ans:
(a) The useful fruit images should be labelled with apple or orange. Country of origin doesn’t tell apple or orange.
Therefore, the data is not useful
(b) We can use n days of historical data as the input, and n+1th day’s data as the target. This dataset is useful;
(c) The useful dataset should include student essays and the grades. Student essays are the input, and the grades
are the desired target of prediction. This dataset is not useful.
Question 10:
Determine whether each of the following is “inductive” or “deductive” reasoning?
(a) The first coin I pulled from the bag is a penny. The second and the third coins from the bag are also pennies.
Therefore, all the coins in the bag are pennies.
(b) All men are mortal. Harold is a man. Therefore, Harold is mortal.
Ans: (a) inductive, (b) deductive.
Question 11:
Find a problem of your interest and formulate it as a machine learning problem. List out the input features and output
response and provide your choice regarding the types of learning (such as supervised or unsupervised learning,
classification or regression, clustering or dimensionality reduction).
scikit-learn depends on two other Python packages, NumPy and SciPy. For plotting and interactive development,
you should also install matplotlib, IPython, and the Jupyter Notebook. We recommend using the following
prepackaged Python distributions, which provides the necessary packages:
Anaconda
A Python distribution made for large-scale data processing, predictive analytics, and scientific computing.
Anaconda comes with NumPy, SciPy, matplotlib, pandas, IPython, Jupyter Notebook, and scikit-learn. Available
on Mac OS, Windows, and Linux, it is a very convenient solution and is the one we suggest for people without
an existing installation of the scientific Python packages. Anaconda now also includes the commercial Intel
MKL library for free. Using MKL (which is done automatically when Anaconda is installed) can give significant
speed improvements for many algorithms in scikit-learn.
Some tutorials that might be useful:
A quickstart tutorial on NumPy: https://numpy.org/devdocs/user/quickstart.html
Some community tutorials on Pandas: https://pandas.pydata.org/pandas-
docs/stable/getting_started/tutorials.html
Scikit-learn tutorials: https://scikit-learn.org/stable/tutorial/index.html
EE2211 Tutorial 2 (Python coding)
Ans:
import pandas as pd import
matplotlib.pyplot as plt
df = pd.read_csv("government-expenditure-on-education.csv")
expenditureList = df ['total_expenditure_on_education'].tolist() yearList
= df ['year'].tolist()
plt.plot(yearList, expenditureList, label = 'Expenditure over the years')
plt.xlabel('Year') plt.ylabel('Expenditure') plt.title('Education
Expenditure') plt.show()
(Data Reading and Visualization, slightly more complicated data structure)
Question 2:
Download the CSV file from https://data.gov.sg/dataset/annual-motor-vehicle-population-by-vehicle-type.
Extract and plot the number of Omnibuses, Excursion buses and Private buses over the years as shown below.
(Hint: you might need “import pandas as pd” and “import matplotlib.pyplot as plt”.)
Ans:
import pandas as pd import
matplotlib.pyplot as plt
df = pd.read_csv("annual-motor-vehicle-population-by-vehicle-type.csv")
year = df ['year'].tolist() category = df ['category'].tolist() vehtype
= df ['type'].tolist() number = df ['number'].tolist() val1 =
df.loc[df['type']=='Omnibuses'].index val2 =
df.loc[df['type']=='Excursion buses'].index val3 =
df.loc[df['type']=='Private buses'].index print(val1)
List1 = df.loc[val1] print(List1)
List2 = df.loc[val2] print(List2)
List3 = df.loc[val3] print(List3)
plt.plot(List1['year'], List1['number'], label = 'Number of Omnibuses')
plt.plot(List2['year'], List2['number'], label = 'Number of Excursion buses')
plt.plot(List3['year'], List3['number'], label = 'Number of Private buses')
plt.xlabel('Year')
plt.ylabel('Number of vehicles')
#plt.xticks(List1['year'])
plt.title('Number of vehicles over the years') plt.legend()
(Data Reading and Visualization, distribution)
Question 3:
The “iris” flower data set consists of measurements such as the length, width of the petals, and the length, width of the
sepals, all measured in centimeters, associated with each iris flower. Get the data set “from sklearn.datasets import
load_iris” and do a scatter plot as shown below. (Hint: you might need “from pandas.plotting import
scatter_matrix”)
Ans:
import pandas as pd
print("pandas version: {}".format(pd.__version__)) import
sklearn
print("scikit-learn version: {}".format(sklearn.__version__))
from sklearn.datasets import load_iris iris_dataset =
load_iris()
from sklearn.model_selection import train_test_split X_train,
X_test, y_train, y_test =
train_test_split( iris_dataset['data'],
iris_dataset['target'], random_state=0)
# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names iris_dataframe
= pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# create a scatter matrix from the dataframe, color by y_train from
pandas.plotting import scatter_matrix
grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15),
marker='o', hist_kwds={'bins': 20})
(Data Wrangling/Normalization)
Question 4:
You are given a set of data for supervised learning. A sample block of data looks like this:
“ 1.2234, 0.3302, 123.50, 0.0081, 30033.81, 1
1.3456, 0.3208, 113.24, 0.0067, 29283.18, -1
0.9988, 0.2326, 133.45, 0.0093, 36034.33, 1
1.1858, 0.4301, 128.55, 0.0077, 34037.35, 1
1.1533, 0.3853, 116.70, 0.0066, 22033.58, -13
1.2755, 0.3102, 118.30, 0.0098, 30183.65, 1
1.0045, 0.2901, 123.52, 0.0065, 31093.98, -1
1.1131, 0.3912, 113.15, 0.0088, 29033.23, -1 ”
Each row corresponds to a sample data measurement with 5 input features and 1 response.
(a) What kind of undesired effect can you anticipate if this set of raw data is used for learning?
(b) How can the data be preprocessed to handle this issue?
Ans:
(a) Those features with very large values may overshadow those with very small values.
(b) We can either use min-max or z-score normalization to resolve the problem.
(Missing Data)
Question 5:
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given
medical details. Download the Pima-Indians-Diabetes data from
https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input
variables and 1 output variable. The variable names are as follows:
0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age (years).
8. Class variable (0 or 1).
(a) Print the summary statistics of this data set.
(b) Count the number of “0” entries in columns [1,2,3,4,5].
(c) Replace these “0” values by “NaN”.
(Hint: you might need the “.describe()” and “.replace(0, numpy.NaN)” functions “from
pandas import read_csv”.)
Ans:
#(a) from pandas import read_csv dataset = read_csv('pima-
indians-diabetes.csv', header=None) print(dataset.describe())
#(b) print((dataset[[1,2,3,4,5]] == 0).sum()) #(c) import numpy #
mark zero values as missing or NaN dataset[[1,2,3,4,5]] =
dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # print the first 20
rows of data print(dataset.head(20))
print(dataset.isnull().sum())
(In Quiz and Exam format)
Question 6:
Disease Outbreak Response System Condition (DORSCON) in Singapore is a colour-coded framework that shows the
current disease situation. The framework provides us with general guidelines on what needs to be done to prevent and
reduce the impact of infections. There are 4 statuses – Green, Yellow, Orange and Red, depending on the severity and
spread of the disease. Which type of data does DORSCON belong to ?
(1) Categorical; (2) Ordinal; (3) Continuous; (4) Interval
A boxplot is a standardized way of displaying the dataset based on a five-number summary: the minimum, the
maximum, _BLANK1_, and the first and third quartiles, where the number of data points that fall between the first and
third quartiles amounts to _BLANK2_ percent of the total number of data on display.
Ans:
_BLANK1_: Median
_BLANK2_: 50%
EE2211 Tutorial 3
Answer:
(a) We wish to find the value of c that makes the PMF sum up to one.
2
𝑐𝑐 𝑐𝑐 4
� 𝑃𝑃𝑁𝑁 (𝑛𝑛) = 𝑐𝑐 + + = 1 implying 𝑐𝑐 =
2 4 7
𝑛𝑛=0
Answer:
(a) From the above PDF we can determine the value of c by integrating the PDF and setting it equal to 1.
2
2
𝑥𝑥 2 4
� 𝑐𝑐𝑥𝑥 𝑑𝑑𝑥𝑥 = 𝑐𝑐 � � = 𝑐𝑐 � � = 2𝑐𝑐 = 1
0 2 0 2
Therefore c = 1/2.
1
(b) 𝑃𝑃𝑃𝑃[0 ≤ 𝑋𝑋 ≤ 1] = ∫0 𝑥𝑥/2 𝑑𝑑𝑥𝑥 = 1/4
1/2
(c) 𝑃𝑃𝑃𝑃[−1/2 ≤ 𝑋𝑋 ≤ 1/2] = ∫0 𝑥𝑥/2 𝑑𝑑𝑥𝑥 = 1/16
(Bayes’ rule)
Question 3:
Let A = {resistor is within 50Ω of the nominal value}. The probability that a resistor is from machine B is Pr[B] = 0.3.
The probability that a resistor is acceptable, i.e., within 50 Ω of the nominal value, is Pr[A] = 0.78. Given that a resistor
is from machine B, the conditional probability that it is acceptable is Pr[A|B] = 0.6. What is the probability that an
acceptable resistor comes from machine B?
Answer:
We are given the event A that a resistor is within 50 Ω of the nominal value, and we need to find Pr[B|A]. Using Bayes’
𝑃𝑃𝑃𝑃(𝐴𝐴|𝐵𝐵)𝑃𝑃𝑃𝑃(𝐵𝐵)
theorem, we have 𝑃𝑃𝑃𝑃(𝐵𝐵|𝐴𝐴) = .
𝑃𝑃𝑃𝑃(𝐴𝐴)
Since all of the quantities we need are given in the problem description, our answer is
Pr(B|A) = (0.6)(0.3)/(0.78) ≈ 0.23.
Answer:
Ref: Python for Probability, Statistics, and Machine Learning, Unpingco, José (pp.37-42).
(a) The first thing to do is characterize the measurable function for this as X : (a, b) → (a + b). Next, we associate all
of the (a, b) pairs with their sum. A Python dictionary can be created like this:
#(i)
d={(i,j):i+j for i in range(1,7) for j in range(1,7)}
The next step is to collect all of the (a, b) pairs that sum to each of the possible values from two to twelve.
# (ii) collect all of the (a, b) pairs that sum to each of the possible values
# from two to twelve
from collections import defaultdict
dinv = defaultdict(list)
for i,j in d.items(): dinv[j].append(i)
For example, dinv[7] contains the following list of pairs that sum to seven, [(1, 6), (2, 5), (5, 2),
(6, 1), (4, 3), (3, 4)]. The next step is to compute the probability measured for each of these items.
Using the independence assumption, this means we have to compute the sum of the products of the individual
item probabilities in dinv. Because we know that each outcome is equally likely, the probability of every term
in the sum equals 1/36.
# Compute the probability measured for each of these items
# including the sum equals seven
X={i:len(j)/36. for i,j in dinv.items() }
print(X)
{2: 0.027777777777777776, 3: 0.05555555555555555, 4: 0.08333333333333333, 5:
0.1111111111111111, 6: 0.1388888888888889, 7: 0.16666666666666666, 8:
0.1388888888888889, 9: 0.1111111111111111, 10: 0.08333333333333333, 11:
0.05555555555555555, 12: 0.027777777777777776}
(b) What is the probability that half the product of three dice will exceed their sum?
Using the same method above, we create the first mapping as follows:
d={(i,j,k):((i*j*k)/2>i+j+k) for i in range(1,7)
for j in range(1,7)
for k in range(1,7)}
The keys of this dictionary are the triples and the values are the logical values of whether or not half the product of
three dice exceeds their sum. Now, we do the inverse mapping to collect the corresponding lists,
dinv = defaultdict(list)
for i,j in d.items(): dinv[j].append(i)
Note that dinv contains only two keys, True and False. Again, because the dice are independent, the probability of
any triple is 1/(63 ). Finally, we collect this for each outcome as in the following,
X={i:len(j)/6.0**3 for i,j in dinv.items() }
print(X)
{False: 0.37037037037037035, True: 0.6296296296296297}
Answer:
Wiki: The cumulative distribution function of a real-valued random variable X is the function given by
𝐹𝐹𝑋𝑋 (𝑥𝑥) = 𝑃𝑃𝑃𝑃(𝑋𝑋 ≤ 𝑥𝑥), ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ (1)
where the right-hand side represents the probability that the random variable X takes on a value less than or equal
to x. The probability that X lies in the semi-closed interval (a, b], where a < b, is therefore
𝑃𝑃𝑃𝑃(𝑎𝑎 < 𝑋𝑋 ≤ 𝑏𝑏) = 𝐹𝐹𝑋𝑋 (𝑏𝑏) − 𝐹𝐹𝑋𝑋 (𝑎𝑎). ⋯ ⋯ ⋯ ⋯ ⋯ (2)
from scipy import stats
# define constants mu =
30 # mean = 30Ω
sigma = 1.8 # standard deviation = 1.8Ω
x1 = 28 # lower bound = 28Ω x2 = 33
# upper bound = 33Ω
## calculate probabilities
# probability from Z=0 to lower bound
p_lower = stats.norm.cdf(x1,mu,sigma) #
probability from Z=0 to upper bound
p_upper = stats.norm.cdf(x2,mu,sigma)
# probability of the interval
Prob = (p_upper) - (p_lower)
http://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH717-QuantCore/PH717-
Module1BDescriptiveStudies_and_Statistics/PH717-Module1B-DescriptiveStudies_and_Statistics6.html
Question 8: (Multiple responses – one or more answers are correct)
If A and B are correlated, but they’re actually caused by C, which of the following statements are correct?
Ans:
a) A and C are correlated (Yes, A and C are correlated because A is caused by C)
b) B and C are correlated (Yes, B and C are correlated because B is caused by C)
Suggested Discussion:
c) A causes B to happen (No, A and B share the confounding factor C, but A and B don’t have causal
(i) Colon cancer is correlated to the amount of daily meat consumption.
relationship)
(ii) d)There
A causes C to
is a clear happen
linear (No,
trend; C causes
countries withAthe
to lowest
happen, however,
meat we arehave
consumption not the
surelowest
if A causes
rates ofCcolon
to happen).
cancer, and the colon cancer rate among these countries progressively increases as meat consumption increases.
(iii) Probably causal.
Question 9: (Multiple responses – one or more answers are correct)
We toss a coin and observe which side is facing up. Which of the following statements represent valid probability
assignments
Question for observing
7: (Multiple head
responses P[‘H’]
– one and tail
or more P[‘T’]?
answers are correct)
If A and B are correlated, but they’re actually caused by C, which of the following statements are correct?
Ans:
Ans: a) P[‘H’]=0.2, P[‘T’]=0.9 (Invalid because they sum to 1.1, which doesn’t conform to the Axioms of probability)
a) b)A and
P[‘H’]=0.0, P[‘T’]=1.0
C are correlated (Yes, (Valid
A and Cbecause it conforms
are correlated to Athe
because is Axioms
caused byofC)probability)
c) P[‘H’]=-0.1, P[‘T’]=1.1 (Invalid because probability cannot
b) B and C are correlated (Yes, B and C are correlated because B is caused havebynegative
C) value)
c) d)A causes
P[‘H’]=P[‘T’]=0.5 (Valid
B to happen (No, A because it conforms
and B share to the Axioms
the confounding of but
factor C, probability)
A and B don’t have causal
relationship)
d) A causes C to happen (No, C causes A to happen, however, we are not sure if A causes C to happen).
Question 10: (Fill-in-blank)
A doctor is called to see a sick child. The doctor has prior information that 90% of sick children in that neighborhood
have the flu, while the other 10% are sick with COVID-19. Let F stand for an event of a child being sick with flu and
C stand8:for
Question an eventresponses
(Multiple of a child being
– one sick with
or more COVID-19,
answers therefore, we have P[F]=0.9 and P[C]=0.1. Assume for
are correct)
simplicity that a child is either with flu or with COVID-19, not both.
We toss a coin and observe which side is facing up. Which of the following statements represent valid probability
assignments for observing
A well-known symptom head
ofP[‘H’] and tailisP[‘T’]?
COVID-19 a dry cough (the event of having which we denote D). Assume that the
probability of having a dry cough if one has COVID-19 is 0.95. However, children with flu also develop a dry cough,
Ans:
and the probability of having a dry cough if one has flu is 0.5. Upon examining the child, the doctor finds the child
a) P[‘H’]=0.2, P[‘T’]=0.9 (Invalid because they sum to 1.1, which doesn’t conform to the Axioms of
has a dry cough. The probability that the child has COVID-19 is _BLANK_.
probability)
b) P[‘H’]=0.0, P[‘T’]=1.0 (Valid because it conforms to the Axioms of probability)
0.95×0.1
c) Because P[C|D]
P[‘H’]=-0.1, = P[D|C]P[C]
P[‘T’]=1.1 / (P[D|C]P[C]+
(Invalid P[D|F]P[F])
because probability cannot=have negative = 0.17
value)
0.95×0.1+0.5×0.9
d) P[‘H’]=P[‘T’]=0.5 (Valid because it conforms to the Axioms of probability)
The probability that the child has COVID-19 is 0.17.
Question 9: (True/False)
Question 11: (True/False)
1 4
Two vectors 𝒂𝒂 = �2� and 𝒃𝒃 = �5� are linearly dependent?
3 6
Ans: False (because a is not a multiply of b.)
Question
Question10:12:
(Fill-in-blank)
(Fill-in-blank)
1 3
The rank of the matrix � � is 2. (try row echelon form)
2 4
1 3 1 3
Ans: � � => � �
2 4 0 −2
1 2 3
The rank of the matrix �4 5 6� is 2. (try row echelon form)
7 8 9
1 3
The rank of the matrix � � is 2. (try row echelon form)
2 4
1 3 1 3
Ans: � � => � �
2 4 0 −2
Question
Question11:13:
(Fill-in-blank)
(Fill-in-blank)
1 2 3
The rank of the matrix �4 5 6� is 2. (try row echelon form)
7 8 9
1 2 3 1 2 3 1 2 3 1 2 3
Ans: �4 5 6� => �0 −3 −6�=> �0 −3 −6 � => �0 −3 −6�
7 8 9 7 8 9 0 −6 −12 0 0 0
EE2211 Tutorial 4
1 1 0
Given Xw = y where 𝐗 = [ ] , 𝐲 = [ ].
3 4 1
Answer:
import numpy as np
X = np.array(m_list)
inv_X = np.linalg.inv(X)
y = np.array([0, 1])
w = inv_X.dot(y)
print(w)
Answer:
1 2 0
Given Xw = y where 𝐗 = [ 2 4 ] , 𝐲 = [0.1].
1 −1 1
Answer:
0
0.4667 −0.2 1 2 1 0.68
̂ = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲 = [
𝐰 ][ ] [0.1] = [ ].
−0.2 0.1333 2 4 − 1 −0.32
1
# import numpy as np
# X = np.array(m_list)
# inv_XTX = np.linalg.inv(X.transpose().dot(X))
# pinv = inv_XTX.dot(X.transpose())
# w = pinv.dot(y)
# print(w)
import numpy as np
w = inv(X.T @ X) @ X.T @ y
print(w)
Answer:
2 2 1
4 0 2 0 2 4
The determinant of 𝐗𝐗𝑇 = det([2 4 0 ]) = 2 |0 2| − 2 |1 2| + 1 |1 0| = 2⨯8 – 2⨯4 + (–4) = 4 .
1 0 2
1 1 1 0.5
2 −1 −1 1
𝑇 𝑇 −10 −1 1 ] [−1 0.5
(c) 𝐰
̂ = 𝐗 (𝐗𝐗 ) 𝐲 = [ 0.75 0.5 ] [0] = [ ]
1 1 0 0.5
0 −1 0 −1 0.5 1 1
0.5
1 2 0
Given 𝐰 𝑇 X = 𝐲 𝑇 where 𝐗 = [ ] , 𝐲 = [ ].
3 6 1
Answer:
Given 𝐰 𝑇 X = 𝐲 𝑇 where
1 2 0
𝐗 = [ 2 4 ] , 𝐲 = [ ].
1 −1 1
Answer:
̂ 𝑇 = (𝐗𝐚)𝑇
𝐰 (The 3-dimensional vector 𝐰 can be constrained by projecting 𝐗 onto a 2-dimensional vector 𝐚)
= 𝐚𝑇 𝐗 𝑇
= y 𝑇 (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇
0.4667 −0.2 1 2 1
= [0 1] [ ][ ]
−0.2 0.1333 2 4 − 1
(𝐰 T 𝐗)𝐓 = 𝐲T => 𝐗T 𝐰 = 𝐲
Assume that we have a new notation = 𝐗𝐓 and we can then use the formula: 𝐰 = 𝐗𝐓 (𝐗𝐗𝐓 )−1𝐲
Note: dim(X) is 3⨯2, dim(a) is 2⨯1, estimation is done/constrained on/to the lower dimension of (3⨯2) and
then projected back to the higher dimension 3.
(Systems of Linear Equations)
Question 7:
This question is related to determination of types of system where an appropriate solution can be found subsequently.
The following matrix has a left inverse.
2 0 0
𝐗=[ ]
0 0 1
a) True
b) False
Answer: b)
Solution: Left inverse is given by (𝐗 𝑻 𝐗)−𝟏 𝐗 𝑻 where 𝐗 𝑻 𝐗 should be invertible. In this case, 𝐗 𝑻 𝐗 is not invertible
so the matrix does not have a left inverse.
Answer: c and d.
EE2211 Tutorial 5
(a) Perform a linear regression with addition of a bias/offset term to the input feature vector and sketch the result
of line fitting.
(b) Perform a linear regression without inclusion of any bias/offset term and sketch the result of line fitting.
(c) What is the effect of adding a bias/offset term to the input feature vector?
Answer:
1 1 1 1 1 1
The input feature including bias/offset can be written as 𝐗 ! = # +.
−10 −8 − 3 − 1 2 8
5
⎡5⎤
⎢ ⎥
6 − 12 "# 1 1 1 1 1 1 ⎢4⎥ 3.1055
- = (𝐗 ! 𝐗)"# 𝐗 ! 𝐲 = #
𝐰 + # + =# +.
−12 242 −10 −8 − 3 − 1 2 8 ⎢3⎥ −0.1972
⎢2⎥
⎣2⎦
(b) This is an over-determined system.
In this case, the input feature without inclusion of bias/offset is a vector given by [−10, −8, −3, −1, 2, 8]! .
5
⎡5⎤
⎢ ⎥
4
- = (𝐱 ! 𝐱)"# 𝐱 ! 𝐲 = [242]"# [−10, −8, −3, −1, 2, 8] ⎢ ⎥ = −0.3512.
w
⎢3⎥
⎢2⎥
⎣2⎦
(c) The bias/offset term allows the line to move away from the origin (moved vertically in this case).
{𝑥! = 1, 𝑥" = 1, 𝑥# = 5} → {𝑦 = 3}
(a) Predict the following test data without inclusion of an input bias/offset term.
(b) Predict the following test data with inclusion of an input bias/offset term.
Answer:
A college bookstore must order books two months before each semester starts. They believe that the number of books
that will ultimately be sold for any particular course is related to the number of students registered for the course when
the books are ordered. They would like to develop a linear regression equation to help plan how many books to order.
From past records, the bookstore obtains the number of students registered, X, and the number of books actually sold
for a course, Y, for 12 different semesters. These data are shown below.
(a) Obtain a scatter plot of the number of books sold versus the number of registered students.
(b) Write down the regression equation and calculate the coefficients for this fitting.
(c) Predict the number of books that would be sold in a semester when 30 students have registered.
(d) Predict the number of books that would be sold in a semester when 5 students have registered.
Answer:
(a)
1 1 1 1 1 1 1 1 1 1 1 1
where 𝐰 = [w% , w# ]! , 𝐗 ! = # +,
36 28 35 39 30 30 31 38 36 38 29 26
(c)
9.30
- = [1 30] #
yD$ = 𝐗 $ 𝐰 + = 29.4818
0.6727
(d) (yD$ = 12.6636) This prediction appears to be somewhat over optimistic. Since 5 students is not within the range
of the sampled number of students, it might not be appropriate to use the regression equation to make this prediction.
We do not know if the straight-line model would fit data at this point, and we might not want to extrapolate far beyond
the observed range.
Answer:
(a),(b),(c),(d)
31
⎡20⎤
⎢34⎥
⎢ ⎥
35
⎢ ⎥
⎢20⎥
12 387 "$ 1 1 1 1 1 1 1 1 1 1 1 1 ⎢30⎥ −10.4126
" = (𝐗 ! 𝐗)"𝟏 𝐗 ! 𝐲 = (
𝐰 0 ( 0 =( 0
387 12791 36 26 35 39 26 30 31 38 36 38 26 26 ⎢30⎥ 1.2143
⎢38⎥
⎢34⎥
⎢33⎥
⎢20⎥
⎣20⎦
−10.4126
- = [1 30] #
yD$ = 𝐗 $ 𝐰 + = 26.0177
1.2143
31
⎡34⎤
⎢35⎥
⎢ ⎥
30
9 309 "$ 1 1 1 1 1 1 1 1 1 ⎢ ⎥ −3.5584
" = (𝐗 ! 𝐗)"𝟏 𝐗 ! 𝐲 = (
𝐰 0 ( 0 ⎢30⎥ = ( 0
309 10763 36 35 39 30 31 38 36 38 26 1.0260
⎢38⎥
⎢34⎥
⎢33⎥
⎣20⎦
−3.5584
- = [1 30] #
yD$ = 𝐗 $ 𝐰 + = 27.2208
1.0260
Note: these results show that duplicating samples can influence the learning and decision too. In this case, purging
seems to give a more optimistic prediction for a relatively small number of students (< 37) and more conservative
prediction for a relatively large number of students (>37).
(Linear Regression, python)
Question 5:
Download the data file “government-expenditure-on-education.csv” from Canvas Tutorial Folder.
It depicts the government’s educational expenditure over the years (downloaded in July 2021 from
https://data.gov.sg/dataset/government-expenditure-on-education)
Predict the educational expenditure of year 2021 based on linear regression. Solve the problem using Python with a
plot. Note: please use the file from the canvas link.
Hint: use Python packages like numpy, pandas, matplotlib.pyplot, numpy.linalg.
Answer:
Codes:
import numpy as np
import pandas as pd
df = pd.read_csv("government-expenditure-on-education.csv")
expenditureList = df ['recurrent_expenditure_total'].tolist()
yearList = df ['year'].tolist()
X = np.array(m_list).T
y = np.array(expenditureList)
w = inv(X.T @ X) @ X.T @ y
print(w)
y_line = X.dot(w)
plt.plot(yearList, y_line)
plt.xlabel('Year')
plt.ylabel('Expenditure')
plt.title('Education Expenditure')
plt.show()
print(y_predict)
Output
[-6.4843247e+08 3.2683591e+05]
12102904.270643068
Answer:
import pandas as pd
import numpy as np
wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-
quality/winequality-red.csv",sep=';')
wine.info()
y = wine.quality
x = wine.drop('quality',axis = 1)
x0 = np.ones((len(y),1))
X = np.hstack((x0,x))
## (Note: this exercise introduces the basic protocol of using the training-test
partitioning of samples for evaluation assuming the list of data is already randomly
indexed)
## In case you really want a general random split to have a better training/test
distributions:
train_X = X[0:1500]
train_y = y[0:1500]
test_X = X[1500:1599]
test_y = y[1500:1599]
## linear regression
print(w)
yt_est = test_X.dot(w);
MSE = np.square(np.subtract(test_y,yt_est)).mean()
print(MSE)
MSE = mean_squared_error(test_y,yt_est)
print(MSE)
0.34352638122440293
0.343526381224403
Question 7:
This question is related to understanding of modelling assumptions. The function given by 𝑓(𝐱) = 1 + 𝑥# + 𝑥& −
𝑥' − 𝑥( is affine.
a) True
b) False
Answer: a)
Question 8:
Answer: b)
Questions 9:
The values of feature vector x and their corresponding values of target vector y are shown in the table below:
Find the least square solution of w using linear regression of multiple outputs and then estimate the value of y when
x = [8, 0, 2].
Answer:
#python
import numpy as np
from numpy.linalg import inv
X = np.array([[1, 3, -1, 0], [1, 5, 1, 2], [1, 9, -1, 3], [1, -6, 7, 2],
[1, 3, -2, 0]])
Y = np.array([[1, -1], [-1, 0], [1, 2], [0, 3], [1, -2]])
W = inv(X.T @ X) @ X.T @ Y
print(W)
newX=np.array([1, 8, 0, 2])
newY=newX@W
print(newY)
Outputs
W= [[ 1.14668974 -0.95997404]
[-0.630463 -0.33427088]
[-1.10601471 -0.24426655]
[ 1.3595846 1.77953267]]
newY=[-1.17784509 -0.07507572]
EE2211 Tutorial 6
⇒ 𝐗 𝑇 𝐗𝐰 + 𝜆𝐰 = 𝐗 𝑇 𝐲
⇒ 𝜆𝐰 = 𝐗 𝑇 𝐲 − 𝐗 𝑇 𝐗𝐰
⇒ 𝐰 = 𝜆−1 (𝐗 𝑇 𝐲 − 𝐗 𝑇 𝐗𝐰)
⇒ 𝐰 = 𝜆−1 𝐗 𝑇 (𝐲 − 𝐗𝐰)
𝐰 = 𝐗𝑇𝒂
where
𝒂 = 𝜆−1 (𝐲 − 𝐗𝐰)
⇒ 𝜆𝒂 = (𝐲 − 𝐗𝐰)
⇒ 𝜆𝒂 = (𝐲 − 𝐗𝐗 𝑇 𝒂)
⇒ 𝐗𝐗 𝑇 𝒂 + 𝜆𝒂 = 𝐲
⇒ (𝐗𝐗 𝑇 + 𝜆𝐈)𝒂 = 𝐲
⇒ 𝒂 = (𝐗𝐗 𝑇 + 𝜆𝐈)−1 𝐲
Hence,
𝐰 = 𝐗 𝑇 𝒂 = 𝐗 𝑇 (𝐗𝐗 𝑇 + 𝜆𝐈)−1 𝐲.
(a) Perform a 3rd-order polynomial regression and sketch the result of line fitting.
(b) Given a test point { 𝑥 = 9 } predict 𝑦 using the polynomial model.
(c) Compare this prediction with that of a linear regression.
Answer:
Polynomial model of 3rd order: 𝑓(𝐱) = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥12 + 𝑤3 𝑥13.
1 − 10 100 − 1000 5
1 −8 64 − 512 5
1 −3 9 − 27 4
𝐏= , 𝐲= .
1 −1 1 −1 3
1 2 4 8 2
[1 8 64 512 ] [2]
Polynomial regression:
̂ = (𝐏𝑇 𝐏)−1 𝐏𝑇 𝐲
𝐰
5
6 − 12 242 − 1020 −1 1 1 1 1 1 1 5
−12 242 − 1020 18290 −10 −8 −3 −1 2 8 4
=[ ] [ ]
242 − 1020 18290 − 100212 100 64 9 1 4 64 3
−1020 18290 − 100212 1525082 −1000 − 512 − 27 −1 8 512 2
[2]
2.6894
−0.3772
=[ ]
0.0134
0.0029
Linear regression:
5
5
6 − 12 −1 1 1 1 1 1 1 4 3.1055
̂ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲 = [
𝐰 ] [ ] =[ ].
−12 242 −10 −8 − 3 − 1 2 8 3 −0.1972
2
[2]
Prediction:
y_predict_Poly = 2.4661
y_predict_Linear = 1.3303
(Polynomial Regression, 3D data, Python)
Question 3:
(a) Write down the expression for a 3rd order polynomial model having a 3-dimensional input.
𝑥11 𝑥12 𝑥13 1 0 1
(b) Write down the P matrix for this polynomial given 𝐗 = [ 𝑥 𝑥22 𝑥23 ] = [1 − 1 1].
21
0
(c) Given 𝐲 = [ ], can a unique solution be obtained in dual form? If so, proceed to solve it.
1
0
(d) Given 𝐲 = [ ], can the primal ridge regression be applied to obtain a unique solution? If so, proceed to
1
solve it.
Answer:
(c) Yes
10 10 −1 0
̂ = 𝐏𝑇 (𝐏𝐏𝑇 )−1 𝐲 = 𝐏𝑇 [
𝐰 ] [ ]
10 20 1
𝑇
𝐰
̂ =[0 0 -0.1000 0 -0.1000 -0.1000 0 0 0.1000 0 -0.1000
In python:
w_dual =
[ 0. 0. -0.1 0. 0. -0.1 0. 0.1 -0.1 0. 0. -0.1 0. 0.1 -0.1 0. -0.1 0.1 -0.1 0. ]
(Note: The arrangement of the polynomial terms in the columns of matrix 𝐏 using
PolynomialFeatures from sklearn.preprocessing might be different from that in equation(1).)
(d) Yes
̂ = (𝐏𝑇 𝐏 + 𝜆𝐈)−1 𝐏𝑇 𝐲
𝐰
̂ 𝑇 =[0.0000
𝐰 0.0000 -0.1000 0.0000 -0.1000 -0.1000 0.0000
0.0000 0.1000 0.0000 -0.1000 0.0000 0.1000 0.1000
0.0000 -0.1000 -0.1000 0.0000 -0.1000 0.0000]
In python:
w_primal = [ 9.99969302e-07 9.99972940e-07 -9.99980001e-02 9.99970098e-07
9.99970666e-07 -9.99980000e-02 9.99967597e-07 9.99980000e-02
-9.99980000e-02 9.99972485e-07 9.99969529e-07 -9.99980000e-02
9.99968506e-07 9.99980000e-02 -9.99980001e-02 9.99970553e-07
-9.99980001e-02 9.99980001e-02 -9.99980001e-02 9.99969416e-07]
(Note: The arrangement of the polynomial terms in the columns of matrix 𝐏 using
PolynomialFeatures from sklearn.preprocessing might be different from that in equation(1).)
Here, at 𝜆 = 0.0001, we observe a very close solution to that in (c) even though (d) constitutes an
approximation whereas (c) is exact.
Codes:
import numpy as np
from numpy.linalg import inv
from sklearn.preprocessing import PolynomialFeatures
X = np.array([[1,0,1], [1,-1,1]])
y = np.array([0, 1])
## Generate polynomial features
order = 3
poly = PolynomialFeatures(order)
P = poly.fit_transform(X)
## dual solution (without ridge)
w_dual = P.T @ inv(P @ P.T) @ y
print(w_dual)
## primal ridge
reg_L = 0.0001*np.identity(P.shape[1])
w_primal_ridge = inv(P.T @ P + reg_L) @ P.T @ y
print(w_primal_ridge)
1 −1 +1
1 0 +1
𝐗 = 1 0.5 , 𝐲 = −1
1 0.3 +1
[1 0.8] [−1]
0.3333
̂ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲 = [
𝐰 ]
−1.1111
0.4444 𝑐𝑙𝑎𝑠𝑠 + 1 → 𝑐𝑙𝑎𝑠𝑠1
sgn(𝐲̂t ) = sgn(𝐗 t𝐰
̂) = sgn ([ ]) = [ ]
−0.1111 𝑐𝑙𝑎𝑠𝑠 − 1 → 𝑐𝑙𝑎𝑠𝑠2
Codes:
import numpy as np
from numpy.linalg import inv
from sklearn.preprocessing import PolynomialFeatures
X = np.array([[1,-1], [1,0], [1,0.5], [1,0.3], [1,0.8]])
y = np.array([1, 1, -1, 1, -1])
## Linear regression for classification
w = inv(X.T @ X) @ X.T @ y
print(w)
Xt = np.array([[1,-0.1], [1,0.4]])
y_predict = Xt @ w
print(y_predict)
y_class_predict = np.sign(y_predict)
print(y_class_predict)
(b) Predict the class label for {𝑥 = −0.1} and {𝑥 = 0.4} using a polynomial model of 5th order and a one-hot
encoded target.
Answer:
1 −1 1 0 0
1 0 1 0 0
1 − 0.1
𝐗 = 1 0.5 , 𝐘 = 0 1 0 , 𝐗t = [ ].
1 0.4
1 0.3 0 0 1
[1 0.8] [0 1 0]
0.4780 0.3333 0.1887
̂ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘 = [
(a) 𝐰 ]
−0.6499 0.5556 0.0943
Question 7
1 1
MCQ: there could be more than one answer. Given three samples of two-dimensional data points 𝐗 = [0 1] with
3 3
1
corresponding target vector 𝐲 = [0]. Suppose you want to use a full third-order polynomial model to fit these data.
1
Which of the following is/are true?
Answer: a, b, c, d
Question 8
MCQ: there could be more than one answer. Which of the following is/are true?
a) The polynomial model can be used to solve problems with nonlinear decision boundary.
c) The solution for learning feature 𝐗 with target 𝐲 based on linear ridge regression can be written as 𝐰
̂=
(𝐗 𝑇 𝐗+𝜆𝐈)−1 𝐗 𝑇 𝐲 for 𝜆 > 0. As 𝜆 increases, 𝐰
̂ 𝑇𝐰
̂ decreases.
d) If there are four data samples with two input features each, the full second-order polynomial model is an over-
determined system.
Answer: a, c
EE2211 Tutorial 7
Question 1:
This question explores the use of Pearson’s correlation as a feature selection metric. We are given the
following training dataset.
What are the top two features we should select if we use Pearson’s correlation as a feature selection
metric? Here’s the definition of Pearson’s correlation. Given 𝑁 pairs of datapoints
{(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), ⋯ , (𝑎𝑁 , 𝑏𝑁 )}, the Pearson’s correlation r is defined as 𝑟 =
1 𝑁
∑ (𝑎 −𝑎̅)(𝑏𝑖 −𝑏̅) 1
𝑁 𝑛=1 𝑖
, where 𝑎̅ = 𝑁 ∑𝑁 ̅ 1 𝑁
𝑛=1 𝑎𝑛 and 𝑏 = 𝑁 ∑𝑛=1 𝑏𝑛 are the empirical means of 𝑎 and
1 𝑁 1
√𝑁 ∑𝑛=1(𝑎𝑖 −𝑎̅)2 √𝑁 ∑𝑁 ̅ 2
𝑛=1(𝑏𝑖 −𝑏 )
1
𝑏 respectively. 𝜎𝑎 = √𝑁 ∑𝑁
1
̅)2 and 𝜎𝑏 = √𝑁 ∑𝑁
𝑛=1(𝑎𝑖 − 𝑎
̅ 2
𝑛=1(𝑏𝑖 − 𝑏 ) are referred to as the empirical
1
standard deviation of 𝑎 and 𝑏. 𝐶𝑜𝑣(𝑎, 𝑏) = ∑𝑁 ̅)(𝑏𝑖 − 𝑏̅) is known as the empirical
𝑛=1(𝑎𝑖 − 𝑎
𝑁
covariance between 𝑎 and 𝑏
Answer:
0.3510+2.1812+0.2415−0.1096+0.1544
Mean of Feature 1 = 𝜇1 = = 0.5637
5
1.1796+2.1068+1.7753+1.2747+2.0851
Mean of Feature 2 = 𝜇2 = = 1.6843
5
−0.9852+1.3766−1.3244−0.6316−0.8320
Mean of Feature 3 = 𝜇3 = = −0.4793
5
0.2758+1.4392−0.4611+0.6154+1.0006
Mean of Target y = 𝜇𝑦 = = 0.5740
5
1
Cov(Feature 1, y) = 5
[(0.3510 − 𝜇1 )(0.2758 − 𝜇𝑦 ) + (2.1812 − 𝜇1 )(1.4392 − 𝜇𝑦 ) + (0.2415 −
𝜇1 )(−0.4611 − 𝜇𝑦 ) + (−0.1096 − 𝜇1 )(0.6154 − 𝜇𝑦 ) + (0.1544 − 𝜇1 )(1.0006 − 𝜇𝑦 )] = 0.3188
1
Cov(Feature 2,y) = 5
[(1.1796 − μ2 )(0.2758 − 𝜇𝑦 ) + (2.1068 − μ2 )(1.4392 − 𝜇𝑦 ) + (1.7753 − μ2 )(−0.4611 −
𝜇𝑦 ) + (1.2747 − μ2 )(0.6154 − 𝜇𝑦 ) + (2.0851 − μ2 )(1.0006 − 𝜇𝑦 )] = 0.1152
1
Cov(Feature 3,y) = 5
[(−0.9852 − 𝜇3 )(0.2758 − 𝜇𝑦 ) + (1.3766 − 𝜇3 )(1.4392 − 𝜇𝑦 ) + (−1.3244 −
𝜇3 )(−0.4611 − 𝜇𝑦 ) + (−0.6316 − 𝜇3 )(0.6154 − 𝜇𝑦 ) + (−0.8320 − 𝜇3 )(1.0006 − 𝜇𝑦 )] = 0.4949
{𝑥 = −8} → {𝑦 = 2.42}
{𝑥 = −3} → {𝑦 = 0.22}
{𝑥 = −1} → {𝑦 = 0.12}
{ 𝑥 = 2 } → {𝑦 = 0.25}
{ 𝑥 = 7 } → {𝑦 = 3.09}
{𝑥 = −9} → {𝑦 = 3}
{𝑥 ={𝑥−9} → {𝑦→={𝑦−6}
= −7} = 1.81}
{𝑥 = −5} → {𝑦 = 0.80}
{𝑥 = −4} → {𝑦 = 0.25}
{ 𝑥 = −2 } → {𝑦 = −0.19}
{ 𝑥 = 1 } → {𝑦 = 0.4}
{ 𝑥 = 4 } → {𝑦 = 1.24}
{ 𝑥 = 5 } → {𝑦 = 1.68}
{ 𝑥 = 6 } → {𝑦 = 2.32}
{ 𝑥 = 9 } → {𝑦 = 5.05}
(a) Use the polynomial model from orders 1 to 6 to train and test the data without regularization. Plot the
Mean Squared Errors (MSE) over orders from 1 to 6 for both the training and the test sets. Which
model order provides the best MSE in the training and test sets? Why? [Hint: the underlying data was
{𝑥 = −9} → {𝑦 = −6}
generated using a quadratic function + noise]
(b) Use regularization (ridge regression) λ=1 for all orders and repeat the same analyses. Compare the
plots of (a) and (b). What do you see? [Hint: the underlying data was generated using a quadratic
function + noise]
Answer:
Q1(a)
• There are 6 training data points. For polynomial orders 1 to 5, we can use the Primal solution:
̂ = (𝑃𝑇 𝑃)−1 𝑃𝑇 𝑦. For order 6, there are 7 unknowns, so the system is under-determined, so we
𝑤
̂ = 𝑃𝑇 (𝑃𝑃𝑇 )−1 𝑦
use the Dual solution: 𝑤
• See plots for estimated polynomial curves and MSE below
• ====== No Regularization =======
Training MSE: [2.3071. 8.4408e-03 8.3026e-03 1.7348e-03 3.8606e-25 2.3656e-17]
Test MSE: [ 3.0006 0.0296 0.0301 0.0854 1.0548 10.7674]
• Observe that the estimated polynomial curves for orders 5 and 6 pass through the training samples
exactly. This results in training MSE of virtually 0, but high test MSE => overfitting
• Note that even though the true underlying data came from a quadratic model (order = 2), estimated
polynomial curves for orders 2, 3 and 4 have relatively low training and test MSE.
• Polynomial curve of order 1 (linear curves) have high training and test MSE => underfitting
Q1(b)
• With regularization, we can simply use the primal solution even for order 6: 𝑤
̂=
(𝑃𝑇 𝑃 + 𝜆𝐼)−1 𝑃𝑇 𝑦.
• See plots for estimated polynomial curves and MSE below
• ====== Regularization =======
Training MSE: [2.3586 8.4565e-03 8.3560e-03 1.8080e-03 7.2650e-04 1.9348e-04]
Test MSE: [3.2756 0.0302 0.0314 0.0939 0.4369 6.0202]
• With the regularization, none of the polynomial curves passes through the training samples exactly.
In the case of orders 5 and 6, test MSE dropped from 1.0548 (order 5) and 10.7674 (order 6) to
0.4369 (order 5) and 6.0202 (order 6) after regularization was added. Thus, the regularization
reduces the overfitting
• On the other hand, the regularization did not help orders 1 to 4. Observe that the test MSE actually
went up. In this case, the regularization was overly strong, which ended up hurting these orders.
• Note that the addition of the regularization does not necessarily favor the lower order (compared
with the higher order). For example, the loss for order 2 is 0.018476, but the loss for order 6 is
0.0036308, which is lower. Thus, adding regularization does not help us choose the best polynomial
order for best prediction.
• In fact, selecting the best regularization parameter or model complexity (e.g., polynomial order) is
typically done through inner-loop (nested) cross-validation or training-validation-test scheme
(which will be taught in a future lecture).
EE2211 Tutorial 8
Question 1
Suppose we are minimizing f (x) = x4 with respect to x. We initialize x to be 2. We
perform gradient descent with learning rate 0.1. What is the value of x after the first
iteration?
Answer:
• At x = 2, the gradient is 4 × 23 = 32
Question 2
Please consider the csv file (government-expenditure-on-education.csv), which depicts
the government’s educational expenditure over the years. We would like to predict
expenditure as a function of year. To do this, fit an exponential model f (x, w) =
exp(−xT w) with squared error loss to estimate w based on the csv file and gradient
descent. In other words, C(w) = m 2
i=1 (f (xi , w) − yi ) .
P
Note that even though year is one dimensional, we should add the bias term, so x =
[1 year]T . Furthermore, optimizing the exponential function is tricky (because a small
change in w can lead to large change in f ). Therefore for the purpose of optimization,
divide the “year” variable by the largest year (2018) and divide the “expenditure” by the
largest expenditure, so that the resulting normalized year and normalized expenditure
variables are between 0 and 1. Use a learning rate of 0.03 and run gradient descent for
2000000 iterations.
(a) Plot the cost function C(w) as a function of the number of iterations.
(b) Use the fitted parameters to plot the predicted educational expenditure from year
1981 to year 2023.
(c) Repeat (a) using a learning rate of 0.1 and learning rate of 0.001. What do you
observe relative to (a)?
1
The goal of this question is for you to code up gradient descent, so I will provide you
with the gradient derivation. First, please note that in general, ∇w (xT w) = x. To see
this:
∂(xT w)
1 x1 +w2 x2 +···+wd xd )
∂(w
∂w1 x1
∂(x T w) ∂(w1 x1 +w2∂w 1
x2 +···+wd xd )
x2
∂w2 ∂w2
∇w (xT w) = = = .=x (1)
.. ..
.
. ..
∂(xT w) ∂(w1 x1 +w2 x2 +···+wd xd ) xd
∂wd ∂wd
The above equality will be very useful for the other questions as well. Now, going back
to our question,
m
(f (xi , w) − yi )2
X
∇w C(w) = ∇w (2)
i=1
m
∇w (f (xi , w) − yi )2
X
= (3)
i=1
Xm
= 2(f (xi , w) − yi )∇w f (xi , w) chain rule (4)
i=1
m
2(f (xi , w) − yi )∇w exp(−xiT w)
X
= (5)
i=1
m
2(f (xi , w) − yi ) exp(−xiT w)∇w (xiT w)
X
=− chain rule (6)
i=1
m
2(f (xi , w) − yi ) exp(−xiT w)xi
X
=− (7)
i=1
Xm
=− 2(f (xi , w) − yi )f (xi , w)xi (8)
i=1
Answer:
2
Learning rate = 0.03
8
Square Error 6
0
0 500000 1000000 1500000 2000000
Iteration Number
1.5
Expenditure
1.0
0.5
0.0
0 250 500 750 1000 1250 1500 1750 2000
Year
Question 3
Given the linear learning model f (x, w) = xT w, where x ∈ Rd . Consider the loss func-
tion L(f (xi , w), yi ) = (f (xi , w) − yi )4 , where i indexes the i-th training sample. The
final cost function is C(w) = m
P
i=1 L(f (xi , w), yi ), where m is the total number of train-
ing samples. Derive the gradient of the cost function with respect to w.
3
Learning rate = 0.1
10
8
Square Error
0
0 500000 1000000 1500000 2000000
Iteration Number
10
Square Error
2
0 500000 1000000 1500000 2000000
Iteration Number
Answer:
m
(f (xi , w) − yi )4
X
∇w C(w) = ∇w (9)
i=1
m
∇w (f (xi , w) − yi )4
X
= (10)
i=1
m
4(f (xi , w) − yi )3 ∇w f (xi , w)
X
= chain rule (11)
i=1
m
4(f (xi , w) − yi )3 ∇w (xiT w)
X
= (12)
i=1
m
4(f (xi , w) − yi )3 xi
X
= (13)
i=1
4
Question 4
1
Repeat Question 3 using f (x, w) = σ(xT w), where σ(a) = 1+exp(−βa)
Answer:
m
(f (xi , w) − yi )4
X
∇w C(w) = ∇w (14)
i=1
m
∇w (f (xi , w) − yi )4
X
= (15)
i=1
m
4(f (xi , w) − yi )3 ∇w f (xi , w)
X
= chain rule (16)
i=1
m
4(f (xi , w) − yi )3 ∇w σ(xiT w)
X
= (17)
i=1
m
∂σ(a)
4(f (xi , w) − yi )3 ∇w (xiT w)
X
= chain rule (18)
i=1
∂a
m
∂σ(a)
4(f (xi , w) − yi )3
X
= xi (19)
i=1
∂a
∂σ(a) ∂σ(a)
So we just have to evaluate ∂a and plug it into the above equation. Note that ∂a
is evaluated at a = xiT w, so
∂σ(a) ∂ 1
= (20)
∂a ∂a 1 + exp(−βa)
1 ∂(1 + e−βa )
=− (21)
(1 + e−βa )2 ∂a
β
= e−βa (22)
(1 + e−βa )2
β
= (1 + e−βa − 1) (23)
(1 + e−βa )2
1 1
=β − (24)
1 + e−βa (1 + e−βa )2
= β σ(a) − σ 2 (a) (25)
= βσ(a)(1 − σ(a)) (26)
= βσ(xiT w)(1 − σ(xiT w)) (27)
Therefore,
m
4(f (xi , w) − yi )3 βσ(xiT w)(1 − σ(xiT w))xi
X
∇w C(w) = (28)
i=1
5
Question 5
Repeat Question 3 using f (x, w) = σ(xT w), where σ(a) = max(0, a)
Answer:
m
(f (xi , w) − yi )4
X
∇w C(w) = ∇w (29)
i=1
m
∇w (f (xi , w) − yi )4
X
= (30)
i=1
m
4(f (xi , w) − yi )3 ∇w f (xi , w)
X
= chain rule (31)
i=1
m
4(f (xi , w) − yi )3 ∇w σ(xiT w)
X
= (32)
i=1
m
∂σ(a)
4(f (xi , w) − yi )3 ∇w (xiT w)
X
= chain rule (33)
i=1
∂a
m
∂σ(a)
4(f (xi , w) − yi )3
X
= xi (34)
i=1
∂a
∂σ(a)
= δ(xiT w > 0) (35)
∂a
Therefore, we get
m
4(f (xi , w) − yi )3 xi δ(xiT w > 0),
X
∇w C(w) = (36)
i=1
6
EE2211 Tutorial 9
Answer:
Let’s assume class 1, class 2 and class 3 correspond to red triangles, orange squares and blue circles
respectively.
" " # &
• For node A, 𝑝! = !# , 𝑝$ = !# , 𝑝% = !# = '
& $ ( ) %
• For node B, 𝑝! = = , 𝑝$ = = 0 , 𝑝% = =
!( " !( !( "
! " $ !
• For node C, 𝑝! = , 𝑝$ = , 𝑝% = =
# # # &
$
For Gini impurity, recall formula is 1 − 𝛴!"# 𝑝𝒊&
" $ " $ & $
• Node A: 1 − (!#) − (!#) − (') = 0.6481
$ $ % $
• Node B: 1 − ( ) −(0)$ − ( ) = 0.48
" "
! $ " $ ! $
• Node C: 1 – (#) − (#) − (&) = 0.5312
!( #
• Overall Gini at depth 1: ( ) 0.48 + ( ) 0.5312 = 0.5028
!# !#
Observe the decrease in Gini impurity from root (0.6481) to depth 1 (0.5028)
Answer:
At depth 1, when 𝑥 ≤ 5
• 𝑦 = {2, 3, 2.5, 1, 2.3, 2.8, 1.5} => 𝑦B = 2.1571
!
• MSE = ((2 − 𝑦B )2 + (3 − 𝑦B)2 + (2.5 − 𝑦B )2 + (1 − 𝑦B)2 + (2.3 − 𝑦B )2 + (2.8 − 𝑦B )2 + (1.5 − 𝑦B )2 ) =
,
0.4367
) ,
Overall MSE at depth 1: !%
× 0.5958 + !%
× 0.4367 = 0.5102
At the root:
• 𝑦 = {2, 3, 2.5, 1, 2.3, 2.8, 1.5, 2.6, 3.5, 4, 3.5, 5, 4.5} => 𝑦B = 2.9385
!
• MSE = !% ((2.6 − 𝑦B )2 + (3.5 − 𝑦B)2 + (4 − 𝑦B )2 + (3.5 − 𝑦B)2 + (5 − 𝑦B )2 + (4.5 − 𝑦B)2 + (2 − 𝑦B )2 +
(3 − 𝑦B)2 + (2.5 − 𝑦B)2 + (1 − 𝑦B)2 + (2.3 − 𝑦B)2 + (2.8 − 𝑦B )2 +(1.5 − 𝑦B)2 ) = 1.2224
Therefore, MSE has decreased from 1.2224 at the root to 0.5102 at depth 1
(Regression tree, Python)
Question 3:
Import the California Housing dataset “from sklearn.datasets import
fetch_california_housing” and “housing = fetch_california_housing()”. This
data set contains 8 features and 1 target variable listed below. Use “MedInc” as the input feature and
“MedHouseVal” as the target output. Fit a regression tree to depth 2 and compare your results with
results generated by “from sklearn.tree import DecisionTreeRegressor” using the
“squared error” criterion.
Target: ['MedHouseVal']
Features:['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
'AveOccup', 'Latitude', 'Longitude']
Answer:
Please refer to Tut9_Q3_zhou.py. We can exactly replicate the results from scikit-learn. Note that in the
plot below, the blue dots are the training datapoints. The curves from scikit-learn (black line) and our own
tree (red dashed line) are on top of each other, so they might be hard to tell apart.
Answer:
Question 1:
We have two classifiers showing the same accuracy with the same cross-validation. The more complex
model (such as a 9th-order polynomial model) is preferred over the simpler one (such as a 2nd-order
polynomial model).
a) True
b) False
Answer: b).
Question 2: According to the plots below, the Gini Coefficient is equal to Two times the Area Under the
ROC minus
Question 2: One.
We have 3 parameter candidates for a classification model, and we would like to choose the optimal one for
deployment. As such, we run 5-fold cross-validation.
Once we have completed the 5-fold cross-validation, in total, we have trained _______ classifiers. Note that, we
treat models with different parameters as different classifiers.
A) 10
B) 20
C) 25
D) True
a) 15
b) False
Answer: D)
Answer: a).
In each fold
Reason: we train
Since 3 classifiers,
the area (A+B) =so ½5(half
foldsthe
give 15 classifiers.
square box of area 1), Gini-coefficient = A/(A+B) = 2 A.
AUC = A+1/2 => A = AUC – 0.5. Substitute A into the Gini above: Gini-coefficient = 2(AUC – 0.5).
Question 3:
Suppose the binary classification problem, which you are dealing with, has highly imbalanced classes. The
majority class has 99 hundred samples and the minority class has 1 hundred samples. Which of the
following metric(s) would you choose for assessing the classification performance? (Select all relevant
metric(s) to get full credit)
a) Classification Accuracy
b) Cost sensitive accuracy
c) Precision and recall
d) None of these
Answer: (b, c)
Question 4:
Given below is a scenario for Training error rate Tr, and Validation error rate Va for a machine learning
algorithm. You want to choose a hyperparameter (P) based on Tr and Va.
P Tr Va
10 0.10 0.25
9 0.30 0.35
8 0.22 0.15
7 0.15 0.25
6 0.18 0.15
Which value of P will you choose based on the above table?
a) 10
b) 9
c) 8
d) 7
e) 6
Answer: e).
(b) Three-category problem (the class-1, class-2 and class-3 data points are respectively indicated by
squares, circles and triangles).
Answer:
(a)
!!" !#"
!! 16 4
!# 4 26
(b)
!!" !#" !$"
!! 16 3 1
!# 1 25 4
!$ 3 1 6
(5-fold Cross-validation)
Question 6:
Get the data set “from sklearn.datasets import load_iris”. Perform a 5-fold Cross-validation to observe the
best polynomial order (among orders 1 to 10 and without regularization) for validation prediction. Note
that, you will have to partition the whole dataset for training/validation/test parts, where the size of
validation set is the same as that of test. Provide a plot of the average 5-fold training and validation error
rates over the polynomial orders. The randomly partitioned data sets of the 5-fold shall be maintained for
reuse in evaluation of future algorithms.
Answer:
##--- load data from scikit ---##
import numpy as np
import pandas as pd
print("pandas version: {}".format(pd.__version__))
import sklearn
print("scikit-learn version: {}".format(sklearn.__version__))
from sklearn.datasets import load_iris
iris_dataset = load_iris()
X = np.array(iris_dataset['data'])
y = np.array(iris_dataset['target'])
## one-hot encoding
Y = list()
for i in y:
letter = [0, 0, 0]
letter[i] = 1
Y.append(letter)
Y = np.array(Y)
test_Idx = np.random.RandomState(seed=2).permutation(Y.shape[0])
X_test = X[test_Idx[:25]]
Y_test = Y[test_Idx[:25]]
X = X[test_Idx[25:]]
Y = Y[test_Idx[25:]]
Question 1
The K-means clustering method uses the target labels for calculating the dis-
tances from the cluster centroids for clustering.
a) True
b) False
Ans: b) because target labels are not available in clustering.
Question 2
The fuzzy C-means algorithm groups the data items such that an item can exist
in multiple clusters.
a) True
b) False
Ans: a).
Question 3
How can you prevent a clustering algorithm from getting stuck in bad local
optima?
a) Set the same seed value for each run
b) Use the bottom ranked samples for initialization
c) Use the top ranked samples for initialization
d) All of the above
e) None of the above
Ans: e).
Question 4
1 0 0
Consider the following data points: x = , y = , and z = . The k-
1 1 0
means algorithm is initialized with centers at x and y. Upon convergence, the
two centres will be at
a) x and z
b) x and y
1
c) y and the midpoint of y and z
d) z and the midpoint of x and y
e) None of the above
Ans: e). The converged centers should be x and the midpoint of y and z.
1 import numpy as np
2
3 # Data points
4 x = np . array ([1 , 1])
5 y = np . array ([0 , 1])
6 z = np . array ([0 , 0])
7
8 data_points = np . array ([ x , y , z ])
9
10 # Initial centers
11 centers = np . array ([ x , y ])
12
13
14 def k_means ( data_points , centers , n_clusters , m ax_iter ations =100 ,
tol =1 e -4) :
15 for _ in range ( max_iter ations ) :
16 # Assign each data point to the closest centroid
17 labels = np . argmin ( np . linalg . norm ( data_points [: , np . newaxis
] - centers , axis =2) , axis =1)
18
19 # Update centroids to be the mean of the data points
assigned to them
20 new_centers = np . zeros (( n_clusters , data_points . shape [1]) )
21
22 # End if centroids no longer change
23 for i in range ( n_clusters ) :
24 new_centers [ i ] = data_points [ labels == i ]. mean ( axis =0)
25 if np . linalg . norm ( new_centers - centers ) < tol :
26 break
27 centers = new_centers
28 return centers , labels
29
30 centers , labels = k_means ( data_points , centers , n_clusters =2)
31 print ( " Converged centers : " , centers )
Question 5
0 0 1 1
Consider the following 8 data points: x1 = , x2 = , x3 = , x4 = ,
0 1 1 0
3 3 4 4
x5 = , x6 = , x7 = , and x8 = . The k-means algorithm is initial-
0 1 0 1
0 3
ized with centers at c1 = and c2 = . The first center after convergence
0 0
0.5 blank1
is c1 = . The second centre after convergence is c2 = .
0.5 blank2
Answer: blank1 = 3.5, blank2 = 0.5.
1 import numpy as np
2
2
3 # Data points
4 x1 = np . array ([0 , 0])
5 x2 = np . array ([0 , 1])
6 x3 = np . array ([1 , 1])
7 x4 = np . array ([1 , 0])
8 x5 = np . array ([3 , 0])
9 x6 = np . array ([3 , 1])
10 x7 = np . array ([4 , 0])
11 x8 = np . array ([4 , 1])
12
13 data_points = np . array ([ x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ])
14
15 # Initial centers
16 c1_init = np . array ([0 , 0])
17 c2_init = np . array ([3 , 0])
18
19 centers = np . array ([ c1_init , c2_init ])
20
21 def k_means ( data_points , centers , n_clusters , m ax_iter ations =100 ,
tol =1 e -4) :
22 for _ in range ( max_iter ations ) :
23 # Assign each data point to the closest centroid
24 labels = np . argmin ( np . linalg . norm ( data_points [: , np . newaxis
] - centers , axis =2) , axis =1)
25
26 # Update centroids to be the mean of the data points
assigned to them
27 new_centers = np . zeros (( n_clusters , data_points . shape [1]) )
28
29 # End if centroids no longer change
30 for i in range ( n_clusters ) :
31 new_centers [ i ] = data_points [ labels == i ]. mean ( axis =0)
32 if np . linalg . norm ( new_centers - centers ) < tol :
33 break
34 centers = new_centers
35 return centers , labels
36
37
38 centers , labels = k_means ( data_points , centers , n_clusters =2)
39 print ( " Converged centers : " , centers )
Question 6
Generate three clusters of data using the following codes.
import random as rd
import numpy as np # linear algebra
from matplotlib import pyplot as plt
# Generate data
3
# Set three centers, the model should predict similar results
center_1 = np.array([2,2])
center_2 = np.array([4,4])
center_3 = np.array([6,1])
4
30
31 # Update centroids to be the mean of the data points
assigned to them
32 new_centers = np . zeros (( n_clusters , data_points . shape [1]) )
33
34 # End if centroids no longer change
35 for i in range ( n_clusters ) :
36 new_centers [ i ] = data_points [ labels == i ]. mean ( axis =0)
37 if np . linalg . norm ( new_centers - centers ) < tol :
38 break
39 centers = new_centers
40 return centers , labels
41
42 centers , labels = k_means ( data , centers , n_clusters = k )
43 print ( " Converged centers : " , centers )
44 plt . title ( ’ Clustering Results ’)
45 plt . scatter ( data [: , 0] , data [: , 1] , c = labels , cmap = ’ viridis ’ , alpha
=0.5)
46 plt . scatter ( centers [: , 0] , centers [: , 1] , marker = ’* ’ , s =200 , c = ’k ’)
47 plt . show ()
Figure 1: K=3
Figure 2: K=5
5
Question 7
Load the iris data from sklearn.datasets import load iris. Assume that
the class labels are not given. Use the Naı̈ve K-means clustering algorithm to
group all the data based on K = 3. How accurate is the result of clustering
comparing with the known labels?
1 from sklearn . datasets import load_iris
2 from sklearn . cluster import KMeans
3 from sklearn . metrics import accu racy_sco re
4 import numpy as np
5
6 # load the iris dataset
7 iris = load_iris ()
8
9 # get the data and the true labels
10 data = iris . data
11 y_true = iris . target
12
13 # initialize the KMeans centers with K =3
14 k = 3
15 centers = data [ np . random . choice ( len ( data ) , k , replace = False ) ]
16
17 def k_means ( data_points , centers , n_clusters , m ax_iter ations =1000 ,
tol =1 e -6) :
18 for _ in range ( max_iter ations ) :
19 # Assign each data point to the closest centroid
20 labels = np . argmin ( np . linalg . norm ( data_points [: , np . newaxis
] - centers , axis =2) , axis =1)
21
22 # Update centroids to be the mean of the data points
assigned to them
23 new_centers = np . zeros (( n_clusters , data_points . shape [1]) )
24
25 # End if centroids no longer change
26 for i in range ( n_clusters ) :
27 new_centers [ i ] = data_points [ labels == i ]. mean ( axis =0)
28 if np . linalg . norm ( new_centers - centers ) < tol :
29 break
30 centers = new_centers
31 return centers , labels
32
33 centers , y_pred = k_means ( data , centers , n_clusters = k )
34 # create a mask that selects elements where the value is 0 , 1 , 2
35 mask_0 = ( y_pred == 0)
36 mask_1 = ( y_pred == 1)
37 mask_2 = ( y_pred == 2)
38
39 y_pred0 = y_pred . copy ()
40 y_pred0 [ mask_0 ] = 0
41 y_pred0 [ mask_1 ] = 1
42 y_pred0 [ mask_2 ] = 2
43
44 y_pred1 = y_pred . copy ()
45 y_pred1 [ mask_0 ] = 0
46 y_pred1 [ mask_1 ] = 2
47 y_pred1 [ mask_2 ] = 1
6
48
49 y_pred2 = y_pred . copy ()
50 y_pred2 [ mask_0 ] = 1
51 y_pred2 [ mask_1 ] = 0
52 y_pred2 [ mask_2 ] = 2
53
54
55 y_pred3 = y_pred . copy ()
56 y_pred3 [ mask_0 ] = 1
57 y_pred3 [ mask_1 ] = 2
58 y_pred3 [ mask_2 ] = 0
59
60 y_pred4 = y_pred . copy ()
61 y_pred4 [ mask_0 ] = 2
62 y_pred4 [ mask_1 ] = 0
63 y_pred4 [ mask_2 ] = 1
64
65 y_pred5 = y_pred . copy ()
66 y_pred5 [ mask_0 ] = 2
67 y_pred5 [ mask_1 ] = 1
68 y_pred5 [ mask_2 ] = 0
69
70 # calculate the accuracy of the clustering
71 accuracy = 0.0
72 for pred in [ y_pred0 , y_pred1 , y_pred2 , y_pred3 , y_pred4 , y_pred5 ]:
73 accuracy = max ([ ac curacy_s core ( y_true , pred ) , accuracy ])
74
75 print ( " Accuracy of clustering : {:.2 f } " . format ( accuracy ) )
7
EE2211 Tutorial 12
Question 1: The convolutional neural network is particularly useful for applications related to image and
text processing due to its dense connections.
a) True
b) False
Ans: b).
Question 2: In neural networks, nonlinear activation functions such as sigmoid, and ReLU
a) speed up the gradient calculation in backpropagation, as compared to linear units
b) are applied only to the output units
c) help to introduce non-linearity into the model
d) always output values between 0 and 1
Ans: c.
(MLP classifier, find the best hidden node size, assuming same hidden layer size in each layer, based on
cross-validation on the training set and then use it for testing)
Question 5:
Obtain the data set “from sklearn.datasets import load_iris”.
(a) Split the database into two sets: 80% of samples for training, and 20% of samples for testing
using random_state=0
(b) Perform a 5-fold Cross-validation using only the training set to determine the best 3-layer
MLPClassifier (from sklearn.neural_network import MLPClassifier
with hidden_layer_sizes=(Nhidd,Nhidd,Nhidd) for Nhidd in
range(1,11))* for prediction. In other words, partition the training set into two sets, 4/5 for
training and 1/5 for validation; and repeat this process until each of the 1/5 has been validated.
Provide a plot of the average 5-fold training and validation accuracies over the different network
sizes.
(c) Find the size of Nhidd that gives the best validation accuracy for the training set.
(d) Use this Nhidd in the MLPClassifier with
hidden_layer_sizes=(Nhidd,Nhidd,Nhidd) to compute the prediction accuracy
based on the 20% of samples for testing in part (a).
* The assumption of hidden_layer_sizes=(Nhidd,Nhidd,Nhidd)is to reduce the search
space in this exercise. In field applications, the search should take different sizes for each hidden layer.
Answer:
## load data from scikit
import numpy as np
import pandas as pd
print("pandas version: {}".format(pd.__version__))
import sklearn
print("scikit-learn version: {}".format(sklearn.__version__))
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier # neural network
from sklearn import metrics
def find_network_size(X_train, y_train):
acc_train_array = []
acc_valid_array = []
for Nhidd in range(1,11):
acc_train_array_fold = []
acc_valid_array_fold = []
## Random permutation of data
Idx = np.random.RandomState(seed=8).permutation(len(y_train))
## Tuning: perform 5-fold cross-validation on the training set to determine the best network
size
for k in range(0,5):
N = np.around((k+1)*len(y_train)/5)
N = N.astype(int)
Xvalid = X_train[Idx[N-24:N]] # validation features
Yvalid = y_train[Idx[N-24:N]] # validation targets
Idxtrn = np.setdiff1d(Idx, Idx[N-24:N])
Xtrain = X_train[Idxtrn] # training features in tuning loop
Ytrain = y_train[Idxtrn] # training targets in tuning loop
## MLP Classification with same size for each hidden-layer (specified in question)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(Nhidd,Nhidd,Nhidd),
random_state=1)
clf.fit(Xtrain, Ytrain)
## trained output
y_est_p = clf.predict(Xtrain)
acc_train_array_fold += [metrics.accuracy_score(y_est_p,Ytrain)]
## validation output
yt_est_p = clf.predict(Xvalid)
acc_valid_array_fold += [metrics.accuracy_score(yt_est_p,Yvalid)]
acc_train_array += [np.mean(acc_train_array_fold)]
acc_valid_array += [np.mean(acc_valid_array_fold)]
## find the size that gives the best validation accuracy
Nhidden = np.argmax(acc_valid_array,axis=0)+1
## plotting
import matplotlib.pyplot as plt
hiddensize = [x for x in range(1,11)]
plt.plot(hiddensize, acc_train_array, color='blue', marker='o', linewidth=3, label='Training')
plt.plot(hiddensize, acc_valid_array, color='orange', marker='x', linewidth=3,
label='Validation')
plt.xlabel('Number of hidden nodes in each layer')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracies')
plt.legend()
plt.show()
return Nhidden
## load data
iris_dataset = load_iris()
## split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'],
iris_dataset['target'],
test_size=0.20,
random_state=0)
## find the best hidden node size using only the training set
Nhidden = find_network_size(X_train, y_train)
print('best hidden node size =', Nhidden, 'based on 5-fold cross-validation on training set')
## perform evaluation
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(Nhidden,Nhidden,Nhidden),
random_state=1)
clf.fit(X_train, y_train)
## trained output
y_test_predict = clf.predict(X_test)
test_accuracy = metrics.accuracy_score(y_test_predict,y_test)
print('test accuracy =', test_accuracy)
>> best hidden node size = 6 based on 5-fold cross-validation on training set
>> test accuracy = 1.0
Results:
Accuracy for each fold:
> 98.583
> 98.425
> 98.342
> 98.575
> 98.592
Accuracy: mean=98.503 std=0.102, n=5
Improved version (network of larger size):
Accuracy for each fold:
> 98.992
> 98.717
> 98.925
> 99.233
> 98.875
Accuracy: mean=98.948 std=0.169, n=5