0% found this document useful (0 votes)
6 views75 pages

Tutorials

The document consists of a tutorial on machine learning concepts and Python coding exercises, covering topics such as the differences between AI and ML, definitions of machine learning, types of problems (classification vs regression), and data preprocessing techniques. It includes multiple-choice questions, coding tasks for data visualization, and handling missing data using Python libraries like pandas and scikit-learn. Additionally, it provides resources for further learning and practical applications of machine learning in various scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views75 pages

Tutorials

The document consists of a tutorial on machine learning concepts and Python coding exercises, covering topics such as the differences between AI and ML, definitions of machine learning, types of problems (classification vs regression), and data preprocessing techniques. It includes multiple-choice questions, coding tasks for data visualization, and handling missing data using Python libraries like pandas and scikit-learn. Additionally, it provides resources for further learning and practical applications of machine learning in various scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

EE2211 Tutorial 1

Question 1:
What is the difference between ML (Machine Learning) and AI (Artificial Intelligence)?
Suggested discussion: Artificial Intelligence is the broader concept of machines being able to carry out tasks
in a way that we would consider “smart”. And, Machine Learning is a current application of AI based around
the idea that we should really just be able to give machines access to data and let them learn for themselves.
https://www.forbes.com/sites/bernardmarr/2016/12/06/what-is-the-difference-between-artificial-intelligence-and-machine-
learning/#741adc6b2742

Ref: https://www.vsinghbisen.com/technology/ai/difference-between-artificial-intelligence-and-machine-learning/

Question 2:
Which of the following is the most reasonable definition of machine learning?
(a) Machine learning is the field of allowing robots to act intelligently.
(b) Machine learning is the science of programming computers.
(c) Machine learning only learn from unlabeled data.
(d) Machine learning is the field of study that gives computers the ability to learn without being explicitly
programmed.
Ans: (d)

Question 3:
A computer program is said to learn from experience E with respect to some task T and some performance measure
P, if its performance on T, as measured by P, improves with experience E. Suppose we feed a learning algorithm a lot
of historical weather data, and have it learn to predict weather. In this setting what is T?
(a) The historical weather data.
(b) The probability of it correctly predicting a future data’s weather.
(c) The weather prediction task.
(d) None of these.
Ans: (c)

Question 4:
Suppose you are working on weather prediction and use a learning algorithm to predict tomorrow’s temperature (in
degrees Centigrade/Fahrenheit).
(i) Would you treat this as a classification or a regression problem?
(a) Regression.
(b) Classification.
(c) Clustering.
(d) None of these.
(ii) What kind of data should you gather?
Ans: (i)(a)
(ii) Weather forecasts are made by collecting quantitative data (e.g., changes in barometric pressure, current weather
conditions, and sky condition or cloud cover) about the current state of the atmosphere at a given place and
using meteorology to project how the atmosphere will change.

Question 5:
You want to develop learning algorithms to address each of the following two problems.
P1: You’d like the software to examine your email accounts, and decide whether each email is a spam or not.
P2: You have a large quantity of green tea (e.g., 1000kg) with a record of previous sales. You want to predict how
much of it will sell over the next 6 months.
Should you treat these as classification or as regression problems?
(a) Treat both P1, P2 regression problems.
(b) Treat both P1, P2 classification problems.
(c) Treat P1 regression problem, P2 classification problem.
(d) Treat P1 classification problem, P2 regression problem.
Ans: (d)

Question 6:
Suppose you are working on stock market prediction. Typically tens of millions of shares of a company’s stock are
traded each day. You would like to predict the number of shares that will be traded tomorrow.
(i) Would you treat this as a classification or a regression problem?
(a) Regression.
(b) Classification.
(c) Clustering.
(d) None of these.
(ii) If the data you have collected involved millions of attributes, what would you do?
Ans: (i)(a), (ii)(extract relevant features)

Question 7:
Some of the problems below are best addressed using a supervised learning algorithm, and the others with an
unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that
apply) Assume some appropriate dataset is available for your algorithm to learn from.
(a) Determine whether there are vocals (i.e., a human voice singing) in each audio clip extracted from a piece of
music, or it is a clip of only musical instruments and no vocals.
(b) Given data on how 1000 medical patients respond to an experimental drug (such as effectiveness of the
treatment, side effects, etc.), discover whether there are different categories or “types” of patients in terms of
how they respond to the drug, and if so what these categories are.
(c) Given a large dataset of medical records of patients suffering from heart disease, try to learn whether there
might be different clusters of such patients for which we might tailor separate treatments.
(d) Given a set of data which contains the diet and the occurrence of diabetes from a population over a 10-year
period. Predict the odds of a person developing diabetes over the next 10 years.
Ans: (a), (d)
Question 8:
Suppose you are working on a machine learning algorithm to predict if a patient is COVID-19 infected according to
the patient’s symptomatic data, such as fever, dry cough, tiredness, aches and pains, sore throat, diarrhoea,
conjunctivitis, and headache etc. What are the Task, Performance, and Experience involved according to the
definition of machine learning?

Ans: (please refer to the definition of Task, Performance, and Experience in the lecture notes)
Task: patient classification into ‘infected’ or ‘uninfected’
Performance: accuracy of classification
Experience: patient’s symptomatic data with actual diagnosis

Question 9:
We use labelled data for supervised learning, where the labels are used as the desired target of prediction for
classifiers. Which of the next data are the useful labelled data?
(a) To build an image object classifier to discriminate between apple and orange, we have many fruit images
labelled with the country of origin.
(b) To build a system to predict the number of COVID cases for tomorrow given the past daily record, we have
a collection of daily data for a period of 12 months.
(c) To build a classifier to automatically evaluate student essays, we have collected a set of student essays that
have not been graded by teachers.

Ans:
(a) The useful fruit images should be labelled with apple or orange. Country of origin doesn’t tell apple or orange.
Therefore, the data is not useful
(b) We can use n days of historical data as the input, and n+1th day’s data as the target. This dataset is useful;
(c) The useful dataset should include student essays and the grades. Student essays are the input, and the grades
are the desired target of prediction. This dataset is not useful.

Question 10:
Determine whether each of the following is “inductive” or “deductive” reasoning?
(a) The first coin I pulled from the bag is a penny. The second and the third coins from the bag are also pennies.
Therefore, all the coins in the bag are pennies.
(b) All men are mortal. Harold is a man. Therefore, Harold is mortal.
Ans: (a) inductive, (b) deductive.

Question 11:
Find a problem of your interest and formulate it as a machine learning problem. List out the input features and output
response and provide your choice regarding the types of learning (such as supervised or unsupervised learning,
classification or regression, clustering or dimensionality reduction).

Some Python Resources


Installing scikit-learn (Ref: [Book2] Andreas C. Muller and Sarah Guido, “Introduction to Machine Learning with
Python: A Guide for Data Scientists”, O’Reilly Media, Inc., 2017)

scikit-learn depends on two other Python packages, NumPy and SciPy. For plotting and interactive development,
you should also install matplotlib, IPython, and the Jupyter Notebook. We recommend using the following
prepackaged Python distributions, which provides the necessary packages:
Anaconda
A Python distribution made for large-scale data processing, predictive analytics, and scientific computing.
Anaconda comes with NumPy, SciPy, matplotlib, pandas, IPython, Jupyter Notebook, and scikit-learn. Available
on Mac OS, Windows, and Linux, it is a very convenient solution and is the one we suggest for people without
an existing installation of the scientific Python packages. Anaconda now also includes the commercial Intel
MKL library for free. Using MKL (which is done automatically when Anaconda is installed) can give significant
speed improvements for many algorithms in scikit-learn.
Some tutorials that might be useful:
A quickstart tutorial on NumPy: https://numpy.org/devdocs/user/quickstart.html
Some community tutorials on Pandas: https://pandas.pydata.org/pandas-
docs/stable/getting_started/tutorials.html
Scikit-learn tutorials: https://scikit-learn.org/stable/tutorial/index.html
EE2211 Tutorial 2 (Python coding)

(Data Reading and Visualization, simple data structure)


Question 1:
A Comma Separated Values (CSV) file is a plain text file that contains a list of data. These files are often used for
exchanging data between different applications. Download the file “government-expenditure-on-education.csv”
from https://data.gov.sg/dataset/government-expenditure-on-education. Plot the educational expenditure over the
years. (Hint: you might need “import pandas as pd” and “import matplotlib.pyplot as plt”.)

Ans:
import pandas as pd import
matplotlib.pyplot as plt
df = pd.read_csv("government-expenditure-on-education.csv")
expenditureList = df ['total_expenditure_on_education'].tolist() yearList
= df ['year'].tolist()
plt.plot(yearList, expenditureList, label = 'Expenditure over the years')
plt.xlabel('Year') plt.ylabel('Expenditure') plt.title('Education
Expenditure') plt.show()
(Data Reading and Visualization, slightly more complicated data structure)
Question 2:
Download the CSV file from https://data.gov.sg/dataset/annual-motor-vehicle-population-by-vehicle-type.
Extract and plot the number of Omnibuses, Excursion buses and Private buses over the years as shown below.
(Hint: you might need “import pandas as pd” and “import matplotlib.pyplot as plt”.)

Ans:
import pandas as pd import
matplotlib.pyplot as plt
df = pd.read_csv("annual-motor-vehicle-population-by-vehicle-type.csv")
year = df ['year'].tolist() category = df ['category'].tolist() vehtype
= df ['type'].tolist() number = df ['number'].tolist() val1 =
df.loc[df['type']=='Omnibuses'].index val2 =
df.loc[df['type']=='Excursion buses'].index val3 =
df.loc[df['type']=='Private buses'].index print(val1)
List1 = df.loc[val1] print(List1)
List2 = df.loc[val2] print(List2)
List3 = df.loc[val3] print(List3)
plt.plot(List1['year'], List1['number'], label = 'Number of Omnibuses')
plt.plot(List2['year'], List2['number'], label = 'Number of Excursion buses')
plt.plot(List3['year'], List3['number'], label = 'Number of Private buses')
plt.xlabel('Year')
plt.ylabel('Number of vehicles')
#plt.xticks(List1['year'])
plt.title('Number of vehicles over the years') plt.legend()
(Data Reading and Visualization, distribution)
Question 3:
The “iris” flower data set consists of measurements such as the length, width of the petals, and the length, width of the
sepals, all measured in centimeters, associated with each iris flower. Get the data set “from sklearn.datasets import
load_iris” and do a scatter plot as shown below. (Hint: you might need “from pandas.plotting import
scatter_matrix”)

Ans:
import pandas as pd
print("pandas version: {}".format(pd.__version__)) import
sklearn
print("scikit-learn version: {}".format(sklearn.__version__))
from sklearn.datasets import load_iris iris_dataset =
load_iris()
from sklearn.model_selection import train_test_split X_train,
X_test, y_train, y_test =
train_test_split( iris_dataset['data'],
iris_dataset['target'], random_state=0)
# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names iris_dataframe
= pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# create a scatter matrix from the dataframe, color by y_train from
pandas.plotting import scatter_matrix
grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15),
marker='o', hist_kwds={'bins': 20})
(Data Wrangling/Normalization)
Question 4:
You are given a set of data for supervised learning. A sample block of data looks like this:
“ 1.2234, 0.3302, 123.50, 0.0081, 30033.81, 1
1.3456, 0.3208, 113.24, 0.0067, 29283.18, -1
0.9988, 0.2326, 133.45, 0.0093, 36034.33, 1
1.1858, 0.4301, 128.55, 0.0077, 34037.35, 1
1.1533, 0.3853, 116.70, 0.0066, 22033.58, -13
1.2755, 0.3102, 118.30, 0.0098, 30183.65, 1
1.0045, 0.2901, 123.52, 0.0065, 31093.98, -1
1.1131, 0.3912, 113.15, 0.0088, 29033.23, -1 ”
Each row corresponds to a sample data measurement with 5 input features and 1 response.
(a) What kind of undesired effect can you anticipate if this set of raw data is used for learning?
(b) How can the data be preprocessed to handle this issue?

Ans:
(a) Those features with very large values may overshadow those with very small values.
(b) We can either use min-max or z-score normalization to resolve the problem.

(Missing Data)
Question 5:
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given
medical details. Download the Pima-Indians-Diabetes data from
https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input
variables and 1 output variable. The variable names are as follows:
0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age (years).
8. Class variable (0 or 1).
(a) Print the summary statistics of this data set.
(b) Count the number of “0” entries in columns [1,2,3,4,5].
(c) Replace these “0” values by “NaN”.
(Hint: you might need the “.describe()” and “.replace(0, numpy.NaN)” functions “from
pandas import read_csv”.)

Ans:
#(a) from pandas import read_csv dataset = read_csv('pima-
indians-diabetes.csv', header=None) print(dataset.describe())
#(b) print((dataset[[1,2,3,4,5]] == 0).sum()) #(c) import numpy #
mark zero values as missing or NaN dataset[[1,2,3,4,5]] =
dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # print the first 20
rows of data print(dataset.head(20))
print(dataset.isnull().sum())
(In Quiz and Exam format)
Question 6:

Disease Outbreak Response System Condition (DORSCON) in Singapore is a colour-coded framework that shows the
current disease situation. The framework provides us with general guidelines on what needs to be done to prevent and
reduce the impact of infections. There are 4 statuses – Green, Yellow, Orange and Red, depending on the severity and
spread of the disease. Which type of data does DORSCON belong to ?
(1) Categorical; (2) Ordinal; (3) Continuous; (4) Interval

Ans: (2) (1)

(In Quiz and Exam format)


Question 7:

A boxplot is a standardized way of displaying the dataset based on a five-number summary: the minimum, the
maximum, _BLANK1_, and the first and third quartiles, where the number of data points that fall between the first and
third quartiles amounts to _BLANK2_ percent of the total number of data on display.
Ans:
_BLANK1_: Median
_BLANK2_: 50%
EE2211 Tutorial 3

(Probability Mass Function)


Question 1:
The random variable N has probability mass function (PMF)
𝑛𝑛
𝑐𝑐�1�2� , 𝑛𝑛 = 0,1,2
𝑃𝑃𝑁𝑁 (𝑛𝑛) = �
0, otherwise
(a) What is the value of the constant c?
(b) What is Pr[N ≤ 1]?

Answer:
(a) We wish to find the value of c that makes the PMF sum up to one.
2
𝑐𝑐 𝑐𝑐 4
� 𝑃𝑃𝑁𝑁 (𝑛𝑛) = 𝑐𝑐 + + = 1 implying 𝑐𝑐 =
2 4 7
𝑛𝑛=0

(b) The probability that N ≤ 1 is


Pr [N ≤ 1] = Pr [N = 0] + Pr [N = 1] = 4/7 + 2/7 = 6/7 .

(Probability Density Function)


Question 2:
The random variable X has probability density function (PDF)
𝑐𝑐𝑥𝑥, 0 ≤ 𝑥𝑥 ≤ 2
𝑓𝑓𝑋𝑋 (𝑥𝑥) = �
0, otherwise.
Use the PDF to find
(a) the constant c,
(b) Pr[0 ≤ X ≤ 1],
(c) Pr[−1/2 ≤ X ≤ 1/2].

Answer:
(a) From the above PDF we can determine the value of c by integrating the PDF and setting it equal to 1.
2
2
𝑥𝑥 2 4
� 𝑐𝑐𝑥𝑥 𝑑𝑑𝑥𝑥 = 𝑐𝑐 � � = 𝑐𝑐 � � = 2𝑐𝑐 = 1
0 2 0 2
Therefore c = 1/2.

1
(b) 𝑃𝑃𝑃𝑃[0 ≤ 𝑋𝑋 ≤ 1] = ∫0 𝑥𝑥/2 𝑑𝑑𝑥𝑥 = 1/4

1/2
(c) 𝑃𝑃𝑃𝑃[−1/2 ≤ 𝑋𝑋 ≤ 1/2] = ∫0 𝑥𝑥/2 𝑑𝑑𝑥𝑥 = 1/16
(Bayes’ rule)
Question 3:
Let A = {resistor is within 50Ω of the nominal value}. The probability that a resistor is from machine B is Pr[B] = 0.3.
The probability that a resistor is acceptable, i.e., within 50 Ω of the nominal value, is Pr[A] = 0.78. Given that a resistor
is from machine B, the conditional probability that it is acceptable is Pr[A|B] = 0.6. What is the probability that an
acceptable resistor comes from machine B?
Answer:
We are given the event A that a resistor is within 50 Ω of the nominal value, and we need to find Pr[B|A]. Using Bayes’
𝑃𝑃𝑃𝑃(𝐴𝐴|𝐵𝐵)𝑃𝑃𝑃𝑃(𝐵𝐵)
theorem, we have 𝑃𝑃𝑃𝑃(𝐵𝐵|𝐴𝐴) = .
𝑃𝑃𝑃𝑃(𝐴𝐴)

Since all of the quantities we need are given in the problem description, our answer is
Pr(B|A) = (0.6)(0.3)/(0.78) ≈ 0.23.

(Discrete random variable in Python)


Question 4:
Consider tossing a fair six-sided die. There are only six outcomes possible, Ω = {1, 2, 3, 4, 5, 6}. Suppose we toss two
dice and assume that each throw is independent.
(a) What is the probability that the sum of the dice equals seven?
i. List out all pairs of possible outcomes together with their sums from the two throws.
(hint: enumerate all the items in range(1,7))
ii. Collect all of the (a, b) pairs that sum to each of the possible values from two to twelve (including
the sum equals seven). (hint: use dictionary from collections import defaultdict to
collect all of the (a, b) pairs that sum to each of the possible values from two to twelve)
(b) What is the probability that half the product of three dice will exceed their sum?

Answer:
Ref: Python for Probability, Statistics, and Machine Learning, Unpingco, José (pp.37-42).
(a) The first thing to do is characterize the measurable function for this as X : (a, b) → (a + b). Next, we associate all
of the (a, b) pairs with their sum. A Python dictionary can be created like this:
#(i)
d={(i,j):i+j for i in range(1,7) for j in range(1,7)}

The next step is to collect all of the (a, b) pairs that sum to each of the possible values from two to twelve.
# (ii) collect all of the (a, b) pairs that sum to each of the possible values
# from two to twelve
from collections import defaultdict
dinv = defaultdict(list)
for i,j in d.items(): dinv[j].append(i)

For example, dinv[7] contains the following list of pairs that sum to seven, [(1, 6), (2, 5), (5, 2),
(6, 1), (4, 3), (3, 4)]. The next step is to compute the probability measured for each of these items.
Using the independence assumption, this means we have to compute the sum of the products of the individual
item probabilities in dinv. Because we know that each outcome is equally likely, the probability of every term
in the sum equals 1/36.
# Compute the probability measured for each of these items
# including the sum equals seven
X={i:len(j)/36. for i,j in dinv.items() }
print(X)
{2: 0.027777777777777776, 3: 0.05555555555555555, 4: 0.08333333333333333, 5:
0.1111111111111111, 6: 0.1388888888888889, 7: 0.16666666666666666, 8:
0.1388888888888889, 9: 0.1111111111111111, 10: 0.08333333333333333, 11:
0.05555555555555555, 12: 0.027777777777777776}
(b) What is the probability that half the product of three dice will exceed their sum?
Using the same method above, we create the first mapping as follows:
d={(i,j,k):((i*j*k)/2>i+j+k) for i in range(1,7)
for j in range(1,7)
for k in range(1,7)}

The keys of this dictionary are the triples and the values are the logical values of whether or not half the product of
three dice exceeds their sum. Now, we do the inverse mapping to collect the corresponding lists,
dinv = defaultdict(list)
for i,j in d.items(): dinv[j].append(i)

Note that dinv contains only two keys, True and False. Again, because the dice are independent, the probability of
any triple is 1/(63 ). Finally, we collect this for each outcome as in the following,
X={i:len(j)/6.0**3 for i,j in dinv.items() }
print(X)
{False: 0.37037037037037035, True: 0.6296296296296297}

(Continuous random variable in Python)


Question 5:
Assuming a normal (Gaussian) distribution with mean 30 Ω and standard deviation of 1.8 Ω, determine the
probability that a resistor coming off the production line will be within the range of 28 Ω to 33 Ω. (Hint: use
stats.norm.cdf function from scipy import stats)

Answer:
Wiki: The cumulative distribution function of a real-valued random variable X is the function given by
𝐹𝐹𝑋𝑋 (𝑥𝑥) = 𝑃𝑃𝑃𝑃(𝑋𝑋 ≤ 𝑥𝑥), ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ (1)
where the right-hand side represents the probability that the random variable X takes on a value less than or equal
to x. The probability that X lies in the semi-closed interval (a, b], where a < b, is therefore
𝑃𝑃𝑃𝑃(𝑎𝑎 < 𝑋𝑋 ≤ 𝑏𝑏) = 𝐹𝐹𝑋𝑋 (𝑏𝑏) − 𝐹𝐹𝑋𝑋 (𝑎𝑎). ⋯ ⋯ ⋯ ⋯ ⋯ (2)
from scipy import stats

# define constants mu =
30 # mean = 30Ω
sigma = 1.8 # standard deviation = 1.8Ω
x1 = 28 # lower bound = 28Ω x2 = 33
# upper bound = 33Ω

## calculate probabilities
# probability from Z=0 to lower bound
p_lower = stats.norm.cdf(x1,mu,sigma) #
probability from Z=0 to upper bound
p_upper = stats.norm.cdf(x2,mu,sigma)
# probability of the interval
Prob = (p_upper) - (p_lower)

# print the results print('\n')


print(f'Normal distribution: mean = {mu}, std dev = {sigma} \n')
print(f'Probability of occurring between {x1} and {x2}: ')
print(f'--> inside interval Pin = {round(Prob*100,1)}%') print(f'-
-> outside interval Pout = {round((1-Prob)*100,1)}% \n')
print('\n')

(Correlation versus Causation)


Question 6:
For each of the following graphs,
(i) State what you think the evidence is trying to suggest. Is there correlation or not?
(ii) Give a reason why you agree or disagree with what the evidence is suggesting.
(iii) Identify whether the variable of the y-axis and the variable of the x-axis are correlated and/or causal?

http://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH717-QuantCore/PH717-
Module1BDescriptiveStudies_and_Statistics/PH717-Module1B-DescriptiveStudies_and_Statistics6.html
Question 8: (Multiple responses – one or more answers are correct)
If A and B are correlated, but they’re actually caused by C, which of the following statements are correct?

Ans:
a) A and C are correlated (Yes, A and C are correlated because A is caused by C)
b) B and C are correlated (Yes, B and C are correlated because B is caused by C)
Suggested Discussion:
c) A causes B to happen (No, A and B share the confounding factor C, but A and B don’t have causal
(i) Colon cancer is correlated to the amount of daily meat consumption.
relationship)
(ii) d)There
A causes C to
is a clear happen
linear (No,
trend; C causes
countries withAthe
to lowest
happen, however,
meat we arehave
consumption not the
surelowest
if A causes
rates ofCcolon
to happen).
cancer, and the colon cancer rate among these countries progressively increases as meat consumption increases.
(iii) Probably causal.
Question 9: (Multiple responses – one or more answers are correct)

We toss a coin and observe which side is facing up. Which of the following statements represent valid probability
assignments
Question for observing
7: (Multiple head
responses P[‘H’]
– one and tail
or more P[‘T’]?
answers are correct)
If A and B are correlated, but they’re actually caused by C, which of the following statements are correct?
Ans:
Ans: a) P[‘H’]=0.2, P[‘T’]=0.9 (Invalid because they sum to 1.1, which doesn’t conform to the Axioms of probability)
a) b)A and
P[‘H’]=0.0, P[‘T’]=1.0
C are correlated (Yes, (Valid
A and Cbecause it conforms
are correlated to Athe
because is Axioms
caused byofC)probability)
c) P[‘H’]=-0.1, P[‘T’]=1.1 (Invalid because probability cannot
b) B and C are correlated (Yes, B and C are correlated because B is caused havebynegative
C) value)
c) d)A causes
P[‘H’]=P[‘T’]=0.5 (Valid
B to happen (No, A because it conforms
and B share to the Axioms
the confounding of but
factor C, probability)
A and B don’t have causal
relationship)
d) A causes C to happen (No, C causes A to happen, however, we are not sure if A causes C to happen).
Question 10: (Fill-in-blank)

A doctor is called to see a sick child. The doctor has prior information that 90% of sick children in that neighborhood
have the flu, while the other 10% are sick with COVID-19. Let F stand for an event of a child being sick with flu and
C stand8:for
Question an eventresponses
(Multiple of a child being
– one sick with
or more COVID-19,
answers therefore, we have P[F]=0.9 and P[C]=0.1. Assume for
are correct)
simplicity that a child is either with flu or with COVID-19, not both.
We toss a coin and observe which side is facing up. Which of the following statements represent valid probability
assignments for observing
A well-known symptom head
ofP[‘H’] and tailisP[‘T’]?
COVID-19 a dry cough (the event of having which we denote D). Assume that the
probability of having a dry cough if one has COVID-19 is 0.95. However, children with flu also develop a dry cough,
Ans:
and the probability of having a dry cough if one has flu is 0.5. Upon examining the child, the doctor finds the child
a) P[‘H’]=0.2, P[‘T’]=0.9 (Invalid because they sum to 1.1, which doesn’t conform to the Axioms of
has a dry cough. The probability that the child has COVID-19 is _BLANK_.
probability)
b) P[‘H’]=0.0, P[‘T’]=1.0 (Valid because it conforms to the Axioms of probability)
0.95×0.1
c) Because P[C|D]
P[‘H’]=-0.1, = P[D|C]P[C]
P[‘T’]=1.1 / (P[D|C]P[C]+
(Invalid P[D|F]P[F])
because probability cannot=have negative = 0.17
value)
0.95×0.1+0.5×0.9
d) P[‘H’]=P[‘T’]=0.5 (Valid because it conforms to the Axioms of probability)
The probability that the child has COVID-19 is 0.17.

Question 9: (True/False)
Question 11: (True/False)
1 4
Two vectors 𝒂𝒂 = �2� and 𝒃𝒃 = �5� are linearly dependent?
3 6
Ans: False (because a is not a multiply of b.)

Question
Question10:12:
(Fill-in-blank)
(Fill-in-blank)
1 3
The rank of the matrix � � is 2. (try row echelon form)
2 4

1 3 1 3
Ans: � � => � �
2 4 0 −2

Question 13: (Fill-in-blank)

1 2 3
The rank of the matrix �4 5 6� is 2. (try row echelon form)
7 8 9
1 3
The rank of the matrix � � is 2. (try row echelon form)
2 4

1 3 1 3
Ans: � � => � �
2 4 0 −2

Question
Question11:13:
(Fill-in-blank)
(Fill-in-blank)

1 2 3
The rank of the matrix �4 5 6� is 2. (try row echelon form)
7 8 9

1 2 3 1 2 3 1 2 3 1 2 3
Ans: �4 5 6� => �0 −3 −6�=> �0 −3 −6 � => �0 −3 −6�
7 8 9 7 8 9 0 −6 −12 0 0 0
EE2211 Tutorial 4

(Systems of Linear Equations)


Question 1:

1 1 0
Given Xw = y where 𝐗 = [ ] , 𝐲 = [ ].
3 4 1

(a) What kind of system is this? (even-, over- or under-determined?)


(b) Is X invertible? Why?
(c) Solve for w if it is solvable.

Answer:

(a) This is an even-determined system.


1 4 −1 4 −1
(b) det(X) = 1⨯4 – 1⨯3 = 1 ≠ 0. 𝐗 −1 = 1 [ ]=[ ].
−3 1 −3 1
4 −1 0 −1
̂ = 𝐗 −1 𝐲 = [
(c) 𝐰 ] [ ] = [ ].
−3 1 1 1

import numpy as np

m_list = [[1, 1], [3, 4]]

X = np.array(m_list)

inv_X = np.linalg.inv(X)

y = np.array([0, 1])

w = inv_X.dot(y)

print(w)

(Systems of Linear Equations)


Question 2:
1 2 0
Given Xw = y where 𝐗 = [ ] , 𝐲 = [ ].
3 6 1

(a) What kind of system is this? (even-, over- or under-determined?)


(b) Is X invertible? Why?
(c) Solve for w if it is solvable.

Answer:

(a) This is an even-determined system.


(b) X is NOT invertible since the determinant of X=1⨯6 – 2⨯3 = 0.
(c) There is no solution for w since the rows/columns of X are inter-dependent. The two lines shown in the plot are parallel
and has no intersection.

(Systems of Linear Equations)


Question 3:

1 2 0
Given Xw = y where 𝐗 = [ 2 4 ] , 𝐲 = [0.1].
1 −1 1

(a) What kind of system is this? (even-, over- or under-determined?)


(b) Is X invertible? Why?
(c) Find a solution for w if it is solvable.

Answer:

(a) This is an over-determined system.


(b) X is NOT invertible but 𝐗𝑇 𝐗 = [6 9 ] is. The determinant of 𝐗𝑇 𝐗 =6⨯21 – 9⨯9 = 45.
9 21
(c) An approximated solution is given by

0
0.4667 −0.2 1 2 1 0.68
̂ = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲 = [
𝐰 ][ ] [0.1] = [ ].
−0.2 0.1333 2 4 − 1 −0.32
1
# import numpy as np

# m_list = [[1, 2], [2, 4], [1, -1]]

# X = np.array(m_list)

# inv_XTX = np.linalg.inv(X.transpose().dot(X))

# pinv = inv_XTX.dot(X.transpose())

# y = np.array([0, 0.1, 1])

# w = pinv.dot(y)

# print(w)

import numpy as np

from numpy.linalg import inv

X = np.array([[1, 2], [2, 4], [1, -1]])

y = np.array([0, 0.1, 1])

w = inv(X.T @ X) @ X.T @ y

print(w)

(Systems of Linear Equations)


Question 4:
1 0 1 0 1
Given Xw = y where 𝐗 = [1 −1 1 − 1] , 𝐲 = [0].
1 1 0 0 1

(a) What kind of system is this? (even-, over- or under-determined?)


(b) Is X invertible? Why?
(c) Solve for w if it is solvable.

Answer:

(a) This is an under-determined system.


(b) X is NOT invertible but 𝐗𝐗𝑇 is.

2 2 1
4 0 2 0 2 4
The determinant of 𝐗𝐗𝑇 = det([2 4 0 ]) = 2 |0 2| − 2 |1 2| + 1 |1 0| = 2⨯8 – 2⨯4 + (–4) = 4 .
1 0 2

1 1 1 0.5
2 −1 −1 1
𝑇 𝑇 −10 −1 1 ] [−1 0.5
(c) 𝐰
̂ = 𝐗 (𝐗𝐗 ) 𝐲 = [ 0.75 0.5 ] [0] = [ ]
1 1 0 0.5
0 −1 0 −1 0.5 1 1
0.5

(Systems of Linear Equations)


Question 5:

1 2 0
Given 𝐰 𝑇 X = 𝐲 𝑇 where 𝐗 = [ ] , 𝐲 = [ ].
3 6 1

(a) What kind of system is this? (even-, over- or under-determined?)


(b) Is X invertible? Why?
(c) Solve for w if it is solvable.

Answer:

(a) This is an even-determined system.


(b) X is NOT invertible since the determinant of X = 1⨯6 – 2⨯3 = 0.
(c) There is no solution for w (two parallel lines).

(Systems of Linear Equations)


Question 6:

Given 𝐰 𝑇 X = 𝐲 𝑇 where

1 2 0
𝐗 = [ 2 4 ] , 𝐲 = [ ].
1 −1 1

(a) What kind of system is this? (even-, over- or under-determined?)


(b) Is X invertible? Why?
(c) Solve for w if it is solvable.

Answer:

(a) This is an under-determined system (there are 3 unknowns with 2 equations).


(b) X is NOT invertible but 𝐗𝑇 𝐗 is. The determinant of 𝐗𝑇 𝐗 =6⨯21 – 9⨯9 = 45.
(c) A constrained solution (exact) is given by

̂ 𝑇 = (𝐗𝐚)𝑇
𝐰 (The 3-dimensional vector 𝐰 can be constrained by projecting 𝐗 onto a 2-dimensional vector 𝐚)

= 𝐚𝑇 𝐗 𝑇

= y 𝑇 (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇

0.4667 −0.2 1 2 1
= [0 1] [ ][ ]
−0.2 0.1333 2 4 − 1

= [0.0667 0.1333 − 0.3333 ].

Note: If we solve w (asked exactly from the question):

We can do a Transpose at first on both sides, say

(𝐰 T 𝐗)𝐓 = 𝐲T => 𝐗T 𝐰 = 𝐲

Assume that we have a new notation = 𝐗𝐓 and we can then use the formula: 𝐰 = 𝐗𝐓 (𝐗𝐗𝐓 )−1𝐲

=> 𝐰 = 𝐓( 𝐓)−𝟏 𝐲 = 𝐗(𝐗T 𝐗)−𝟏 𝐲

Note: dim(X) is 3⨯2, dim(a) is 2⨯1, estimation is done/constrained on/to the lower dimension of (3⨯2) and
then projected back to the higher dimension 3.
(Systems of Linear Equations)
Question 7:
This question is related to determination of types of system where an appropriate solution can be found subsequently.
The following matrix has a left inverse.
2 0 0
𝐗=[ ]
0 0 1
a) True
b) False

Answer: b)

Solution: Left inverse is given by (𝐗 𝑻 𝐗)−𝟏 𝐗 𝑻 where 𝐗 𝑻 𝐗 should be invertible. In this case, 𝐗 𝑻 𝐗 is not invertible
so the matrix does not have a left inverse.

(Systems of Linear Equations)


Question 8:
MCQ: Which of the following is/are true about matrix 𝐀 below? There could be more than one answer.
1 2 3
𝐀=[ ]
4 5 6
a) 𝐀 is invertible
b) 𝐀 is left invertible
c) 𝐀 is right invertible
d) 𝐀 has no determinant
e) None of the above

Answer: c and d.
EE2211 Tutorial 5

(Linear Regression, bias/offset)


Question 1:

Given the following data pairs for training:


{𝑥 = −10} → {𝑦 = 5}
{𝑥 = −8} → {𝑦 = 5}
{𝑥 = −3} → {𝑦 = 4}
{𝑥 = −1} → {𝑦 = 3}
{ 𝑥 = 2 } → {𝑦 = 2}
{ 𝑥 = 8 } → {𝑦 = 2}

(a) Perform a linear regression with addition of a bias/offset term to the input feature vector and sketch the result
of line fitting.
(b) Perform a linear regression without inclusion of any bias/offset term and sketch the result of line fitting.
(c) What is the effect of adding a bias/offset term to the input feature vector?

Answer:

(a) This is an over-determined system.

1 1 1 1 1 1
The input feature including bias/offset can be written as 𝐗 ! = # +.
−10 −8 − 3 − 1 2 8

5
⎡5⎤
⎢ ⎥
6 − 12 "# 1 1 1 1 1 1 ⎢4⎥ 3.1055
- = (𝐗 ! 𝐗)"# 𝐗 ! 𝐲 = #
𝐰 + # + =# +.
−12 242 −10 −8 − 3 − 1 2 8 ⎢3⎥ −0.1972
⎢2⎥
⎣2⎦
(b) This is an over-determined system.

In this case, the input feature without inclusion of bias/offset is a vector given by [−10, −8, −3, −1, 2, 8]! .

5
⎡5⎤
⎢ ⎥
4
- = (𝐱 ! 𝐱)"# 𝐱 ! 𝐲 = [242]"# [−10, −8, −3, −1, 2, 8] ⎢ ⎥ = −0.3512.
w
⎢3⎥
⎢2⎥
⎣2⎦

(c) The bias/offset term allows the line to move away from the origin (moved vertically in this case).

(Linear Regression, prediction, even/under-determined)


Question 2:

Given the following data pairs for training:


{𝑥! = 1, 𝑥" = 0, 𝑥# = 1} → {𝑦 = 1}

{𝑥! = 2, 𝑥" = −1, 𝑥# = 1} → {𝑦 = 2}

{𝑥! = 1, 𝑥" = 1, 𝑥# = 5} → {𝑦 = 3}

(a) Predict the following test data without inclusion of an input bias/offset term.
(b) Predict the following test data with inclusion of an input bias/offset term.

{𝑥! = −1, 𝑥" = 2, 𝑥# = 8} → {𝑦 =? }


{𝑥! = 1, 𝑥" = 5, 𝑥# = −1} → {𝑦 =? }

Answer:

(a) Without bias, this is an even-determined system and X is invertible.


1 0 1 "# 1 0.3333
𝐰- = 𝐗 "# 𝐲 = B 2 − 1 1 C B2C = B−0.6667C
1 1 5 3 0.6667
0.3333
−1 2 8 3.6667
𝐲D$ = 𝐗 $ 𝐰
- =# + B−0.6667C = # +
1 5 −1 −3.6667
0.6667

(b) After adding bias, it becomes an under-determined system.


"#
1 1 1 −0.1429
3 4 7 1
! ! "# 1 2 1 0.5238
𝐰- = 𝐗 (𝐗𝐗 ) 𝐲 = E F B 4 7 7 C B2C = E F
0 −1 1 −0.4762
7 7 28 3
1 1 5 0.6190
−0.1429
1 −1 2 8 0.5238 3.3333
𝐲D$ = 𝐗 $ 𝐰
-=# +E F=# +
1 1 5 − 1 −0.4762 −2.6190
0.6190
(Linear Regression, prediction, extrapolation)
Question 3:

A college bookstore must order books two months before each semester starts. They believe that the number of books
that will ultimately be sold for any particular course is related to the number of students registered for the course when
the books are ordered. They would like to develop a linear regression equation to help plan how many books to order.
From past records, the bookstore obtains the number of students registered, X, and the number of books actually sold
for a course, Y, for 12 different semesters. These data are shown below.

Semester Students Books


1 36 31
2 28 29
3 35 34
4 39 35
5 30 29
6 30 30
7 31 30
8 38 38
9 36 34
10 38 33
11 29 29
12 26 26

(a) Obtain a scatter plot of the number of books sold versus the number of registered students.
(b) Write down the regression equation and calculate the coefficients for this fitting.
(c) Predict the number of books that would be sold in a semester when 30 students have registered.
(d) Predict the number of books that would be sold in a semester when 5 students have registered.

Answer:

(a)

(b) Regression equation: 𝐲 = 𝐗𝐰,

1 1 1 1 1 1 1 1 1 1 1 1
where 𝐰 = [w% , w# ]! , 𝐗 ! = # +,
36 28 35 39 30 30 31 38 36 38 29 26

𝐲 ! = [31,29, 34, 35, 29,30,30, 38,34,33,29,26].


31
⎡29⎤
⎢34⎥
⎢ ⎥
35
⎢ ⎥
⎢29⎥
12 396 "# 1 1 1 1 1 1 1 1 1 1 1 1 ⎢30⎥
- = (𝐗 ! 𝐗)"# 𝐗 ! 𝐲 = #
𝐰 + # +
396 13288 36 28 35 39 30 30 31 38 36 38 29 26 ⎢30⎥
⎢38⎥
⎢34⎥
⎢33⎥
⎢29⎥
⎣26⎦
9.30
=# +
0.6727

(c)
9.30
- = [1 30] #
yD$ = 𝐗 $ 𝐰 + = 29.4818
0.6727

(d) (yD$ = 12.6636) This prediction appears to be somewhat over optimistic. Since 5 students is not within the range
of the sampled number of students, it might not be appropriate to use the regression equation to make this prediction.
We do not know if the straight-line model would fit data at this point, and we might not want to extrapolate far beyond
the observed range.

(Linear Regression, prediction, impact of duplicated entries)


Question 4:

Repeat the above problem using the following training data:


Semester Students Books
1 36 31
2 26 20
3 35 34
4 39 35
5 26 20
6 30 30
7 31 30
8 38 38
9 36 34
10 38 33
11 26 20
12 26 20

(a) Calculate the regression coefficients for this fitting.


(b) Predict the number of books that would be sold in a semester when 30 students have registered.
(c) Purge those duplicating data and re-fit the line and observe the impact on predicting the number of books that
would be sold in a semester when 30 students have registered.
(d) Sketch and compare the two fitting lines.

Answer:
(a),(b),(c),(d)

Using the full data:

31
⎡20⎤
⎢34⎥
⎢ ⎥
35
⎢ ⎥
⎢20⎥
12 387 "$ 1 1 1 1 1 1 1 1 1 1 1 1 ⎢30⎥ −10.4126
" = (𝐗 ! 𝐗)"𝟏 𝐗 ! 𝐲 = (
𝐰 0 ( 0 =( 0
387 12791 36 26 35 39 26 30 31 38 36 38 26 26 ⎢30⎥ 1.2143
⎢38⎥
⎢34⎥
⎢33⎥
⎢20⎥
⎣20⎦

−10.4126
- = [1 30] #
yD$ = 𝐗 $ 𝐰 + = 26.0177
1.2143

After having the duplicating data purged:

31
⎡34⎤
⎢35⎥
⎢ ⎥
30
9 309 "$ 1 1 1 1 1 1 1 1 1 ⎢ ⎥ −3.5584
" = (𝐗 ! 𝐗)"𝟏 𝐗 ! 𝐲 = (
𝐰 0 ( 0 ⎢30⎥ = ( 0
309 10763 36 35 39 30 31 38 36 38 26 1.0260
⎢38⎥
⎢34⎥
⎢33⎥
⎣20⎦

−3.5584
- = [1 30] #
yD$ = 𝐗 $ 𝐰 + = 27.2208
1.0260

Note: these results show that duplicating samples can influence the learning and decision too. In this case, purging
seems to give a more optimistic prediction for a relatively small number of students (< 37) and more conservative
prediction for a relatively large number of students (>37).
(Linear Regression, python)
Question 5:
Download the data file “government-expenditure-on-education.csv” from Canvas Tutorial Folder.
It depicts the government’s educational expenditure over the years (downloaded in July 2021 from
https://data.gov.sg/dataset/government-expenditure-on-education)
Predict the educational expenditure of year 2021 based on linear regression. Solve the problem using Python with a
plot. Note: please use the file from the canvas link.
Hint: use Python packages like numpy, pandas, matplotlib.pyplot, numpy.linalg.

Answer:

The predicted educational expenditure in year 2021 is 12102904.270643068.

Codes:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from numpy.linalg import inv

df = pd.read_csv("government-expenditure-on-education.csv")

expenditureList = df ['recurrent_expenditure_total'].tolist()

yearList = df ['year'].tolist()

m_list = [[1]*len(yearList), yearList]

X = np.array(m_list).T

y = np.array(expenditureList)

w = inv(X.T @ X) @ X.T @ y

print(w)

y_line = X.dot(w)

plt.plot(yearList, expenditureList, 'o', label = 'Expenditure over the years')

plt.plot(yearList, y_line)

plt.xlabel('Year')

plt.ylabel('Expenditure')
plt.title('Education Expenditure')

plt.show()

y_predict = np.array([1, 2021]).dot(w)

print(y_predict)

Output
[-6.4843247e+08 3.2683591e+05]
12102904.270643068

(Linear Regression, python)


Question 6:
Download the CSV file for red-wine using “ wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-
databases/wine-quality/winequality-red.csv",sep=';') ” . Use Python to perform the following tasks. Hint: use Python
packages like numpy, pandas, matplotlib.pyplot, numpy.linalg, and sklearn.metrics.

(a) Take y = wine.quality as the target output and x = wine.drop('quality',axis = 1)as


the input features. Assume the given list of data is already randomly indexed (i.e., not in particular order),
split the database into two sets: [0:1500] samples for regression training, and [1500:1599] samples for testing.
(b) Perform linear regression on the training set and print out the learned parameters.
(c) Perform prediction using the test set and provide the prediction accuracy in terms of the mean of squared
errors (MSE).

Answer:

import pandas as pd

#import matplotlib.pyplot as plt

import numpy as np

from numpy.linalg import inv


from sklearn.metrics import mean_squared_error

## get data from web

wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-
quality/winequality-red.csv",sep=';')

wine.info()

y = wine.quality

x = wine.drop('quality',axis = 1)

## Include the offset/bias term

x0 = np.ones((len(y),1))

X = np.hstack((x0,x))

## split data into training and test sets

## (Note: this exercise introduces the basic protocol of using the training-test
partitioning of samples for evaluation assuming the list of data is already randomly
indexed)

## In case you really want a general random split to have a better training/test
distributions:

## from sklearn.model_selection import train_test_split

## train_X,test_X,train_y,test_y = train_test_split(X,y,test_size=99/1599, random_state


= 0)

train_X = X[0:1500]

train_y = y[0:1500]

test_X = X[1500:1599]

test_y = y[1500:1599]

## linear regression

w = inv(train_X.T @ train_X) @ train_X.T @ train_y

print(w)

yt_est = test_X.dot(w);

MSE = np.square(np.subtract(test_y,yt_est)).mean()

print(MSE)

MSE = mean_squared_error(test_y,yt_est)
print(MSE)

[ 2.22330327e+01 2.68702621e-02 -1.12838019e+00 -2.06141685e-01

1.22000584e-02 -1.77718503e+00 4.29357454e-03 -3.18953315e-03

-1.81795124e+01 -3.98142390e-01 8.92474793e-01 2.77147239e-01]

0.34352638122440293

0.343526381224403

Question 7:
This question is related to understanding of modelling assumptions. The function given by 𝑓(𝐱) = 1 + 𝑥# + 𝑥& −
𝑥' − 𝑥( is affine.
a) True
b) False

Answer: a)

Question 8:

MCQ: There could be more than one answer.


Suppose 𝑓(𝐱) is a scalar function of d variables where 𝐱 is a d ×1 vector. Then, without taking data points into
consideration, the outcome of differentiation of 𝑓(𝐱) w.r.t. 𝐱 is
a) a scalar
b) a d×1 vector
c) a d×d matrix
d) a d×d×d tensor
e) None of the above

Answer: b)

(Linear regression with multiple outputs)

Questions 9:

The values of feature vector x and their corresponding values of target vector y are shown in the table below:

x [3, -1, 0] [5, 1, 2] [9, -1, 3] [-6, 7, 2] [3, -2, 0]


y [1, -1] [-1, 0] [1, 2] [0, 3] [1, -2]

Find the least square solution of w using linear regression of multiple outputs and then estimate the value of y when
x = [8, 0, 2].

Answer:
#python
import numpy as np
from numpy.linalg import inv
X = np.array([[1, 3, -1, 0], [1, 5, 1, 2], [1, 9, -1, 3], [1, -6, 7, 2],
[1, 3, -2, 0]])
Y = np.array([[1, -1], [-1, 0], [1, 2], [0, 3], [1, -2]])
W = inv(X.T @ X) @ X.T @ Y
print(W)

newX=np.array([1, 8, 0, 2])
newY=newX@W
print(newY)

Outputs

W= [[ 1.14668974 -0.95997404]
[-0.630463 -0.33427088]
[-1.10601471 -0.24426655]
[ 1.3595846 1.77953267]]
newY=[-1.17784509 -0.07507572]
EE2211 Tutorial 6

(Ridge Regression in Dual Form)


Question 1:
Derive the solution for linear ridge regression in dual form (see Lecture 6 notes page 16).

Answer: For 𝜆 > 0,


(𝐗 𝑇 𝐗 + 𝜆𝐈)𝐰 = 𝐗 𝑇 𝐲

⇒ 𝐗 𝑇 𝐗𝐰 + 𝜆𝐰 = 𝐗 𝑇 𝐲

⇒ 𝜆𝐰 = 𝐗 𝑇 𝐲 − 𝐗 𝑇 𝐗𝐰

⇒ 𝐰 = 𝜆−1 (𝐗 𝑇 𝐲 − 𝐗 𝑇 𝐗𝐰)

⇒ 𝐰 = 𝜆−1 𝐗 𝑇 (𝐲 − 𝐗𝐰)

𝐰 = 𝐗𝑇𝒂

where

𝒂 = 𝜆−1 (𝐲 − 𝐗𝐰)

⇒ 𝜆𝒂 = (𝐲 − 𝐗𝐰)

⇒ 𝜆𝒂 = (𝐲 − 𝐗𝐗 𝑇 𝒂)

⇒ 𝐗𝐗 𝑇 𝒂 + 𝜆𝒂 = 𝐲

⇒ (𝐗𝐗 𝑇 + 𝜆𝐈)𝒂 = 𝐲

⇒ 𝒂 = (𝐗𝐗 𝑇 + 𝜆𝐈)−1 𝐲

Hence,
𝐰 = 𝐗 𝑇 𝒂 = 𝐗 𝑇 (𝐗𝐗 𝑇 + 𝜆𝐈)−1 𝐲.

(Polynomial Regression, 1D data)


Question 2:
Given the following data pairs for training:
{𝑥 = −10} → {𝑦 = 5}
{𝑥 = −8} → {𝑦 = 5}
{𝑥 = −3} → {𝑦 = 4}
{𝑥 = −1} → {𝑦 = 3}
{ 𝑥 = 2 } → {𝑦 = 2}
{ 𝑥 = 8 } → {𝑦 = 2}

(a) Perform a 3rd-order polynomial regression and sketch the result of line fitting.
(b) Given a test point { 𝑥 = 9 } predict 𝑦 using the polynomial model.
(c) Compare this prediction with that of a linear regression.
Answer:
Polynomial model of 3rd order: 𝑓(𝐱) = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥12 + 𝑤3 𝑥13.
1 − 10 100 − 1000 5
1 −8 64 − 512 5
1 −3 9 − 27 4
𝐏= , 𝐲= .
1 −1 1 −1 3
1 2 4 8 2
[1 8 64 512 ] [2]
Polynomial regression:
̂ = (𝐏𝑇 𝐏)−1 𝐏𝑇 𝐲
𝐰
5
6 − 12 242 − 1020 −1 1 1 1 1 1 1 5
−12 242 − 1020 18290 −10 −8 −3 −1 2 8 4
=[ ] [ ]
242 − 1020 18290 − 100212 100 64 9 1 4 64 3
−1020 18290 − 100212 1525082 −1000 − 512 − 27 −1 8 512 2
[2]
2.6894
−0.3772
=[ ]
0.0134
0.0029
Linear regression:
5
5
6 − 12 −1 1 1 1 1 1 1 4 3.1055
̂ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲 = [
𝐰 ] [ ] =[ ].
−12 242 −10 −8 − 3 − 1 2 8 3 −0.1972
2
[2]
Prediction:
y_predict_Poly = 2.4661
y_predict_Linear = 1.3303
(Polynomial Regression, 3D data, Python)
Question 3:
(a) Write down the expression for a 3rd order polynomial model having a 3-dimensional input.
𝑥11 𝑥12 𝑥13 1 0 1
(b) Write down the P matrix for this polynomial given 𝐗 = [ 𝑥 𝑥22 𝑥23 ] = [1 − 1 1].
21
0
(c) Given 𝐲 = [ ], can a unique solution be obtained in dual form? If so, proceed to solve it.
1
0
(d) Given 𝐲 = [ ], can the primal ridge regression be applied to obtain a unique solution? If so, proceed to
1
solve it.

Answer:

(a) Polynomial model of 3rd order:


𝑓(𝐱) = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3
+ 𝑤12 𝑥1 𝑥2 + 𝑤23 𝑥2 𝑥3 +𝑤13 𝑥1 𝑥3 + 𝑤11 𝑥12 + 𝑤22 𝑥22 + 𝑤33 𝑥32
+ 𝑤211 𝑥2 𝑥12 + 𝑤311 𝑥3 𝑥12 + 𝑤122 𝑥1 𝑥22 + 𝑤322 𝑥3 𝑥22 + 𝑤133 𝑥1 𝑥32 + 𝑤233 𝑥2 𝑥32
+ 𝑤123 𝑥1 𝑥2 𝑥3 + 𝑤111 𝑥13 + 𝑤222 𝑥23 + 𝑤333 𝑥33 ________________(1)
(b) P = [
Columns 1 through 13
1 1 0 1 0 0 1 1 0 1 0 1 0
1 1 -1 1 -1 -1 1 1 1 1 -1 1 1
Columns 14 through 20
0 1 0 0 1 0 1
1 1 -1 -1 1 -1 1 ]

(c) Yes
10 10 −1 0
̂ = 𝐏𝑇 (𝐏𝐏𝑇 )−1 𝐲 = 𝐏𝑇 [
𝐰 ] [ ]
10 20 1
𝑇
𝐰
̂ =[0 0 -0.1000 0 -0.1000 -0.1000 0 0 0.1000 0 -0.1000

0 0.1000 0.1000 0 -0.1000 -0.1000 0 -0.1000 0]

In python:
w_dual =
[ 0. 0. -0.1 0. 0. -0.1 0. 0.1 -0.1 0. 0. -0.1 0. 0.1 -0.1 0. -0.1 0.1 -0.1 0. ]

(Note: The arrangement of the polynomial terms in the columns of matrix 𝐏 using
PolynomialFeatures from sklearn.preprocessing might be different from that in equation(1).)

(d) Yes
̂ = (𝐏𝑇 𝐏 + 𝜆𝐈)−1 𝐏𝑇 𝐲
𝐰
̂ 𝑇 =[0.0000
𝐰 0.0000 -0.1000 0.0000 -0.1000 -0.1000 0.0000
0.0000 0.1000 0.0000 -0.1000 0.0000 0.1000 0.1000
0.0000 -0.1000 -0.1000 0.0000 -0.1000 0.0000]
In python:
w_primal = [ 9.99969302e-07 9.99972940e-07 -9.99980001e-02 9.99970098e-07
9.99970666e-07 -9.99980000e-02 9.99967597e-07 9.99980000e-02
-9.99980000e-02 9.99972485e-07 9.99969529e-07 -9.99980000e-02
9.99968506e-07 9.99980000e-02 -9.99980001e-02 9.99970553e-07
-9.99980001e-02 9.99980001e-02 -9.99980001e-02 9.99969416e-07]

(Note: The arrangement of the polynomial terms in the columns of matrix 𝐏 using
PolynomialFeatures from sklearn.preprocessing might be different from that in equation(1).)
Here, at 𝜆 = 0.0001, we observe a very close solution to that in (c) even though (d) constitutes an
approximation whereas (c) is exact.
Codes:
import numpy as np
from numpy.linalg import inv
from sklearn.preprocessing import PolynomialFeatures
X = np.array([[1,0,1], [1,-1,1]])
y = np.array([0, 1])
## Generate polynomial features
order = 3
poly = PolynomialFeatures(order)
P = poly.fit_transform(X)
## dual solution (without ridge)
w_dual = P.T @ inv(P @ P.T) @ y
print(w_dual)
## primal ridge
reg_L = 0.0001*np.identity(P.shape[1])
w_primal_ridge = inv(P.T @ P + reg_L) @ P.T @ y
print(w_primal_ridge)

(Binary Classification, Python)


Question 4:
Given the training data:
{𝑥 = −1} → { 𝑦 = 𝑐𝑙𝑎𝑠𝑠1 }
{𝑥 = 0 } → { 𝑦 = 𝑐𝑙𝑎𝑠𝑠1 }
{𝑥 = 0.5} → { 𝑦 = 𝑐𝑙𝑎𝑠𝑠2 }
{𝑥 = 0.3} → { 𝑦 = 𝑐𝑙𝑎𝑠𝑠1 }
{𝑥 = 0.8} → { 𝑦 = 𝑐𝑙𝑎𝑠𝑠2 }
Predict the class label for {𝑥 = −0.1} and {𝑥 = 0.4} using linear regression with signum discrimination.
Answer:

1 −1 +1
1 0 +1
𝐗 = 1 0.5 , 𝐲 = −1
1 0.3 +1
[1 0.8] [−1]

0.3333
̂ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲 = [
𝐰 ]
−1.1111
0.4444 𝑐𝑙𝑎𝑠𝑠 + 1 → 𝑐𝑙𝑎𝑠𝑠1
sgn(𝐲̂t ) = sgn(𝐗 t𝐰
̂) = sgn ([ ]) = [ ]
−0.1111 𝑐𝑙𝑎𝑠𝑠 − 1 → 𝑐𝑙𝑎𝑠𝑠2
Codes:
import numpy as np
from numpy.linalg import inv
from sklearn.preprocessing import PolynomialFeatures
X = np.array([[1,-1], [1,0], [1,0.5], [1,0.3], [1,0.8]])
y = np.array([1, 1, -1, 1, -1])
## Linear regression for classification
w = inv(X.T @ X) @ X.T @ y
print(w)
Xt = np.array([[1,-0.1], [1,0.4]])
y_predict = Xt @ w
print(y_predict)
y_class_predict = np.sign(y_predict)
print(y_class_predict)

(Multi-Category Classification, Python)


Question 5:
Given the training data:
{𝑥 = −1} → { 𝑦 = 𝑐𝑙𝑎𝑠𝑠1 }
{𝑥 = 0 } → { 𝑦 = 𝑐𝑙𝑎𝑠𝑠1 }
{𝑥 = 0.5} → { 𝑦 = 𝑐𝑙𝑎𝑠𝑠2 }
{𝑥 = 0.3} → { 𝑦 = 𝑐𝑙𝑎𝑠𝑠3 }
{𝑥 = 0.8} → { 𝑦 = 𝑐𝑙𝑎𝑠𝑠2 }
(a) Predict the class label for {𝑥 = −0.1} and {𝑥 = 0.4} based on linear regression towards a one-hot encoded
target.

(b) Predict the class label for {𝑥 = −0.1} and {𝑥 = 0.4} using a polynomial model of 5th order and a one-hot
encoded target.
Answer:

1 −1 1 0 0
1 0 1 0 0
1 − 0.1
𝐗 = 1 0.5 , 𝐘 = 0 1 0 , 𝐗t = [ ].
1 0.4
1 0.3 0 0 1
[1 0.8] [0 1 0]
0.4780 0.3333 0.1887
̂ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘 = [
(a) 𝐰 ]
−0.6499 0.5556 0.0943

̂t = 𝐗 t 𝐰 1 − 0.1 0.4780 0.3333 0.1887


𝐘 ̂= [ ][ ]
1 0.4 −0.6499 0.5556 0.0943
0.5430 0.2778 0.1792 𝑐𝑙𝑎𝑠𝑠1
=[ ] ⇒[ ]
0.2180 0.5556 0.2264 𝑐𝑙𝑎𝑠𝑠2
(b) Polynomial model of 5 th order: 𝑓(𝐱) = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥12 + 𝑤3 𝑥13 + 𝑤4 𝑥14 + 𝑤5 𝑥15
𝐏 = [ 1.0000 -1.0000 1.0000 -1.0000 1.0000 -1.0000
1.0000 0 0 0 0 0
1.0000 0.5000 0.2500 0.1250 0.0625 0.0313
1.0000 0.3000 0.0900 0.0270 0.0081 0.0024
1.0000 0.8000 0.6400 0.5120 0.4096 0.3277 ].
1.0000 0 − 0.0000
−5.3031 − 3.7023 9.0055
5.2198 10.8728 − 16.0926
̂ = 𝐏𝑇 (𝐏𝐏𝑇 )−1 𝐘 =
𝐰 .
6.6662 9.4698 − 16.1360
−6.4765 − 12.9099 19.3864
[ −2.6199 − 7.8045 10.4244 ]
𝐏t = [ 1.0000 -0.1000 0.0100 -0.0010 0.0001 -0.0000
1.0000 0.4000 0.1600 0.0640 0.0256 0.0102 ].
̂t = 𝐏t 𝐰 1.5752 0.4683 − 1.0435 𝑐𝑙𝑎𝑠𝑠1
𝐘 ̂=[ ] ⇒ [ ] .
−0.0521 0.4544 0.5977 𝑐𝑙𝑎𝑠𝑠3
Codes:
import numpy as np
from numpy.linalg import inv
from sklearn.preprocessing import PolynomialFeatures
X = np.array([[1,-1], [1,0], [1,0.5], [1,0.3], [1,0.8]])
Y = np.array([[1,0,0], [1,0,0], [0,1,0], [0,0,1], [0,1,0]])

## Linear regression for classification


W = inv(X.T @ X) @ X.T @ Y
print(W)
Xt = np.array([[1,-0.1], [1,0.4]])
y_predict = Xt @ W
print(y_predict)
y_class_predict = [[1 if y == max(x) else 0 for y in x] for x in y_predict ]
print(y_class_predict)

## Polynomial regression for


## Generate polynomial features
order = 5
poly = PolynomialFeatures(order)
## only the data column (2nd) is needed for generation of polynomial terms
reshaped = X[:,1].reshape(len(X[:,1]),1)
P = poly.fit_transform(reshaped)
reshaped = Xt[:,1].reshape(len(Xt[:,1]),1)
Pt = poly.fit_transform(reshaped)
## dual solution (without ridge)
Wp_dual = P.T @ inv(P @ P.T) @ Y
print(Wp_dual)
yp_predict = Pt @ Wp_dual
print(yp_predict)
yp_class_predict = [[1 if y == max(x) else 0 for y in x] for x in yp_predict ]
print(yp_class_predict)

(Multi-Category Classification, Python)


Question 6 (continued from Q3 of Tutorial 2):
Get the data set “from sklearn.datasets import load_iris”. Use Python to perform the following
tasks.
(a) Split the database into two sets: 74% of samples for training, and 26% of samples for testing. Hint: you might
want to utilize from sklearn.model_selection import train_test_split for the splitting.
(b) Construct the target output using one-hot encoding.
(c) Perform a linear regression for classification (without inclusion of ridge, utilizing one-hot encoding for the
learning target) and compute the number of test samples that are classified correctly.
(d) Using the same training and test sets as in above, perform a 2nd order polynomial regression for classification
(again, without inclusion of ridge, utilizing one-hot encoding for the learning target) and compute the number
of test samples that are classified correctly. Hint: you might want to use from
sklearn.preprocessing import PolynomialFeatures for generation of the polynomial
matrix.
Codes:
## (a) split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
iris_dataset['data'], iris_dataset['target'], test_size=0.26, random_state=0)

## (b) one-hot encoding


# Ytr_onehot = list()
# for i in y_train:
# letter = [0, 0, 0]
# letter[i] = 1
# Ytr_onehot.append(letter)
# Yts_onehot = list()
# for i in y_test:
# letter = [0, 0, 0]
# letter[i] = 1
# Yts_onehot.append(letter)
from sklearn.preprocessing import OneHotEncoder
onehot_encoder=OneHotEncoder(sparse=False)
reshaped = y_train.reshape(len(y_train), 1)
Ytr_onehot = onehot_encoder.fit_transform(reshaped)
reshaped = y_test.reshape(len(y_test), 1)
Yts_onehot = onehot_encoder.fit_transform(reshaped)

## (c) Linear Classification


bias1 = np.ones((X_train.shape[0], 1))
X_train = np.concatenate((bias, X_train), axis = 1)
Bias2 = np.ones((X_test.shape[0], 1))
X_test = np.concatenate((bias, X_test), axis = 1)

w = inv(X_train.T @ X_train) @ X_train.T @ Ytr_onehot


print(w)
# calculate the output based on the estimated w and test input X and then assign
to one of the classes based on one hot encoding
yt_est = X_test.dot(w);
yt_cls = [[1 if y == max(x) else 0 for y in x] for x in yt_est ]
print(yt_cls)
# compare the predicted y with the ground truth
m1 = np.matrix(Yts_onehot)
m2 = np.matrix(yt_cls)
difference = np.abs(m1 - m2)
print(difference)
# calculate the error rate/accuracy
correct = np.where(~difference.any(axis=1))[0]
accuracy = len(correct)/len(difference)
print(len(correct))
print(accuracy)

## (d) Polynomial Classification


import numpy as np
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)
P = poly.fit_transform(X_train)
Pt = poly.fit_transform(X_test)
if P.shape[0] > P.shape[1]:
wp = inv(P.T @ P) @ P.T @ Ytr_onehot
else:
wp = P.T @ inv(P @ P.T) @ Ytr_onehot
print(wp)
yt_est_p = Pt.dot(wp);
yt_cls_p = [[1 if y == max(x) else 0 for y in x] for x in yt_est_p ]
print(yt_cls_p)
m1 = np.matrix(Yts_onehot)
m2 = np.matrix(yt_cls_p)
difference = np.abs(m1 - m2)
print(difference)
correct_p = np.where(~difference.any(axis=1))[0]
accuracy_p = len(correct_p)/len(difference)
print(len(correct_p))
print(accuracy_p)
(c) Correct prediction 28/39
(d) Correct prediction 38/39

Question 7
1 1
MCQ: there could be more than one answer. Given three samples of two-dimensional data points 𝐗 = [0 1] with
3 3
1
corresponding target vector 𝐲 = [0]. Suppose you want to use a full third-order polynomial model to fit these data.
1
Which of the following is/are true?

a) The polynomials model has 10 parameters to learn


b) The polynomial learning system is an under-determined one
c) The learning of the polynomial model has infinite number of solutions
d) The input matrix X has linearly dependent samples
e) None of the above

Answer: a, b, c, d

Question 8

MCQ: there could be more than one answer. Which of the following is/are true?

a) The polynomial model can be used to solve problems with nonlinear decision boundary.

b) The ridge regression cannot be applied to multi-target regression.

c) The solution for learning feature 𝐗 with target 𝐲 based on linear ridge regression can be written as 𝐰
̂=
(𝐗 𝑇 𝐗+𝜆𝐈)−1 𝐗 𝑇 𝐲 for 𝜆 > 0. As 𝜆 increases, 𝐰
̂ 𝑇𝐰
̂ decreases.

d) If there are four data samples with two input features each, the full second-order polynomial model is an over-
determined system.

Answer: a, c
EE2211 Tutorial 7

Question 1:
This question explores the use of Pearson’s correlation as a feature selection metric. We are given the
following training dataset.

Datapoint 1 Datapoint 2 Datapoint 3 Datapoint 4 Datapoint 5


Feature 1 0.3510 2.1812 0.2415 -0.1096 0.1544
Feature 2 1.1796 2.1068 1.7753 1.2747 2.0851
Feature 3 -0.9852 1.3766 -1.3244 -0.6316 -0.8320
Target y 0.2758 1.4392 -0.4611 0.6154 1.0006

What are the top two features we should select if we use Pearson’s correlation as a feature selection
metric? Here’s the definition of Pearson’s correlation. Given 𝑁 pairs of datapoints
{(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), ⋯ , (𝑎𝑁 , 𝑏𝑁 )}, the Pearson’s correlation r is defined as 𝑟 =
1 𝑁
∑ (𝑎 −𝑎̅)(𝑏𝑖 −𝑏̅) 1
𝑁 𝑛=1 𝑖
, where 𝑎̅ = 𝑁 ∑𝑁 ̅ 1 𝑁
𝑛=1 𝑎𝑛 and 𝑏 = 𝑁 ∑𝑛=1 𝑏𝑛 are the empirical means of 𝑎 and
1 𝑁 1
√𝑁 ∑𝑛=1(𝑎𝑖 −𝑎̅)2 √𝑁 ∑𝑁 ̅ 2
𝑛=1(𝑏𝑖 −𝑏 )

1
𝑏 respectively. 𝜎𝑎 = √𝑁 ∑𝑁
1
̅)2 and 𝜎𝑏 = √𝑁 ∑𝑁
𝑛=1(𝑎𝑖 − 𝑎
̅ 2
𝑛=1(𝑏𝑖 − 𝑏 ) are referred to as the empirical
1
standard deviation of 𝑎 and 𝑏. 𝐶𝑜𝑣(𝑎, 𝑏) = ∑𝑁 ̅)(𝑏𝑖 − 𝑏̅) is known as the empirical
𝑛=1(𝑎𝑖 − 𝑎
𝑁
covariance between 𝑎 and 𝑏

Answer:

0.3510+2.1812+0.2415−0.1096+0.1544
Mean of Feature 1 = 𝜇1 = = 0.5637
5
1.1796+2.1068+1.7753+1.2747+2.0851
Mean of Feature 2 = 𝜇2 = = 1.6843
5
−0.9852+1.3766−1.3244−0.6316−0.8320
Mean of Feature 3 = 𝜇3 = = −0.4793
5
0.2758+1.4392−0.4611+0.6154+1.0006
Mean of Target y = 𝜇𝑦 = = 0.5740
5

(0.3510−𝜇1)2+(2.1812−𝜇1 )2+(0.2415−𝜇1 )2+(−0.1096−𝜇1 )2+(0.1544−𝜇1 )2


Feature 1 std = 𝜎1 = √ = 0.8229
5

(1.1796−μ2 )2+(2.1068−μ2 )2+(1.7753−μ2 )2+(1.2747−μ2 )2+(2.0851−μ2 )2


Feature 2 std = 𝜎2 = √ = 0.3924
5

(−0.9852−𝜇3 )2+(1.3766−𝜇3)2+(−1.3244−𝜇3)2 +(−0.6316−𝜇3 )2+(−0.8320−𝜇3 )2


Feature 3 std = 𝜎3 = √ = 0.9552
5
2 2 2 2 2
(0.2758−𝜇𝑦 ) +(1.4392−𝜇𝑦 ) +(−0.4611−𝜇𝑦 ) +(0.6154−𝜇𝑦 ) +(1.0006−𝜇𝑦 )
Target y std = 𝜎𝑦 = √ = 0.6469
5

1
Cov(Feature 1, y) = 5
[(0.3510 − 𝜇1 )(0.2758 − 𝜇𝑦 ) + (2.1812 − 𝜇1 )(1.4392 − 𝜇𝑦 ) + (0.2415 −
𝜇1 )(−0.4611 − 𝜇𝑦 ) + (−0.1096 − 𝜇1 )(0.6154 − 𝜇𝑦 ) + (0.1544 − 𝜇1 )(1.0006 − 𝜇𝑦 )] = 0.3188
1
Cov(Feature 2,y) = 5
[(1.1796 − μ2 )(0.2758 − 𝜇𝑦 ) + (2.1068 − μ2 )(1.4392 − 𝜇𝑦 ) + (1.7753 − μ2 )(−0.4611 −
𝜇𝑦 ) + (1.2747 − μ2 )(0.6154 − 𝜇𝑦 ) + (2.0851 − μ2 )(1.0006 − 𝜇𝑦 )] = 0.1152
1
Cov(Feature 3,y) = 5
[(−0.9852 − 𝜇3 )(0.2758 − 𝜇𝑦 ) + (1.3766 − 𝜇3 )(1.4392 − 𝜇𝑦 ) + (−1.3244 −
𝜇3 )(−0.4611 − 𝜇𝑦 ) + (−0.6316 − 𝜇3 )(0.6154 − 𝜇𝑦 ) + (−0.8320 − 𝜇3 )(1.0006 − 𝜇𝑦 )] = 0.4949

𝐶𝑜𝑣(𝐹𝑒𝑎𝑡𝑢𝑟𝑒 1,𝑦) 0.3188


Correlation of Feature 1 & y = = 0.8229×0.6469 = 0.5988
𝜎1 𝜎𝑦
𝐶𝑜𝑣(𝐹𝑒𝑎𝑡𝑢𝑟𝑒 2,𝑦) 0.1152
Correlation of Feature 2 & y = = 0.3924×0.6469 = 0.4537
𝜎2 𝜎𝑦
𝐶𝑜𝑣(𝐹𝑒𝑎𝑡𝑢𝑟𝑒 3,𝑦) 0.4949
Correlation of Feature 3 & y = = 0.9552×0.6469 = 0.8009
𝜎3 𝜎𝑦

Therefore, the top 2 features are Feature 1 and Feature 3.


Question 2:
This question further explores linear regression and ridge regression. The following data pairs are used for
training:
{𝑥 = −10} → {𝑦 = 4.18}

{𝑥 = −8} → {𝑦 = 2.42}

{𝑥 = −3} → {𝑦 = 0.22}

{𝑥 = −1} → {𝑦 = 0.12}

{ 𝑥 = 2 } → {𝑦 = 0.25}

{ 𝑥 = 7 } → {𝑦 = 3.09}

The data for testing are as follows:

{𝑥 = −9} → {𝑦 = 3}
{𝑥 ={𝑥−9} → {𝑦→={𝑦−6}
= −7} = 1.81}

{𝑥 = −5} → {𝑦 = 0.80}

{𝑥 = −4} → {𝑦 = 0.25}

{ 𝑥 = −2 } → {𝑦 = −0.19}

{ 𝑥 = 1 } → {𝑦 = 0.4}

{ 𝑥 = 4 } → {𝑦 = 1.24}

{ 𝑥 = 5 } → {𝑦 = 1.68}

{ 𝑥 = 6 } → {𝑦 = 2.32}

{ 𝑥 = 9 } → {𝑦 = 5.05}

(a) Use the polynomial model from orders 1 to 6 to train and test the data without regularization. Plot the
Mean Squared Errors (MSE) over orders from 1 to 6 for both the training and the test sets. Which
model order provides the best MSE in the training and test sets? Why? [Hint: the underlying data was
{𝑥 = −9} → {𝑦 = −6}
generated using a quadratic function + noise]
(b) Use regularization (ridge regression) λ=1 for all orders and repeat the same analyses. Compare the
plots of (a) and (b). What do you see? [Hint: the underlying data was generated using a quadratic
function + noise]

Answer:

Please see code: Tut7_Q2_yeo.py

Q1(a)
• There are 6 training data points. For polynomial orders 1 to 5, we can use the Primal solution:
̂ = (𝑃𝑇 𝑃)−1 𝑃𝑇 𝑦. For order 6, there are 7 unknowns, so the system is under-determined, so we
𝑤
̂ = 𝑃𝑇 (𝑃𝑃𝑇 )−1 𝑦
use the Dual solution: 𝑤
• See plots for estimated polynomial curves and MSE below
• ====== No Regularization =======
Training MSE: [2.3071. 8.4408e-03 8.3026e-03 1.7348e-03 3.8606e-25 2.3656e-17]
Test MSE: [ 3.0006 0.0296 0.0301 0.0854 1.0548 10.7674]
• Observe that the estimated polynomial curves for orders 5 and 6 pass through the training samples
exactly. This results in training MSE of virtually 0, but high test MSE => overfitting
• Note that even though the true underlying data came from a quadratic model (order = 2), estimated
polynomial curves for orders 2, 3 and 4 have relatively low training and test MSE.
• Polynomial curve of order 1 (linear curves) have high training and test MSE => underfitting

Q1(b)
• With regularization, we can simply use the primal solution even for order 6: 𝑤
̂=
(𝑃𝑇 𝑃 + 𝜆𝐼)−1 𝑃𝑇 𝑦.
• See plots for estimated polynomial curves and MSE below
• ====== Regularization =======
Training MSE: [2.3586 8.4565e-03 8.3560e-03 1.8080e-03 7.2650e-04 1.9348e-04]
Test MSE: [3.2756 0.0302 0.0314 0.0939 0.4369 6.0202]
• With the regularization, none of the polynomial curves passes through the training samples exactly.
In the case of orders 5 and 6, test MSE dropped from 1.0548 (order 5) and 10.7674 (order 6) to
0.4369 (order 5) and 6.0202 (order 6) after regularization was added. Thus, the regularization
reduces the overfitting
• On the other hand, the regularization did not help orders 1 to 4. Observe that the test MSE actually
went up. In this case, the regularization was overly strong, which ended up hurting these orders.
• Note that the addition of the regularization does not necessarily favor the lower order (compared
with the higher order). For example, the loss for order 2 is 0.018476, but the loss for order 6 is
0.0036308, which is lower. Thus, adding regularization does not help us choose the best polynomial
order for best prediction.
• In fact, selecting the best regularization parameter or model complexity (e.g., polynomial order) is
typically done through inner-loop (nested) cross-validation or training-validation-test scheme
(which will be taught in a future lecture).
EE2211 Tutorial 8

Question 1
Suppose we are minimizing f (x) = x4 with respect to x. We initialize x to be 2. We
perform gradient descent with learning rate 0.1. What is the value of x after the first
iteration?

Answer:

• The gradient of f (x) is 4x3 .

• At x = 2, the gradient is 4 × 23 = 32

• After first iteration of gradient descent, value of x will be x = 2 − 0.1 × 32 = −1.2

Question 2
Please consider the csv file (government-expenditure-on-education.csv), which depicts
the government’s educational expenditure over the years. We would like to predict
expenditure as a function of year. To do this, fit an exponential model f (x, w) =
exp(−xT w) with squared error loss to estimate w based on the csv file and gradient
descent. In other words, C(w) = m 2
i=1 (f (xi , w) − yi ) .
P

Note that even though year is one dimensional, we should add the bias term, so x =
[1 year]T . Furthermore, optimizing the exponential function is tricky (because a small
change in w can lead to large change in f ). Therefore for the purpose of optimization,
divide the “year” variable by the largest year (2018) and divide the “expenditure” by the
largest expenditure, so that the resulting normalized year and normalized expenditure
variables are between 0 and 1. Use a learning rate of 0.03 and run gradient descent for
2000000 iterations.

(a) Plot the cost function C(w) as a function of the number of iterations.

(b) Use the fitted parameters to plot the predicted educational expenditure from year
1981 to year 2023.

(c) Repeat (a) using a learning rate of 0.1 and learning rate of 0.001. What do you
observe relative to (a)?

1
The goal of this question is for you to code up gradient descent, so I will provide you
with the gradient derivation. First, please note that in general, ∇w (xT w) = x. To see
this:
 ∂(xT w) 
1 x1 +w2 x2 +···+wd xd )
 ∂(w   
∂w1 x1
 ∂(x T w)   ∂(w1 x1 +w2∂w 1
x2 +···+wd xd ) 
 x2 
     
 ∂w2  ∂w2
∇w (xT w) =  = = .=x (1)
 
.. ..

 .  
  .   .. 
 
∂(xT w) ∂(w1 x1 +w2 x2 +···+wd xd ) xd
∂wd ∂wd

The above equality will be very useful for the other questions as well. Now, going back
to our question,
m
(f (xi , w) − yi )2
X
∇w C(w) = ∇w (2)
i=1
m
∇w (f (xi , w) − yi )2
X
= (3)
i=1
Xm
= 2(f (xi , w) − yi )∇w f (xi , w) chain rule (4)
i=1
m
2(f (xi , w) − yi )∇w exp(−xiT w)
X
= (5)
i=1
m
2(f (xi , w) − yi ) exp(−xiT w)∇w (xiT w)
X
=− chain rule (6)
i=1
m
2(f (xi , w) − yi ) exp(−xiT w)xi
X
=− (7)
i=1
Xm
=− 2(f (xi , w) − yi )f (xi , w)xi (8)
i=1
Answer:

Please see code Tut8_Q2.py.


(a) See Figure 1 below. The cost function decreases rapidly at first and then converges
to a final value.
(b) See Figure 2 below.
(c) See Figures 3 and 4 below. A learning rate of 0.1 is too big, so the cost function
does not decrease monotonically with increasing iterations, but instead fluctuate
a lot without convergence. The final cost function value is much worse that (a).
On the other hand, a learning rate of 0.001 is too small. So even though the cost
function decreases monotonically with increasing iterations, gradient descent has
not converged even after 2000000 iterations. The final cost function value is much
worse that (a).

2
Learning rate = 0.03
8

Square Error 6

0
0 500000 1000000 1500000 2000000
Iteration Number

Figure 1: Cost function value as a function of iterations.

1e7 Learning rate = 0.03


2.0
fitted curve real data

1.5
Expenditure

1.0

0.5

0.0
0 250 500 750 1000 1250 1500 1750 2000
Year

Figure 2: Fitted curve from 1981 to 2023

Question 3
Given the linear learning model f (x, w) = xT w, where x ∈ Rd . Consider the loss func-
tion L(f (xi , w), yi ) = (f (xi , w) − yi )4 , where i indexes the i-th training sample. The
final cost function is C(w) = m
P
i=1 L(f (xi , w), yi ), where m is the total number of train-
ing samples. Derive the gradient of the cost function with respect to w.

3
Learning rate = 0.1
10

8
Square Error

0
0 500000 1000000 1500000 2000000
Iteration Number

Figure 3: Cost function value as a function of iterations.

Learning rate = 0.001


12

10
Square Error

2
0 500000 1000000 1500000 2000000
Iteration Number

Figure 4: Cost function value as a function of iterations.

Answer:
m
(f (xi , w) − yi )4
X
∇w C(w) = ∇w (9)
i=1
m
∇w (f (xi , w) − yi )4
X
= (10)
i=1
m
4(f (xi , w) − yi )3 ∇w f (xi , w)
X
= chain rule (11)
i=1
m
4(f (xi , w) − yi )3 ∇w (xiT w)
X
= (12)
i=1
m
4(f (xi , w) − yi )3 xi
X
= (13)
i=1

4
Question 4
1
Repeat Question 3 using f (x, w) = σ(xT w), where σ(a) = 1+exp(−βa)

Answer:
m
(f (xi , w) − yi )4
X
∇w C(w) = ∇w (14)
i=1
m
∇w (f (xi , w) − yi )4
X
= (15)
i=1
m
4(f (xi , w) − yi )3 ∇w f (xi , w)
X
= chain rule (16)
i=1
m
4(f (xi , w) − yi )3 ∇w σ(xiT w)
X
= (17)
i=1
m
∂σ(a)
4(f (xi , w) − yi )3 ∇w (xiT w)
X
= chain rule (18)
i=1
∂a
m
∂σ(a)
4(f (xi , w) − yi )3
X
= xi (19)
i=1
∂a

∂σ(a) ∂σ(a)
So we just have to evaluate ∂a and plug it into the above equation. Note that ∂a
is evaluated at a = xiT w, so

∂σ(a) ∂ 1
 
= (20)
∂a ∂a 1 + exp(−βa)
1 ∂(1 + e−βa )
=− (21)
(1 + e−βa )2 ∂a
β
= e−βa (22)
(1 + e−βa )2
β
= (1 + e−βa − 1) (23)
(1 + e−βa )2
1 1
 
=β − (24)
1 + e−βa (1 + e−βa )2
 
= β σ(a) − σ 2 (a) (25)
= βσ(a)(1 − σ(a)) (26)
= βσ(xiT w)(1 − σ(xiT w)) (27)

Therefore,
m
4(f (xi , w) − yi )3 βσ(xiT w)(1 − σ(xiT w))xi
X
∇w C(w) = (28)
i=1

5
Question 5
Repeat Question 3 using f (x, w) = σ(xT w), where σ(a) = max(0, a)

Answer:
m
(f (xi , w) − yi )4
X
∇w C(w) = ∇w (29)
i=1
m
∇w (f (xi , w) − yi )4
X
= (30)
i=1
m
4(f (xi , w) − yi )3 ∇w f (xi , w)
X
= chain rule (31)
i=1
m
4(f (xi , w) − yi )3 ∇w σ(xiT w)
X
= (32)
i=1
m
∂σ(a)
4(f (xi , w) − yi )3 ∇w (xiT w)
X
= chain rule (33)
i=1
∂a
m
∂σ(a)
4(f (xi , w) − yi )3
X
= xi (34)
i=1
∂a

So we just have to evaluate ∂σ(a) ∂σ(a)


∂a and plug it into the above equation. Note that ∂a
is evaluated at a = xiT w. When a < 0, σ(a) = 0, so ∂σ(a)
∂a = 0. When a > 0, σ(a) = a,
(
T
1 if xi w > 0
so ∂σ(a) T
∂a = 1. Let us define δ(xi w > 0) = , so we get
0 if xiT w < 0

∂σ(a)
= δ(xiT w > 0) (35)
∂a
Therefore, we get
m
4(f (xi , w) − yi )3 xi δ(xiT w > 0),
X
∇w C(w) = (36)
i=1

6
EE2211 Tutorial 9

(Gini impurity, entropy and misclassification rate)


Question 1:
Compute the Gini impurity, entropy, misclassification rate for nodes A, B and C, as well as the overall
metrics (Gini impurity, entropy misclassification error) at depth 1 of the decision tree shown below.

Answer:

Let’s assume class 1, class 2 and class 3 correspond to red triangles, orange squares and blue circles
respectively.
" " # &
• For node A, 𝑝! = !# , 𝑝$ = !# , 𝑝% = !# = '
& $ ( ) %
• For node B, 𝑝! = = , 𝑝$ = = 0 , 𝑝% = =
!( " !( !( "
! " $ !
• For node C, 𝑝! = , 𝑝$ = , 𝑝% = =
# # # &

$
For Gini impurity, recall formula is 1 − 𝛴!"# 𝑝𝒊&
" $ " $ & $
• Node A: 1 − (!#) − (!#) − (') = 0.6481
$ $ % $
• Node B: 1 − ( ) −(0)$ − ( ) = 0.48
" "
! $ " $ ! $
• Node C: 1 – (#) − (#) − (&) = 0.5312
!( #
• Overall Gini at depth 1: ( ) 0.48 + ( ) 0.5312 = 0.5028
!# !#
Observe the decrease in Gini impurity from root (0.6481) to depth 1 (0.5028)

For entropy, recall formula is − ∑! 𝑝! log & 𝑝!


" " " " & &
• Node A: − (!#) log $ (!#) − (!#) log $ (!#) − (') log $ (') = 1.5466
$ $ % %
• Node B: − (") log $ (") − (0) log $ (0) − (") log $ (") = 0.9710
! ! " " ! !
• Node C: − (#) log $ (#) − (#) log $ (#) − (&) log $ (&) = 1.2988
!( #
• Overall entropy at depth 1: (!#) 0.9710 + (!#) 1.2988 = 1.1167
Observe the decrease in entropy from root (1.5466) to depth 1 (1.1167)
For misclassification rate, recall formula is 1 − max 𝑝+
*
" " & & "
• Node A: 1 − max((!#) , (!#) , (')) = 1 − (') = = 0.5556
'
$ % % $
• Node B: 1 − max((") , 0, (")) = 1 − (") =
"
! " ! " %
• Node C: 1 − max((#) , (#) , (&)) = 1 − (#) = #
!( $ # %
• Overall misclassification error rate at depth 1: (!#) (") + (!#) (#) = 0.3889
• We can also double check that at depth 1, the 4 red triangles will be classified wrongly for node B
and the 1 red triangle + 2 blue circles will be classified wrongly for node C. So in total, there will
,
be 7 wrong classifications out of 18 datapoints, which corresponds to (!#) = 0.3889
• Observe the decrease in misclassification rate from root (0.5556) to depth 1 (0.3889)
(MSE of regression trees)
Question 2:
Calculate the overall MSE for the following data at depth 1 of a regression tree assuming a decision
threshold is taken at 𝑥 = 5.0. How does it compare with the MSE at the root?
{𝑥, 𝑦}: {1, 2}, {0.8, 3}, {2, 2.5}, {2.5, 1}, {3, 2.3}, {4, 2.8}, {4.2, 1.5}, {6, 2.6}, {6.3, 3.5}, {7, 4}, {8,
3.5}, {8.2, 5}, {9, 4.5}

Answer:

At depth 1, when 𝑥 > 5


• 𝑦 = {2.6, 3.5, 4, 3.5, 5, 4.5} => 𝑦B = 3.85
!
• MSE = ((2.6 − 𝑦B)2 + (3.5 − 𝑦B)2 + (4 − 𝑦B )2 + (3.5 − 𝑦B)2 + (5 − 𝑦B )2 + (4.5 − 𝑦B)2 ) = 0.5958
)

At depth 1, when 𝑥 ≤ 5
• 𝑦 = {2, 3, 2.5, 1, 2.3, 2.8, 1.5} => 𝑦B = 2.1571
!
• MSE = ((2 − 𝑦B )2 + (3 − 𝑦B)2 + (2.5 − 𝑦B )2 + (1 − 𝑦B)2 + (2.3 − 𝑦B )2 + (2.8 − 𝑦B )2 + (1.5 − 𝑦B )2 ) =
,
0.4367
) ,
Overall MSE at depth 1: !%
× 0.5958 + !%
× 0.4367 = 0.5102

At the root:
• 𝑦 = {2, 3, 2.5, 1, 2.3, 2.8, 1.5, 2.6, 3.5, 4, 3.5, 5, 4.5} => 𝑦B = 2.9385
!
• MSE = !% ((2.6 − 𝑦B )2 + (3.5 − 𝑦B)2 + (4 − 𝑦B )2 + (3.5 − 𝑦B)2 + (5 − 𝑦B )2 + (4.5 − 𝑦B)2 + (2 − 𝑦B )2 +
(3 − 𝑦B)2 + (2.5 − 𝑦B)2 + (1 − 𝑦B)2 + (2.3 − 𝑦B)2 + (2.8 − 𝑦B )2 +(1.5 − 𝑦B)2 ) = 1.2224

Therefore, MSE has decreased from 1.2224 at the root to 0.5102 at depth 1
(Regression tree, Python)
Question 3:
Import the California Housing dataset “from sklearn.datasets import
fetch_california_housing” and “housing = fetch_california_housing()”. This
data set contains 8 features and 1 target variable listed below. Use “MedInc” as the input feature and
“MedHouseVal” as the target output. Fit a regression tree to depth 2 and compare your results with
results generated by “from sklearn.tree import DecisionTreeRegressor” using the
“squared error” criterion.

Target: ['MedHouseVal']
Features:['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
'AveOccup', 'Latitude', 'Longitude']

Answer:
Please refer to Tut9_Q3_zhou.py. We can exactly replicate the results from scikit-learn. Note that in the
plot below, the blue dots are the training datapoints. The curves from scikit-learn (black line) and our own
tree (red dashed line) are on top of each other, so they might be hard to tell apart.

(Classification tree, Python)


Question 4:
Get the data set “from sklearn.datasets import load_iris”. Perform the following tasks.
(a) Split the database into two sets: 80% of samples for training, and 20% of samples for testing using
random_state=0
(b) Train a decision tree classifier (i.e., “tree.DecisionTreeClassifier” from sklearn) using
the training set with a maximum depth of 4 based on the “entropy” criterion.
(c) Compute the training and test accuracies. You can use accuracy_score from
sklearn.metrics for accuracy computation
(d) Plot the tree using “tree.plot_tree”.

Answer:

Please refer to Tut9_Q4_yeo.py.

Training accuracy: 0.9917


Test accuracy: 1.0

The resulting tree looks like this:


EE2211 Tutorial 10

Question 1:
We have two classifiers showing the same accuracy with the same cross-validation. The more complex
model (such as a 9th-order polynomial model) is preferred over the simpler one (such as a 2nd-order
polynomial model).
a) True
b) False
Answer: b).

Question 2: According to the plots below, the Gini Coefficient is equal to Two times the Area Under the
ROC minus
Question 2: One.

We have 3 parameter candidates for a classification model, and we would like to choose the optimal one for
deployment. As such, we run 5-fold cross-validation.

Once we have completed the 5-fold cross-validation, in total, we have trained _______ classifiers. Note that, we
treat models with different parameters as different classifiers.

A) 10
B) 20
C) 25
D) True
a) 15
b) False
Answer: D)
Answer: a).
In each fold
Reason: we train
Since 3 classifiers,
the area (A+B) =so ½5(half
foldsthe
give 15 classifiers.
square box of area 1), Gini-coefficient = A/(A+B) = 2 A.
AUC = A+1/2 => A = AUC – 0.5. Substitute A into the Gini above: Gini-coefficient = 2(AUC – 0.5).

Question 3:
Suppose the binary classification problem, which you are dealing with, has highly imbalanced classes. The
majority class has 99 hundred samples and the minority class has 1 hundred samples. Which of the
following metric(s) would you choose for assessing the classification performance? (Select all relevant
metric(s) to get full credit)
a) Classification Accuracy
b) Cost sensitive accuracy
c) Precision and recall
d) None of these

Answer: (b, c)

Question 4:
Given below is a scenario for Training error rate Tr, and Validation error rate Va for a machine learning
algorithm. You want to choose a hyperparameter (P) based on Tr and Va.
P Tr Va
10 0.10 0.25
9 0.30 0.35
8 0.22 0.15
7 0.15 0.25
6 0.18 0.15
Which value of P will you choose based on the above table?
a) 10
b) 9
c) 8
d) 7
e) 6

Answer: e).

(Binary and Multicategory Confusion Matrices)


Question 5:
Tabulate the confusion matrices for the following classification problems.
(a) Binary problem (the class-1 and class-2 data points are respectively indicated by squares and circles)

(b) Three-category problem (the class-1, class-2 and class-3 data points are respectively indicated by
squares, circles and triangles).
Answer:
(a)
!!" !#"
!! 16 4
!# 4 26

(b)
!!" !#" !$"

!! 16 3 1

!# 1 25 4

!$ 3 1 6

(5-fold Cross-validation)
Question 6:
Get the data set “from sklearn.datasets import load_iris”. Perform a 5-fold Cross-validation to observe the
best polynomial order (among orders 1 to 10 and without regularization) for validation prediction. Note
that, you will have to partition the whole dataset for training/validation/test parts, where the size of
validation set is the same as that of test. Provide a plot of the average 5-fold training and validation error
rates over the polynomial orders. The randomly partitioned data sets of the 5-fold shall be maintained for
reuse in evaluation of future algorithms.
Answer:
##--- load data from scikit ---##
import numpy as np
import pandas as pd
print("pandas version: {}".format(pd.__version__))
import sklearn
print("scikit-learn version: {}".format(sklearn.__version__))
from sklearn.datasets import load_iris
iris_dataset = load_iris()
X = np.array(iris_dataset['data'])
y = np.array(iris_dataset['target'])
## one-hot encoding
Y = list()
for i in y:
letter = [0, 0, 0]
letter[i] = 1
Y.append(letter)
Y = np.array(Y)
test_Idx = np.random.RandomState(seed=2).permutation(Y.shape[0])
X_test = X[test_Idx[:25]]
Y_test = Y[test_Idx[:25]]
X = X[test_Idx[25:]]
Y = Y[test_Idx[25:]]

from sklearn.preprocessing import PolynomialFeatures


error_rate_train_array = []
error_rate_val_array = []
##--- Loop for Polynomial orders 1 to 10 ---##
for order in range(1,11):
error_rate_train_array_fold = []
error_rate_val_array_fold = []
# Random permutation of data
Idx = np.random.RandomState(seed=8).permutation(Y.shape[0])
# Loop 5 times for 5-fold
for k in range(0,5):
##--- Prepare training, validation, and test data for the 5-fold ---#
# Prepare indexing for each fold
X_val = X[Idx[k*25:(k+1)*25]]
Y_val = Y[Idx[k*25:(k+1)*25]]
Idxtrn = np.setdiff1d(Idx, Idx[k*25:(k+1)*25])
X_train = X[Idxtrn]
Y_train = Y[Idxtrn]
##--- Polynomial Classification ---##
poly = PolynomialFeatures(order)
P = poly.fit_transform(X_train)
Pval = poly.fit_transform(X_val)
if P.shape[0] > P.shape[1]: # over-/under-determined cases
reg_L = 0.00*np.identity(P.shape[1])
inv_PTP = np.linalg.inv(P.transpose().dot(P)+reg_L)
pinv_L = inv_PTP.dot(P.transpose())
wp = pinv_L.dot(Y_train)
else:
reg_R = 0.00*np.identity(P.shape[0])
inv_PPT = np.linalg.inv(P.dot(P.transpose())+reg_R)
pinv_R = P.transpose().dot(inv_PPT)
wp = pinv_R.dot(Y_train)
##--- trained output ---##
y_est_p = P.dot(wp);
y_cls_p = [[1 if y == max(x) else 0 for y in x] for x in y_est_p ]
m1tr = np.matrix(Y_train)
m2tr = np.matrix(y_cls_p)
# training classification error count and rate computation
difference = np.abs(m1tr - m2tr)
error_train = np.where(difference.any(axis=1))[0]
error_rate_train = len(error_train)/len(difference)
error_rate_train_array_fold += [error_rate_train]
##--- validation output ---##
yval_est_p = Pval.dot(wp);
yval_cls_p = [[1 if y == max(x) else 0 for y in x] for x in yval_est_p ]
m1 = np.matrix(Y_val)
m2 = np.matrix(yval_cls_p)
# validation classification error count and rate computation
difference = np.abs(m1 - m2)
error_val = np.where(difference.any(axis=1))[0]
error_rate_val = len(error_val)/len(difference)
error_rate_val_array_fold += [error_rate_val]
# store results for each polynomial order
error_rate_train_array += [np.mean(error_rate_train_array_fold)]
error_rate_val_array += [np.mean(error_rate_val_array_fold)]
##--- plotting ---##
import matplotlib.pyplot as plt
order=[x for x in range(1,11)]
plt.plot(order, error_rate_train_array, color='blue', marker='o', linewidth=3,
label='Training')
plt.plot(order, error_rate_val_array, color='orange', marker='x', linewidth=3,
label='Validation')
plt.xlabel('Order')
plt.ylabel('Error Rates')
plt.title('Training and Validation Error Rates')
plt.legend()
plt.show()
(Plot the ROC and Compute the AUC for Binary Classification)
Question 7:
Download the spambase data set from the UCI Machine Learning repository
https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/ and use the following function to
pack the data:
def load_data(Train=False):
import csv
data = []
## Read the training data
f = open('spambase.data')
reader = csv.reader(f)
next(reader, None)
for row in reader:
data.append(row)
f.close()
## x[:-1]: omit the last element of each x row
X = np.array([x[:-1] for x in data]).astype(np.float)
## x[-1]: the first element from the right instead of from the left
y = np.array([x[-1] for x in data]).astype(np.float)
del data # free up the memory
if Train:
# returns X_train, X_test, y_train, y_test
return train_test_split(X, y, test_size=0.2, random_state=8)
else:
return X, y
Randomly split the dataset into two parts, 80% for training and 20% for testing. Compute the test
Classification Error Rate and the AUC based on the optimal linear regression model without regularization.
EE2211 Tutorial 11

Question 1
The K-means clustering method uses the target labels for calculating the dis-
tances from the cluster centroids for clustering.
a) True
b) False
Ans: b) because target labels are not available in clustering.

Question 2
The fuzzy C-means algorithm groups the data items such that an item can exist
in multiple clusters.
a) True
b) False
Ans: a).

Question 3
How can you prevent a clustering algorithm from getting stuck in bad local
optima?
a) Set the same seed value for each run
b) Use the bottom ranked samples for initialization
c) Use the top ranked samples for initialization
d) All of the above
e) None of the above
Ans: e).

Question 4
     
1 0 0
Consider the following data points: x = , y = , and z = . The k-
1 1 0
means algorithm is initialized with centers at x and y. Upon convergence, the
two centres will be at
a) x and z
b) x and y

1
c) y and the midpoint of y and z
d) z and the midpoint of x and y
e) None of the above
Ans: e). The converged centers should be x and the midpoint of y and z.
1 import numpy as np
2
3 # Data points
4 x = np . array ([1 , 1])
5 y = np . array ([0 , 1])
6 z = np . array ([0 , 0])
7
8 data_points = np . array ([ x , y , z ])
9
10 # Initial centers
11 centers = np . array ([ x , y ])
12
13
14 def k_means ( data_points , centers , n_clusters , m ax_iter ations =100 ,
tol =1 e -4) :
15 for _ in range ( max_iter ations ) :
16 # Assign each data point to the closest centroid
17 labels = np . argmin ( np . linalg . norm ( data_points [: , np . newaxis
] - centers , axis =2) , axis =1)
18
19 # Update centroids to be the mean of the data points
assigned to them
20 new_centers = np . zeros (( n_clusters , data_points . shape [1]) )
21
22 # End if centroids no longer change
23 for i in range ( n_clusters ) :
24 new_centers [ i ] = data_points [ labels == i ]. mean ( axis =0)
25 if np . linalg . norm ( new_centers - centers ) < tol :
26 break
27 centers = new_centers
28 return centers , labels
29
30 centers , labels = k_means ( data_points , centers , n_clusters =2)
31 print ( " Converged centers : " , centers )

Question 5
       
0 0 1 1
Consider the following 8 data points: x1 = , x2 = , x3 = , x4 = ,
0 1 1 0
       
3 3 4 4
x5 = , x6 = , x7 = , and x8 = . The k-means algorithm is initial-
0 1 0 1
   
0 3
ized with centers at c1 = and c2 = . The first center after convergence
  0 0  
0.5 blank1
is c1 = . The second centre after convergence is c2 = .
0.5 blank2
Answer: blank1 = 3.5, blank2 = 0.5.
1 import numpy as np

2
2
3 # Data points
4 x1 = np . array ([0 , 0])
5 x2 = np . array ([0 , 1])
6 x3 = np . array ([1 , 1])
7 x4 = np . array ([1 , 0])
8 x5 = np . array ([3 , 0])
9 x6 = np . array ([3 , 1])
10 x7 = np . array ([4 , 0])
11 x8 = np . array ([4 , 1])
12
13 data_points = np . array ([ x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ])
14
15 # Initial centers
16 c1_init = np . array ([0 , 0])
17 c2_init = np . array ([3 , 0])
18
19 centers = np . array ([ c1_init , c2_init ])
20
21 def k_means ( data_points , centers , n_clusters , m ax_iter ations =100 ,
tol =1 e -4) :
22 for _ in range ( max_iter ations ) :
23 # Assign each data point to the closest centroid
24 labels = np . argmin ( np . linalg . norm ( data_points [: , np . newaxis
] - centers , axis =2) , axis =1)
25
26 # Update centroids to be the mean of the data points
assigned to them
27 new_centers = np . zeros (( n_clusters , data_points . shape [1]) )
28
29 # End if centroids no longer change
30 for i in range ( n_clusters ) :
31 new_centers [ i ] = data_points [ labels == i ]. mean ( axis =0)
32 if np . linalg . norm ( new_centers - centers ) < tol :
33 break
34 centers = new_centers
35 return centers , labels
36
37
38 centers , labels = k_means ( data_points , centers , n_clusters =2)
39 print ( " Converged centers : " , centers )

Question 6
Generate three clusters of data using the following codes.

# Import necessary libraries

import random as rd
import numpy as np # linear algebra
from matplotlib import pyplot as plt

# Generate data

3
# Set three centers, the model should predict similar results

center_1 = np.array([2,2])
center_2 = np.array([4,4])
center_3 = np.array([6,1])

# Generate random data and center it to the three centers

data_1 = np.random.randn(200, 2) + center_1


data_2 = np.random.randn(200,2) + center_2
data_3 = np.random.randn(200,2) + center_3
data = np.concatenate((data_1, data_2, data_3), axis = 0)
plt.scatter(data[:,0], data[:,1], s=7)
(i) Implement the Naı̈ve K-means clustering algorithm to find the 3 cluster
centroids. Classify the data based on the three centroids found and illustrate
the results using a plot (e.g., mark the 3 clusters of data points using different
colours).
(ii) Change the number of clusters K to 5 and classify the data points again
with a plot illustration.
1 # Import necessary libraries
2
3 import random as rd
4 import numpy as np # linear algebra
5 from matplotlib import pyplot as plt
6
7 # Generate data
8
9 # Set three centers , the model should predict similar results
10
11 center_1 = np . array ([2 ,2])
12 center_2 = np . array ([4 ,4])
13 center_3 = np . array ([6 ,1])
14
15 # Generate random data and center it to the three centers
16
17 data_1 = np . random . randn (200 , 2) + center_1
18 data_2 = np . random . randn (200 ,2) + center_2
19 data_3 = np . random . randn (200 ,2) + center_3
20 data = np . concatenate (( data_1 , data_2 , data_3 ) , axis = 0)
21
22 # initialize cluster centers
23 k = 3
24 centers = data [ np . random . choice ( len ( data ) , k , replace = False ) ]
25
26 def k_means ( data_points , centers , n_clusters , m ax_iter ations =100 ,
tol =1 e -4) :
27 for _ in range ( max_iter ations ) :
28 # Assign each data point to the closest centroid
29 labels = np . argmin ( np . linalg . norm ( data_points [: , np . newaxis
] - centers , axis =2) , axis =1)

4
30
31 # Update centroids to be the mean of the data points
assigned to them
32 new_centers = np . zeros (( n_clusters , data_points . shape [1]) )
33
34 # End if centroids no longer change
35 for i in range ( n_clusters ) :
36 new_centers [ i ] = data_points [ labels == i ]. mean ( axis =0)
37 if np . linalg . norm ( new_centers - centers ) < tol :
38 break
39 centers = new_centers
40 return centers , labels
41
42 centers , labels = k_means ( data , centers , n_clusters = k )
43 print ( " Converged centers : " , centers )
44 plt . title ( ’ Clustering Results ’)
45 plt . scatter ( data [: , 0] , data [: , 1] , c = labels , cmap = ’ viridis ’ , alpha
=0.5)
46 plt . scatter ( centers [: , 0] , centers [: , 1] , marker = ’* ’ , s =200 , c = ’k ’)
47 plt . show ()

Figure 1: K=3

Figure 2: K=5

5
Question 7
Load the iris data from sklearn.datasets import load iris. Assume that
the class labels are not given. Use the Naı̈ve K-means clustering algorithm to
group all the data based on K = 3. How accurate is the result of clustering
comparing with the known labels?
1 from sklearn . datasets import load_iris
2 from sklearn . cluster import KMeans
3 from sklearn . metrics import accu racy_sco re
4 import numpy as np
5
6 # load the iris dataset
7 iris = load_iris ()
8
9 # get the data and the true labels
10 data = iris . data
11 y_true = iris . target
12
13 # initialize the KMeans centers with K =3
14 k = 3
15 centers = data [ np . random . choice ( len ( data ) , k , replace = False ) ]
16
17 def k_means ( data_points , centers , n_clusters , m ax_iter ations =1000 ,
tol =1 e -6) :
18 for _ in range ( max_iter ations ) :
19 # Assign each data point to the closest centroid
20 labels = np . argmin ( np . linalg . norm ( data_points [: , np . newaxis
] - centers , axis =2) , axis =1)
21
22 # Update centroids to be the mean of the data points
assigned to them
23 new_centers = np . zeros (( n_clusters , data_points . shape [1]) )
24
25 # End if centroids no longer change
26 for i in range ( n_clusters ) :
27 new_centers [ i ] = data_points [ labels == i ]. mean ( axis =0)
28 if np . linalg . norm ( new_centers - centers ) < tol :
29 break
30 centers = new_centers
31 return centers , labels
32
33 centers , y_pred = k_means ( data , centers , n_clusters = k )
34 # create a mask that selects elements where the value is 0 , 1 , 2
35 mask_0 = ( y_pred == 0)
36 mask_1 = ( y_pred == 1)
37 mask_2 = ( y_pred == 2)
38
39 y_pred0 = y_pred . copy ()
40 y_pred0 [ mask_0 ] = 0
41 y_pred0 [ mask_1 ] = 1
42 y_pred0 [ mask_2 ] = 2
43
44 y_pred1 = y_pred . copy ()
45 y_pred1 [ mask_0 ] = 0
46 y_pred1 [ mask_1 ] = 2
47 y_pred1 [ mask_2 ] = 1

6
48
49 y_pred2 = y_pred . copy ()
50 y_pred2 [ mask_0 ] = 1
51 y_pred2 [ mask_1 ] = 0
52 y_pred2 [ mask_2 ] = 2
53
54
55 y_pred3 = y_pred . copy ()
56 y_pred3 [ mask_0 ] = 1
57 y_pred3 [ mask_1 ] = 2
58 y_pred3 [ mask_2 ] = 0
59
60 y_pred4 = y_pred . copy ()
61 y_pred4 [ mask_0 ] = 2
62 y_pred4 [ mask_1 ] = 0
63 y_pred4 [ mask_2 ] = 1
64
65 y_pred5 = y_pred . copy ()
66 y_pred5 [ mask_0 ] = 2
67 y_pred5 [ mask_1 ] = 1
68 y_pred5 [ mask_2 ] = 0
69
70 # calculate the accuracy of the clustering
71 accuracy = 0.0
72 for pred in [ y_pred0 , y_pred1 , y_pred2 , y_pred3 , y_pred4 , y_pred5 ]:
73 accuracy = max ([ ac curacy_s core ( y_true , pred ) , accuracy ])
74
75 print ( " Accuracy of clustering : {:.2 f } " . format ( accuracy ) )

7
EE2211 Tutorial 12

Question 1: The convolutional neural network is particularly useful for applications related to image and
text processing due to its dense connections.
a) True
b) False

Ans: b).

Question 2: In neural networks, nonlinear activation functions such as sigmoid, and ReLU
a) speed up the gradient calculation in backpropagation, as compared to linear units
b) are applied only to the output units
c) help to introduce non-linearity into the model
d) always output values between 0 and 1
Ans: c.

Question 3: A fully connected network of 2 layers has been constructed as


!! (#) = &(&(#'" )'# )
−1 0 1
1 1 3.0
where # = ( 0 , '" = '# = 2 0 − 1 04 .
1 2 2.5
1 0 1
Suppose the Rectified Linear Unit (ReLU) has been used as the activation function (&) for all the
nodes. Compute the network output matrix !! (#) (up to 1 decimal place for each entry) based on
the given network weights and data.
567891 567892 567893
!! (#) = ( 0
567894 567895 567896
Answer:
−1 0 1
1 1 3.0
&(#'" ) = & <( 02 0 − 1 04=
1 2 2.5
1 0 1
2 −1 4.0
= & >( 0?
1.5 −2 3.5
2 0 4.0
=( 0
1.5 0 3.5
−1 0 1
2 0 4.0
&(&(#'" )'# ) = & <( 02 0 −1 04=
1.5 0 3.5
1 0 1
2 0 6
= & >( 0?
2 0 5
2 0 6
=( 0
2 0 5
Matlab codes:
X = [1 1 3; 1 2 2.5]
W1 = [-1 0 1; 0 -1 0; 1 0 1]
W2 = W1;
F = ReLU(ReLU(X*W1)*W2)
function y = ReLU(x)
y = max(0,x);
end

Question 4: A fully connected network of 3 layers has been constructed as


!! (#) = &([A, &([A, &(#'" )]'# )]'$ )
−1 0 1
−1 0 1
1 2 1 0 −1 0
where # = ( 0 , '" = 2 0 −1 0 4 , '# = '$ C D.
1 5 1 1 0 1
1 0 −1
1 −1 1
Suppose the Sigmoid has been used as the activation function (&) for all the nodes. Compute the
network output matrix !! (#) (up to 1 decimal place for each entry) based on the given network
weights and data.
567891 567892 567893
!! (#) = ( 0
567894 567895 567896
Answer:
−1 0 1
) 1 2 1
&(#'" = & <( 02 0 −1 0 4=
1 5 1
1 0 −1
0 −2 0
= & >( 0?
0 −5 0
0.5 0.1192 0.5
=( 0
0.5 0.0067 0.5
−1 0 1
1 0.5 0.5 0 −1
0.1192 0
&([A, &(#'" )]'# ) = & G( 0C DH
1 0.5 0.50.0067
1 0 1
1 −1 1
−0.3808 − 1.0000 1.6192
= & >( 0?
−0.4933 − 1.0000 1.5067

0.4059 0.2689 0.8347


=( 0
0.3791 0.2689 0.8186
−1 0 1
1 0.4059 0.2689 0.8347 0 −1 0
&([A, &([A, &(#'" )]'# )]'$ ) = & G( 0C DH
1 0.3791 0.2689 0.8186 1 0 1
1 −1 1
0.5259 0.2243 0.8913
=( 0
0.5219 0.2319 0.8897
Matlab Codes:
X = [1 2 1; 1 5 1]
W1 = [-1 0 1; 0 -1 0; 1 0 -1]
W2 = [-1 0 1; 0 -1 0; 1 0 1; 1 -1 1]
W3 = W2;
F = sigmoid([ones(2,1),sigmoid([ones(2,1),sigmoid(X*W1)]*W2)]*W3)
function y = sigmoid(x)
y = 1./(1+exp(-x));
end

(MLP classifier, find the best hidden node size, assuming same hidden layer size in each layer, based on
cross-validation on the training set and then use it for testing)
Question 5:
Obtain the data set “from sklearn.datasets import load_iris”.
(a) Split the database into two sets: 80% of samples for training, and 20% of samples for testing
using random_state=0
(b) Perform a 5-fold Cross-validation using only the training set to determine the best 3-layer
MLPClassifier (from sklearn.neural_network import MLPClassifier
with hidden_layer_sizes=(Nhidd,Nhidd,Nhidd) for Nhidd in
range(1,11))* for prediction. In other words, partition the training set into two sets, 4/5 for
training and 1/5 for validation; and repeat this process until each of the 1/5 has been validated.
Provide a plot of the average 5-fold training and validation accuracies over the different network
sizes.
(c) Find the size of Nhidd that gives the best validation accuracy for the training set.
(d) Use this Nhidd in the MLPClassifier with
hidden_layer_sizes=(Nhidd,Nhidd,Nhidd) to compute the prediction accuracy
based on the 20% of samples for testing in part (a).
* The assumption of hidden_layer_sizes=(Nhidd,Nhidd,Nhidd)is to reduce the search
space in this exercise. In field applications, the search should take different sizes for each hidden layer.

Answer:
## load data from scikit
import numpy as np
import pandas as pd
print("pandas version: {}".format(pd.__version__))
import sklearn
print("scikit-learn version: {}".format(sklearn.__version__))
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier # neural network
from sklearn import metrics
def find_network_size(X_train, y_train):
acc_train_array = []
acc_valid_array = []
for Nhidd in range(1,11):
acc_train_array_fold = []
acc_valid_array_fold = []
## Random permutation of data
Idx = np.random.RandomState(seed=8).permutation(len(y_train))
## Tuning: perform 5-fold cross-validation on the training set to determine the best network
size
for k in range(0,5):
N = np.around((k+1)*len(y_train)/5)
N = N.astype(int)
Xvalid = X_train[Idx[N-24:N]] # validation features
Yvalid = y_train[Idx[N-24:N]] # validation targets
Idxtrn = np.setdiff1d(Idx, Idx[N-24:N])
Xtrain = X_train[Idxtrn] # training features in tuning loop
Ytrain = y_train[Idxtrn] # training targets in tuning loop
## MLP Classification with same size for each hidden-layer (specified in question)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(Nhidd,Nhidd,Nhidd),
random_state=1)
clf.fit(Xtrain, Ytrain)
## trained output
y_est_p = clf.predict(Xtrain)
acc_train_array_fold += [metrics.accuracy_score(y_est_p,Ytrain)]
## validation output
yt_est_p = clf.predict(Xvalid)
acc_valid_array_fold += [metrics.accuracy_score(yt_est_p,Yvalid)]
acc_train_array += [np.mean(acc_train_array_fold)]
acc_valid_array += [np.mean(acc_valid_array_fold)]
## find the size that gives the best validation accuracy
Nhidden = np.argmax(acc_valid_array,axis=0)+1

## plotting
import matplotlib.pyplot as plt
hiddensize = [x for x in range(1,11)]
plt.plot(hiddensize, acc_train_array, color='blue', marker='o', linewidth=3, label='Training')
plt.plot(hiddensize, acc_valid_array, color='orange', marker='x', linewidth=3,
label='Validation')
plt.xlabel('Number of hidden nodes in each layer')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracies')
plt.legend()
plt.show()
return Nhidden

## load data
iris_dataset = load_iris()
## split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'],
iris_dataset['target'],
test_size=0.20,
random_state=0)
## find the best hidden node size using only the training set
Nhidden = find_network_size(X_train, y_train)
print('best hidden node size =', Nhidden, 'based on 5-fold cross-validation on training set')
## perform evaluation
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(Nhidden,Nhidden,Nhidden),
random_state=1)
clf.fit(X_train, y_train)
## trained output
y_test_predict = clf.predict(X_test)
test_accuracy = metrics.accuracy_score(y_test_predict,y_test)
print('test accuracy =', test_accuracy)
>> best hidden node size = 6 based on 5-fold cross-validation on training set
>> test accuracy = 1.0

(An example of handwritten digit image classification using CNN)


Question 6:
Please go through the baseline example in the following link to get a feel of how the Convolutional Neural
Network (CNN) can be used for handwritten digit image classification.
https://machinelearningmastery.com/how-to-develop-a-convolutional-neural-network-from-scratch-for-
mnist-handwritten-digit-classification/
Note: This example assumes that you are using standalone Keras running on top of TensorFlow with Python
3 (you might need conda install -c conda-forge keras tensorflow to get the Keras
library installed).
The following codes might be useful for warnings suppression if you find them annoying:
import warnings
warnings.filterwarnings("ignore",category=UserWarning)
As the data size and the network size are relatively large comparing with previous assignments, the codes
can take quite some time to run (e.g., several minutes running on the latest notebook).

Results:
Accuracy for each fold:
> 98.583
> 98.425
> 98.342
> 98.575
> 98.592
Accuracy: mean=98.503 std=0.102, n=5
Improved version (network of larger size):
Accuracy for each fold:
> 98.992
> 98.717
> 98.925
> 99.233
> 98.875
Accuracy: mean=98.948 std=0.169, n=5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy