Compre FoDS
Compre FoDS
1.a. By making use of conditional entropy definition and product rule of probability prove that
H[x, y] = H[y/x] + H[x]. You need to prove the result in both the cases that x, y are continuous and
discrete random variables. [8 Marks]
1.b. Prove Bayes’ rule, i.e., H(y/x) = H(x/y) – H(x) + H(y), for conditional entropy. [6 Marks]
1.c. Consider two binary variables x and y having the joint distribution given in the following Table.
Evaluate the following quantities (a) H[y|x] (b) H[x|y] (c) H[x, y] [3 Marks]
2.a. Prove that that linear regression problem, solved by minimizing the sum of squares of error, will
always have unique optimal solution. It is fine even if you assume that there is only one independent
variable or feature. [8 Marks]
2.b. Write down the loss function that is optimized for lasso and ridge regression. Illustrate few
advantages of lasso regression over ridge regression. [8 Marks]
2.c. The uniform distribution for a continuous variable x is defined by U(x|a, b) = 1/(b – a) for a ≤ x ≤ b.
Verify that this distribution is normalized, and find expressions for its mean and variance. [3 Marks]
3.a. Principal component analysis, or PCA, is a technique that is widely used for applications such as
dimensionality reduction, lossy data compression, feature extraction, and data visualization. Derive ‘k’
principal components amongst ‘n’ features by presenting the problem as maximum variance formulation.
Prove all steps that are required in this derivation. [8 Marks]
3.b. Given some data in R3 with the corresponding 3 X 3 covariance matrix C with eigenvectors c1, c2, c3
(c1,c2,c3 can be taken to be unit orthonormal vectors) and eigenvalues ɳ1, ɳ2, and ɳ3, with ɳ1 = 3, ɳ2 =
1 and ɳ3 = 0.2. [8 Marks]
1. Define a matrix A € R2X3 that maps the data into a two-dimensional space while preserving as much
variance as possible.
2. Define a matrix B € R3X2 that places the reduced data back into R3 with minimal reconstruction error.
How large is the reconstruction error?
3. Prove that AB is an identity matrix. Why would one expect that intuitively?
4.a. What are Markov, Chebyshev, Chernoff bounds for a random variable X. Illustrate each bound with
suitable examples? [6 Marks]
4.b. Find the optimal solution to the following constrained optimization problem using Legrangian
function. [4 Marks]
maximize 1 – x2 – y2
subject to the constraint x + y – 1 = 0
4.c. Write down the Legrangian function with appropriate constraints for the following optimization
problem: maximize f(x) subject to gj(x) = 0 for j = 1, 2, …, J and hk(x) ≥ 0 for k = 1, 2, …, K.[4 Marks]
5. Assume the following data is given: {22, 12, 61, 57, 30, 1, 32, 37, 37, 68, 42, 11, 25, 7, 8, 16}.
a) Apply data discretization by binning the data into 4 bins using equal-depth and equi-width binning,
respectively.
b) Describe the differences between the two binning methods. Give for each of the binning methods an
example application for which that binning method is the most appropriate.
c) If you know that the data actually represent ages of persons, what kind of binning method would you
then use? [2+2+2 Marks]
6. a. What is the difference between supervised and unsupervised discretization? [3+3 Marks]
b. What's the difference between dimensionality reduction and feature selection?
7. a. What is the data type of each of the following kinds of attributes? [3 Marks]
i. Age
ii. Salary
iii. ZIP code
iv. State of residence
v. Height
vi. Weight
7.b. Identify the following data as OLTP, OLAP and Big Data. Justify your answers [3 Marks]
i. Weekly sales of ice cream in Amul BITS campus.
ii. Customer profiles for fraud detection
iii. Add a book to shopping cart
8. You are given the input which is a list of housing data where each input record contains information
about a single house: (address, city, state, zip, value). The output should be the average house value in
each zip code. Draw the pipeline of how this problem can be solved using map-reduce?
Note: Just show how the input is mapped into (key, value) pairs by the map stage, specify what is the key
and what is the associated value in each pair. [6 Marks]