ML Unit 3 Assignment
ML Unit 3 Assignment
Assignment Unit 3
Subset Selection:
Subset selection is a specific type of feature selection, where a subset of the original
features is chosen to build the model. The goal is to select the most important or
relevant features for the task, often based on some criteria like statistical significance or
feature importance.
● Goal: Improve model interpretability and performance by selecting only the most
informative features.
● Impact:
● Can lead to better model accuracy by removing irrelevant or noisy data.
● Increases computational efficiency by reducing the size of the input data.
Summary:
● Dimensionality reduction helps simplify high-dimensional data, reducing
overfitting and improving computational efficiency.
● Subset selection chooses the most relevant features, improving model
performance and interpretability.
● PCA and feature selection are key methods, with different approaches depending
on the data and task.
● Dimensionality reduction is especially useful in cases like image processing or
text analysis, where the data has many features but only a subset is informative.
In standard linear regression, the goal is to minimize the sum of squared errors between
the predicted and actual values. Shrinkage methods modify this by adding a penalty
term that controls the size of the coefficients, helping to keep the model simpler and
less sensitive to fluctuations in the training data.
Feature Selection Yes, shrinks some coefficients to zero No, retains all features
Handling Multicollinearity Selects one predictor from a group Shrinks coefficients for all correlated predictors
When to Use When you expect some irrelevant features When all features are important
Effect on Coefficients Some coefficients exactly zero Shrinks all coefficients, none exactly zero
Summary:
● Shrinkage methods (Lasso and Ridge) prevent overfitting by penalizing the size
of the regression coefficients.
● Lasso (L1) is best for feature selection, while Ridge (L2) is used when you want
to reduce the impact of all features without eliminating any.
● Both methods are useful for improving model generalization and reducing the
effect of irrelevant or redundant features.
1. Principal Component Analysis (PCA) transforms the original features into new
variables (principal components), which are linear combinations of the original
variables. These principal components are uncorrelated and capture the
maximum variance in the data.
2. Instead of using the original features in the regression model, PCR fits the model
on the top principal components, which reduces multicollinearity and the model's
complexity.
Addressing Multicollinearity:
Multicollinearity occurs when two or more predictor variables are highly correlated,
leading to instability in estimating regression coefficients. In PCR:
● PCA identifies the directions in the data with the most variance and transforms
the original correlated variables into uncorrelated principal components.
● By using only the first few principal components (the ones that explain the most
variance), PCR reduces the dimensionality of the data and eliminates
multicollinearity, leading to more stable regression estimates.
Example Walkthrough:
Consider a dataset of wine quality prediction with 12 chemical properties as predictors.
Since many chemical properties are correlated (like sugar content and alcohol level),
multicollinearity can cause instability in a linear regression model.
● After applying PCA, we find that the first 4 principal components explain 85% of
the variance in the data.
● We then use these 4 principal components in the regression model to predict
wine quality.
● By doing so, we reduce multicollinearity, improve model stability, and make the
model more computationally efficient.
Summary:
● Principal Components Regression (PCR) addresses multicollinearity by
transforming correlated variables into uncorrelated principal components and
then applying linear regression.
● PCR combines PCA with regression, which makes it ideal for datasets with many
correlated predictors.
● The number of principal components used is based on how much variance they
explain, usually around 80-90% of the total variance.
● PCR is commonly used in high-dimensional datasets where multicollinearity is a
concern.
Q-4 a) Discuss Logistic Regression and its role in
classification tasks. How does logistic regression model the
probability of a binary outcome, and what is the
interpretation of its coefficients?
Even though it's called "regression," it is used for classification problems, where you
need to assign data into categories.
Interpretation of Coefficients:
● The coefficients in logistic regression show how much a predictor (like hours
studied) affects the probability of the outcome (e.g., passing an exam).
● If a coefficient is positive, it means that an increase in the predictor will increase
the probability of the positive outcome.
● If a coefficient is negative, it decreases the probability of the positive outcome.
1. Collect Data:
● You have data showing how many hours students studied and whether
they passed or failed.
● For example:
● A student who studied 2 hours failed.
● A student who studied 8 hours passed.
2. Train the Model:
● Logistic regression is used to create a model that learns from this data. It
identifies the relationship between the number of hours studied and the
likelihood of passing.
● The model calculates probabilities, such as:
● A student who studied 5 hours has a 70% probability of passing.
● A student who studied only 2 hours has a 20% probability of
passing.
3. Make Predictions:
● Once trained, the model can be used to predict outcomes for new
students based on how many hours they studied.
● For example, if a student studies 6 hours, the model might predict a pass
because the probability is greater than 0.5.
4. Evaluate the Model:
● You can assess how well the logistic regression model works by
comparing its predictions to actual results. For example, if the model
predicts that a student will pass but they fail, it indicates that the model
might need improvement.
Summary:
● Logistic regression is a tool used for binary classification.
● It predicts the probability of an outcome (e.g., pass or fail).
● The coefficients in the model tell you how each factor (e.g., hours studied)
impacts the likelihood of that outcome.