SP 24 BADM 576 Final - Exam - Study - Guide
SP 24 BADM 576 Final - Exam - Study - Guide
Linear Regression
Example solutions
1. Linear regression is a statistical method used to model the relationship between two variables by
fitting a linear equation to the observed data. The goal is to find the line of best fit that
minimizes the sum of the squared differences between the predicted and actual values. The
equation for the line of best fit is y = mx + b, where y is the dependent variable, x is the
independent variable, m is the slope of the line, and b is the y-intercept.
Closed form solution: The line of best fit is calculated using the method of least squares, which
involves minimizing the sum of the squared residuals between the predicted values and the
actual values. The slope and y-intercept of the line of best fit can be calculated using the
following formulas:
b = (Σy - mΣx) / n
where n is the number of data points, Σ is the sum of the values, x and y are the independent
and dependent variables, and xy is the product of the two variables.
Another method for Linear Regression is Gradient Descent.
2. Gradient descent is an optimization algorithm used to find the values of the parameters that
minimize the cost function in linear regression. The cost function is a measure of the difference
between the predicted and actual values, and the goal is to minimize this difference. Gradient
descent involves iteratively adjusting the parameters in the direction of the steepest descent of
the cost function. The steps involved in gradient descent are:
i. Initialize the parameters to random values.
ii. Calculate the cost function for the current values of the parameters.
iii. Calculate the gradient of the cost function with respect to each parameter.
iv. Update the parameters by subtracting a small fraction of the gradient from each parameter.
v. Repeat steps 2-4 until the cost function converges to a minimum.
The advantages of gradient descent are that it can handle large datasets and high-dimensional
feature spaces, and it can converge to a global minimum. The disadvantages are that it can get
stuck in local minima, and the learning rate must be carefully tuned to balance convergence
speed and stability.
3. Underfitting in linear regression occurs when the model is too simple and fails to capture the
underlying relationship between the variables. This can be detected by plotting the predicted
values against the actual values and observing whether there is a systematic pattern of errors.
Underfitting can be corrected by adding more polynomial features or interaction terms to the
model, which allows it to capture more complex relationships between the variables. Reasons
that can cause underfitting include using too few features, not including interactions between
features, and not including higher-order polynomial terms.
5. The lambda parameter in regularization controls the strength of the penalty term and
determines the tradeoff between bias and variance. A higher value of lambda will lead to a
simpler model with smaller parameter values and less overfitting, but may increase bias and
decrease accuracy. A lower value of lambda will lead to a more complex model with larger
parameter values and more overfitting, but may decrease bias and increase accuracy. The
optimal value of lambda can be determined using GridSearchCV.
Decision Tree and Random Forest
● Name two measures of impurity, in the context of classification, with their formulae?
● Check slides
● How to calculate the impurity in case of a regression problem?
● Standard deviation or variance
You must also be able to calculate Gini Index/ Shannon Entropy on a toy dataset.
● How does a decision tree decide which variable to use at each node?
● What are Shannon's Entropy and Gini Index, and how are they used in building decision trees?
● Describe what an ensemble method is in the context of tree-based models.
● Compare and contrast Random Forest and Gradient Boosted Trees in terms of their approach to
building the trees.
● What are some strategies to avoid overfitting in tree-based models?
● How does splitting data into training and testing sets help in managing overfitting?
● What is information gain, and why is it important in decision trees?
● How is information gain calculated, and what does it signify about a feature?
https://canvas.illinois.edu/courses/44221/pages/week-10-and-11-tree-based-models?module_it
em_id=3348442
https://canvas.illinois.edu/courses/44221/pages/week-10-and-11-tree-based-models?module_item_id=
3348442
Unsupervised Learning
Example solutions:
1. Unsupervised learning is a type of machine learning where the model is trained on a dataset
without labeled output. Unlike supervised learning, there is no specific target variable that the
algorithm is trying to predict. Instead, the algorithm tries to find patterns or relationships within
the data on its own. Examples of unsupervised learning algorithms include clustering algorithms
such as KMeans, DBSCAN, and hierarchical clustering, as well as dimensionality reduction
techniques such as Principal Component Analysis (PCA) and t-SNE. Use cases for unsupervised
learning include market segmentation, anomaly detection, and image or speech recognition.
2. Semi-supervised learning is a type of machine learning that uses a combination of labeled and
unlabeled data to train the model. Unlike supervised learning, where all the data is labeled, or
unsupervised learning, where none of the data is labeled, semi-supervised learning falls
somewhere in between. The labeled data is used to guide the model in the right direction, while
the unlabeled data helps the model generalize better. Semi-supervised learning can be more
effective than supervised learning in situations where labeled data is scarce or expensive to
obtain, while still achieving higher accuracy than unsupervised learning.
3. Principal Component Analysis (PCA) is a dimensionality reduction technique that is used to
identify patterns in high-dimensional data. The goal of PCA is to find a lower-dimensional
representation of the data that retains as much of the original information as possible. This is
achieved by projecting the data onto a new set of orthogonal axes, called principal components,
that capture the most variance in the data. PCA is commonly used in image compression,
genetics, and finance, where the data may have hundreds or thousands of features and reducing
the number of dimensions can simplify the analysis and interpretation of the data.
4. The KMeans algorithm is a clustering algorithm that partitions a dataset into k clusters based on
the similarity of the data points to each other. The algorithm works by first randomly selecting k
initial centroids, assigning each data point to the nearest centroid, re-estimating the centroid of
each cluster, and repeating until the centroids no longer change or a maximum number of
iterations is reached. KMeans is different from KNN (K-Nearest Neighbors) in that KMeans is an
unsupervised learning algorithm used for clustering, while KNN is a supervised learning
algorithm used for classification or regression.
5. Distance measures are used in machine learning to determine the similarity or dissimilarity
between two data points. Some common distance measures include Euclidean distance,
Manhattan distance, and cosine similarity. Euclidean distance is calculated as the straight-line
distance between two points in n-dimensional space, while Manhattan distance is calculated as
the sum of the absolute differences between the corresponding coordinates of two points.
Euclidean distance is more appropriate when the data is continuous and the scale of the
variables is important, while Manhattan distance is more appropriate when the data is discrete
or when the variables are on different scales. Cosine similarity measures the cosine of the angle
between two vectors and is commonly used in text or image analysis to measure similarity
between documents or images.
6. In the KMeans algorithm, the initial centroids are chosen randomly from the data points. This
can lead to suboptimal clustering results, especially if the initial centroids are chosen poorly. To
address this issue, several methods for selecting initial centroids have been proposed, including:
i. KMeans++: This method chooses the first centroid randomly from the data points, and
subsequent centroids are chosen based on their distance from the existing centroids.
This ensures that the initial centroids are spread out and diverse.
ii. Random Partition: This method randomly assigns each data point to one of the K
clusters, and the initial centroids are then calculated as the means of the points in each
cluster.
7. The number of clusters (K) in KMeans has a significant impact on the results. If K is too small, the
algorithm may group different types of data together, resulting in low cluster quality. If K is too
large, the algorithm may overfit the data, resulting in high variance and poor generalization to
new data. To determine the optimal value of K, several techniques can be used, including:
I. Elbow Method: Plot the within-cluster sum of squares (WCSS) against the number of
clusters, and choose the value of K at the "elbow" of the curve, where the improvement
in WCSS starts to level off.
II. Silhouette Analysis: Calculate the silhouette coefficient for each point, which measures
how well it fits in its assigned cluster compared to other clusters. The optimal value of K
is the one that maximizes the average silhouette coefficient across all points.
8. The objective function of KMeans is to minimize the within-cluster sum of squares (WCSS), which
is the sum of the squared distances between each data point and its assigned centroid. The
algorithm iteratively updates the centroids and assigns each data point to the nearest centroid
until the WCSS converges. The WCSS is used to evaluate the performance of the algorithm, with
lower values indicating better clustering quality.
These challenges can be addressed through various techniques, including using more advanced
initialization methods, pre-processing the data to remove outliers, and using other clustering
algorithms that can handle non-linearly separable data.
1. What is a confusion matrix and how is it used to evaluate the performance of a classification
model? What are the different metrics that can be derived from a confusion matrix?
2. What is accuracy and how is it calculated? What are some limitations of using accuracy as a
metric for evaluation?
3. What is precision and how is it calculated? What is the significance of precision in the context of
classification models?
4. What is recall and how is it calculated? What is the significance of recall in the context of
classification models?
5. What is the F1 score and how is it calculated? How is it different from accuracy, precision, and
recall?
6. What is log loss and how is it used to evaluate the performance of a classification model? What
is the significance of log loss?
7. What is the ROC curve and how is it used to evaluate the performance of a classification model?
What is the AUC-ROC score and how is it calculated? How does it differ from other evaluation
metrics?
Example Solutions:
3. Precision is a metric that measures the proportion of true positive instances out of all instances
that are predicted positive. It is calculated using the formula: TP / (TP + FP). The significance of
precision in the context of classification models is that it measures the ability of the model to
avoid false positives.
4. Recall is a metric that measures the proportion of true positive instances out of all actual
positive instances. It is calculated using the formula: TP / (TP + FN). The significance of recall in
the context of classification models is that it measures the ability of the model to identify all
positive instances, even if some are classified as false positives.
5. The F1 score is a metric that combines precision and recall into a single value by taking the
harmonic mean of the two metrics. It is calculated using the formula: 2 * (precision * recall) /
(precision + recall). The F1 score is different from accuracy, precision, and recall in that it
provides a balanced evaluation of both metrics.
6. Log loss (also known as cross-entropy loss) is a metric that measures the difference between the
predicted probabilities and the actual probabilities of the classes. It is used to evaluate the
performance of a classification model by penalizing the model for making incorrect predictions.
The significance of log loss is that it provides a continuous measure of the model's performance,
and it can be used to compare different models.
7. The ROC curve (receiver operating characteristic curve) is a graphical representation of the
performance of a binary classification model by plotting the true positive rate (TPR) against the
false positive rate (FPR) at different classification thresholds. The AUC-ROC score (area under the
ROC curve) is a metric that measures the overall performance of the model by calculating the
area under the ROC curve. It is calculated by integrating the ROC curve between the limits of 0
and 1. The AUC-ROC score is different from other evaluation metrics in that it provides a
threshold-independent measure of the model's performance, and it is less sensitive to class
imbalance.
Note: The following questions may not have one right answer. If asked as free text, use your best
judgment. For MCQs, the choices won’t be close.
1. Suppose you are developing a model to predict credit card fraud. The cost of missing fraudulent
transactions is 10 times the cost of flagging non-fraudulent transactions as fraudulent. What
evaluation metric(s) would you use to evaluate the model's performance, and why?
2. You are a doctor and you are deciding which test to use for diagnosing a disease. The disease is
life-threatening and requires immediate treatment. Would you prefer a test with a high
sensitivity of 90% or a high specificity of 90%? Explain your reasoning.
3. Suppose you are working on a spam detection algorithm for emails. The cost of incorrectly
flagging a non-spam email as spam is twice the cost of missing a spam email. What evaluation
metric(s) would you use to evaluate the algorithm's performance, and why?
4. You are working on a classification model to predict if a customer is likely to churn from a
telecom company. The company wants to focus on retaining customers who are likely to churn.
What evaluation metric(s) would you use to evaluate the model's performance, and why?
5. You are developing a machine learning model to predict whether a tumor is malignant or benign.
The cost of missing a malignant tumor is 10 times the cost of incorrectly diagnosing a benign
tumor as malignant. Would you prefer a model with a high sensitivity of 90% or a high specificity
of 90%? Explain your reasoning.
6. You are a data scientist working for a financial company, and you are tasked with predicting
whether a loan applicant is likely to default. The company wants to minimize the risk of default.
What evaluation metric(s) would you use to evaluate the model's performance, and why?
7. You are developing a fraud detection algorithm for a banking application. The cost of missing a
fraudulent transaction is 5 times the cost of incorrectly flagging a non-fraudulent transaction as
fraudulent. What evaluation metric(s) would you use to evaluate the algorithm's performance,
and why?
Example Solutions:
1. We may choose “recall” as identifying potential fraud is more important than identifying
potential non-fraud. Moreover, we may want to develop a metric such that it gives 10 times
more weightage to recall than to precision.
2. In this scenario, the disease is life-threatening and requires immediate treatment. Therefore, it is
more important to have a high sensitivity (true positive rate), which means that the test can
correctly identify as many individuals with the disease as possible. This is because missing a
positive diagnosis could have serious consequences, such as delaying treatment or potentially
causing harm to the patient.
3. In this scenario, the cost of incorrectly flagging a non-spam email as spam is higher than the cost
of missing a spam email. Therefore, the evaluation metric that should be used is precision, which
measures the proportion of correctly identified spam emails among all emails that were
identified as spam. This is because the cost of flagging a non-spam email as spam could result in
important emails being missed or incurring additional costs to retrieve them.
4. In this scenario, the company wants to focus on retaining customers who are likely to churn.
Therefore, the evaluation metric that should be used is recall, which measures the proportion of
customers who are likely to churn that are correctly identified by the model. This is because the
company wants to identify as many customers as possible who are likely to churn so that they
can take actions to retain them.
5. In this scenario, the cost of missing a malignant tumor is higher than the cost of incorrectly
diagnosing a benign tumor as malignant. Therefore, the evaluation metric that should be used is
sensitivity, which measures the proportion of correctly identified malignant tumors among all
malignant tumors. This is because missing a malignant tumor could have serious consequences,
such as delaying treatment or potentially causing harm to the patient.
6. In this scenario, the company wants to minimize the risk of default. Therefore, the evaluation
metric should be recall or sensitivity.
7. In this scenario, the cost of missing a fraudulent transaction is higher than the cost of incorrectly
flagging a non-fraudulent transaction as fraudulent. Therefore, the evaluation metric that should
be used is recall, which measures the proportion of correctly identified fraudulent transactions
among all fraudulent transactions. This is because missing a fraudulent transaction could result
in financial loss to the bank and its customers.
Regression Custom Metrics
1. You are developing a machine learning model to predict the number of units sold for a particular
product. The cost of overestimating the number of units sold is twice that of underestimating
the number of units sold. How would you define a custom metric that takes this into account?
2. You are working on a regression model to predict the amount of rainfall in a particular region.
The stakeholders are interested in the accuracy of the model in predicting rainfall levels above 50
mm. How would you define a custom metric that focuses on this aspect of the model's
performance?
3. You are developing a machine learning model to predict the amount of time it will take for a
particular project to be completed. The stakeholders are interested in the accuracy of the model
in predicting project completion times that exceed 100 days. How would you define a custom
metric that takes this into account?
4. You are working on a model to predict the number of customer complaints received by a
company. The stakeholders are interested in the accuracy of the model in predicting the number
of complaints during peak seasons, which typically occur between October and December. How
would you define a custom metric that focuses on this aspect of the model's performance?
Example Solutions:
1. To take into account the cost of overestimating the number of units sold being twice that of
underestimating, a possible custom metric could be defined as the absolute difference between
the predicted and true values, multiplied by a factor of 2 if the predicted value is greater than
the true value, and 1 otherwise. This custom metric would penalize overestimation more heavily,
while still taking into account the accuracy of the model in both overestimating and
underestimating.
2. A custom metric that focuses on the accuracy of the model in predicting rainfall levels above 50
mm could be defined as the percentage of predictions that exceed this threshold, out of all
predictions made by the model. This custom metric would provide a measure of the model's
performance in predicting high rainfall levels, while still taking into account the accuracy of the
model in predicting lower values.
3. A custom metric that takes into account the accuracy of the model in predicting project
completion times that exceed 100 days could be defined as the absolute difference between the
predicted and true values, multiplied by a factor of 2 if the predicted value is greater than 100
days, and 1 otherwise. This custom metric would penalize overestimation of project completion
times more heavily, while still taking into account the accuracy of the model in predicting shorter
completion times.
4. A custom metric that focuses on the accuracy of the model in predicting the number of
complaints during peak seasons, which typically occur between October and December, could be
defined as the percentage of predictions that fall within this time period, out of all predictions
made by the model. This custom metric would provide a measure of the model's performance
specifically during the peak season, while still taking into account the accuracy of the model in
predicting complaint numbers outside of this period.
ML Process / MLOPs
MLflow is an open source platform for managing machine learning workflows. It is used by
MLOps teams and data scientists. MLflow has four main components:
❖ The tracking component allows you to record machine model training sessions (called
runs) and run queries using Java, Python, R, and REST APIs.
❖ The model component provides a standard unit for packaging and reusing machine
learning models.
❖ The model registry component lets you centrally manage models and their lifecycle.
❖ The project component packages code used in data science projects, ensuring it can
easily be reused and experiments can be reproduced.
❖ A run is a collection of parameters, metrics, labels, and artifacts related to the training
process of a machine learning model.
❖ An experiment is the basic unit of MLflow organization. All MLflow runs belong to an
experiment. For each experiment, you can analyze and compare the results of different
runs, and easily retrieve metadata artifacts for analysis using downstream tools.
Experiments are maintained on an MLflow tracking server hosted on Azure Databricks or
locally. .
source -
https://www.run.ai/guides/machine-learning-operations/mlflow#:~:text=MLflow%20is%20an%2
0open%20source,MLOps%20teams%20and%20data%20scientists.
Coding part (open notebook, internet, chatgpt, anything except coordinating with a human)
Be able to run:
1. Create a pipeline (using column column transformers) for data preprocessing and model fitting
to fit Linear Regression, Logistic Regression, Decision Tree/ Random Forest models for
classification or regression problems.
● Perform hyperparameter tuning on these models
● Report model performance on the train, test, and validation data and comment on the
degree of underfitting/ overfitting.
● Save the model as pickle file
● Make predictions on a provided dataset
2. Perform KMeans and visualize clusters.
3. Perform PCA and interpret the results