Statistic and Data Science Ii PDF
Statistic and Data Science Ii PDF
b. Unsupervised learning
Unlabeled data: no specific output is provided → no target variable. Thus, we cannot assess if
the outcome is right or not, there is not correct value to compare with.
Supervised learning learns patterns from labeled data to make predictions, while
unsupervised learning finds patterns in unlabeled data.
ML projects: Common steps
0. Collecting data
RECAP
- How to compare a variable between 2 groups → t-Test
- We want to check whether differences between groups are statistically
significant.
- In particular, we want to compare means:
- General mean
- One mean value per group
- All t-test methods such as t-test (assuming normality) or Wilcoxon test (not
necessary data to be normally distributed) are based on the idea of analyzing
residual distribution between a base linear model and 2 adapted linear models,
one per group.
- How to compare a variable between more than 2 groups → ANOVA
- We want to check whether differences between groups are statistically
significant
- In particular, we want to compare means:
- General mean
- One mean per group
- ANOVA to check whether there exists statistical differences, and
Kruskal-Wallis for non normal.
- Tukey-Honest significant differences, or Dunn´s for not normal.
MODELLING DATA
What is a model?
It is a mathematical formula that defines a variable y as a function of other variables x:
- One dependent variable (y) → target
- Several independent ones (xs)
In some contexts, we have discovered exact analytical formulas that fit the data almost
perfectly.
Data is not that perfect, though. For a given value of x, we may find several values for y. It is
not possible to build an exact equation fitting our data. So, as perfection does not exist, we
use statistical approaches.
Whenever we model a dataset, we think about the process that could have generated the
observed data. This is the data generation process.
Some concepts
- Predictions or fitted values: what our model predicts for y when x is substituted by
concrete values. (Regression line, blue)
- True Y values (black dots)
- Residuals or prediction error: difference between our prediction and the real value
(red lines).
-
Our goal is to minimize the prediction error → making our predictions as closed as possible
to the real data.
Assumptions:
1. Linearity
2. Normality
3. Homoscedasticity
4. Independence
5. No multicollinearity
6. Exogeneity
Coefficients
There are two ways of computing them:
- Search for coefficients minimizing MSE → ordinary least squares, OLS
- Search for them maximizing log-likelihood
In the case of a linear regression, it can be shown that both methods return the same results.
Significance metrics:
- P-value of each coefficient → t-test to check if they are not 0
- F-statistic → overall model significance
Interpretation
Results
Obviously, if our data does not show a linear relationship between target and independent
variables, our model is not going to properly fit the data.
- Residuals vs Fitted plot (patterns indicate potential non-linearity)
We must always check to what extent our model represents data robustly. To do so we check:
- Coefficient of determination (R2) & Adjusted R2.
- Mean squared error (MSE) & Root mean squared error (RMSE)
- Residual standard error (RSE) = RMSE adjusted by # predictors
- Mean absolute error (MAE)
Multicollinearity
- Data multicollinearity: This type occurs when we create a model feature using other
features. In other words, it’s a byproduct of the model that we specify rather than
being present in the data itself. For example, if you square term X to model curvature,
clearly there is a correlation between X and X2.
- Structural multicollinearity: This type of multicollinearity is present in the data
itself rather than being an artifact of our model. Observational experiments are more
likely to exhibit this kind of multicollinearity.
- How to assess multicollinearity for a given predictor? With the variance inflation
factor (VIF) = Score that measures how much the variance of a coefficient is inflated
due to multicollinearity.
- If VIF = 1 - absence
- VIF > 5 – multicollinearity problems
IMPROVING LINEAR MODELS
Penalized regression:
- alpha = 1 → Lassp Regression (L1)
- alpha = 0 → Ridge Regression (L2)
- alpha between 0 and 1 → ElasticNet model
We need to choose alpha and lambda, the constant values. How to select optimal values for
them? With cross-validation.
Crossing the right regularization depends on your data´s feature relevance and correlation
structure.
CROSS-VALIDATION (CV)
- Simple validation approach:
We cannot just fit the model to our training data and hope it would accurately work for the
real data it has never seen before. We need some assurance that our model has got most of the
patterns from the data. Then, when we need to divide our data in 2 sets: training and testing.
CV is a set of methods to evaluate the performance of a model by testing it on a new unseen
data sets (test data).
The basic idea is:
- Reserve a small sample of data by randomly splitting it into 2 parts.
- Build the model in the remaining part.
- Test the performance on the reserved sample.
CV is also known as resampling method, as it involves fitting the same model multiple times
using different subsets of the data.
- K-Fold Cross-validation:
1. Split the dataset into multiple parts or folds, where some parts are used for training the
model and the remaining for testing (checking performance).
2. The process is performed as many times as folds are.
3. Results are averaged over folds to give a reliable estimate of how the model will
perform on new data, so then we can select the best performing model, parameter, etc.
How to select optimal values for lambda and alpha?
In ElasticNet, first we need to establish a range of potential values for each parameter. For
every lambda and alpha combination, the model is trained on all K folds. 5 in the example.
We compute the average error for every combination of lambdas across the folds. The
combination providing the minimum error is selected as the best one.
- Cross-validation
It is used for:
- Model comparison and selection
- Hyperparameters tuning
- Avoiding overfitting
2. HYPOTHESIS TESTING RECAP
The log of the ratio of the probabilities is called the logit function, which is the basis of the
logistic regression. Why? The log-odds of p can be modeled linearly because the logit
transformation expands the probability space to cover all real numbers. The linearity
simplifies the interpretation and computation of the model parameters (allowing us to use
linear regression methods to estimate them) and it assures that p remains between 0 and 1:
The AUC, Area Under the ROC Curve, is a single number that summarizes the
performance of the model over all possible thresholds. It measures the total area under the
ROC curve:
- 0.5 = a random classification
- 1 = a perfect classifier
The points on a ROC curve closes to (0, 1) represent a range of the best performing
thresholds for the given model. However, there is an important assumption here: our errors
are symmetric, it is as bad to do FPs as FNs. Is this always the case?
In many cases, the 2 different types of error are not symmetric, so we must consider this
when fixing the score threshold so we reduce the most costly error.
MULTIPLE CLASSFICATION
Imagine we had 3 classes in our target data, A, B and C; which do not follow a natural order
(the data is not ordinal). How do we design a method mashed on logistic regression to do
multinomial classification?
- Option 1: several logistic regressions
1 regression for each category → the chosen class takes value 1, the rest the value 0.
Thus, we have 3 probabilities for each new observation, one for each possible class. We
identify the maximum value of those 3.
Given new data, we evaluate all models and assign the class with the highest score.
Once we have the logit, the model estimates the probability of each class using the softmax
function, which ensures the probabilities sum up to 1.
So, in our example of the 3 classes, we have:
The final output of the multinomial logistic regression is a probability per class for each
observation (many packages provide the highest prob and its class). Thus, as before, given
new data we evalute three models and we assign the class with the highest score.
It assumes proportional odds. This means the effect of all predictors is consistent across
categories ( = the impact of features don’t change for different categories). Because the
relationship between all pairs of groups is the same, there is only one set of coefficients.
Brant test to check if it holds
Coefficient interpretation: exp(alpha_k) is related to the increase of odds to be in category k
or less.
Model evaluation: mixed methods for regression and classification.
CLASS 3:
SUMMARY
Linear Regression
ElasticNet
Continuous Normal Regression Identity
Lasso Regression
Ridge Regression
Classificatio Logistic
Binary Binomial Logit
n Regression
Non-ordered
Multinomi Classificatio Multinomial
categorical Softmax
al n Classification
variables
Ordinal
Ordered Ordinal Ordered Logistic
categorical distributio Logit
Regression Regression
variables n
The model outputs the always non-negative mean number of occurrences (events) expected
given the predictors.
However, it is important to know that this 𝜙 factor is not introduced as a new parameter in the
model: it only modifies the way standard errors are computed after fitting the model.
The model structure and link function remains the same as in Poisson Regression, but the
standard errors of the estimated coefficients are adjusted based on 𝜙, ensuring that hypothesis
tests and confidence intervals account for the overdispersion:
𝜙 is estimated along with the other parameters through maximum likelihood (MLE), meaning
it is part of the core model fitting process, but the model structure and link function
remain the same.
SOLUTIONS
- Zero-Inflated Poisson Model, ZIP
This model assumes that there are 2 processes involved and handles them separately:
- Tries to predict zeros by using logistic regression.
- For non-zero data, it computes a Poisson Regression model (that can also
produce 0s).
This is our first example of mixed modelling.
GLM FINAL REMARKS
UNDERSTANDING E(Y|X)
In Linear models, the focus is on understanding and predicting averages for the response
variable.
LMs VS GLMs
In a standard LM → relationship between the target variable and predictors is modeled like
this:
- g(⋅): The link function transforms the mean to a scale where it is linearly related to
the predictors.
- E[Y ∣ X]: The mean of Y is modeled indirectly via the link function.
- The mean of Y may have a nonlinear relationship with the predictors on its original
scale
FAST SUMMARY
GLMs relate the linear combination of predictors Xβ to the expected value of the response
variable E(Y|X) through a link function (glmnet + family in R).
Linear
Regression
ElasticNet
Continuous Regression Lasso Normal Identity
Regression
Ridge
Regression
Classificati Logistic
Binary Binomial Logit
on Regression
Non-ordered
Classificati Multinomial Multino
categorical Softmax
on classification mial
variables
Longitudinal data → target variable is measured more than once for each unit of analysis,
with the repeated measures likely to be correlated.
Suppose you want to study the relationship between sleep quality (y) and the levels daily
stress with samples from 1,000 people recording their stress level and sleep quality each day
for 30 days.
If you model as such, you ignore the fact that repeated measurements are nested within each
individual - observations within the same person are not independent, leading to correlated
residuals.
LMMs are statistical models for continuous overcome variables in which the residuals are
normally distributed but may not be independent or have constant variance. LMMs extend
linear models by addling flexibility to handle grouped data structures, where observations
aren´t entirely independent.
LMMs combine the benefits of different approaches, as it counts for non-independence while
retaining individual-level data. They model the relationship at both levels:
1. Fixed effects: like coefficients in standard linear models, these capture
population-level effects, representing the overall influence of predictors that are
assumed to be the same across all groups or clusters.
a. These are explanatory variables of main interests in our study
b. Estimate general trends
2. Random effects: these capture group-level deviations from the fixed effect trends,
allowing different groups to have their own intercepts or slopes. RE introduce an
additional layer of flexibility, allowing the model to account for intra-group
correlations and variations across groups.
a. Model group-specific deviations or variability around fixed effects
Model specification -- General
Fixed or Random effect?