0% found this document useful (0 votes)

14 views37 pages

Statistic and Data Science Ii PDF

The document outlines key concepts in machine learning, focusing on supervised and unsupervised learning, model building, and evaluation techniques. It covers statistical methods such as t-tests, ANOVA, linear regression, and cross-validation, emphasizing the importance of understanding data relationships and model performance metrics. Additionally, it introduces generalized linear models and logistic regression for classification tasks, including methods for handling multiple classes and evaluating model performance through ROC-AUC analysis.

Uploaded by

wkg6ndm9dt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views37 pages

Statistic and Data Science Ii PDF

Uploaded by

wkg6ndm9dt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

STATISTIC AND DATA SCIENCE II

1. ML Projects: types and steps

a. Supervised learning
Labeled training data: there is a target variable, also called label.

b. Unsupervised learning
Unlabeled data: no specific output is provided → no target variable. Thus, we cannot assess if
the outcome is right or not, there is not correct value to compare with.

Supervised learning learns patterns from labeled data to make predictions, while
unsupervised learning finds patterns in unlabeled data.
ML projects: Common steps
0. Collecting data

EDA and feature engineering

- Descriptive statistics
- Analysis of target variable
- Univariate and bivariate analysis + visualizations
- Missing values
- Correlation among explanatory variables
- Hypothesis validation
- Relationships target - explanatory
- Hypothesis testing
- Feature engineering
- Feature generation
- Encoding
- Feature selection

RECAP
- How to compare a variable between 2 groups → t-Test
- We want to check whether differences between groups are statistically
significant.
- In particular, we want to compare means:
- General mean
- One mean value per group
- All t-test methods such as t-test (assuming normality) or Wilcoxon test (not
necessary data to be normally distributed) are based on the idea of analyzing
residual distribution between a base linear model and 2 adapted linear models,
one per group.
- How to compare a variable between more than 2 groups → ANOVA
- We want to check whether differences between groups are statistically
significant
- In particular, we want to compare means:
- General mean
- One mean per group
- ANOVA to check whether there exists statistical differences, and
Kruskal-Wallis for non normal.
- Tukey-Honest significant differences, or Dunn´s for not normal.

MODELLING DATA
What is a model?
It is a mathematical formula that defines a variable y as a function of other variables x:
- One dependent variable (y) → target
- Several independent ones (xs)

In some contexts, we have discovered exact analytical formulas that fit the data almost
perfectly.
Data is not that perfect, though. For a given value of x, we may find several values for y. It is
not possible to build an exact equation fitting our data. So, as perfection does not exist, we
use statistical approaches.
Whenever we model a dataset, we think about the process that could have generated the
observed data. This is the data generation process.

Some concepts
- Predictions or fitted values: what our model predicts for y when x is substituted by
concrete values. (Regression line, blue)
- True Y values (black dots)
- Residuals or prediction error: difference between our prediction and the real value
(red lines).

-
Our goal is to minimize the prediction error → making our predictions as closed as possible
to the real data.

Linear relationship between variables

- Correlation Coefficient (Pearson)
- A quantity to measure the degree of linear covariation of 2 variables.
- It may move from -1 to +1.
- It is not enough to understand the data.
- Linear correlation coefficient is very sensitive to outliers.
- There are other methods to compute correlation robustly.

Linear Regression Model

Linear regression is used to predict a quantitative outcome variable, y, on the basis of one or
more predictor variables, x. The data generation process for linear regression involves
simulating a linear relationship between x and y.

Assumptions:
1. Linearity
2. Normality
3. Homoscedasticity
4. Independence
5. No multicollinearity
6. Exogeneity
Coefficients
There are two ways of computing them:
- Search for coefficients minimizing MSE → ordinary least squares, OLS
- Search for them maximizing log-likelihood
In the case of a linear regression, it can be shown that both methods return the same results.
Significance metrics:
- P-value of each coefficient → t-test to check if they are not 0
- F-statistic → overall model significance

Interpretation

- Ceteris paribus, for each increase of one unit in x, y changes in a 1.

- Dummy variables have a different interpretation.
- If you transform variables to make them “Normal”, interpretation also changes.

Results
Obviously, if our data does not show a linear relationship between target and independent
variables, our model is not going to properly fit the data.
- Residuals vs Fitted plot (patterns indicate potential non-linearity)

We must always check to what extent our model represents data robustly. To do so we check:
- Coefficient of determination (R2) & Adjusted R2.
- Mean squared error (MSE) & Root mean squared error (RMSE)
- Residual standard error (RSE) = RMSE adjusted by # predictors
- Mean absolute error (MAE)

- Residuals vs Leverage: check whether some x predictor have extreme values

- Influential points, those that, if omitted, would significantly change the fit of the
model.
- Cook´s distance: check whether some points influence predictions too much. Not all
outliers are influential.
Residual analysis
- Normality
Residuals should have zero mean and follow a normal distribution. There are ways to
check this, such as the QQ plot, or the Shapiro-Wilk normality test.
- Independence
We assume every row in our data is extracted independently, there is no auto-correlation
between different rows of data. This is the typical case when we analyze customers, surveys,
etc. Once we have trained a model, we can check this condition on the independence of the
residuals using the Durbin-Watson test.
It might become a problem when analyzing temporal data (i.e.: distinct surveys on the same
sample of population). → time series analysis
- Homoscedasticity
For a given value of X, we assume the target variable underlying the distribution exhibits the
same variance.

Multicollinearity
- Data multicollinearity: This type occurs when we create a model feature using other
features. In other words, it’s a byproduct of the model that we specify rather than
being present in the data itself. For example, if you square term X to model curvature,
clearly there is a correlation between X and X2.
- Structural multicollinearity: This type of multicollinearity is present in the data
itself rather than being an artifact of our model. Observational experiments are more
likely to exhibit this kind of multicollinearity.
- How to assess multicollinearity for a given predictor? With the variance inflation
factor (VIF) = Score that measures how much the variance of a coefficient is inflated
due to multicollinearity.
- If VIF = 1 - absence
- VIF > 5 – multicollinearity problems
IMPROVING LINEAR MODELS

BEYOND STANDARD LINEAR MODELS

There are two problems that typically emerge when analyzing data:
- Too many variables
- Multicollinearity
We have seen some techniques to measure and try to avoid these problems, however,
sometimes it is impossible to completely manage them. A good alternative are shrinkage
methods, also called penalized regression: linear model penalized by adding a constraint in
the equation. Such penalty reduces coefficients towards 0 in the less contributive variables.

Error in formula to minimize in standard linear models

Penalized regression:
- alpha = 1 → Lassp Regression (L1)
- alpha = 0 → Ridge Regression (L2)
- alpha between 0 and 1 → ElasticNet model

We need to choose alpha and lambda, the constant values. How to select optimal values for
them? With cross-validation.
Crossing the right regularization depends on your data´s feature relevance and correlation
structure.
CROSS-VALIDATION (CV)
- Simple validation approach:
We cannot just fit the model to our training data and hope it would accurately work for the
real data it has never seen before. We need some assurance that our model has got most of the
patterns from the data. Then, when we need to divide our data in 2 sets: training and testing.
CV is a set of methods to evaluate the performance of a model by testing it on a new unseen
data sets (test data).
The basic idea is:
- Reserve a small sample of data by randomly splitting it into 2 parts.
- Build the model in the remaining part.
- Test the performance on the reserved sample.
CV is also known as resampling method, as it involves fitting the same model multiple times
using different subsets of the data.

- K-Fold Cross-validation:
1. Split the dataset into multiple parts or folds, where some parts are used for training the
model and the remaining for testing (checking performance).
2. The process is performed as many times as folds are.
3. Results are averaged over folds to give a reliable estimate of how the model will
perform on new data, so then we can select the best performing model, parameter, etc.
How to select optimal values for lambda and alpha?
In ElasticNet, first we need to establish a range of potential values for each parameter. For
every lambda and alpha combination, the model is trained on all K folds. 5 in the example.
We compute the average error for every combination of lambdas across the folds. The
combination providing the minimum error is selected as the best one.

- Cross-validation
It is used for:
- Model comparison and selection
- Hyperparameters tuning
- Avoiding overfitting
2. HYPOTHESIS TESTING RECAP

SO FAR (FAST SUMMARY)

- Linear regression: method to model and predict a continuous target variable on
independent variables (they may be continuous or discrete).
- Goodness of fit: R2, MSE, MAE…
- Coefficients: effect of X on Y. If data is scaled, they can be used to compare the
importance of different features in our model.
- p-value: not all coefficients might be statistically significant because the variable is
not relevant to the target (maybe due to multicollinearity).
- We can also induce some desirable properties into the coefficients with regularization
methods, like Ridge, Lasso and ElasticNet.
- We search for the best hyperparameters by doing Cross-Validation.
Model Validation Framework
Testing model´s generalization ability:
1. A training set is used to train the machine learning model(s).
2. A validation set is used to estimate the generalization error of the model created from
the training set → select best model specification.
3. A test is used to estimate the generalization error of the final/chosen model.
GENERALIZED LINEAR MODELS (GLM)
GLMs are a flexible extension of regular linear models that allow you to model a wider
variety of data types. They consist of three main components:
1. Linear relationship: the model still assumes a linear relationship between the
input variables (features) and the target, but in a transformed space. This is, in
GLMs we transform the target variable (using a function called link function)
so we can still model it by using the linear combination of the features.
2. Link function: a function that connects the linear predictor (the combination
of input variables) to the expected value of the target. This allows the model to
handle non-normal distributions of the output.
3. Distribution of the outcome: instead of just modeling continuous data with
normal distributions, GLMs can handle other types of data distributions, sich
as binomial or Poisson.

BINARY TARGET VARIABLES

Binary target variable: yes = 1, no = 0 as an outcome. This implies a classification problem.
How do we model it? How do we predict it? Remember the probability distribution of binary
random variables: Bernouilli (p) and Binomial (n, p), where:

Being p the probability of success.

BUT we cannot fit a linear regression to model a probability, because a probability is NOT
linear, but bounded between 0 and 1, while the outcome of a linear regression varies from
minus infinite to infinite.

PROBABILITY, ODDS AND LOG-ODDS

The asymmetry in these cases makes it very difficult to compare odds. Let´s remember that,
having p as the probability of success:
So log-odds can be reformulated as:

The log of the ratio of the probabilities is called the logit function, which is the basis of the
logistic regression. Why? The log-odds of p can be modeled linearly because the logit
transformation expands the probability space to cover all real numbers. The linearity
simplifies the interpretation and computation of the model parameters (allowing us to use
linear regression methods to estimate them) and it assures that p remains between 0 and 1:

MERGING ODDS AND LINEAR MODELS

DERIVING LOGISTIC REGRESSION
First, we model the logit function as a linear model:

This would be the logistic function.

INTERPRETING COEFFICIENTS

MEASURING CLASSIFICATION MODEL PERFORMANCE

Since LogR returns probabilities, if prob is > X, then we assign a point to class 1; otherwise,
0.
Confusion matrix and derived metrics
Let´s fix X = 0.5 for now, so if predicted prob > 0.5, the class is “positive”, and negative
otherwise.

CLASSIFICATION MODEL PERFORMANCE

It is key to analyze all performance metrics in order to fully understand how the model is
performing.
Since LogR returns probabilities, if the score is > x, then we assign 1; otherwise, we assign
0.
ROC - AUC
Every different value of the score threshold, X, provides a different rule to decide whether to
assign to the positive class or not. For a given model, we can compute many different
thresholds which change the performance metrics and choose the best one using the Receiver
Operating Characteristic (ROC Curve).
It is a graph to evaluate the performance of a binary classifier. It shows how well it can
separate two classes as you change the decision threshold:
- y axis: TPF (sensitivity or recall)
- x axis: FPR
- Threshold (each provides a TPF-FPR combination)

The AUC, Area Under the ROC Curve, is a single number that summarizes the
performance of the model over all possible thresholds. It measures the total area under the
ROC curve:
- 0.5 = a random classification
- 1 = a perfect classifier

The points on a ROC curve closes to (0, 1) represent a range of the best performing
thresholds for the given model. However, there is an important assumption here: our errors
are symmetric, it is as bad to do FPs as FNs. Is this always the case?
In many cases, the 2 different types of error are not symmetric, so we must consider this
when fixing the score threshold so we reduce the most costly error.
MULTIPLE CLASSFICATION
Imagine we had 3 classes in our target data, A, B and C; which do not follow a natural order
(the data is not ordinal). How do we design a method mashed on logistic regression to do
multinomial classification?
- Option 1: several logistic regressions
1 regression for each category → the chosen class takes value 1, the rest the value 0.

We run the 3 models and register results for each observation.

Thus, we have 3 probabilities for each new observation, one for each possible class. We
identify the maximum value of those 3.
Given new data, we evaluate all models and assign the class with the highest score.

- Option 2: Miltinomial Logistic Regression

As with the binomial logit model, we need to link the probabilities p to the predictors x, while
ensuring that the probabilities are restricted between 0 and 1.
The multinomial logistic regression is an extension of the logistic regression. It is intended
for nominal data that involves more than 2 cases. It can be used for ordinal data, but the
information about order will not be used.
We here the logit function, but instead of one single equation, we have an equation for each
category relative to a reference category. This is, one class acts as a baseline or reference, and
the model predicts the log-odds of each other category relative to this baseline.

Once we have the logit, the model estimates the probability of each class using the softmax
function, which ensures the probabilities sum up to 1.
So, in our example of the 3 classes, we have:

The final output of the multinomial logistic regression is a probability per class for each
observation (many packages provide the highest prob and its class). Thus, as before, given
new data we evalute three models and we assign the class with the highest score.

- Option 3: Ordered Categorical Regression

What if I have ordinal data (more than 2 ordered categories)?
Very common situation in different contexts:
- Measuring severity of damages: “low”, “medium”, “high”
- Reported interest: “not at all”, “medium”, “very interested”
- Consumer satisfaction: “excellent”, “good”, “neutral”, “fair”, “poor”

We proceed similarly to logistic regression, but we introduce the concept of cumulative

odds, that is, odds of being less than or equal category k:
Ordinal regression model:

It assumes proportional odds. This means the effect of all predictors is consistent across
categories ( = the impact of features don’t change for different categories). Because the
relationship between all pairs of groups is the same, there is only one set of coefficients.
Brant test to check if it holds
Coefficient interpretation: exp(alpha_k) is related to the increase of odds to be in category k
or less.
Model evaluation: mixed methods for regression and classification.

CLASS 3:
SUMMARY

Target Type of Link

E(Y|X) Algorithms
variable modeling function

Linear Regression
ElasticNet
Continuous Normal Regression Identity
Lasso Regression
Ridge Regression

Classificatio Logistic
Binary Binomial Logit
n Regression

Non-ordered
Multinomi Classificatio Multinomial
categorical Softmax
al n Classification
variables
Ordinal
Ordered Ordinal Ordered Logistic
categorical distributio Logit
Regression Regression
variables n

POISSON REGRESSION MODEL

Now, suppose our target variable only takes positive integer numbers. How do we model it?
BUILDING POISSON REGRESSION MODEL
In logistic regression, we modeled the log-odds with a linear model. In Poisson Regression,
we model lambda (λ, the expected value of Y) as a linear function of the independent
variables. λ is related to the predictors through the log-link function to ensure non-negative
predictions for λ.

The model outputs the always non-negative mean number of occurrences (events) expected
given the predictors.

TWO POTENTIAL PROBLEMS

1. In our data, we see variance is significantly higher than mean → overdispersion
2. Number of 0s is much higher than expected by Poisson distribution → Zero-inflation
We need to adapt the Poisson Regression Models to deal with these problems.
a. Overdispersion
The data has more variability than the expected by a Poisson distribution V[Y] >> E[Y]. In
consequences, the model underestimates the variability, resulting in underestimated standard
errors and poor fit.
SOLUTIONS
- Quasi-Poisson regression: extension of Poisson model that assumes that the variance
can be modeled as a linear function of the mean, this is, variance is proportional to
mean, providing robust standard errors.
This model computes an overdispersion factor relating mean and variance:

However, it is important to know that this 𝜙 factor is not introduced as a new parameter in the
model: it only modifies the way standard errors are computed after fitting the model.
The model structure and link function remains the same as in Poisson Regression, but the
standard errors of the estimated coefficients are adjusted based on 𝜙, ensuring that hypothesis
tests and confidence intervals account for the overdispersion:

- Negative Binomial regression: based on the Negative Binomial distribution. The

dispersion parameter is part of the likelihood function, meaning it directly impacts the
predicted values as well as the variance. It also allows for negative values.
Negative-binomial distribution models a variable indicating the number of failures until
reaching a number of successes (r).
In count data modeling, we use a reparameterized form where the interpretation of “failures
before successes” translates into a model that counts events with a flexible mean and
variance, which can grow independently, making it suitable for overdispersed data.
This model explicitly includes the overdispersion by adding an additional parameter:

𝜙 is estimated along with the other parameters through maximum likelihood (MLE), meaning
it is part of the core model fitting process, but the model structure and link function
remain the same.

HOW TO CHOOSE THE RIGHT MODEL?

- Check the dispersions: Check whether the variance of count data significantly
exceeds the mean V[Y] >> E[Y]
- Estimate Dispersion factor: Calculate the dispersion factor, which is the ratio of the
residual deviance to the residual degrees of freedom. A value greater than 1 indicates
overdispersion.
- If the overdispersion is mild, quasi-Poisson might be enough
- If the overdispersion is moderate to severe, the Negative Binomial model is
likely a better fit
b. Zero-inflation
Zero inflation occurs when there are more zeros in the data than a standard Poisson model
can account for. The Poisson model assumes that zeros occur randomly according to the rate
(λ), but in some cases, there might be an excess number of zeros due to structural reasons
(certain conditions where events simply don’t happen).
In consequence, the model will underestimate the number of 0s and overestimate counts

SOLUTIONS
- Zero-Inflated Poisson Model, ZIP
This model assumes that there are 2 processes involved and handles them separately:
- Tries to predict zeros by using logistic regression.
- For non-zero data, it computes a Poisson Regression model (that can also
produce 0s).
This is our first example of mixed modelling.
GLM FINAL REMARKS
UNDERSTANDING E(Y|X)
In Linear models, the focus is on understanding and predicting averages for the response
variable.

LMs VS GLMs
In a standard LM → relationship between the target variable and predictors is modeled like
this:

- Xβ=E[Y∣X]: The mean of Y given X is a linear combination of the predictors

- 𝜖: The error term is assumed to follow a normal distribution with constant variance
and 0 mean.
- It assumes a linear relationship between the predictors and the mean of Y on its
original scale

In GLM, the mean of Y is also modeled, but in a more flexible way:

- g(⋅): The link function transforms the mean to a scale where it is linearly related to
the predictors.
- E[Y ∣ X]: The mean of Y is modeled indirectly via the link function.
- The mean of Y may have a nonlinear relationship with the predictors on its original
scale

FAST SUMMARY
GLMs relate the linear combination of predictors Xβ to the expected value of the response
variable E(Y|X) through a link function (glmnet + family in R).

Target Type of Link

Algorithms E(Y|X)
variable modeling function

Linear
Regression
ElasticNet
Continuous Regression Lasso Normal Identity
Regression
Ridge
Regression

Classificati Logistic
Binary Binomial Logit
on Regression

Non-ordered
Classificati Multinomial Multino
categorical Softmax
on classification mial
variables

Ordered Ordinal Ordered Logistic

categorical Ordinal Logit
Regression Regression
variables
Poisson
regression
Quasi-poisson
regression Discrete
Non-negativ
Regression Negative-Binomi and Logarithmic
e & discrete
al Regression positive
Zero-inflation
poisson
regression

GLMs AND REGULARIZATION METHODS

- Regularization methods can be applied to models within the GLM family.
Penalization is applied to the coefficients, making them smaller or even 0.
- Scaling data is key data is key when applying regularization: unscaled features can
lead to biased penalization.
- If we apply a very strong penalty (high lambda) some important predictors might be
excluded (excessive regularization). For optimal performance, we use cross validation
to select best hyperparameters = those most reducing the error. [In R: caret and
cv.glmnet packages]
Key considerations for modelling
1. Choose the Right GLM: Match the distribution of the target variable to the
appropriate GLM (Logistic for binary, Poisson for counts,…).
2. Need for regularization? Understand your data and whether you need regularization
or not
- Prevent overfitting.
- Handle high-dimensional data and/or multicollinearity.
- Perform feature selection (Lasso).

GAMMA REGRESSION (EXTRA)

A Gamma Generalized Linear Model is a type of GLM used when the response variable is
continuous, positive and skewed.
- Survival Analysis and Reliability Engineering: modeling time until an event
occurs, like time to failure of machine parts.
- Healthcare and Biology: modeling times to healing, times to infection, or other
duration-based phenomena.
- Finance and Insurance: modeling incomes or financial returns that are positive and
skewed.

Gamma distribution fits well continous

random variables since it is bounded between
zero and infinity and exhibits right skewness,
a typical phenomena in many situations.

- If our continuous target variable is fitted appropriately by gamma distribution, we can

use gamma regression to model and predict it.
- In this case, the variance is proportional to the square of the mean, so we can handle
data with high skewness.
- Idea: more likely situations tend to appear earlier in time. Therefore, less probable
situations tend to appear later.
- Link: Inverse link function
Sometimes, observations are not independent:
Clustered data → target variable is measured once for each subject (the unit of analysis),
and the units of analysis are grouped into clusters of units.
Let´s suppose we want to study the relationship between average income (y) and the
educational level in the population of a town comprising four fully segregated
neighborhoods. You will sample 1000 individuals irrespective of their areas of origin. If you
ignore dependencies among observation, since individuals from the same neighborhoods are
not independent, then you would be yielding residuals that correlate within block.

Longitudinal data → target variable is measured more than once for each unit of analysis,
with the repeated measures likely to be correlated.
Suppose you want to study the relationship between sleep quality (y) and the levels daily
stress with samples from 1,000 people recording their stress level and sleep quality each day
for 30 days.
If you model as such, you ignore the fact that repeated measurements are nested within each
individual - observations within the same person are not independent, leading to correlated
residuals.

Not independent samples

When working with data where observations are grouped, we often want to:
1. Capture overall patterns (fixed effects): trends or effects that apply to the whole
population.
2. Account for group-specific differences (random effects): variations specific to each
group that we don´t explicitly model but still want to control for.

LMMs are statistical models for continuous overcome variables in which the residuals are
normally distributed but may not be independent or have constant variance. LMMs extend
linear models by addling flexibility to handle grouped data structures, where observations
aren´t entirely independent.

LINEAR MIXED MODELS (LMMs)

Why to use Mixed Models? LMMs address two core issues:
1. Hierarchical Data Structure: when data has a natural grouping, assuming all
observations are independent leads to biased parameter estimates. Mixed models
capture this structure by allowing for correlations within clusters.
2. Partial Pooling: instead of treating each group entirely separately (no pooling = one
model for one group), or ignoring group differences (complete pooling = one model
for all observations), LMMs provide a middle ground. Partial pooling “shares
strength” across groups, using data from all of them to inform the estimates for each,
making the model more robust, especially when some groups have small sample sizes.

LMMs combine the benefits of different approaches, as it counts for non-independence while
retaining individual-level data. They model the relationship at both levels:
1. Fixed effects: like coefficients in standard linear models, these capture
population-level effects, representing the overall influence of predictors that are
assumed to be the same across all groups or clusters.
a. These are explanatory variables of main interests in our study
b. Estimate general trends
2. Random effects: these capture group-level deviations from the fixed effect trends,
allowing different groups to have their own intercepts or slopes. RE introduce an
additional layer of flexibility, allowing the model to account for intra-group
correlations and variations across groups.
a. Model group-specific deviations or variability around fixed effects
Model specification -- General
Fixed or Random effect?

Complex effects: Nested

When we have more than 2 levels of grouping factors, LMMs get more complex. Some
effects may be nested within others. For instance, in this case we have subgroups 1 and 2
(grey line), which are nested inside the upper left green box:
Complex effects: Crossed
When having multiple levels in our data, first-level variables may vary across units of the
other levels. Similarly, second-level variables may vary across units of the third level, so we
need to include cross-level interactions between variables. Sometimes we cannot set a strict
hierarchy between random effects, and this is called crossed effects.
REGULARIZED LMMs AND EXTENSIONS
Can we apply regularization methods such as Ridge, Lasso, or Elastic Net to LMMs? Yes.

Radiographic DHA EXAM
No ratings yet
Radiographic DHA EXAM
53 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Project Scope Report
100% (1)
Project Scope Report
21 pages
Sizing Capacitor Banks Power Factor Correction
No ratings yet
Sizing Capacitor Banks Power Factor Correction
21 pages
Girish Chadha - 29th December 2022
100% (3)
Girish Chadha - 29th December 2022
35 pages
As 2783-1992 Use of Reinforced Concrete For Small Swimming Pools
0% (2)
As 2783-1992 Use of Reinforced Concrete For Small Swimming Pools
6 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
(Prelim) Understanding Culture, Society, and Politics
No ratings yet
(Prelim) Understanding Culture, Society, and Politics
92 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Econometric Modeling
No ratings yet
Econometric Modeling
38 pages
Stormwater Management: Executive Summary
No ratings yet
Stormwater Management: Executive Summary
22 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Regression Packet
No ratings yet
Regression Packet
27 pages
Ps04cmic02 - Environmental Biotechnology
No ratings yet
Ps04cmic02 - Environmental Biotechnology
1 page
Linear Regression PDF
100% (1)
Linear Regression PDF
32 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Module01.1 LinearRegression
No ratings yet
Module01.1 LinearRegression
32 pages
Overview of Lean Six Sigma: Presented by The University of Texas-School of Public Health
No ratings yet
Overview of Lean Six Sigma: Presented by The University of Texas-School of Public Health
22 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
Intro To Regresion: Codergirl Data Analysis
No ratings yet
Intro To Regresion: Codergirl Data Analysis
32 pages
Chapter 6 (Part Ii)
No ratings yet
Chapter 6 (Part Ii)
41 pages
Module 5
No ratings yet
Module 5
48 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
DSR Notes 3 To 5
No ratings yet
DSR Notes 3 To 5
70 pages
Linear Regression
No ratings yet
Linear Regression
10 pages
Sun Temple, Konarak, Orissa, India: Superintending Archaeologist
No ratings yet
Sun Temple, Konarak, Orissa, India: Superintending Archaeologist
44 pages
Prediction & Forecasting: Regression Analysis
No ratings yet
Prediction & Forecasting: Regression Analysis
3 pages
3 Da
No ratings yet
3 Da
16 pages
Pyrocrete 241 PDF
No ratings yet
Pyrocrete 241 PDF
4 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
Bias Varience Trade Off
100% (2)
Bias Varience Trade Off
35 pages
3-Linear Regreesion-Assumptions
No ratings yet
3-Linear Regreesion-Assumptions
28 pages
XRF Theory
No ratings yet
XRF Theory
1 page
Unit Iii
No ratings yet
Unit Iii
27 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Brochure ZWCAD Mechanical 2019
No ratings yet
Brochure ZWCAD Mechanical 2019
2 pages
Unit V - Big Data Programming
No ratings yet
Unit V - Big Data Programming
22 pages
Module 4
No ratings yet
Module 4
33 pages
Desalination of Seawater by Using Graphene Membrane: FYP Proposal
No ratings yet
Desalination of Seawater by Using Graphene Membrane: FYP Proposal
8 pages
O Kolkf D Ijh (K.K Fjiksvz (Izkjafhkd) Ekg: Tafe, MF 7235 Di Tractor
100% (1)
O Kolkf D Ijh (K.K Fjiksvz (Izkjafhkd) Ekg: Tafe, MF 7235 Di Tractor
11 pages
Unit II - Diagnotis and Multiple Linear
No ratings yet
Unit II - Diagnotis and Multiple Linear
8 pages
UNIT 2 Notes
No ratings yet
UNIT 2 Notes
8 pages
Regression
No ratings yet
Regression
45 pages
Galvanized Steel Wire Strands (GSW) JIS G 3537: 1994
No ratings yet
Galvanized Steel Wire Strands (GSW) JIS G 3537: 1994
1 page
Engine
No ratings yet
Engine
33 pages
Predictive Analytics-Mid Sem Exam Question Bank
No ratings yet
Predictive Analytics-Mid Sem Exam Question Bank
28 pages
Loddon Valley Link
No ratings yet
Loddon Valley Link
32 pages
ML PPT 2
No ratings yet
ML PPT 2
206 pages
Faces of God
No ratings yet
Faces of God
11 pages
Fundamentals of RF Coordination For Live Sound Part 1 - The RF Environment - Sound Forums
No ratings yet
Fundamentals of RF Coordination For Live Sound Part 1 - The RF Environment - Sound Forums
6 pages
Unit 3
No ratings yet
Unit 3
55 pages
Thesis On Flooding in Nigeria
100% (3)
Thesis On Flooding in Nigeria
4 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Arun 27072021 Predictive Modeling PDF
No ratings yet
Arun 27072021 Predictive Modeling PDF
33 pages
Physics Paper 6
No ratings yet
Physics Paper 6
260 pages
Data Screening and Main Model Analysis in Spss
No ratings yet
Data Screening and Main Model Analysis in Spss
26 pages
Lecture 1: Introduction and Key Concepts
No ratings yet
Lecture 1: Introduction and Key Concepts
62 pages
Linear Regression Datascience Basit PDF
No ratings yet
Linear Regression Datascience Basit PDF
19 pages
ML Ai
No ratings yet
ML Ai
53 pages
Metal Railing Specs
No ratings yet
Metal Railing Specs
2 pages
Attachment I - MERV Rating Chart OCR
No ratings yet
Attachment I - MERV Rating Chart OCR
2 pages
Deerwood CHM1025C Lab Manual SPRING2021
No ratings yet
Deerwood CHM1025C Lab Manual SPRING2021
71 pages
Linear Regression Algorithm
No ratings yet
Linear Regression Algorithm
16 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Unit 4 Data Science
No ratings yet
Unit 4 Data Science
18 pages
Datamining Unit4
No ratings yet
Datamining Unit4
21 pages
ML 1 PPT Unit 1
No ratings yet
ML 1 PPT Unit 1
93 pages
Intermediate Analytics-Regression-Week 1
No ratings yet
Intermediate Analytics-Regression-Week 1
52 pages
eOC BODAS Pump Controlre95484 - 2024-05-10
No ratings yet
eOC BODAS Pump Controlre95484 - 2024-05-10
4 pages
Rohan 20QM30011 AMSM Assignment Ch8
No ratings yet
Rohan 20QM30011 AMSM Assignment Ch8
11 pages
PH-KMP-PP Presisi Tbk-004-060122
No ratings yet
PH-KMP-PP Presisi Tbk-004-060122
2 pages
m2 Data Analytic and Visualization
No ratings yet
m2 Data Analytic and Visualization
53 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
Jesd51 14
No ratings yet
Jesd51 14
46 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
CC02 Group6 Report
No ratings yet
CC02 Group6 Report
36 pages
380 Dia Clutch - Oyster
No ratings yet
380 Dia Clutch - Oyster
29 pages
Class 9 After
No ratings yet
Class 9 After
38 pages
Regression Questionnaire
No ratings yet
Regression Questionnaire
10 pages
Unit III
No ratings yet
Unit III
13 pages
1 - Intro To Machine Learning
No ratings yet
1 - Intro To Machine Learning
34 pages
Lecture 10
No ratings yet
Lecture 10
5 pages
Linear Regression Firm Basit PDF
No ratings yet
Linear Regression Firm Basit PDF
21 pages
Unit-2 Ak
No ratings yet
Unit-2 Ak
106 pages
Day.11 What Is Multiple Linear Regression
No ratings yet
Day.11 What Is Multiple Linear Regression
10 pages
STULZ High-Powered Air-Cooled Condensers Brochure 2303 en
No ratings yet
STULZ High-Powered Air-Cooled Condensers Brochure 2303 en
4 pages
SMDM Predictive Modeling Business Report 05.02.2022 PDF
No ratings yet
SMDM Predictive Modeling Business Report 05.02.2022 PDF
38 pages
History of England Volume 1 A Prehistory To 1714 6th Ed Edition Clayton Roberts Download
No ratings yet
History of England Volume 1 A Prehistory To 1714 6th Ed Edition Clayton Roberts Download
50 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Statistic and Data Science Ii PDF

Uploaded by

Statistic and Data Science Ii PDF

Uploaded by

STATISTIC AND DATA SCIENCE II

1. ML Projects: types and steps

EDA and feature engineering

Linear relationship between variables

Linear Regression Model

- Ceteris paribus, for each increase of one unit in x, y changes in a 1.

- Residuals vs Leverage: check whether some x predictor have extreme values

BEYOND STANDARD LINEAR MODELS

Error in formula to minimize in standard linear models

SO FAR (FAST SUMMARY)

BINARY TARGET VARIABLES

Being p the probability of success.

PROBABILITY, ODDS AND LOG-ODDS

MERGING ODDS AND LINEAR MODELS

This would be the logistic function.

MEASURING CLASSIFICATION MODEL PERFORMANCE

CLASSIFICATION MODEL PERFORMANCE

We run the 3 models and register results for each observation.

- Option 2: Miltinomial Logistic Regression

- Option 3: Ordered Categorical Regression

We proceed similarly to logistic regression, but we introduce the concept of cumulative

Target Type of Link

POISSON REGRESSION MODEL

TWO POTENTIAL PROBLEMS

- Negative Binomial regression: based on the Negative Binomial distribution. The

HOW TO CHOOSE THE RIGHT MODEL?

- Xβ=E[Y∣X]: The mean of Y given X is a linear combination of the predictors

In GLM, the mean of Y is also modeled, but in a more flexible way:

Target Type of Link

Ordered Ordinal Ordered Logistic

GLMs AND REGULARIZATION METHODS

GAMMA REGRESSION (EXTRA)

Gamma distribution fits well continous

- If our continuous target variable is fitted appropriately by gamma distribution, we can

Not independent samples

LINEAR MIXED MODELS (LMMs)

Complex effects: Nested

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.