0% found this document useful (0 votes)

11 views28 pages

FDSA Unit V LECTURE NOTS

The document covers the fundamentals of predictive analytics, focusing on linear least squares, regression techniques, and their applications in data science. It discusses the implementation of least squares, limitations, goodness of fit tests, and various regression methods including multiple and logistic regression. Additionally, it introduces time series analysis and the importance of resampling techniques in improving statistical accuracy.

Uploaded by

rdxsparrowgaming

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views28 pages

FDSA Unit V LECTURE NOTS

Uploaded by

rdxsparrowgaming

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

AD3491- FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

UNIT V PREDICTIVE ANALYTICS

Linear least squares – implementation – goodness of fit – testing a linear model – weighted
resampling. Regression using StatsModels – multiple regression – nonlinear relationships –
logistic regression – estimating parameters – Time series analysis – moving averages – missing
values – serial correlation – autocorrelation. Introduction to survival analysis.
LINEAR LEAST SQUARES

➢ Least square method is the process of finding a regression line or best-fitted line for any data
set that is described by an equation.

➢ This method requires reducing the sum of the squares of the residual parts of the points from
the curve or line and the trend of outcomes is found quantitatively.

➢ The least-squares method is a statistical method used to find the line of best fit of the form
of an equation such as y = mx + b to the given data.
➢ The curve of the equation is called the regression line. Our main objective in this method is
to reduce the sum of the squares of errors as much as possible. This is the reason this
method is called the least-squares method.
Example :

Limitations for Least Square Method

Even though the least-squares method is considered the best method to find the line of best fit, it
has a few limitations. They are:

• This method exhibits only the relationship between the two variables. All other causes
and effects are not taken into consideration.
• This method is unreliable when data is not evenly distributed.
• This method is very sensitive to outliers. In fact, this can skew the results of the least-
squares analysis.
Least Square Graph

The straight line shows the potential relationship between the independent variable and the
dependent variable. The ultimate goal of this method is to reduce this difference between the
observed response and the response predicted by the regression line. Less residual means that
the model fits better. The data points need to be minimized by the method of reducing residuals of
each point from the line.

3 types of regression

1. Vertical difference – Direct Regression

2. Horizontal direction – Reverse Regression method
3. Perpendicular distance- Major Axis regression Method

IMPLEMENTATION

Least-square method is the curve that best fits a set of observations with a minimum sum of squared
residuals or errors. Let us assume that the given points of data are (x1, y1), (x2, y2), (x3, y3), …, (xn, yn)
in which all x’s are independent variables, while all y’s are dependent ones. This method is used to
find a linear line of the form y = mx + b, where y and x are variables, m is the slope, and b is the y-
intercept. The formula to calculate slope m and the value of b is given by:
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2
b = (∑y - m∑x)/n
Here, n is the number of data points.
Following are the steps to calculate the least square using the above formulas.
• Step 1: Draw a table with 4 columns where the first two columns are for x and y points.
• Step 2: In the next two columns, find xy and (x)2.

• Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.

• Step 4: Find the value of slope m using the above formula.
• Step 5: Calculate the value of b using the above formula.
• Step 6: Substitute the value of m and b in the equation y = mx + b simple functions that demonstrate
linear least squares:

def LeastSquares(xs, ys):

meanx, varx = MeanVar(xs)
meany = Mean(ys)
slope = Cov(xs, ys, meanx, meany) / varx
inter = meany - slope * meanx
return inter, slope

Example

Consider the set of points: (1, 1), (-2,-1), and (3, 2). Plot these points and the leastsquares regression
line in the same graph.
Solution: There are three points, so the value of n is 3

Now, find the value of m, using the formula. m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2
m = [(3×9) - (2×2)]/(3×14) - (2)2
m = (27 - 4)/(42 - 4)
m = 23/38

Now, find the value of b using the formula,

b = (∑y - m∑x)/n
b = [2 - (23/38)×2]/3 b =
[2 -(23/19)]/3
b = 15/(3×19)
b = 5/19

So, the required equation of least squares is y = mx + b = 23/38x + 5/19. The required
graph is shown as:
GOODNESS OF FIT
A goodness-of-fit test, in general, refers to measuring how well do the observed data correspond
to the fitted (assumed) model. The goodness-of-fit test compares the observed values to the
expected (fitted or predicted) values.
A goodness-of-fit statistic tests the following hypothesis:
H0: the model M0 fits vs.
H1: the model M0 does not fit (or, some other model MA fits)

Goodness of fit tests commonly used in statistics are:

1. Chi-square.
Most common method for categorical data

2. Kolmogorov-Smirnov.
o For continuous data.
o Compares cumulative distributions of observed and expected data.
o Non-parametric test.
3. Anderson-Darling.
o Like K-S test, but gives more weight to tails of the distribution.
o Often used for checking normality.
4. Shapiro-Wilk.
o Specifically for testing normality in small datasets.
Testing a Linear Model

The following measures are used to validate the simple linear regression models.
1. Co-efficient of determination
2. Hypothesis test for the regression coefficient
3. ANOVA test
4. Residual Analysis to validate the regression model
5. Outlier Analysis.

Example

Here's a HypothesisTest for the model: find the birth weight of baby based on mother age.
import thinkstats2
import numpy as np
class SlopeTest(thinkstats2.HypothesisTest):
def TestStatistic(self, data):
ages, weights = data
slope,_ = thinkstats2.LeastSquares(ages, weights)
return slope
def MakeModel(self):
_, weights = self.data
self.ybar = weights.mean()
self.res = weights - self.ybar
def RunModel(self):
ages, _ = self.data
shuffled_weights = self.ybar + p.random.permutation(self.res)
return ages, shuffled_weights
1. WEIGHTED RESAMPLING

Resampling is a series of techniques used in statistics to gather more information about a

sample. This can include retaking a sample or estimating its accuracy. With these additional
techniques, resampling often improves the overall accuracy and estimates any uncertainty within
a population.
Sampling Vs Resampling

Sampling is the process of selecting certain groups within a population to gather data.
Resampling often involves performing similar testing methods with sample sizes within that group.
This can mean testing the same sample, or reselecting samples that can provide more information
about a population.
Resampling methods are:
1. Permutation tests (also re-randomization tests)
2. Bootstrapping
3. Cross validation
Permutation tests

Permutation tests rely on resampling the original data assuming the null hypothesis. Based
on the resampled data it can be concluded how likely the original data is to occur under the null
hypothesis.

Bootstrapping

Bootstrapping is a statistical method for estimating the sampling distribution of an

estimator by sampling with replacement from the original sample, most often with the purpose
of deriving robust estimates of standard errors and confidence intervals of a
population parameter like a mean, median, proportion, odds ratio,
correlation coefficient or regression coefficient. It has been called the plug-in principle, as it is the
method of estimation of functionals of a population distribution by evaluating the same functionals
at the empirical distribution based on a sample.

For example, when estimating the population mean, this method uses the sample
mean; to estimate the population median, it uses the sample median; to estimate the population
regression line, it uses the sample regression line.

Cross validation
Cross-validation is a statistical method for validating a predictive model. Subsets of the
data are held out for use as validating sets; a model is fit to the remaining data (a training set) and
used to predict for the validation set. Averaging the quality of the predictions across the validation
sets yields an overall measure of prediction accuracy. Cross-validation is employed repeatedly in
building decision trees.
In weighted sampling, each element is given a weight, where the probability of an element being
selected is based on its weight. As an example, if you survey 100,000 people in a country of 300
million, each respondent represents 3,000 people.
If you oversample one group by a factor of 2, each person in the oversampled group would have a
lower weight, about 1500. To correct for oversampling, we can use resampling;
As an example, I will estimate mean birth weight with and without sampling weights.
2. REGRESSION USING STATSMODELS

Statsmodels is Python module that provides classes and functions for the estimation of many
different statistical models
For multiple regression use StatsModels, a Python package that provides several forms of
regression and other analyses. .

There are 4 available classes of the properties of the regression model that will help us to
use the statsmodel linear regression.
The classes are as follows
1. Ordinary Least Square (OLS)
2. Weighted Least Square (WLS)
3. Generalized Least Square (GLS).
Example

4. MULTIPLE REGRESSION

Multiple regression is an extension of simple linear regression. Multiple linear regression

is a statistical technique that uses multiple linear regression to model more complex relationships
between two or more independent variables and one dependent variable. It is used when there are
two or more x variables.
Assumptions of Multiple Linear Regression

In multiple linear regression, the dependent variable is the outcome or result from you're trying to
predict. The independent variables are the things that explain your dependent variable. You can
use them to build a model that accurately predicts your dependent variable from the independent
variables.
For your model to be reliable and valid, there are some essential requirements:
• The independent and dependent variables are linearly related.
• There is no strong correlation between the independent variables.

• Residuals have a constant variance.

• Observations should be independent of one another.

• It is important that all variables follow multivariate normality.

y = m1x1+m2x2+m3x3+m4x4+…. + b
Example :

The Difference Between Linear and Multiple Regression

When predicting a complex process's outcome, it is best to use multiple linear regression
instead of simple linear regression.

A simple linear regression can accurately capture the relationship between two variables in
simple relationships. On the other hand, multiple linear regression can capture more complex
interactions that require more thought.
A multiple regression model uses more than one independent variable. It does not suffer
from the same limitations as the simple regression equation, and it is thus able to fit curved and
non-linear relationships. The following are the uses of multiple linear regression.

1. Planning and Control.

2. Prediction or Forecasting.

Estimating relationships between variables can be exciting and useful. As with all other
regression models, the multiple regression model assesses relationships among variables in terms
of their ability to predict the value of the dependent variable.
Nonlinear relationships
5. LOGISTIC REGRESSION

o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The below
image is showing the logistic function:

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.

o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

6. ESTIMATING PARAMETERS
Unlike linear regression, logistic regression does not have a closed form solution, so it is solved
by guessing an initial solution and improving it iteratively. The usual goal is to find the maximum-
likelihood estimate (MLE), which is the set of parameters that maximizes the likelihood of the
data. For example, suppose we have the following data:
The goal of logistic regression is to find parameters that maximizes this likelihood

7. TIME SERIES ANALYSIS

Time series analysis is a specific way of analyzing a sequence of data points collected over
an interval of time. In time series analysis, analysts record data points at consistent intervals over
a set period of time rather than just recording the data points intermittently or randomly.
Characteristics of Time Series Analysis:
➢ Homogeneous data
➢ Values with different to time
➢ Data as per reasonable period
➢ Gaps of time values must be equal.
Objective of Time series Analysis:
➢ To evaluate past performance in respect of a particular period.
➢ To make future forecasts in respect of the particular variable.
➢ To chart short term and long term strategies of the business in a respect of a particular
variable.

How to Analyze Time Series?

To perform the time series analysis, we have to follow the following steps:
• Collecting the data and cleaning it
• Preparing Visualization with respect to time vs key feature
• Observing the stationarity of the series
• Developing charts to understand its nature.
• Extracting insights from prediction
Significance of Time Series:
TSA is the backbone for prediction and forecasting analysis, specific to time-based problem
statements.

• Analyzing the historical dataset and its patterns

• Understanding and matching the current situation with patterns derived from the previous stage.
• Understanding the factor or factors influencing certain variable(s) in different periods.

With the help of “Time Series,” we can prepare numerous time-based analyses and results.

• Forecasting: Predicting any value for the future.

• Segmentation: Grouping similar items together.
• Classification: Classifying a set of items into given classes.
• Descriptive analysis: Analysis of a given dataset to find out what is there in it.
• Intervention analysis: Effect of changing a given variable on the outcome.
Components of Time Series Analysis:
Let’s look at the various components of Time Series Analysis:
• Trend: In which there is no fixed interval and any divergence within the given dataset is a
continuous timeline. The trend would be Negative or Positive or Null Trend
• Seasonality: In which regular or fixed interval shifts within the dataset in a continuous timeline.
Would be bell curve or saw tooth
• Cyclical: In which there is no fixed interval, uncertainty in movement and its pattern
• Irregularity: Unexpected situations/events/scenarios and spikes in a short time span.

What Are the Limitations of Time Series Analysis?

Time series has the below-mentioned limitations; we have to take care of those during our
data analysis.

• Similar to other models, the missing values are not supported by TSA
• The data points must be linear in their relationship.
• Data transformations are mandatory, so they are a little expensive.
• Models mostly work on Uni-variate data.

A time series is a sequence of measurements from a system that varies in time. The following
code reads data from pandas dataframe.
8. MOVING AVERAGES
In the method of moving average, successive arithmetic averages are computed from
overlapping groups of successive values of a time series.
Each group includes all the observations in a given time interval, termed as the period of
moving average.can be used for data preparation, feature engineering, and forecasting.
The trend and seasonal variations can be used to help make predictions about the
future – and as such can be very useful when budgeting and forecasting. Calculating moving
averages
One method of establishing the underlying trend (smoothing out peaks and troughs) in a
set of data is using the moving averages technique. Other methods, such as regression analysis can
also be used to estimate the trend. Regression analysis is dealt with in a separate article.
A moving average is a series of averages, calculated from historic data. Moving averages
can be calculated for any number of time periods, for example a three-month moving average, a
seven-day moving average, or a four-quarter moving average. The basic calculations are the same.
The following simplified example will take us through the calculation process.
Monthly sales revenue data were collected for a company for 20X2:

From this data, we will calculate a three-month moving average, as we can see a basic cycle that
follows a three-monthly pattern

Calculate the three-month moving average.

Add together the first three sets of data, for this example it would be January, February and
March. This gives a total of (125+145+186) = 456. Put this total in the middle of the data you are
adding, so in this case across from February. Then calculate the average of this total, by dividing
this figure by 3 (the figure you divide by will be the same as the number of time periods you have
added in your total column). Our three-month moving average is therefore (456 ÷ 3) = 152.
The average needs to be calculated for each three-month period. To do this you move your average
calculation down one month, so the next calculation will involve February, March and April. The total
for these three months would be (145+186+131) = 462 and the average would be (462 ÷ 3) = 154.

Calculate the trend

The three-month moving average represents the trend. From our example we can see a clear
trend in that each moving average is $2,000 higher than the preceding month moving average. This
suggests that the sales revenue for the company is, on average, growing at a rate of $2,000 per
month.

This trend can now be used to predict future underlying sales values.

Calculate the seasonal variation

Once a trend has been established, any seasonal variation can be calculated. The seasonal
variation can be assumed to be the difference between the actual sales and the trend (three-month
moving average) value.
A negative variation means that the actual figure in that period is less than the trend and a positive figure
means that the actual is more than the trend.

9. MISSING VALUES
The data in the real world has many missing data in most cases. There might be different
reasons why each value is missing.
There might be loss or corruption of data, or there might be specific reasons also.
The missing data will decrease the predictive power of your model.
If you apply algorithms with missing data, then there will be bias in the
estimation of parameters.
You cannot be confident about your results if you don’t handle missing data.
Check for missing values
When you have a dataset, the first step is to check which columns have missing data
and how many.
The ” isnull()” function is used for this. When you call the sum function along with
isnull, the total sum of missing data in each column is the output.
missing_values=train.isnull().sum() print(missing_values)
Dropping rows with missing values
It is a simple method, where we drop all the rows that have any missing values belonging
to a particular column. As easy as this is, it comes with a huge disadvantage. You might end up
losing a huge chunk of your data. This will reduce the size of your dataset and make your model
predictions biased. You should use this only when the no of missing values is very less.
we can drop rows using dropna function.
Handle missing values in Time series data
The datasets where information is collected along with timestamps in an orderly fashion are
denoted as time-series data.
1. Forward-fill missing values
The value of the next row will be used to fill the missing value.’ffill’ stands for ‘forward fill’. It is
very easy to implement. You just have to pass the “method” parameter as “ffill” in the fillna()
function.
forward_filled=df.fillna(method='ffill')
print(forward_filled)

2. Backward-fill missing values

Here, we use the value of the previous row to fill the missing value. ‘bfill’ stands for backward
fill. Here, you need to pass ‘bfill’ as the method parameter.
backward_filled=df.fillna(method='bfill')
print(backward_filled)

10. SERIAL CORRELATION

Serial correlation occurs in a time series when a variable and a lagged version of itself (for
instance a variable at times T and at T-1) are observed to be correlated with one another over
periods of time. Repeating patterns often show serial correlation when the level of a variable affects
its future level. In finance, this correlation is used by technical analysts to determine how well the
past price of a security predicts the future price.
Serial correlation is similar to the statistical concepts of autocorrelation or lagged correlation.
KEY TAKEAWAYS

• Serial correlation is the relationship between a given variable and a lagged version of
itself over various time intervals.
• It measures the relationship between a variable's current value given its past values.
• A variable that is serially correlated indicates that it may not be random.
• Technical analysts validate the profitable patterns of a security or group of securities
and determine the risk associated with investment opportunities.

we can shift the time series by an interval called a lag, and then compute the correlation of

the shifted series with the original:

AUTOCORRELATION
Autocorrelation refers to the degree of correlation of the same variables between two successive
time intervals. It measures how the lagged version of the value of a variable is related to the original
version of it in a time series. Autocorrelation, as a statistical concept, is also known as serial
correlation.
we calculate the correlation between two different versions Xt and Xt-k of the same time series.
Given time-series measurements, Y1, Y2,…YN at time X1, X2, …XN, the lag k autocorrelation
function is defined as:
An autocorrelation of +1 represents perfectly positive correlations and -1 represents a
perfectly negative correlation.

Usage:
• An autocorrelation test is used to detect randomness in the time-series. In many statistical
processes, our assumption is that the data generated is random. For checking randomness, we
need to check for the autocorrelation of lag 1.
• To determine whether there is a relation between past and future values of time series, we
try to lag between different values.

Testing For Autocorrelation

Durbin-Watson Test:
Durbin-Watson test is used to measure the amount of autocorrelation in residuals from the
regression analysis. Durbin Watson test is used to check for the first- order autocorrelation.
Assumptions for the Durbin-Watson Test:
• The errors are normally distributed and the mean is 0.
• The errors are stationary.
The test statistics are calculated with the following formula.

Where et is the residual of error from the Ordinary Least Squares (OLS) method. The null
hypothesis and alternate hypothesis for the Durbin-Watson Test are
• H0: No first-order autocorrelation.
• H1: There is some first-order correlation.

The Durbin Watson test has values between 0 and 4. Below is the table containing
values and their interpretations:
• 2: No autocorrelation. Generally, we assume 1.5 to 2.5 as no correlation.
• 0- <2: positive autocorrelation. The more close it to 0, the more signs of positive
autocorrelation.
• >2 -4: negative autocorrelation. The more close it to 4, the more signs of negative
autocorrelation.

The autocorrelation function is a function that maps from lag to the serial correlation with the given
lag. Autocorrelation" is another name for serial correlation, used more often when the lag is not 1.
acf computes serial correlations with lags from 0 through nlags. The unbiased ag tells acf to correct
the estimates for the sample size. The result is an array of correlations.
11. Introduction to Survival analysis

Introduction

Survival analysis is a statistical method essential for analyzing time-to-event data, widely
employed in medical research, economics, and various scientific disciplines. At the core of
survival analysis is the Kaplan-Meier estimator, a powerful tool for estimating survival
probabilities over time. This article provides a concise introduction to survival analysis,
unraveling its significance and applications. We delve into the fundamentals of the Kaplan-Meier
estimator, exploring its role in handling censored data and creating survival curves. Whether
you’re new to survival analysis or seeking a refresher, this guide navigates through the key
concepts, making this statistical approach accessible and comprehensible.

What is Survival Analysis?

Survival analysis explores the time until an event occurs, answering questions about failure
rates, disease progression, or recovery duration. It’s a crucial statistical field, involving terms like
event, time, censoring, and various methods like Kaplan-Meier curves, Cox regression models,
and log-rank tests for group comparisons. This branch delves into modeling time-to-event data,
offering insights into diverse scenarios, from medical diagnoses to mechanical system failures.
Understanding survival analysis requires defining a specific time frame and employing various
statistical tools to analyze and interpret data effectively.

Censoring/ Censored Observation

This terminology is defined as if the subject matter on which we are doing the study of
survival analysis doesn’t get affected by the defined event of study, then they are described as
censored. The censored subject might also not have an event after the end of the survival analysis
observation. The subject is called censored in the sense that nothing was observed out of the subject
after the time of censoring.

Censoring Observation are also of 3 types-

1. Right Censored

Right censoring is used in many problems. It happens when we are not certain what happened to
people after a certain point in time.

It occurs when the true event time is greater than the censored time when c < t. This happens if
either some people cannot be followed the entire time because they died or were lost to follow up
or withdrew from the study.
2. Left Censored

Left censoring is when we are not certain what happened to people before some point in time. Left
censoring is the opposite, occurring when the true event time is less than the censored time when
c > t.

3. Interval Censored

Interval censoring is when we know something has happened in an interval (not before starting
time and not after ending time of the study) but we do not know exactly when in the interval it
happened.

Interval censoring is a concatenation of the left and right censoring when the time is known to have
occurred between two-time points

Survival Function S (t): This is a probability function that depends on the time of the study. The
subject survives more than time t. The Survivor function gives the probability that the random
variable T exceeds the specified time t.
Kaplan Meier Estimator

Kaplan Meier Estimator is used to estimate the survival function for lifetime data. It is a
non-parametric statistics technique. It is also known as the product- limit estimator, and the concept
lies in estimating the survival time for a certain time of like a major medical trial event, a certain
time of death, failure of the machine, or any major significant event.

There are lots of examples like

1. Failure of machine parts after several hours of operation.

2. How much time it will take for COVID 19 vaccine to cure the patient.
3. How much time is required to get a cure from a medical diagnosis etc.
4. To estimate how many employees will leave the company in a specific period of
time.
5. How many patients will get cured by lung cancer

To Estimate the Kaplan Meier Survival we first need to estimate the Survival Function S (t) is the
probability of event time t

Where (d) are the number of death events at the time (t), and (n) is the number of subjects at risk
of death just prior to the time (t).

Assumptions of Kaplan Meier Survival

In real-life cases, we do not have an idea of the true survival rate function. So in Kaplan Meier
Estimator we estimate and approximate the true survival function from the study data. There are 3
assumptions of Kaplan Meier Survival
• Survival Probabilities are the same for all the samples who joined late in the study and those who
have joined early. The Survival analysis which can affect is not assumed to change.
• Occurrence of Event are done at a specified time.
• Censoring of the study does not depend on the outcome. The Kaplan Meier method doesn’t depend
on the outcome of interest.

Interpretation of Survival Analysis is Y-axis shows the probability of subject which has not come
under the case study. The X-axis shows the representation of the subject’s interest after surviving
up to time. Each drop in the survival function (approximated by the Kaplan-Meier estimator) is
caused by the event of interest happening for at least one observation.

The plot is often accompanied by confidence intervals, to describe the uncertainty about the point
estimates-wider confidence intervals show high uncertainty, this happens when we have a few
participants- occurs in both observations dying and being censored.

Ssi Seminar
No ratings yet
Ssi Seminar
69 pages
DA unit-III
No ratings yet
DA unit-III
30 pages
Data Mining and Predictive Analytics 2nd Edition Daniel T. Larose Download
100% (1)
Data Mining and Predictive Analytics 2nd Edition Daniel T. Larose Download
52 pages
Sharmila Vege Sana 2018
No ratings yet
Sharmila Vege Sana 2018
37 pages
Unit 3 1
No ratings yet
Unit 3 1
41 pages
8614 2nd Assignment
No ratings yet
8614 2nd Assignment
12 pages
Selecting Predictive Modeling Technique
No ratings yet
Selecting Predictive Modeling Technique
121 pages
Fundamental Analysis and Subsequent Stock Returns (Journal of Accounting and Economics, Vol. 15, Issue 2-3) (1992)
No ratings yet
Fundamental Analysis and Subsequent Stock Returns (Journal of Accounting and Economics, Vol. 15, Issue 2-3) (1992)
30 pages
Unit-3 Data Analysis
No ratings yet
Unit-3 Data Analysis
36 pages
Linear Models
No ratings yet
Linear Models
50 pages
Chap6 ClassificationBasic
No ratings yet
Chap6 ClassificationBasic
83 pages
Fdsa Model (Aids)
No ratings yet
Fdsa Model (Aids)
2 pages
Final .Proposal
No ratings yet
Final .Proposal
34 pages
Regression Analysis
No ratings yet
Regression Analysis
38 pages
ML Unit-4
No ratings yet
ML Unit-4
65 pages
IV Ai & Ds Al3451 ML Unit2
No ratings yet
IV Ai & Ds Al3451 ML Unit2
50 pages
Unit 2-1
No ratings yet
Unit 2-1
30 pages
Artificial Intelligence and Machine Learning - CS3491 - Notes - Unit 3 - Supervised Learning
No ratings yet
Artificial Intelligence and Machine Learning - CS3491 - Notes - Unit 3 - Supervised Learning
37 pages
Least Square Regression
No ratings yet
Least Square Regression
15 pages
Module4 CSE3190 FDA Updated
No ratings yet
Module4 CSE3190 FDA Updated
46 pages
Python Notes
No ratings yet
Python Notes
25 pages
Cs3351 Aiml Unit 3 Notes Eduengg
No ratings yet
Cs3351 Aiml Unit 3 Notes Eduengg
38 pages
Ifjo 320 Fy 98324 Fo 3 F 2 Ifr
No ratings yet
Ifjo 320 Fy 98324 Fo 3 F 2 Ifr
6 pages
AI - Mod 5. Part 3
No ratings yet
AI - Mod 5. Part 3
26 pages
Unit 3new
No ratings yet
Unit 3new
34 pages
Coding 2
No ratings yet
Coding 2
3 pages
LMC02 App
No ratings yet
LMC02 App
3 pages
Unit 2 Regression
No ratings yet
Unit 2 Regression
31 pages
Da Unit III
0% (1)
Da Unit III
43 pages
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
No ratings yet
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
7 pages
Unit 2
No ratings yet
Unit 2
26 pages
Alcohol and Retention
No ratings yet
Alcohol and Retention
10 pages
STAT22209 - Chapter 02-Regression Analyisis - 2022
No ratings yet
STAT22209 - Chapter 02-Regression Analyisis - 2022
41 pages
Mohit Final REASEARCH PAPER
No ratings yet
Mohit Final REASEARCH PAPER
20 pages
Analyzing Activity and Injury: Lessons Learned From The Acute:Chronic Workload Ratio
No ratings yet
Analyzing Activity and Injury: Lessons Learned From The Acute:Chronic Workload Ratio
12 pages
BST 32202 Linear Regression 6 SLR Assumptions Lse
No ratings yet
BST 32202 Linear Regression 6 SLR Assumptions Lse
20 pages
Rohini 73149042113
No ratings yet
Rohini 73149042113
11 pages
Advanced Cyberbullying Detection A Hybrid Model Integrated With Nave Bayes
No ratings yet
Advanced Cyberbullying Detection A Hybrid Model Integrated With Nave Bayes
5 pages
Body Dissatisfaction in Midlife Women
No ratings yet
Body Dissatisfaction in Midlife Women
21 pages
Da Module 3
No ratings yet
Da Module 3
54 pages
Mini Project Analysis On Messi
No ratings yet
Mini Project Analysis On Messi
10 pages
Unit III
No ratings yet
Unit III
18 pages
Additional Material - Linear Regression
No ratings yet
Additional Material - Linear Regression
11 pages
12 Gen Ch3 Least Squares Regression Notes 2024
No ratings yet
12 Gen Ch3 Least Squares Regression Notes 2024
17 pages
Linear Regression Models
No ratings yet
Linear Regression Models
41 pages
Unit-5 - Notes
No ratings yet
Unit-5 - Notes
41 pages
Logistic Regression in Credit Scoring
No ratings yet
Logistic Regression in Credit Scoring
27 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
13 pages
Lecture 3 - Linear Regression Imran 20022025 092939am
No ratings yet
Lecture 3 - Linear Regression Imran 20022025 092939am
46 pages
ch03 Regression
No ratings yet
ch03 Regression
10 pages
Section 2
No ratings yet
Section 2
22 pages
Da Unit III
No ratings yet
Da Unit III
43 pages
Experiment 1
No ratings yet
Experiment 1
17 pages
CSL0777 L12
No ratings yet
CSL0777 L12
18 pages
Regression Notes - Part-1
No ratings yet
Regression Notes - Part-1
17 pages
15 Types of Regression in Data Science PDF
No ratings yet
15 Types of Regression in Data Science PDF
42 pages
Topic 8 - Regression Analysis
No ratings yet
Topic 8 - Regression Analysis
51 pages
Unit - Iii
No ratings yet
Unit - Iii
9 pages
Least Square Method
No ratings yet
Least Square Method
5 pages
Essentials of Econometrics
7% (27)
Essentials of Econometrics
12 pages
Linear Regression Analysis and Least Square Methods
No ratings yet
Linear Regression Analysis and Least Square Methods
65 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
55 pages
Dr. Siti Mariam Binti Abdul Rahman Faculty of Mechanical Engineering Office: T1-A14-01C E-Mail: Mariam4528@salam - Uitm.edu - My
No ratings yet
Dr. Siti Mariam Binti Abdul Rahman Faculty of Mechanical Engineering Office: T1-A14-01C E-Mail: Mariam4528@salam - Uitm.edu - My
30 pages
Machine Learning and Deep Learning Course
No ratings yet
Machine Learning and Deep Learning Course
23 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Pseudo-R Squared
No ratings yet
Pseudo-R Squared
9 pages
Unit III - Least Square Estimation
No ratings yet
Unit III - Least Square Estimation
6 pages
Statistics 02
No ratings yet
Statistics 02
8 pages
Socionomic Causality in Politics - How Social Mood Influences Everything From Elections To Geopolitics (PDFDrive)
No ratings yet
Socionomic Causality in Politics - How Social Mood Influences Everything From Elections To Geopolitics (PDFDrive)
473 pages
Wiley Journal of Accounting Research: This Content Downloaded From 171.79.1.63 On Sat, 28 Mar 2020 02:21:52 UTC
No ratings yet
Wiley Journal of Accounting Research: This Content Downloaded From 171.79.1.63 On Sat, 28 Mar 2020 02:21:52 UTC
24 pages
SM Notes 2020
No ratings yet
SM Notes 2020
139 pages
A) The Least-Squares Method
No ratings yet
A) The Least-Squares Method
19 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Yanet College
No ratings yet
Yanet College
46 pages
Ybi Python Final Internship Report
100% (6)
Ybi Python Final Internship Report
29 pages
MKT614 Marketing Research 18256::Dr - Anish Yousaf 3.0 1.0 0.0 4.0 Courses With Research Focus
No ratings yet
MKT614 Marketing Research 18256::Dr - Anish Yousaf 3.0 1.0 0.0 4.0 Courses With Research Focus
17 pages
Chapter2 (Simple Linear Regression)
No ratings yet
Chapter2 (Simple Linear Regression)
11 pages
Unit 5 Ad3491 Fundamentals of Data Science Unit 5 Notes
No ratings yet
Unit 5 Ad3491 Fundamentals of Data Science Unit 5 Notes
24 pages
(Revised) Simple Linear Regression and Correlation
No ratings yet
(Revised) Simple Linear Regression and Correlation
41 pages
Questionnaire Survey of Working Relationships Between Nurses and Doctors in University Teaching Hospitals in Southern Nigeria
No ratings yet
Questionnaire Survey of Working Relationships Between Nurses and Doctors in University Teaching Hospitals in Southern Nigeria
22 pages
Least Square Regression
No ratings yet
Least Square Regression
13 pages
Regression
No ratings yet
Regression
72 pages
Statistical Modelling For Biomedical Researchers
100% (2)
Statistical Modelling For Biomedical Researchers
544 pages
Machine Learning Notes
100% (1)
Machine Learning Notes
8 pages
Multivariate in Epidemiology
100% (3)
Multivariate in Epidemiology
427 pages
Ms 236 N 0
No ratings yet
Ms 236 N 0
63 pages
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
Atkinson
No ratings yet
Atkinson
14 pages
Fdsa UNIT V
No ratings yet
Fdsa UNIT V
18 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

FDSA Unit V LECTURE NOTS

Uploaded by

FDSA Unit V LECTURE NOTS

Uploaded by

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

AD3491- FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

UNIT V PREDICTIVE ANALYTICS

Limitations for Least Square Method

1. Vertical difference – Direct Regression

• Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.

def LeastSquares(xs, ys):

Now, find the value of b using the formula,

Goodness of fit tests commonly used in statistics are:

Resampling is a series of techniques used in statistics to gather more information about a

Bootstrapping is a statistical method for estimating the sampling distribution of an

Multiple regression is an extension of simple linear regression. Multiple linear regression

• Residuals have a constant variance.

• Observations should be independent of one another.

• It is important that all variables follow multivariate normality.

The Difference Between Linear and Multiple Regression

1. Planning and Control.

Logistic Function (Sigmoid Function):

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.

Logistic Regression Equation:

o We know the equation of the straight line can be written as:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

7. TIME SERIES ANALYSIS

How to Analyze Time Series?

• Analyzing the historical dataset and its patterns

• Forecasting: Predicting any value for the future.

What Are the Limitations of Time Series Analysis?

Calculate the three-month moving average.

Calculate the trend

Calculate the seasonal variation

2. Backward-fill missing values

10. SERIAL CORRELATION

the shifted series with the original:

Testing For Autocorrelation

What is Survival Analysis?

Censoring/ Censored Observation

Censoring Observation are also of 3 types-

There are lots of examples like

1. Failure of machine parts after several hours of operation.

Assumptions of Kaplan Meier Survival

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.