0% found this document useful (0 votes)
11 views28 pages

FDSA Unit V LECTURE NOTS

The document covers the fundamentals of predictive analytics, focusing on linear least squares, regression techniques, and their applications in data science. It discusses the implementation of least squares, limitations, goodness of fit tests, and various regression methods including multiple and logistic regression. Additionally, it introduces time series analysis and the importance of resampling techniques in improving statistical accuracy.

Uploaded by

rdxsparrowgaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views28 pages

FDSA Unit V LECTURE NOTS

The document covers the fundamentals of predictive analytics, focusing on linear least squares, regression techniques, and their applications in data science. It discusses the implementation of least squares, limitations, goodness of fit tests, and various regression methods including multiple and logistic regression. Additionally, it introduces time series analysis and the importance of resampling techniques in improving statistical accuracy.

Uploaded by

rdxsparrowgaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

AD3491- FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

UNIT V PREDICTIVE ANALYTICS


Linear least squares – implementation – goodness of fit – testing a linear model – weighted
resampling. Regression using StatsModels – multiple regression – nonlinear relationships –
logistic regression – estimating parameters – Time series analysis – moving averages – missing
values – serial correlation – autocorrelation. Introduction to survival analysis.
LINEAR LEAST SQUARES

➢ Least square method is the process of finding a regression line or best-fitted line for any data
set that is described by an equation.

➢ This method requires reducing the sum of the squares of the residual parts of the points from
the curve or line and the trend of outcomes is found quantitatively.

➢ The least-squares method is a statistical method used to find the line of best fit of the form
of an equation such as y = mx + b to the given data.
➢ The curve of the equation is called the regression line. Our main objective in this method is
to reduce the sum of the squares of errors as much as possible. This is the reason this
method is called the least-squares method.
Example :

Limitations for Least Square Method

Even though the least-squares method is considered the best method to find the line of best fit, it
has a few limitations. They are:

• This method exhibits only the relationship between the two variables. All other causes
and effects are not taken into consideration.
• This method is unreliable when data is not evenly distributed.
• This method is very sensitive to outliers. In fact, this can skew the results of the least-
squares analysis.
Least Square Graph

The straight line shows the potential relationship between the independent variable and the
dependent variable. The ultimate goal of this method is to reduce this difference between the
observed response and the response predicted by the regression line. Less residual means that
the model fits better. The data points need to be minimized by the method of reducing residuals of
each point from the line.

3 types of regression

1. Vertical difference – Direct Regression


2. Horizontal direction – Reverse Regression method
3. Perpendicular distance- Major Axis regression Method

IMPLEMENTATION

Least-square method is the curve that best fits a set of observations with a minimum sum of squared
residuals or errors. Let us assume that the given points of data are (x1, y1), (x2, y2), (x3, y3), …, (xn, yn)
in which all x’s are independent variables, while all y’s are dependent ones. This method is used to
find a linear line of the form y = mx + b, where y and x are variables, m is the slope, and b is the y-
intercept. The formula to calculate slope m and the value of b is given by:
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2
b = (∑y - m∑x)/n
Here, n is the number of data points.
Following are the steps to calculate the least square using the above formulas.
• Step 1: Draw a table with 4 columns where the first two columns are for x and y points.
• Step 2: In the next two columns, find xy and (x)2.

• Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.


• Step 4: Find the value of slope m using the above formula.
• Step 5: Calculate the value of b using the above formula.
• Step 6: Substitute the value of m and b in the equation y = mx + b simple functions that demonstrate
linear least squares:

def LeastSquares(xs, ys):


meanx, varx = MeanVar(xs)
meany = Mean(ys)
slope = Cov(xs, ys, meanx, meany) / varx
inter = meany - slope * meanx
return inter, slope

Example

Consider the set of points: (1, 1), (-2,-1), and (3, 2). Plot these points and the least- squares regression
line in the same graph.
Solution: There are three points, so the value of n is 3

Now, find the value of m, using the formula. m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2
m = [(3×9) - (2×2)]/(3×14) - (2)2
m = (27 - 4)/(42 - 4)
m = 23/38

Now, find the value of b using the formula,


b = (∑y - m∑x)/n
b = [2 - (23/38)×2]/3 b =
[2 -(23/19)]/3
b = 15/(3×19)
b = 5/19

So, the required equation of least squares is y = mx + b = 23/38x + 5/19. The required
graph is shown as:
GOODNESS OF FIT
A goodness-of-fit test, in general, refers to measuring how well do the observed data correspond
to the fitted (assumed) model. The goodness-of-fit test compares the observed values to the
expected (fitted or predicted) values.
A goodness-of-fit statistic tests the following hypothesis:
H0: the model M0 fits vs.
H1: the model M0 does not fit (or, some other model MA fits)

Goodness of fit tests commonly used in statistics are:


1. Chi-square.
Most common method for categorical data

2. Kolmogorov-Smirnov.
o For continuous data.
o Compares cumulative distributions of observed and expected data.
o Non-parametric test.
3. Anderson-Darling.
o Like K-S test, but gives more weight to tails of the distribution.
o Often used for checking normality.
4. Shapiro-Wilk.
o Specifically for testing normality in small datasets.
Testing a Linear Model

The following measures are used to validate the simple linear regression models.
1. Co-efficient of determination
2. Hypothesis test for the regression coefficient
3. ANOVA test
4. Residual Analysis to validate the regression model
5. Outlier Analysis.

Example

Here's a HypothesisTest for the model: find the birth weight of baby based on mother age.
import thinkstats2
import numpy as np
class SlopeTest(thinkstats2.HypothesisTest):
def TestStatistic(self, data):
ages, weights = data
slope,_ = thinkstats2.LeastSquares(ages, weights)
return slope
def MakeModel(self):
_, weights = self.data
self.ybar = weights.mean()
self.res = weights - self.ybar
def RunModel(self):
ages, _ = self.data
shuffled_weights = self.ybar + p.random.permutation(self.res)
return ages, shuffled_weights
1. WEIGHTED RESAMPLING

Resampling is a series of techniques used in statistics to gather more information about a


sample. This can include retaking a sample or estimating its accuracy. With these additional
techniques, resampling often improves the overall accuracy and estimates any uncertainty within
a population.
Sampling Vs Resampling

Sampling is the process of selecting certain groups within a population to gather data.
Resampling often involves performing similar testing methods with sample sizes within that group.
This can mean testing the same sample, or reselecting samples that can provide more information
about a population.
Resampling methods are:
1. Permutation tests (also re-randomization tests)
2. Bootstrapping
3. Cross validation
Permutation tests

Permutation tests rely on resampling the original data assuming the null hypothesis. Based
on the resampled data it can be concluded how likely the original data is to occur under the null
hypothesis.

Bootstrapping

Bootstrapping is a statistical method for estimating the sampling distribution of an


estimator by sampling with replacement from the original sample, most often with the purpose
of deriving robust estimates of standard errors and confidence intervals of a
population parameter like a mean, median, proportion, odds ratio,
correlation coefficient or regression coefficient. It has been called the plug-in principle, as it is the
method of estimation of functionals of a population distribution by evaluating the same functionals
at the empirical distribution based on a sample.

For example, when estimating the population mean, this method uses the sample
mean; to estimate the population median, it uses the sample median; to estimate the population
regression line, it uses the sample regression line.

Cross validation
Cross-validation is a statistical method for validating a predictive model. Subsets of the
data are held out for use as validating sets; a model is fit to the remaining data (a training set) and
used to predict for the validation set. Averaging the quality of the predictions across the validation
sets yields an overall measure of prediction accuracy. Cross-validation is employed repeatedly in
building decision trees.
In weighted sampling, each element is given a weight, where the probability of an element being
selected is based on its weight. As an example, if you survey 100,000 people in a country of 300
million, each respondent represents 3,000 people.
If you oversample one group by a factor of 2, each person in the oversampled group would have a
lower weight, about 1500. To correct for oversampling, we can use resampling;
As an example, I will estimate mean birth weight with and without sampling weights.
2. REGRESSION USING STATSMODELS

Statsmodels is Python module that provides classes and functions for the estimation of many
different statistical models
For multiple regression use StatsModels, a Python package that provides several forms of
regression and other analyses. .

There are 4 available classes of the properties of the regression model that will help us to
use the statsmodel linear regression.
The classes are as follows
1. Ordinary Least Square (OLS)
2. Weighted Least Square (WLS)
3. Generalized Least Square (GLS).
Example

4. MULTIPLE REGRESSION

Multiple regression is an extension of simple linear regression. Multiple linear regression


is a statistical technique that uses multiple linear regression to model more complex relationships
between two or more independent variables and one dependent variable. It is used when there are
two or more x variables.
Assumptions of Multiple Linear Regression

In multiple linear regression, the dependent variable is the outcome or result from you're trying to
predict. The independent variables are the things that explain your dependent variable. You can
use them to build a model that accurately predicts your dependent variable from the independent
variables.
For your model to be reliable and valid, there are some essential requirements:
• The independent and dependent variables are linearly related.
• There is no strong correlation between the independent variables.

• Residuals have a constant variance.

• Observations should be independent of one another.

• It is important that all variables follow multivariate normality.

y = m1x1+m2x2+m3x3+m4x4+…. + b
Example :

The Difference Between Linear and Multiple Regression

When predicting a complex process's outcome, it is best to use multiple linear regression
instead of simple linear regression.

A simple linear regression can accurately capture the relationship between two variables in
simple relationships. On the other hand, multiple linear regression can capture more complex
interactions that require more thought.
A multiple regression model uses more than one independent variable. It does not suffer
from the same limitations as the simple regression equation, and it is thus able to fit curved and
non-linear relationships. The following are the uses of multiple linear regression.

1. Planning and Control.

2. Prediction or Forecasting.

Estimating relationships between variables can be exciting and useful. As with all other
regression models, the multiple regression model assesses relationships among variables in terms
of their ability to predict the value of the dependent variable.
Nonlinear relationships
5. LOGISTIC REGRESSION

o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The below
image is showing the logistic function:

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

6. ESTIMATING PARAMETERS
Unlike linear regression, logistic regression does not have a closed form solution, so it is solved
by guessing an initial solution and improving it iteratively. The usual goal is to find the maximum-
likelihood estimate (MLE), which is the set of parameters that maximizes the likelihood of the
data. For example, suppose we have the following data:
The goal of logistic regression is to find parameters that maximizes this likelihood

7. TIME SERIES ANALYSIS

Time series analysis is a specific way of analyzing a sequence of data points collected over
an interval of time. In time series analysis, analysts record data points at consistent intervals over
a set period of time rather than just recording the data points intermittently or randomly.
Characteristics of Time Series Analysis:
➢ Homogeneous data
➢ Values with different to time
➢ Data as per reasonable period
➢ Gaps of time values must be equal.
Objective of Time series Analysis:
➢ To evaluate past performance in respect of a particular period.
➢ To make future forecasts in respect of the particular variable.
➢ To chart short term and long term strategies of the business in a respect of a particular
variable.

How to Analyze Time Series?

To perform the time series analysis, we have to follow the following steps:
• Collecting the data and cleaning it
• Preparing Visualization with respect to time vs key feature
• Observing the stationarity of the series
• Developing charts to understand its nature.
• Extracting insights from prediction
Significance of Time Series:
TSA is the backbone for prediction and forecasting analysis, specific to time-based problem
statements.

• Analyzing the historical dataset and its patterns


• Understanding and matching the current situation with patterns derived from the previous stage.
• Understanding the factor or factors influencing certain variable(s) in different periods.

With the help of “Time Series,” we can prepare numerous time-based analyses and results.

• Forecasting: Predicting any value for the future.


• Segmentation: Grouping similar items together.
• Classification: Classifying a set of items into given classes.
• Descriptive analysis: Analysis of a given dataset to find out what is there in it.
• Intervention analysis: Effect of changing a given variable on the outcome.
Components of Time Series Analysis:
Let’s look at the various components of Time Series Analysis:
• Trend: In which there is no fixed interval and any divergence within the given dataset is a
continuous timeline. The trend would be Negative or Positive or Null Trend
• Seasonality: In which regular or fixed interval shifts within the dataset in a continuous timeline.
Would be bell curve or saw tooth
• Cyclical: In which there is no fixed interval, uncertainty in movement and its pattern
• Irregularity: Unexpected situations/events/scenarios and spikes in a short time span.

What Are the Limitations of Time Series Analysis?


Time series has the below-mentioned limitations; we have to take care of those during our
data analysis.

• Similar to other models, the missing values are not supported by TSA
• The data points must be linear in their relationship.
• Data transformations are mandatory, so they are a little expensive.
• Models mostly work on Uni-variate data.

A time series is a sequence of measurements from a system that varies in time. The following
code reads data from pandas dataframe.
8. MOVING AVERAGES
In the method of moving average, successive arithmetic averages are computed from
overlapping groups of successive values of a time series.
Each group includes all the observations in a given time interval, termed as the period of
moving average.can be used for data preparation, feature engineering, and forecasting.
The trend and seasonal variations can be used to help make predictions about the
future – and as such can be very useful when budgeting and forecasting. Calculating moving
averages
One method of establishing the underlying trend (smoothing out peaks and troughs) in a
set of data is using the moving averages technique. Other methods, such as regression analysis can
also be used to estimate the trend. Regression analysis is dealt with in a separate article.
A moving average is a series of averages, calculated from historic data. Moving averages
can be calculated for any number of time periods, for example a three-month moving average, a
seven-day moving average, or a four-quarter moving average. The basic calculations are the same.
The following simplified example will take us through the calculation process.
Monthly sales revenue data were collected for a company for 20X2:

From this data, we will calculate a three-month moving average, as we can see a basic cycle that
follows a three-monthly pattern

Calculate the three-month moving average.

Add together the first three sets of data, for this example it would be January, February and
March. This gives a total of (125+145+186) = 456. Put this total in the middle of the data you are
adding, so in this case across from February. Then calculate the average of this total, by dividing
this figure by 3 (the figure you divide by will be the same as the number of time periods you have
added in your total column). Our three-month moving average is therefore (456 ÷ 3) = 152.
The average needs to be calculated for each three-month period. To do this you move your average
calculation down one month, so the next calculation will involve February, March and April. The total
for these three months would be (145+186+131) = 462 and the average would be (462 ÷ 3) = 154.

Calculate the trend

The three-month moving average represents the trend. From our example we can see a clear
trend in that each moving average is $2,000 higher than the preceding month moving average. This
suggests that the sales revenue for the company is, on average, growing at a rate of $2,000 per
month.

This trend can now be used to predict future underlying sales values.

Calculate the seasonal variation

Once a trend has been established, any seasonal variation can be calculated. The seasonal
variation can be assumed to be the difference between the actual sales and the trend (three-month
moving average) value.
A negative variation means that the actual figure in that period is less than the trend and a positive figure
means that the actual is more than the trend.

9. MISSING VALUES
The data in the real world has many missing data in most cases. There might be different
reasons why each value is missing.
There might be loss or corruption of data, or there might be specific reasons also.
The missing data will decrease the predictive power of your model.
If you apply algorithms with missing data, then there will be bias in the
estimation of parameters.
You cannot be confident about your results if you don’t handle missing data.
Check for missing values
When you have a dataset, the first step is to check which columns have missing data
and how many.
The ” isnull()” function is used for this. When you call the sum function along with
isnull, the total sum of missing data in each column is the output.
missing_values=train.isnull().sum() print(missing_values)
Dropping rows with missing values
It is a simple method, where we drop all the rows that have any missing values belonging
to a particular column. As easy as this is, it comes with a huge disadvantage. You might end up
losing a huge chunk of your data. This will reduce the size of your dataset and make your model
predictions biased. You should use this only when the no of missing values is very less.
we can drop rows using dropna function.
Handle missing values in Time series data
The datasets where information is collected along with timestamps in an orderly fashion are
denoted as time-series data.
1. Forward-fill missing values
The value of the next row will be used to fill the missing value.’ffill’ stands for ‘forward fill’. It is
very easy to implement. You just have to pass the “method” parameter as “ffill” in the fillna()
function.
forward_filled=df.fillna(method='ffill')
print(forward_filled)

2. Backward-fill missing values


Here, we use the value of the previous row to fill the missing value. ‘bfill’ stands for backward
fill. Here, you need to pass ‘bfill’ as the method parameter.
backward_filled=df.fillna(method='bfill')
print(backward_filled)

10. SERIAL CORRELATION


Serial correlation occurs in a time series when a variable and a lagged version of itself (for
instance a variable at times T and at T-1) are observed to be correlated with one another over
periods of time. Repeating patterns often show serial correlation when the level of a variable affects
its future level. In finance, this correlation is used by technical analysts to determine how well the
past price of a security predicts the future price.
Serial correlation is similar to the statistical concepts of autocorrelation or lagged correlation.
KEY TAKEAWAYS

• Serial correlation is the relationship between a given variable and a lagged version of
itself over various time intervals.
• It measures the relationship between a variable's current value given its past values.
• A variable that is serially correlated indicates that it may not be random.
• Technical analysts validate the profitable patterns of a security or group of securities
and determine the risk associated with investment opportunities.

we can shift the time series by an interval called a lag, and then compute the correlation of

the shifted series with the original:


AUTOCORRELATION
Autocorrelation refers to the degree of correlation of the same variables between two successive
time intervals. It measures how the lagged version of the value of a variable is related to the original
version of it in a time series. Autocorrelation, as a statistical concept, is also known as serial
correlation.
we calculate the correlation between two different versions Xt and Xt-k of the same time series.
Given time-series measurements, Y1, Y2,…YN at time X1, X2, …XN, the lag k autocorrelation
function is defined as:
An autocorrelation of +1 represents perfectly positive correlations and -1 represents a
perfectly negative correlation.

Usage:
• An autocorrelation test is used to detect randomness in the time-series. In many statistical
processes, our assumption is that the data generated is random. For checking randomness, we
need to check for the autocorrelation of lag 1.
• To determine whether there is a relation between past and future values of time series, we
try to lag between different values.

Testing For Autocorrelation


Durbin-Watson Test:
Durbin-Watson test is used to measure the amount of autocorrelation in residuals from the
regression analysis. Durbin Watson test is used to check for the first- order autocorrelation.
Assumptions for the Durbin-Watson Test:
• The errors are normally distributed and the mean is 0.
• The errors are stationary.
The test statistics are calculated with the following formula.

Where et is the residual of error from the Ordinary Least Squares (OLS) method. The null
hypothesis and alternate hypothesis for the Durbin-Watson Test are
• H0: No first-order autocorrelation.
• H1: There is some first-order correlation.

The Durbin Watson test has values between 0 and 4. Below is the table containing
values and their interpretations:
• 2: No autocorrelation. Generally, we assume 1.5 to 2.5 as no correlation.
• 0- <2: positive autocorrelation. The more close it to 0, the more signs of positive
autocorrelation.
• >2 -4: negative autocorrelation. The more close it to 4, the more signs of negative
autocorrelation.

The autocorrelation function is a function that maps from lag to the serial correlation with the given
lag. Autocorrelation" is another name for serial correlation, used more often when the lag is not 1.
acf computes serial correlations with lags from 0 through nlags. The unbiased ag tells acf to correct
the estimates for the sample size. The result is an array of correlations.
11. Introduction to Survival analysis

Introduction

Survival analysis is a statistical method essential for analyzing time-to-event data, widely
employed in medical research, economics, and various scientific disciplines. At the core of
survival analysis is the Kaplan-Meier estimator, a powerful tool for estimating survival
probabilities over time. This article provides a concise introduction to survival analysis,
unraveling its significance and applications. We delve into the fundamentals of the Kaplan-Meier
estimator, exploring its role in handling censored data and creating survival curves. Whether
you’re new to survival analysis or seeking a refresher, this guide navigates through the key
concepts, making this statistical approach accessible and comprehensible.

What is Survival Analysis?

Survival analysis explores the time until an event occurs, answering questions about failure
rates, disease progression, or recovery duration. It’s a crucial statistical field, involving terms like
event, time, censoring, and various methods like Kaplan-Meier curves, Cox regression models,
and log-rank tests for group comparisons. This branch delves into modeling time-to-event data,
offering insights into diverse scenarios, from medical diagnoses to mechanical system failures.
Understanding survival analysis requires defining a specific time frame and employing various
statistical tools to analyze and interpret data effectively.

Censoring/ Censored Observation

This terminology is defined as if the subject matter on which we are doing the study of
survival analysis doesn’t get affected by the defined event of study, then they are described as
censored. The censored subject might also not have an event after the end of the survival analysis
observation. The subject is called censored in the sense that nothing was observed out of the subject
after the time of censoring.

Censoring Observation are also of 3 types-


1. Right Censored

Right censoring is used in many problems. It happens when we are not certain what happened to
people after a certain point in time.

It occurs when the true event time is greater than the censored time when c < t. This happens if
either some people cannot be followed the entire time because they died or were lost to follow up
or withdrew from the study.
2. Left Censored

Left censoring is when we are not certain what happened to people before some point in time. Left
censoring is the opposite, occurring when the true event time is less than the censored time when
c > t.

3. Interval Censored

Interval censoring is when we know something has happened in an interval (not before starting
time and not after ending time of the study) but we do not know exactly when in the interval it
happened.

Interval censoring is a concatenation of the left and right censoring when the time is known to have
occurred between two-time points

Survival Function S (t): This is a probability function that depends on the time of the study. The
subject survives more than time t. The Survivor function gives the probability that the random
variable T exceeds the specified time t.
Kaplan Meier Estimator

Kaplan Meier Estimator is used to estimate the survival function for lifetime data. It is a
non-parametric statistics technique. It is also known as the product- limit estimator, and the concept
lies in estimating the survival time for a certain time of like a major medical trial event, a certain
time of death, failure of the machine, or any major significant event.

There are lots of examples like

1. Failure of machine parts after several hours of operation.


2. How much time it will take for COVID 19 vaccine to cure the patient.
3. How much time is required to get a cure from a medical diagnosis etc.
4. To estimate how many employees will leave the company in a specific period of
time.
5. How many patients will get cured by lung cancer

To Estimate the Kaplan Meier Survival we first need to estimate the Survival Function S (t) is the
probability of event time t

Where (d) are the number of death events at the time (t), and (n) is the number of subjects at risk
of death just prior to the time (t).

Assumptions of Kaplan Meier Survival

In real-life cases, we do not have an idea of the true survival rate function. So in Kaplan Meier
Estimator we estimate and approximate the true survival function from the study data. There are 3
assumptions of Kaplan Meier Survival
• Survival Probabilities are the same for all the samples who joined late in the study and those who
have joined early. The Survival analysis which can affect is not assumed to change.
• Occurrence of Event are done at a specified time.
• Censoring of the study does not depend on the outcome. The Kaplan Meier method doesn’t depend
on the outcome of interest.

Interpretation of Survival Analysis is Y-axis shows the probability of subject which has not come
under the case study. The X-axis shows the representation of the subject’s interest after surviving
up to time. Each drop in the survival function (approximated by the Kaplan-Meier estimator) is
caused by the event of interest happening for at least one observation.

The plot is often accompanied by confidence intervals, to describe the uncertainty about the point
estimates-wider confidence intervals show high uncertainty, this happens when we have a few
participants- occurs in both observations dying and being censored.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy