0% found this document useful (0 votes)
23 views30 pages

IEA 01 Probability & Statastical Method

Uploaded by

ashish.katake
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views30 pages

IEA 01 Probability & Statastical Method

Uploaded by

ashish.katake
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

The Standard Error of Estimate (SEE) is a statistical measure that

quantifies the precision of predictions in a regression analysis. In simple terms, it


indicates how closely the actual values of the dependent variable cluster around the
regression line. The lower the standard error of estimate, the better the model's
predictions are expected to be.

Here's a breakdown of the concept:

1. Regression Analysis:
• The Standard Error of Estimate is often associated with regression
analysis, which is a statistical technique used to model the relationship
between a dependent variable and one or more independent variables.
2. Predicted Values:
• In regression analysis, the model generates predicted values for the
dependent variable based on the values of the independent variables.
3. Actual Values:
• The actual values of the dependent variable are the real observations from
the dataset.
4. Residuals:
• Residuals are the differences between the actual values and the predicted
values. A positive residual indicates that the actual value is higher than the
predicted value, while a negative residual indicates the opposite.
5. Standard Error of Estimate:
• The SEE is a measure of the typical size of the residuals. It quantifies the
dispersion or variability of actual values around the regression line.
Mathematically, it is calculated as the square root of the mean squared
residuals.
SEE=n−k∑(Yi−Y^i)2
• Where:
• Yi is the actual value of the dependent variable.
• Y^i is the predicted value of the dependent variable.
• n is the number of data points.
• k is the number of independent variables in the model.
6. Interpretation:
• A smaller SEE suggests that the model's predictions are closer to the
actual values, indicating a better fit.
• A larger SEE implies more variability in the data points around the
regression line and suggests that the model may not be capturing the
underlying patterns well.

In summary, the Standard Error of Estimate helps assess the precision of a regression
model by providing a measure of how well the predicted values align with the actual
observations

Public
Time Series Forecasting:

A time series is a series of data points ordered chronologically. Each data point in a
time series is associated with a specific time and represents the value of a variable or
several variables at that particular time. Time series data is commonly used in various
fields, and time series analysis is particularly important in forecasting.

Key Components of Time Series:

1. Time Interval:
• Time series data is collected at regular intervals over time. The time
intervals can be hourly, daily, monthly, annually, etc., depending on the
nature of the data and the analysis requirements.
2. Data Points:
• Each data point in a time series corresponds to a specific time and
contains information about the variable(s) being measured at that time.
3. Trend:
• A trend represents the long-term movement or pattern in the data. It could
be increasing, decreasing, or stable over time.
4. Seasonality:
• Seasonality refers to recurring patterns or fluctuations in the data that
occur at regular intervals, often associated with certain seasons, months,
days of the week, etc.
5. Cyclic Patterns:
• Cyclic patterns are longer-term undulating patterns that are not strictly tied
to specific calendar intervals. They may occur over several years and are
more irregular than seasonal patterns.
6. Random Fluctuations:
• Random fluctuations are irregular movements in the data that cannot be
attributed to trends, seasonality, or cycles. They represent the inherent
variability or noise in the data.

Time series forecasting involves using historical time series data to predict future values
of the variable(s) of interest. This is particularly useful in various domains such as
finance, economics, weather forecasting, stock market analysis, and more. Forecasting
methods aim to capture and model the underlying patterns in the time series data to
make accurate predictions.

Common techniques for time series forecasting include:

1. Moving Averages:

Public
• Simple Moving Average (SMA) and Exponential Moving Average (EMA)
are methods where the predicted value is based on the average of past
observations.
2. Autoregressive Integrated Moving Average (ARIMA):
• ARIMA is a popular and powerful model for time series forecasting that
combines autoregression, differencing, and moving averages.
3. Seasonal Decomposition of Time Series (STL):
• STL decomposes a time series into its trend, seasonal, and residual
components, making it easier to model and forecast each component
separately.
4. Prophet:
• Developed by Facebook, Prophet is a forecasting tool designed for time
series data with daily observations that may contain missing data and
include various seasonalities.
5. Machine Learning Models:
• Various machine learning algorithms, such as decision trees, random
forests, and neural networks, can be applied to time series forecasting,
especially when dealing with complex and nonlinear patterns.

Effective time series forecasting requires a good understanding of the underlying


patterns, careful selection of forecasting models, and validation of the models using
appropriate evaluation metrics. It is a valuable tool for decision-making in many fields
where anticipating future trends and values is crucial.

Public
Monthly salary X in a small organisation is normally distributed with mean
Rs.3000 and standard deviation of Rs. 250.What should be the minimum salary of
a worker in this organisation, so that the probability that he belongs to top 5%
workers?
`

Public
A stud manufacturer wants to determine the inner diameter of a certain grade of tire. Ideally, the
diameter would be 35mm.The data are as follows: 35, 36, 35, 34, 33, 35, 36, 34

(i) Find the sample mean and median.


(ii) (ii) Find the sample variance, standard deviation and range
(iii) Using the calculated statistics in parts (i) and (ii), comment on the quality of stud.

Arrange data is ascending oreder

Public
In a certain assembly plant, three machines, B1, B2, and B3 make 35%, 40% and 25%,
respectively of the products. It is known from past experience that 2%, 3% and 2% of the
products made by each machine respectively are defective. Now, suppose that a finished
product is randomly selected. What is the probability that it was made by machine B3

Public
A factory produces an article by mass production methods. From the past experience it
is found that 25 articles on an average are rejected out of every batch of 100.Find the
mean and the variance of the number of rejected articles.

Public
Ten college students were given a test in Statistics and their scores were recorded. They
were given a month special coaching and a second test was given to them in the same
subject at the end of the coaching period. Test if the marks given below give evidence to
the fact that the students are benefited by coaching.
Marks in Test I 75 70 60 75 80 85 68 75 65 58
Marks in Test II 68 70 55 73 75 78 80 92 64 55

Public
Public
Explain with illustration sample space, event, exhaustive events, and mutually exclusive
events

Public
Public
What Is Analysis of Variance (ANOVA)?
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed aggregate
variability found inside a data set into two parts: systematic factors and random factors. The systematic
factors have a statistical influence on the given data set, while the random factors do not. Analysts use
the ANOVA test to determine the influence that independent variables have on the dependent variable in
a regression study.

The Formula for ANOVA is:

F=MSTMSEwhere:F=ANOVA coefficientMST=Mean sum of squares due to treatmentMSE=Mean su


m of squares due to errorF=MSEMST
where:F=ANOVA coefficientMST=Mean sum of squares due to treatmentMSE=Mean sum of squar
es due to error

What Does the Analysis of Variance Reveal?

The ANOVA test is the initial step in analyzing factors that affect a given data set. Once the test is
finished, an analyst performs additional testing on the methodical factors that measurably contribute to
the data set's inconsistency. The analyst utilizes the ANOVA test results in an f-test to generate
additional data that aligns with the proposed regression models.

The ANOVA test allows a comparison of more than two groups at the same time to determine whether a
relationship exists between them. The result of the ANOVA formula, the F statistic (also called the F-
ratio), allows for the analysis of multiple groups of data to determine the variability between samples and
within samples.

If no real difference exists between the tested groups, which is called the null hypothesis, the result of
the ANOVA's F-ratio statistic will be close to 1. The distribution of all possible values of the F statistic is
the F-distribution. This is actually a group of distribution functions, with two characteristic numbers,
called the numerator degrees of freedom and the denominator degrees of freedom.

Example of How to Use ANOVA

A researcher might, for example, test students from multiple colleges to see if students from one of the
colleges consistently outperform students from the other colleges. In a business application, an R&D
researcher might test two different processes of creating a product to see if one process is better than
the other in terms of cost efficiency.

The type of ANOVA test used depends on a number of factors. It is applied when data needs to be
experimental. Analysis of variance is employed if there is no access to statistical software resulting in
computing ANOVA by hand. It is simple to use and best suited for small samples. With many
experimental designs, the sample sizes have to be the same for the various factor level combinations.

ANOVA is helpful for testing three or more variables. It is similar to multiple two-sample t-tests. However,
it results in fewer type I errors and is appropriate for a range of issues. ANOVA groups differences by
comparing the means of each group and includes spreading out the variance into diverse sources. It is
employed with subjects, test groups, between groups and within groups.

Public
Introduction to Design of Experiments
(DOE)
If you look at many industries today you see similar products being offered by multiple
manufacturers. Many companies today are frequently re-designing their products in an
attempt to make their product stand out from the crowd. In addition, a great number of
manufacturers are constantly developing new products to gain a foothold in other
markets. With new products come new or changed processes. Every time we change a
design or process we introduce new content. The amount of new content can be
equated to the level of risk in the design or process. Product validation testing and
prototype production runs are effective, but costly and in many cases problems are
detected late in the development process. Engineers must use various analysis tools
and statistical methods to reduce risk in a design or process. They must evaluate every
change and how it could affect the process output. If you have multiple changes
occurring at one time you could be multiplying your risk. So what can be done to predict
how a set of changes will likely affect the process output? Design of Experiments
(DOE) is a statistical tool available to engineers that can be used to evaluate single
changes or multiple changes to a process at once and predict the resulting change to
the output of the process.

What is Design of Experiments (DOE)


Design of Experiments (DOE) is a branch of applied statistics focused on using the
scientific method for planning, conducting, analyzing and interpreting data from
controlled tests or experiments. DOE is a mathematical methodology used to effectively
plan and conduct scientific studies that change input variables (X) together to reveal
their effect on a given response or the output variable (Y). In plain, non-statistical
language, the DOE allows you to evaluate multiple variables or inputs to a process or
design, their interactions with each other and their impact on the output. In addition, if
performed and analyzed properly you should be able to determine which variables have
the most and least impact on the output. By knowing this you can design a product or
process that meets or exceeds quality requirements and satisfies customer needs.

Why Utilize Design of Experiments (DOE)


DOE allows the experimenter to manipulate multiple inputs to determine their effect on
the output of the experiment or process. By performing a multi-factorial or “full-factorial”
experiment, DOE can reveal critical interactions that are often missed when performing

Public
a single or “fractional factorial” experiment. By properly utilizing DOE methodology, the
number of trial builds or test runs can be greatly reduced. A robust Design of
Experiments can save project time and uncover hidden issues in the process. The
hidden issues are generally associated with the interactions of the various factors. In
the end, teams will be able to identify which factors impact the process the most and
which ones have the least influence on the process output.

When to Utilize Design of Experiments


(DOE)
Experimental design or Design of Experiments can be used during a New Product /
Process Introduction (NPI) project or during a Kaizen or process improvement
exercise. DOE is generally used in two different stages of process improvement
projects.
• During the “Analyze” phase of a project, DOE can be used to help identify the Root
Cause of a problem. With DOE the team can examine the effects of the various inputs
(X) on the output (Y). DOE enables the team to determine which of the Xs impact the Y
and which one(s) have the most impact.
• During the “Improve” phase of a project, DOE can be used in the development of a
predictive equation, enabling the performance of what-if analysis. The team can then
test different ideas to assist in determining the optimum settings for the Xs to achieve the
best Y output.
Some knowledge of statistical tools and experimental planning is required to fully
understand DOE methodology. While there are several software programs available for
DOE analysis, to properly apply DOE you need to possess an understanding of basic
statistical concepts.

How to Perform a Design of Experiments


(DOE)
A DOE generally consists of the following four main phases, detailed below.
The Experimental Plan
• Gain a thorough understanding of the process being analyzed
o Include all inputs and outputs of the process
o Define the problem or goal of the experimenter
o Clearly identify any specific questions that you need the experiment to answer
• List the known or expected sources of variability in the experimental units (X)
• Determine how to identify and block the uncontrollable inputs (S)
• The method of measurement for the output (Y) should be determined
o Attribute measures (pass / fail) should be avoided
o Measurement Systems Analysis (MSA) should be performed (if not previously
completed) on the selected measurement system

Public
The Test
• The test matrix should include all variables and identify all possible combinations for
each of the controllable input factors of the process
o The number of variables is up to the experimenter

Analyze the Results


After completing the experiment and collecting the data, the next step is to analyze the data
and determine which input factors (X) or interactions (S) had the most impact on the process
output (Y). By analyzing the data the experimenter can optimize the process by determining the
combination of variables that produce the most desirable process output (Y).

Determine Appropriate Actions


Once the experiment is complete and the data is analyzed, actions must be identified to
improve the process. The experimenter or team should determine any appropriate actions to be
taken, assign an owner and a due date for each action.
DOE Example
The example described below is a simple experiment meant only to demonstrate the four steps
of a basic Design of Experiments. Using DOE on your processes will most likely involve several
input factors (X) and multiple interactions (S). Examining each factor individually would require
a tremendous amount of time and resources. Using DOE enables the experimenter to examine
multiple factors at once, including the effect of interactions between factors, reducing the
required number of runs, thus saving time and valuable resources.
• Let’s say we want to evaluate the inputs of sunlight and water in relation to a plants
growth. Use the formula 2n to determine how many tests need to be run. The letter n is
equal to the number of factors being examined. For our two factor (sunlight and water)
experiment, four runs are required. We will need to represent each factor at its highest
and lowest points. The experiments need to be performed in a randomized fashion.
• For each of the input factors, determine the high and low levels to be used for the
experiment. While the levels may be outside of the range currently applied, make sure
they are still realistic. There will be uncontrollable factors, such as soil temperature,
involved that must be blocked from the experiment. For our example, the levels will be
as follows:
o Controllable Factor A: Water, Levels: 50 to 1.5 cups per day
o Controllable Factor B: Sunlight, Levels: 1 to 3 hours per day
o Uncontrollable Factors could include soil temperature
• For our experiment, the amount of sunlight per day had a greater effect on the plant
growth and the effect of the interactions was insignificant

Public
A process engineer is investing the effect of process operating temperature X on product yield Y.
The study results in the following data. X 100 110 120 130 140 150 160 170 180 190 Y 45 51 54
61 66 70 74 78 85 89 Find the equation of the least square line which will enable to predict yield
on the basis of temperature. Find also the degree of relationship between the temperature and
the yield

Public
Public
Investigate the association between the darkness of eye colour in father and son from
the following data Colour of father’s eyes Colour of son’s eyes Dark Not Dark Total Dark
58 90 148 Not Dark 80 782 862

Public
Public
Public
Public
Present a brief overview of SPSS package along with its features

SPSS (Statistical Package for the Social Sciences) is a comprehensive statistical


software package widely used in social science and business research for data analysis
and interpretation. Here's a brief overview of SPSS along with its key features:

1. Data Management:
• SPSS provides tools for data entry, data cleaning, and manipulation.
• It supports various data types, including numeric, string, and date formats.
2. Descriptive Statistics:
• Generates descriptive statistics like mean, median, mode, standard
deviation, and more.
• Frequency distribution and cross-tabulation features help in summarizing
and exploring data.
3. Inferential Statistics:
• Performs a wide range of inferential statistical analyses such as t-tests,
ANOVA, regression analysis, chi-square tests, and non-parametric tests.
• Allows users to test hypotheses and make inferences about populations
based on sample data.
4. Graphics and Visualization:
• Offers a variety of charts and graphs for visualizing data, including bar
charts, histograms, scatterplots, and box plots.
• Customization options for enhancing the visual representation of results.
5. Data Transformation:
• Supports variable transformations, recoding, and the creation of derived
variables.
• Handles missing data and allows imputation methods.
6. Advanced Analytics:

Public
• Includes advanced statistical techniques such as factor analysis, cluster
analysis, and discriminant analysis.
• Supports complex survey data analysis and bootstrapping.
7. Syntax and Programming:
• SPSS allows users to work with a graphical user interface (GUI) for point-
and-click operations.
• Additionally, users can utilize syntax language for scripting and
automating analyses, enhancing reproducibility.
8. Data Export and Integration:
• Enables the import and export of data in various formats, facilitating
integration with other statistical software and data sources.
9. Reporting:
• SPSS allows the creation of professional-looking reports with tables and
charts.
• Integration with other software like Microsoft Excel and Word for seamless
reporting.
10. Extensibility:
• Supports the integration of external extensions and plugins to enhance
functionality.
• Customizable and extensible through its Python and R integration.

SPSS is widely used in academia, market research, healthcare, and various industries
where data analysis is crucial for decision-making. Its user-friendly interface, broad
range of statistical tools, and versatility make it a popular choice for researchers and
analysts conducting statistical analyses on diverse datasets.

The Central Limit Theorem (CLT) is a fundamental concept in statistics that


plays a crucial role in inferential statistics. It states that, under certain
conditions, the distribution of the sum (or average) of a large number of
independent, identically distributed random variables approaches a normal
distribution, regardless of the original distribution of the variables.

Here are the key points of the Central Limit Theorem:

Public
1. Random Sampling: The observations must be obtained through a
random process, meaning each observation is independent of the
others.
2. Independence: Each observation must be independent of the others.
The outcome of one observation should not affect the outcome of
another.
3. Sample Size: The sample size should be sufficiently large. While there
is no strict rule about what constitutes "large," a sample size of 30 or
more is often considered adequate for the CLT to apply. However,
smaller sample sizes can also result in approximately normal
distributions if the underlying population is itself normally distributed.
4. Population Distribution: The original population distribution does not
need to be normal. The CLT works for a wide range of distributions, as
long as the sample size is large enough.

The consequences of the Central Limit Theorem are significant for statistical
inference. Regardless of the shape of the original population distribution, the
distribution of the sample mean will tend to be approximately normal. This
normal distribution becomes particularly important when making inferences
about population parameters using statistical tests and confidence intervals.

The Central Limit Theorem forms the basis for many statistical techniques,
such as hypothesis testing and confidence interval estimation. It allows
statisticians to make valid inferences about a population based on the
properties of the sample mean distribution, even when the underlying
population distribution may be unknown or not normal.

Public
Public
Public
Public
Public
Public
Public

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy