IEA 01 Probability & Statastical Method
IEA 01 Probability & Statastical Method
1. Regression Analysis:
• The Standard Error of Estimate is often associated with regression
analysis, which is a statistical technique used to model the relationship
between a dependent variable and one or more independent variables.
2. Predicted Values:
• In regression analysis, the model generates predicted values for the
dependent variable based on the values of the independent variables.
3. Actual Values:
• The actual values of the dependent variable are the real observations from
the dataset.
4. Residuals:
• Residuals are the differences between the actual values and the predicted
values. A positive residual indicates that the actual value is higher than the
predicted value, while a negative residual indicates the opposite.
5. Standard Error of Estimate:
• The SEE is a measure of the typical size of the residuals. It quantifies the
dispersion or variability of actual values around the regression line.
Mathematically, it is calculated as the square root of the mean squared
residuals.
SEE=n−k∑(Yi−Y^i)2
• Where:
• Yi is the actual value of the dependent variable.
• Y^i is the predicted value of the dependent variable.
• n is the number of data points.
• k is the number of independent variables in the model.
6. Interpretation:
• A smaller SEE suggests that the model's predictions are closer to the
actual values, indicating a better fit.
• A larger SEE implies more variability in the data points around the
regression line and suggests that the model may not be capturing the
underlying patterns well.
In summary, the Standard Error of Estimate helps assess the precision of a regression
model by providing a measure of how well the predicted values align with the actual
observations
Public
Time Series Forecasting:
A time series is a series of data points ordered chronologically. Each data point in a
time series is associated with a specific time and represents the value of a variable or
several variables at that particular time. Time series data is commonly used in various
fields, and time series analysis is particularly important in forecasting.
1. Time Interval:
• Time series data is collected at regular intervals over time. The time
intervals can be hourly, daily, monthly, annually, etc., depending on the
nature of the data and the analysis requirements.
2. Data Points:
• Each data point in a time series corresponds to a specific time and
contains information about the variable(s) being measured at that time.
3. Trend:
• A trend represents the long-term movement or pattern in the data. It could
be increasing, decreasing, or stable over time.
4. Seasonality:
• Seasonality refers to recurring patterns or fluctuations in the data that
occur at regular intervals, often associated with certain seasons, months,
days of the week, etc.
5. Cyclic Patterns:
• Cyclic patterns are longer-term undulating patterns that are not strictly tied
to specific calendar intervals. They may occur over several years and are
more irregular than seasonal patterns.
6. Random Fluctuations:
• Random fluctuations are irregular movements in the data that cannot be
attributed to trends, seasonality, or cycles. They represent the inherent
variability or noise in the data.
Time series forecasting involves using historical time series data to predict future values
of the variable(s) of interest. This is particularly useful in various domains such as
finance, economics, weather forecasting, stock market analysis, and more. Forecasting
methods aim to capture and model the underlying patterns in the time series data to
make accurate predictions.
1. Moving Averages:
Public
• Simple Moving Average (SMA) and Exponential Moving Average (EMA)
are methods where the predicted value is based on the average of past
observations.
2. Autoregressive Integrated Moving Average (ARIMA):
• ARIMA is a popular and powerful model for time series forecasting that
combines autoregression, differencing, and moving averages.
3. Seasonal Decomposition of Time Series (STL):
• STL decomposes a time series into its trend, seasonal, and residual
components, making it easier to model and forecast each component
separately.
4. Prophet:
• Developed by Facebook, Prophet is a forecasting tool designed for time
series data with daily observations that may contain missing data and
include various seasonalities.
5. Machine Learning Models:
• Various machine learning algorithms, such as decision trees, random
forests, and neural networks, can be applied to time series forecasting,
especially when dealing with complex and nonlinear patterns.
Public
Monthly salary X in a small organisation is normally distributed with mean
Rs.3000 and standard deviation of Rs. 250.What should be the minimum salary of
a worker in this organisation, so that the probability that he belongs to top 5%
workers?
`
Public
A stud manufacturer wants to determine the inner diameter of a certain grade of tire. Ideally, the
diameter would be 35mm.The data are as follows: 35, 36, 35, 34, 33, 35, 36, 34
Public
In a certain assembly plant, three machines, B1, B2, and B3 make 35%, 40% and 25%,
respectively of the products. It is known from past experience that 2%, 3% and 2% of the
products made by each machine respectively are defective. Now, suppose that a finished
product is randomly selected. What is the probability that it was made by machine B3
Public
A factory produces an article by mass production methods. From the past experience it
is found that 25 articles on an average are rejected out of every batch of 100.Find the
mean and the variance of the number of rejected articles.
Public
Ten college students were given a test in Statistics and their scores were recorded. They
were given a month special coaching and a second test was given to them in the same
subject at the end of the coaching period. Test if the marks given below give evidence to
the fact that the students are benefited by coaching.
Marks in Test I 75 70 60 75 80 85 68 75 65 58
Marks in Test II 68 70 55 73 75 78 80 92 64 55
Public
Public
Explain with illustration sample space, event, exhaustive events, and mutually exclusive
events
Public
Public
What Is Analysis of Variance (ANOVA)?
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed aggregate
variability found inside a data set into two parts: systematic factors and random factors. The systematic
factors have a statistical influence on the given data set, while the random factors do not. Analysts use
the ANOVA test to determine the influence that independent variables have on the dependent variable in
a regression study.
The ANOVA test is the initial step in analyzing factors that affect a given data set. Once the test is
finished, an analyst performs additional testing on the methodical factors that measurably contribute to
the data set's inconsistency. The analyst utilizes the ANOVA test results in an f-test to generate
additional data that aligns with the proposed regression models.
The ANOVA test allows a comparison of more than two groups at the same time to determine whether a
relationship exists between them. The result of the ANOVA formula, the F statistic (also called the F-
ratio), allows for the analysis of multiple groups of data to determine the variability between samples and
within samples.
If no real difference exists between the tested groups, which is called the null hypothesis, the result of
the ANOVA's F-ratio statistic will be close to 1. The distribution of all possible values of the F statistic is
the F-distribution. This is actually a group of distribution functions, with two characteristic numbers,
called the numerator degrees of freedom and the denominator degrees of freedom.
A researcher might, for example, test students from multiple colleges to see if students from one of the
colleges consistently outperform students from the other colleges. In a business application, an R&D
researcher might test two different processes of creating a product to see if one process is better than
the other in terms of cost efficiency.
The type of ANOVA test used depends on a number of factors. It is applied when data needs to be
experimental. Analysis of variance is employed if there is no access to statistical software resulting in
computing ANOVA by hand. It is simple to use and best suited for small samples. With many
experimental designs, the sample sizes have to be the same for the various factor level combinations.
ANOVA is helpful for testing three or more variables. It is similar to multiple two-sample t-tests. However,
it results in fewer type I errors and is appropriate for a range of issues. ANOVA groups differences by
comparing the means of each group and includes spreading out the variance into diverse sources. It is
employed with subjects, test groups, between groups and within groups.
Public
Introduction to Design of Experiments
(DOE)
If you look at many industries today you see similar products being offered by multiple
manufacturers. Many companies today are frequently re-designing their products in an
attempt to make their product stand out from the crowd. In addition, a great number of
manufacturers are constantly developing new products to gain a foothold in other
markets. With new products come new or changed processes. Every time we change a
design or process we introduce new content. The amount of new content can be
equated to the level of risk in the design or process. Product validation testing and
prototype production runs are effective, but costly and in many cases problems are
detected late in the development process. Engineers must use various analysis tools
and statistical methods to reduce risk in a design or process. They must evaluate every
change and how it could affect the process output. If you have multiple changes
occurring at one time you could be multiplying your risk. So what can be done to predict
how a set of changes will likely affect the process output? Design of Experiments
(DOE) is a statistical tool available to engineers that can be used to evaluate single
changes or multiple changes to a process at once and predict the resulting change to
the output of the process.
Public
a single or “fractional factorial” experiment. By properly utilizing DOE methodology, the
number of trial builds or test runs can be greatly reduced. A robust Design of
Experiments can save project time and uncover hidden issues in the process. The
hidden issues are generally associated with the interactions of the various factors. In
the end, teams will be able to identify which factors impact the process the most and
which ones have the least influence on the process output.
Public
The Test
• The test matrix should include all variables and identify all possible combinations for
each of the controllable input factors of the process
o The number of variables is up to the experimenter
Public
A process engineer is investing the effect of process operating temperature X on product yield Y.
The study results in the following data. X 100 110 120 130 140 150 160 170 180 190 Y 45 51 54
61 66 70 74 78 85 89 Find the equation of the least square line which will enable to predict yield
on the basis of temperature. Find also the degree of relationship between the temperature and
the yield
Public
Public
Investigate the association between the darkness of eye colour in father and son from
the following data Colour of father’s eyes Colour of son’s eyes Dark Not Dark Total Dark
58 90 148 Not Dark 80 782 862
Public
Public
Public
Public
Present a brief overview of SPSS package along with its features
1. Data Management:
• SPSS provides tools for data entry, data cleaning, and manipulation.
• It supports various data types, including numeric, string, and date formats.
2. Descriptive Statistics:
• Generates descriptive statistics like mean, median, mode, standard
deviation, and more.
• Frequency distribution and cross-tabulation features help in summarizing
and exploring data.
3. Inferential Statistics:
• Performs a wide range of inferential statistical analyses such as t-tests,
ANOVA, regression analysis, chi-square tests, and non-parametric tests.
• Allows users to test hypotheses and make inferences about populations
based on sample data.
4. Graphics and Visualization:
• Offers a variety of charts and graphs for visualizing data, including bar
charts, histograms, scatterplots, and box plots.
• Customization options for enhancing the visual representation of results.
5. Data Transformation:
• Supports variable transformations, recoding, and the creation of derived
variables.
• Handles missing data and allows imputation methods.
6. Advanced Analytics:
Public
• Includes advanced statistical techniques such as factor analysis, cluster
analysis, and discriminant analysis.
• Supports complex survey data analysis and bootstrapping.
7. Syntax and Programming:
• SPSS allows users to work with a graphical user interface (GUI) for point-
and-click operations.
• Additionally, users can utilize syntax language for scripting and
automating analyses, enhancing reproducibility.
8. Data Export and Integration:
• Enables the import and export of data in various formats, facilitating
integration with other statistical software and data sources.
9. Reporting:
• SPSS allows the creation of professional-looking reports with tables and
charts.
• Integration with other software like Microsoft Excel and Word for seamless
reporting.
10. Extensibility:
• Supports the integration of external extensions and plugins to enhance
functionality.
• Customizable and extensible through its Python and R integration.
SPSS is widely used in academia, market research, healthcare, and various industries
where data analysis is crucial for decision-making. Its user-friendly interface, broad
range of statistical tools, and versatility make it a popular choice for researchers and
analysts conducting statistical analyses on diverse datasets.
Public
1. Random Sampling: The observations must be obtained through a
random process, meaning each observation is independent of the
others.
2. Independence: Each observation must be independent of the others.
The outcome of one observation should not affect the outcome of
another.
3. Sample Size: The sample size should be sufficiently large. While there
is no strict rule about what constitutes "large," a sample size of 30 or
more is often considered adequate for the CLT to apply. However,
smaller sample sizes can also result in approximately normal
distributions if the underlying population is itself normally distributed.
4. Population Distribution: The original population distribution does not
need to be normal. The CLT works for a wide range of distributions, as
long as the sample size is large enough.
The consequences of the Central Limit Theorem are significant for statistical
inference. Regardless of the shape of the original population distribution, the
distribution of the sample mean will tend to be approximately normal. This
normal distribution becomes particularly important when making inferences
about population parameters using statistical tests and confidence intervals.
The Central Limit Theorem forms the basis for many statistical techniques,
such as hypothesis testing and confidence interval estimation. It allows
statisticians to make valid inferences about a population based on the
properties of the sample mean distribution, even when the underlying
population distribution may be unknown or not normal.
Public
Public
Public
Public
Public
Public
Public