4.1.1 Input Modeling
4.1.1 Input Modeling
Contents
• Data Collection
• Identifying the Distribution with Data
• Parameter Estimation
• Goodness-of-Fit Tests
• Fitting a Nonstationary Poisson Process
• Selecting Input Models without Data
• Multivariate and Time-Series Input Data
Purpose & Overview
• Input models provide the driving force for a simulation model.
• The quality of the output is no better than the quality of inputs.
• In this chapter, we will discuss the 4 steps of input model
development:
System
Input
Raw Data Performance Output
Data
Simulation
30
20
10
0
4
1
2
1
6
7 14 20
2
0
Histograms: Example
• Vehicle Arrival Example: Arrivals per
Period Frequency
Number of vehicles arriving at 0 12
an intersection between 7 1
2
10
19
am and 7:05 am was 3 17
data range
20
15
10
0
0
1
Histograms: Example
• Life tests were performed on electronic components at 1.5
times the nominal voltage, and their lifetime was
recorded
2000
5000
3500
• Sample size
10000
4000
1500
2500
3000
Frequency
Frequency
Frequency
• Histograms with
1000
1500
2000
different numbers
500
1000
of bins
500
0
0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
400
200
600
300
150
Frequency
Frequency
Frequency
400
200
100
200
100
50
0
0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
People
800
Ca
10
ll s
10 15 en ts
Mo vem
• Groups with different 0
20
communication
Number of People
15000
• Interesting 10000
characteristic
• Number of people with odd 5000
number movements is 0
negligible! -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Number of Movements
Identifying the Distribution
Scatter diagrams
Scatter Diagrams
• A scatter diagram is a quality tool that can show the
relationship between paired data
• Random Variable X = Data 1
• Random Variable Y = Data 2
• Draw random variable X on the x-axis and Y on the y-
axis
35
60 35
30
50 30
25
25
20 40
20
15 30
15
10 20
10
5 10 5
0 0 0
0 10 20 40 0 10 20 30 40 0 10 20 30 40
30
Moderate Correlation No Correlation
Strong Correlation
Scatter Diagrams
• Linear relationship
• Correlation: Measures how well data line up
• Slope: Measures the steepness of the data
• Direction
• Y intercept
30 35
30
25
25
20
20
15
15
10
10
5 5
0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Identifying the Distribution
Selecting the Family of Distributions
Selecting the Family of Distributions
• A family of distributions is selected based on:
• The context of the input variable
• Shape of the histogram
• Frequently encountered distributions:
• Easier to analyze: Exponential, Normal, and Poisson
• Is it bound?
• Value range?
• Only positive values
• Only negative values
• Interval of [-a:b]
F ( ) P( X ) q , for 0 1
q1 q
• When F has an inverse, = F-1(q)
x
• Let {xi, i = 1,2, …., n} be a sample of data from X
and {yj, j = 1,2, …, n} be this sample in ascending order:
⎛⎜ j
1 ⎞
⎟
y j is approximately
0.5
F
⎝ n ⎠
• where j is the ranking or order number
Quantile-Quantile Plots
F-1()
yj
Quantile-Quantile Plots: Example
100,6
100,4
100,2
Straight line, 100
supporting the 99,8
hypothesis of a 99,6
normal distribution 99,4
99,2
99,2 99,4 99,6 99,8 100 100,2 100,4 100,6 100,8
0,35
0,3
0,25
0,2
Superimposed
density function of 0,15
distribution 0,05
0
99,4 99,6 99,8 100 100,2 100,4 100,6
Quantile-Quantile Plots
• Consider the following while evaluating the linearity of a
Q-Q plot:
• The observed values never fall exactly on a straight line
• The ordered values are ranked and hence not independent,
unlikely for the points to be scattered about the line
• Variance of the extremes is higher than the middle. Linearity
of the points in the middle of the plot is more important.
Quantile-Quantile Plots
n 1
Frequency
2080 100
S 2 10
(3.64)2 99
5
7.63
0
0 1 2 3 4 5 6 7 8 9 10 11
Number of Arrivals per Period
Gamma ,
ˆ 1X
Normal , 2
ˆ X ,ˆ 2 S 2
Lognormal , 2
ˆ X ,ˆ 2 S 2 After taking ln
of data.
Parameter Estimation
• Maximum Likelihood example exponential distribution
Goodness-of-Fit Tests
Goodness-of-Fit Tests
• Conduct hypothesis testing on input data distribution
using
• Kolmogorov-Smirnov test
• Chi-square test
Type II Error
Accept H0 Correct Incorrectly accept H0
False negative
Type I Error
Reject H0 Incorrectly reject H0 Correct
False positive
Chi-Square Test
• Intuition: comparing the histogram of the data to the shape of
the candidate density or mass function
• Valid for large sample sizes when parameters are estimated by
maximum-likelihood
• Arrange the n observations into a set of k class intervals
• The test statistic is:
Expected Frequency
Observed frequency in Ei = n ×pi
the i-th class where pi is the theoretical
prob. of the i-th interval.
k Suggested Minimum = 5
(Oi Ei )2
02
i1
Ei
Accept 0
⎨0s1
Test result⎧ 2,k
2
H
⎩2 0 ,k
Reject H0
2
s1
pi p(xi ) P( X xi )
Chi-Square Test
• If the distribution tested is continuous:
ai
pi ai1
f (x) dx F (ai ) F
(ai1)
• where ai-1 and ai are the endpoints of the i-th class interval
• f(x) is the assumed PDF, F(x) is the assumed CDF
• Recommended number of class intervals (k):
Sample size (n) Number of class intervals (k)
20 Do not use the chi-square test
50 5 to 10
100 10 to 20
> 100 n
n to
5
0,80
0,00
1000 <= X <= 2000 2000 < X <=2500 2500 < X <= 4500 4500 < X <= 5000
Multivariate and Time-Series Input Models
Multivariate and Time-Series Input Models
• The random variable discussed until now were considered to be
independent of any other variables within the context of the
problem
• However, variables may be related
• If they appear as input, the relationship should be investigated and
taken into consideration
• Multivariate input models
• Fixed, finite number of random variables X1, X2, …, Xk
• For example, lead time and annual demand for an inventory model
• An increase in demand results in lead time increase, hence variables
are dependent.
• Time-series input models
• Infinite sequence of random variables, e.g., X1, X2, X3, …
• For example, time between arrivals of orders to buy and sell stocks
• Buy and sell orders tend to arrive in bursts, hence, times between
arrivals are dependent.
Time-Series
• A time series is a sequence of random variables X1, X2, X3,…
which are identically distributed (same mean and
variance) but dependent.
• cov(Xt, Xt+h) is the lag-h autocovariance
• corr(Xt, Xt+h) is the lag-h autocorrelation
• If the autocovariance value depends only on h and not on t,
the time series is covariance stationary
• For covariance stationary time series, the shorthand for lag-h
is used
h corr( X t , X t h )
• Notice
• autocorrelation measures the dependence between random
variables that are separated by h-1 others in the time
series
Multivariate Input Models
• If X1 and X2 are normally distributed, dependence between them
can be modeled by the bivariate normal distribution with 1, 2,
12, 22 and correlation
• To estimate 1, 2, 12, 22, see “Parameter Estimation”
• To estimate , suppose we have n independent and identically
distributed pairs (X11, X21), (X12, X22), … (X1n, X2n),
ˆ2
Multivariate Input Models: Example
• Let X1 the average lead time to deliver and X2 the annual
demand for a product.
• Data for 10 years is available. Lead Time
(X1)
Demand
(X2)
6,5 103
X1 6.14, 1 1.02 4,3 83
6,9 116
X 2 101.8, 2
6,0 97
9.93
6,9 112
• To estimate , 2 :
X, ˆ 2 ˆ 2 (1ˆ 2 ˆ
coˆv( X t , X t
ˆ
), )
ˆ 1 2
where coˆv( X t , X t 1) is the lag-
1autocovariance
Summary
• In this chapter, we described the 4 steps in developing input
data models:
(1) Collecting the raw data
(2) Identifying the underlying statistical distribution
(3) Estimating the parameters
(4) Testing for goodness of fit