Bahan Final Term Facsim
Bahan Final Term Facsim
Verification, Calibration,
and Validation of Simulation
Models
One of the most important and difficult tasks facing a model developer is the verification and validation
of the simulation model. The engineers and analysts who use the model outputs to aid in making
design recommendations and the managers who make decisions based on these recommendations—
justifiably look upon a model with some degree of skepticism about its validity. It is the job of
the model developer to work closely with the end users throughout the period of development and
validation to reduce this skepticism and to increase the model’s credibility.
The goal of the validation process is twofold: (a) to produce a model that represents true system
behavior closely enough for the model to be used as a substitute for the actual system for the purpose
of experimenting with the system, analyzing system behavior, and predicting system performance;
and (b) to increase the credibility of the model to an acceptable level, so that the model will be used
by managers and other decision makers.
Validation should not be seen as an isolated set of procedures that follows model development,
but rather as an integral part of model development. Conceptually, however, the verification and
validation process consists of the following components:
388
Section 10.1 Model Building, Verification, and Validation
e eres
e MVR OF wigeia = CCC389
1. Verification is concerned with building the model correctly. It proceeds by the comparison
of the conceptual model to the computer representation that implements that conception. It
asks the questions: Is the model implemented correctly in the simulation software? Are the
input parameters and logical structure of the model represented correctly?
2. Validation is concerned with building the correct model. It attempts to confirm that a model
is an accurate representation of the real system. Validation is usually achieved through the
calibration of the model, an iterative process of comparing the model to actual system behavior
and using the discrepancies between the two, and the insights gained, to improve the model.
This process is repeated until model accuracy is judged to be acceptable.
This chapter describes methods that have been recommended and used in the verification and
validation process. Most of the methods are informal subjective comparisons; a few are formal
statistical procedures. The use of the latter procedures involves issues related to output analysis, the
subject of Chapters 11 and 12. Output analysis refers to analysis of the data produced by a simulation
and to drawing inferences from these data about the behavior of the real system. To summarize their
relationship, validation is the process by which model users gain confidence that output analysis is
making valid inferences about the real system under study.
Many articles and chapters in texts have been written on verification and validation. For discus-
sion of the main issues, the reader is referred to Balci [1994, 1998, 2003], Carson [1986, 2002], Gass
[1983], Kleijnen [1995], Law and Kelton [2000], Naylor and Finger [1967], Oren [1981], Sargent
[2003], Shannon [1975], and van Horn [1969, 1971]. For statistical techniques relevant to various
aspects of validation, the reader can obtain the foregoing references plus those by Balci and Sargent
[1982a,b; 1984a], Kleijnen [1987], and Schruben [1980]. For case studies in which validation is
emphasized, the reader is referred to Carson et al. [1981a,b], Gafarian and Walsh [1970], Kleijnen
[1993], and Shechter and Lucas [1980]. Bibliographies on validation have been published by Balci
and Sargent [1984b] and by Youngblood [1993].
The first step in model building consists of observing the real system and the interactions among its
various components and of collecting data on their behavior. But observation alone seldom yields
sufficient understanding of system behavior. People familiar with the system, or any subsystem,
should be questioned, to take advantage of their special knowledge. Operators, technicians, repair
and maintenance personnel, engineers, supervisors, and managers understand certain aspects of the
system that might be unfamiliar to others. As model development proceeds, new questions may arise,
and the model developers will return to this step to learn more about system structure and behavior.
The second step in model building is the construction of a conceptual model—a collection of
assumptions about the components and the structure of the system, plus hypotheses about the values
of model input parameters. As is illustrated by Figure 10.1, conceptual validation is the comparison
of the real system to the conceptual model.
The third step is the implementation of an operational model, usually by using simulation
software, and incorporating the assumptions of the conceptual model into the worldview and concepts
of the simulation software.
10 Verification, Calibration,a Ul Validation
lied: idea nd of Simulation Models
390 Ian
2 oT mS Chapter
SEC A Sh ae oct Se
Real system
Calibration
and
ane Conceptual
validation validation
Conceptual model
1, Assumptions on system components
2. Structural assumptions, which define
the interactions between system
components
. Input parameters and data assumptions
Model
verification
Operational model
(Computerized
representation)
In actuality, model building is not a linear process with three steps. Instead, the model builder
will return to each of these steps many times while building, verifying, and validating the model.
Figure 10.1 depicts the ongoing model building process, in which the need for verification and
validation causes continual comparison of the real system to both the conceptual model and the
operational model, and induces repeated modification of the model to improve its accuracy.
The purpose of model verification is to assure that the conceptual model is reflected accurately
in the operational model. The conceptual model quite often involves some degree of abstraction
about system operations or some amount of simplification of actual operations. Verification asks the
following question: Is the conceptual model (assumptions about system components and system struc-
ture, parameter values, abstractions, and simplifications) accurately represented by the operational
model?
Many common-sense suggestions can be used in the verification process:
1. Have the operational model checked by someone other than its developer, preferably an
expert in the simulation software being used.
2. Make a flow diagram that includes each logically possible action a system can take when an
event occurs, and follow the model logic for each action for each event type. (An example of
a logic flow diagram is given in Figures 2.4 and 2.5 for the model of a single-server queue.)
Section 10.2 Verification of Simulation
a ee Models
ce se ee 391
3. Closely examine the model output for reasonableness under a variety of settings of the input
parameters. Have the implemented model display a wide variety of output statistics, and
examine all of them closely.
4. Have the operational model print the input parameters at the end of the simulation, to be sure
that these parameter values have not been changed inadvertently.
5. Make the operational model as self-documenting as possible. Give a precise definition of
every variable used and a general description of the purpose of each submodel, procedure
(or major section of code), component, or other model subdivision.
6. If the operational model is animated, verify that what is seen in the animation imitates the
actual system. Examples of errors that can be observed through animation are automated
guided vehicles (AGVs) that pass through one another on a unidirectional path or at an
intersection and entities that disappear (unintentionally) during a simulation.
7. The Interactive Run Controller (IRC) or debugger is an essential component of successful
simulation model building. Even the best of simulation analysts makes mistakes or commits
logical errors when building a model. The IRC assists in finding and correcting those errors
in the following ways:
(a) The simulation can be monitored as it progresses. This can be accomplished by advancing
the simulation until a desired time has elapsed, then displaying model information at that
time. Another possibility is to advance the simulation until a particular condition is in
effect, and then display information.
(b) Attention can be focused on a particular entity, line of code, or procedure. For instance,
every time that an entity enters a specified procedure, the simulation will pause so that
information can be gathered. As another example, every time that a specified entity
becomes active, the simulation will pause.
(c) Values of selected model components can be observed. When the simulation has paused,
the current value or status of variables, attributes, queues, resources, counters, and so on
can be observed.
(d) The simulation can be temporarily suspended, or paused, not only to view information,
but also to reassign values or redirect entities.
8. Graphical interfaces are recommended for accomplishing verification and validation [Bort-
scheller and Saulnier, 1992]. The graphical representation of the model is essentially a form
of self-documentation. It simplifies the task of understanding the model.
These suggestions are basically the same ones any software engineer would follow.
Among these common-sense suggestions, one that is very easily implemented, but quite often
overlooked, especially by students who are learning simulation, is a close and thorough examination
of model output for reasonableness (suggestion 3). For example, consider a model of a complex
network of queues consisting of many service centers in series and parallel configurations. Suppose
that the modeler is interested mainly in the response time, defined as the time required for a customer
to pass through a designated part of the network. During the verification (and calibration) phase
of model development, it is recommended that the program collect and print out many statistics in
addition to response times, such as utilizations of servers and time-average number of customers in
various subsystems. Examination of the utilization of a server, for example, might reveal that it is
unreasonably low (or high), a possible error that could be caused by wrong specification of mean
392 Verification,
Chapter 10 NETIC NyCalibration,and Simulation Models
ee ee of Oe
REISS Ine Validation
R T E
CRNOT
service time, or by a mistake in model logic that sends too few (or too many) customers to this
particular server, or by any number of other possible parameter misspecifications or errors in logic.
In a simulation language that automatically collects many standard statistics (average queue
lengths, average waiting times, etc.), it takes little or no extra programming effort to display almost
all statistics of interest. The effort required can be considerably greater in a general-purpose language
such as C, C++, or Java, which do not have statistics-gathering capabilities to aid the programmer.
Two sets of statistics that can give a quick indication of model reasonableness are current contents
and total count. These statistics apply to any system having items of some kind flowing through it,
whether these items be called customers, transactions, inventory, or vehicles. Current contents refers
to the number of items in each component of the system at a given time. Total count refers to the total
number of items that have entered each component of the system by a given time. In some simulation
software, these statistics are kept automatically and can be displayed at any point in simulation
time. In other simulation software, simple counters might have to be added to the operational model
and displayed at appropriate times. If the current contents in some portion of the system are high,
this condition indicates that a large number of entities are delayed. If the output is displayed for
successively longer simulation run times and the current contents tend to grow in a more or less
linear fashion, it is highly likely that a queue is unstable and that the server(s) will fall further behind
as time continues. This indicates possibly that the number of servers is too small or that a service
time is misspecified. (Unstable queues were discussed in Chapter 6.) On the other hand, if the total
count for some subsystem is zero, this indicates that no items entered that subsystem—again, a
highly suspect occurrence. Another possibility is that the current count and total count are equal to
one. This could indicate that an entity has captured a resource, but never freed that resource. Careful
evaluation of these statistics for various run lengths can aid in the detection of mistakes in model logic
and data misspecifications. Checking for output reasonableness will usually fail to detect the more
subtle errors, but it is one of the quickest ways to discover gross errors. To aid in error detection,
it is best for the model developer to forecast a reasonable range for the value of selected output
Statistics before making a run of the model. Such a forecast reduces the possibility of rationalizing
a discrepancy and failing to investigate the cause of unusual output.
For certain models, it is possible to consider more than whether a particular statistic is reason-
able. It is possible to compute certain long-run measures of performance. For example, as seen in
Chapter 6, the analyst can compute the long-run server utilization for a large number of queueing
systems without any special assumptions regarding interarrival or service-time distributions. Typi-
cally, the only information needed is the network configuration, plus arrival and service rates. Any
measure of performance that can be computed analytically and then compared to its simulated
counterpart provides another valuable tool for verification. Presumably, the objective of the simu-
lation is to estimate some measure of performance, such as mean response time, that cannot be
computed analytically; but, as illustrated by the formulas in Chapter 6 for a number of special queues
(M/M/1,M/G/1, etc.), all the measures of performance in a queueing system are interrelated. Thus,
if a simulation model is predicting one measure (such as utilization) correctly, then confidence in
the model’s predictive ability for other related measures (such as response time) is increased (even
though the exact relation between the two measures is, of course, unknown in general and varies
from model to model). Conversely, if a model incorrectly predicts utilization, its prediction of other
quantities, such as mean response time, is highly suspect.
Section 10.2eneVerification
See eee of Simulation Models 393
eee wei Uf wrote)
Another important way to aid the verification process is the oft-neglected documentation phase.
If a model builder writes brief comments in the operational model, plus definitions of all variables
and parameters, plus descriptions of each major section of the operational model, it becomes much
simpler for someone else, or the model builder at a later date, to veri fy the model logic. Documentation
is also important as a means of clarifying the logic of a model and verifying its completeness.
A more sophisticated technique is the use of a trace. In general, a trace provides detailed, time-
stamped simulation output representing the values for a selected set of system states, entity attributes,
and model variables whenever an event occurs. That is, at the time an event occurs, the model would
write detailed status information to a file for later examination by the model developer to assist in
model verification and the detection of modeling errors
Definition of Variables:
CLOCK = Simulation clock
EVTYP = Event type (start, arrival, departure, or stop)
NCUST = Number of customers in system at time given by CLOCK
STATUS = Status of server (1—busy, 0-idle)
restricted to a very short period of time. It is desirable, of course, to ensure that each type of event
(such as Arrival) occurs at least once, so that its consequences and effect on the model can be checked
for accuracy. If an event is especially rare in occurrence, it may be necessary to use artificial data to
force it to occur during a simulation of short duration. This is legitimate, as the purpose is to. verify
that the effect on the system of the rare event is as intended.
Some software allows a selective trace. For example, a trace could be set for specific locations
in the model or could be triggered to begin at a specified simulation time. Whenever an entity goes
through the designated locations, the simulation software writes a time-stamped message to a trace
file. Some simulation software allows for tracing a selected entity; any time the designated entity
becomes active, the trace is activated and time-stamped messages are written. This trace is very
useful in following one entity through the entire model. Another example of a selective trace is to
set it for the occurrence of a particular condition. For example, whenever the queue before a certain
resource reaches five or more, turn on the trace. This allows running the simulation until something
unusual occurs, then examining the behavior from that point forward in time. Different simulation
software packages support tracing to various extents. In practice, it is often implemented by the
model developer by adding printed messages at appropriate points into a model.
Of the three classes of techniques—the common-sense techniques, thorough documentation,
and traces—it is recommended that the first two always be carried out. Close examination of model
output for reasonableness is especially valuable and informative. A generalized trace may provide
voluminous data, far more than can be used or examined carefully. A selective trace can provide
useful information on key model components and keep the amount of data to a manageable level.
Section 10.3 Calibration and Validation of Models 395
Calibration and validation, although conceptually distinct, usually are conducted simultaneously by
the modeler. Validation is the overall process of comparing the model and its behavior to the real
system and its behavior. Calibration is the iterative process of comparing the model to the real system,
making adjustments (or even major changes) to the model, comparing the revised model to reality,
making additional adjustments, comparing again, and so on. Figure 10.3 shows the relationship of
model calibration to the overall validation process. The comparison of the model to reality is carried
out by a variety of tests—some subjective, others objective. Subjective tests usually involve people,
who are knowledgeable about one or more aspects of the system, making judgments about the model
and its output. Objective tests always require data on the system’s behavior, plus the corresponding
data produced by the model. Then one or more statistical tests are performed to compare some aspect
of the system data set with the same aspect of the model data set. This iterative process of comparing
model with system and then revising both the conceptual and operational models to accommodate
any perceived model deficiencies is continued until the model is judged to be sufficiently accurate.
A possible criticism of the calibration phase, were it to stop at this point, is that the model has
been validated only for the one data set used—that is, the model has been “fitted” to one data set.
One way to alleviate this criticism is to collect a new set of system data (or to reserve a portion of
the original system data) to be used at this final stage of validation. That is, after the model has been
calibrated by using the original system data set, a “final” validation is conducted, using the second
system data set. If unacceptable discrepancies between the model and the real system are discovered
in the “final” validation effort, the modeler must return to the calibration phase and modify the model
until it becomes acceptable.
Revise
Revise
Revise
Validation is not an either/or proposition—no model is ever totally representative of the system
under study. In addition, each revision of the model, as pictured in Figure 10.3, involves some cost,
time, and effort. The modeler must weigh the possible, but not guaranteed, increase in model accuracy
versus the cost of increased validation effort. Usually, the modeler (and model users) have some
maximum discrepancy between model predictions and system behavior that would be acceptable. If
this level of accuracy cannot be obtained within the budget constraints, either expectations of model
accuracy must be lowered, or the model must be abandoned.
As an aid in the validation process, Naylor and Finger [1967] formulated a three-step approach
that has been widely followed:
The first goal of the simulation modeler is to construct a model that appears reasonable on its face to
model users and others who are knowledgeable about the real system being simulated. The potential
users of a model should be involved in model construction from its conceptualization to its implemen-
tation, to ensure that a high degree of realism is built into the model through reasonable assumptions
regarding system structure and through reliable data. Potential users and knowledgeable people can
also evaluate model output for reasonableness and can aid in identifying model deficiencies. Thus,
the users can be involved in the calibration process as the model is improved iteratively by the insights
gained from identification of the initial model deficiencies. Another advantage of user involvement
is the increase in the model’s perceived validity, or credibility, without which a manager would not
be willing to trust simulation results as a basis for decision making.
Sensitivity analysis can also be used to check a model’s face validity. The model user is asked
whether the model behaves in the expected way when one or more input variables is changed. For
example, in most queueing systems, if the arrival rate of customers (or demands for service) were to
increase, it would be expected that utilizations of servers, lengths of lines, and delays would tend to
increase (although by how much might well be unknown). From experience and from observations on
the real system (or similar related systems), the model user and model builder would probably have
some notion at least of the direction of change in model output when an input variable is increased
or decreased. For most large-scale simulation models, there are many input variables and thus many
possible sensitivity tests. The model builder must attempt to choose the most critical input variables
for testing if it is too expensive or time consuming to vary all input variables. If real system data are
available for at least two settings of the input parameters, objective scientific sensitivity tests can be
conducted via appropriate statistical techniques.
Section 10.3 Calibration and Validation of Models
ar
e 397
e
1. interarrival times of customers during several 2-hour periods of peak loading (“rush-hour”
traffic);
2. interarrival times during a slack period;
3. service times for commercial accounts;
4. service times for personal accounts.
The reliability of the data was verified by consultation with bank managers, who identified typical
rush hours and typical slack times. When combining two or more data sets collected at different
times, data reliability can be further enhanced by objective statistical tests for homogeneity of data.
(Do two data sets {X;} and {Y;} on service times for personal accounts, collected at two different
times, come from the same parent population? If so, the two sets can be combined.) Additional tests
might be required, to test for correlation in the data. As soon as the analyst is assured of dealing with
arandom sample (i.e., correlation is not present), the statistical analysis can begin.
The procedures for analyzing input data from a random sample were discussed in detail in
Chapter 9. Whether done manually or by special-purpose software, the analysis consists of three
steps:
Step 3. Validate the assumed statistical model by a goodness-of-fit test, such as the chi-square or
Kolmogorov—Smirnov test, and by graphical methods.
The use of goodness-of-fit tests is an important part of the validation of data assumptions.
The ultimate test of a model, and in fact the only objective test of the model as a whole, is the model’s
ability to predict the future behavior of the real system when the model input data match the real
inputs and when a policy implemented in the model is implemented at some point in the system.
Furthermore, if the level of some input variables (e.g., the arrival rate of customers to a service
Simulation
eae of ee
Validation Models
398
I a Chapter I
oe RITA STEEN Calibration,and
10 5Verification, err OS re ee ee
facility) were to increase or decrease, the model should accurately predict what would happen in
the real system under similar circumstances. In other words, the structure of the model should be
accurate enough for the model to make good predictions, not just for one input data set, but for the
range of input data sets that are of interest.
In this phase of the validation process, the model is viewed as an input-output transformation—
that is, the model accepts values of the input parameters and transforms these inputs into output
measures of performance. It is this correspondence that is being validated.
Instead of validating the model input-output transformations by predicting the future, the modeler
could use historical data that have been reserved for validation purposes only—that is, if one data
set has been used to develop and calibrate the model, it is recommended that a separate data set be
used as the final validation test. Thus, accurate “prediction of the past” can replace prediction of the
future for the purpose of validating the model.
A model is usually developed with primary interest in a specific set of system responses to be
measured under some range of input conditions. For example, in a queueing system, the responses
may be server utilization and customer delay, and the range of input conditions (or input variables)
may include two or three servers at some station and a choice of scheduling rules. In a production
system, the response may be throughput (i.e., production per hour), and the input conditions may be a
choice of several machines that run at different speeds, with each machine having its own breakdown
and maintenance characteristics.
In any case, the modeler should use the main responses of interest as the primary criteria for
validating a model. If the model is used later for a purpose different from its original purpose, the
model should be revalidated in terms of the new responses of interest and under the possibly new
input conditions.
A necessary condition for the validation of input-output transformations is that some version
of the system under study exist, so that system data under at least one set of input conditions can
be collected to compare to model predictions. If the system is in the planning stages and no system
operating data can be collected, complete input-output validation is not possible. Other types of
validation should be conducted, to the extent possible. In some cases, subsystems of the planned
system may exist, and a partial input-output validation can be conducted.
Presumably, the model will be used to compare alternative system designs or to investigate
system behavior under a range of new input conditions. Assume for now that some version of the
system is operating and that the model of the existing system has been validated. What, then, can be
said about the validity of the model when different inputs are used? That is, if model inputs are being
changed to represent a new system design, or a new way to operate the system, or even hypothesized
future conditions, what can be said about the validity of the model with respect to this new but
nonexistent proposed system or to the system under new input conditions?
First, the responses of the two models under similar input conditions will be used as the criteria
for comparison of the existing system to the proposed system. Validation increases the modeler’s
confidence that the model of the existing system is accurate. Second, in many cases, the proposed
system is a modification of the existing system, and the modeler hopes that confidence in the model
of the existing system can be transferred to the model of the new system. This transfer of confidence
usually can be justified if the new model is a relatively minor modification of the old model in terms
of changes to the operational model (it may be a major change for the actual system). Changes in
the operational model ranging from relatively minor to relatively major include the following:
Section 10.3 Calibration and Validation of Models
e 399
e eS Ul ee
1. minor changes of single numerical parameters, such as the speed of a machine, the arrival
rate of customers (with no change in distributional form of interarrival times), the number
of servers in a parallel service center, or the mean time to failure or mean time to repair of a
machine;
2. minor changes in the form of a statistical distribution, such as the distribution of a service
time or a time to failure of a machine;
3. major changes in the logical structure of a subsystem, such as a change in queue discipline
for a waiting-line model or a change in the scheduling rule for a job-shop model;
4. major changes involving a different design for the new system, such as a computerized inven-
tory control system replacing an older noncomputerized system, or an automated storage-
and-retrieval system replacing a warehouse system in which workers manually pick items
using fork trucks.
If the change to the operational model is minor, such as in items 1 or 2, these changes can
be carefully verified and output from the new model accepted with considerable confidence. If a
sufficiently similar subsystem exists elsewhere, it might be possible to validate the submodel that
represents the subsystem and then to integrate this submodel with other validated submodels to build
a complete model. In this way, partial validation of the substantial model changes in items 3 and
4 might be possible. Unfortunately, there is no way to validate the input-output transformations of
a model of a nonexisting system completely. In any case, within time and budget constraints, the
modeler should use as many validation techniques as possible, including input-output validation of
subsystem models if operating data can be collected on such subsystems.
Example 10.2 will illustrate some of the techniques that are possible for input-output validation
and will discuss the concepts of an input variable, uncontrollable variable, decision variable, output
or response variable, and input-output transformation in more detail.
Y’all Come
Back
Main Street
Jaspar
S
pep
-oOonretDN
Welcome
to Jaspar
normally distributed, with mean 1.1 minutes and standard deviation 0.2 minute. Thus, the model has
two input variables:
1. interarrival times, exponentially distributed (i.e., a Poisson arrival process) at rate A = 45 per
hour;
2. service times, assumed to be N(1.1, (0.2)7).
Each input variable has a level: the rate A = 45 per hour for the interarrival times, and the mean 1.1
minutes and standard deviation 0.2 minute for the service times. The interarrival times are examples
of uncontrollable variables (i.e., uncontrollable by management in the real system). The service
times are also treated as uncontrollable variables, although the level of the service times might be
partially controllable. If the mean service time could be decreased to 0.9 minute by installing a
computer terminal, the level of the service-time variable becomes a decision variable or controllable
parameter, Setting all decision variables at some level constitutes a policy. For example, the current
bank policy is one teller D} = 1, mean service time D2 = 1.1 minutes, and one line for waiting
cars D3 = 1. (D,, D2,... are used to denote decision variables.) Decision variables are under
management’s control; the uncontrollable variables, such as arrival rate and actual arrival times, are
not under management’s control. The arrival rate might change from time to time, but such change
is treated as being due to external factors not under management control.
Section 10.3. Calibration and Validation of Models 401
Table 10.1 Input and Output Variables for Model of Current Bank Operations
A model of current bank operations was developed and verified in close consultation with bank
management and employees. Model assumptions were validated, as discussed in Section 10.3.2.
The resulting model is now viewed as a “black box” that takes all input-variable specifications and
transforms them into a set of output or response variables. The output variables consist of all statistics
of interest generated by the simulation about the model’s behavior. For example, management is
interested in the teller’s utilization at the drive-in window (percent of time the teller is busy at
the window), average delay in minutes of a customer from arrival to beginning of service, and
the maximum length of the line during the rush hour. These input and output variables are shown
in Figure 10.5 and are listed in Table 10.1, together with some additional output variables. The
uncontrollable input variables are denoted by X, the decision variables by D, and the output variables
by Y. From the “black box” point of view, the model takes the inputs X and D and produces the
outputs Y, namely
(XD) > ¥
or
LA, DD =F
Here f denotes the transformation that is due to the structure of the model. For the Fifth National
Bank study, the exponentially distributed interarrival time generated in the model (by the methods
of Chapter 8) between customer n — 1 and customer n is denoted by X1,. (Do not confuse Xj, with
An; the latter was an observation made on the real system.) The normally distributed service time
generated in the model for customer n is denoted by X2,,. The set of decision variables, or policy, is
D = (Dj, D2, D3) = (1,1.1,1) for current operations. The output, or response, variables are denoted
by Y = (%, Y2,..., Y7) and are defined in Table 10.1.
For validation of the input-output transformations of the bank model to be possible, real system
data must be available, comparable to at least some of the model output Y of Table 10.1. The system
responses should have been collected during the same time period (from 11:00 A.M. to 1:00 P.M. on
10 Verification,
Chapter rt Calibration,an d Validation Simulation
of tty Sot yesModels
402
{a e L! AMLLLLU phetesdtdabel te ide ea oN
One line
D; — 1
the same Friday) in which the input data {A;, S;} were collected. This is important because, if system
response data were collected ona slower day (say, an arrival rate of 40 per hour), the system responses
such as teller utilization Z;, average delay Z2, and maximum line length Z3 would be expected to be
lower than the same variables during a time slot when the arrival rate was 45 per hour, as observed.
Suppose that the delay of successive customers was measured on the same Friday between 11:00
A.M. and 1:00 P.M. and that the average delay was found to be Z, = 4.3 minutes. For the purpose
of validation, we will consider this to be the true mean value up = 4.3.
When the model is run with generated random variates X),, and X2,, it is expected that observed
values of average delay Y2 should be close to Zz = 4.3 minutes. The generated input values X,, and
X2n cannot be expected to replicate the actual input values A, and S,, of the real system exactly, but
they are expected to replicate the statistical pattern of the actual inputs. Hence, simulation-generated
values of Y2 are expected to be consistent with the observed system variable Z) = 4.3 minutes. Now
consider how the modeler might test this consistency.
The modeler makes a small number of statistically independent replications of the model. Statis-
tical independence is guaranteed by using nonoverlapping sets of random numbers produced by the
random-number generator or by choosing seeds for each replication independently (from a random
number table). The results of six independent replications, each of 2 hours duration, are given in
Table 10.2.
Observed arrival rate Y4 and sample average service time Ys for each replication of the model
are also noted, to be compared with the specified values of 45/hour and 1.1 minutes, respectively. The
validation test consists of comparing the system response, namely average delay Z) = 4.3 minutes,
Section 10.3.
a Calibration
nN and Validation of Models 403
A ee ee
to the model responses Y2. Formally, a statistical test of the null hypothesis
is conducted. If Ho is not rejected, then, on the basis of this test, there is no reason to consider the
model invalid. If Ho is rejected, the current version of the model is rejected, and the modeler is forced
to seek ways to improve the model, as illustrated by Figure 10.3. As formulated here, the appropriate
statistical test is the f test, which is conducted in the following manner:
Choose a level of significance a and a sample size n. For the bank model, choose
a= 0.05... 1 = 5
Compute the sample mean Y> and the sample standard deviation S over the n replications, by
using Equations (9.1) and (9.2):
i 1 n
and "3
u ;—Y.)*
S — ae — 0.82 minute
n —
Y4 Ys Y2 = Average Delay
Replication (arrivals/hour) (minutes) (minutes)
6
Sample mean
Standard deviation
ofaSimulation Models
404 :CO
ok NE Ls he 10
Chapter Verification,
bod abelLepeteit cater eeeValidation
Calibration,and agta Mare,
Y2 — Lo
fy = ——— (10.2)
Ae
where {Uo is the specified value in the null hypothesis Ho. Here 49 = 4.3 minutes, so that
51—-4.3
i = nh = —5.34
0.82//6
For the two-sided test, if |fo| > te/2,n-1, reject Ho. Otherwise, do not reject Ho. [For the one-sided
test with H, : E(Y2) > [o, reject Ho ift> ten—1; with H; : E(Y2) < mo, reject Ho if t < —fgn-1.]
Since |t| = 5.34 > 0,025.5 = 2.571, reject Ho, and conclude that the model is inadequate in its
prediction of average customer delay.
Recall that, in the testing of hypotheses, rejection of the null hypothesis Hp is a strong conclusion,
because
and the level of significance a is chosen small, say ~w= 0.05, as was done here. Equation (10.3) says
that the probability is low (a = 0.05) of making the error of rejecting Hp when Ap is in fact true
—that is, the probability is small of declaring the model invalid when it is valid (with respect to the
variable being tested). The assumptions justifying a f test are that the observations Y2; are normally
and independently distributed. Are these assumptions met in the present case?
1. The ith observation Y; is the average delay of all drive-in customers who began service during
the ith simulation run of 2 hours; thus, by a Central Limit Theorem effect, it is reasonable
to assume that each observation Y2; is approximately normally distributed, provided that the
number of customers it is based on is not too small.
2. The observations Y2;,1 = 1,...,6, are statistically independent by design—that is, by
choosing the random-number seeds independently for each replication or by using nonover-
lapping streams.
3. The f statistic computed by Equation (10.2) is a robust statistic—that is, it is distributed
approximately as the f distribution with n — 1 degrees of freedom, even when Y>, Y22,...
are not exactly normally distributed, and thus the critical values in Table A.5 can reliably be
used.
Now that the model of the Fifth National Bank of Jaspar has been found lacking, what should the
modeler do? Upon further investigation, the modeler realized that the model contained two unstated
assumptions:
1. When a car arrived to find the window immediately available, the teller began service imme-
diately.
2. There is no delay between one service ending and the next beginning, when a car is waiting.
Sectioneee
ee 10.3 eee
Calibrationeee
and Validation
eee of Models ETE vate 405
Assumption 2 was found to be approximately correct, because a service time was considered to
begin when the teller actually began service but was not considered to have ended until the car had
exited the drive-in window and the next car, if any, had begun service, or the teller saw that the line
was empty. On the other hand, assumption | was found to be incorrect because the teller had other
duties—mainly, serving walk-in customers if no cars were present—and tellers always finished with
a previous customer before beginning service on a car. It was found that walk-in customers were
always present during rush hour; that the transactions were mostly commercial in nature, taking a
considerably longer time than the time required to service drive-up customers; and that, when an
arriving car found no other cars at the window, it had to wait until the teller finished with the present
walk-in customer. To correct this model inadequacy, the structure of the model was changed to
include the additional demand on the teller’s time, and data were collected on service times of walk-
in customers. Analysis of these data found that they were approximately exponentially distributed
with a mean of 3 minutes.
The revised model was run, yielding the results in Table 10.3. A test of the null hypothesis
Ho : E(Y2) = 4.3 minutes was again conducted, according to the procedure previously outlined.
Choose aw = 0.05 and n = 6 (sample size).
Compute Y) = 4.78 minutes, S = 1.66 minutes.
Look up, in Table A.5, the critical value fo.25,5 = 2.571.
Compute the test statistic f9 = (Y2 — o)/(S//n) = 0.710.
Since |tf9| < f0,025,5 = 2.571, do not reject Ho, and thus tentatively accept the model as valid.
Failure to reject Hp must be considered as a weak conclusion unless the power of the test has
been estimated and found to be high (close to 1)—that is, it can be concluded only that the data
at hand (Y,..., Yoo) were not sufficient to reject the hypothesis Ho : “49 = 4.3 minutes. In other
words, this test detects no inconsistency between the sample data (Y2),... , Y26) and the specified
mean Lo.
The power of a test is the probability of detecting a departure from Ho : 4 = 40 when, in fact,
such a departure exists. In the validation context, the power of the test is the probability of detecting
an invalid model. The power may also be expressed as 1 minus the probability of a Type II, or B,
Y4 Y5 Y2 = Average Delay
Replication (arrivals/hour) (minutes) (minutes)
6
Sample mean
Standard deviation
10 Verification, Calibration,and Validation of Simulation Models
406ho
Biche neChapteree ese A
error, where B = P(Type II error) = P(failing to reject Ho|H; is true) is the probability of accepting
the model as valid when it is not valid.
To consider failure to reject Ho as a strong conclusion, the modeler would want B to be small.
Now, 8 depends on the sample size n and on the true difference between E (Y>) and uo = 4.3
minutes—that is, on
)
_ |E(¥2) — bol
7 or
where o,, the population standard deviation of an individual Y2;, is estimated by S. Tables A.10 and
A.11 are typical operating-characteristic (OC) curves, which are graphs of the probability of a Type
II error B(6) versus 5 for given sample size n. Table A.10 is for a two-sided f test; Table A.11 is for a
one-sided f test. Suppose that the modeler would like to reject Ho (model validity) with probability at
least 0.90 if the true mean delay of the model, E(Y2), differed from the average delay in the system,
jo = 4.3 minutes, by | minute. Then 6 is estimated by
= |E(¥2) — Mol 1
= ——> =.0:60
S 1.66
For the two-sided test with a = 0.05, use of Table A.10 results in
Associated
Statistical Terminology Modeling Terminology Risk
Philosophically, the hypothesis-testing approach tries to evaluate whether the simulation and the
real system are the same with respect to some output performance measure or measures. A different,
but closely related, approach is to attempt to evaluate whether the simulation and the real-system
performance measures are close enough by using confidence intervals.
We continue to assume that there is a known output performance measure for the existing system,
denoted by zo, and an unknown performance measure of the simulation, jz, that we hope is close.
The hypothesis-testing formulation tested whether 4. = j40; the confidence-interval formulation tries
to bound the difference |44 — 4o| to see whether it is < ¢, a difference that is small enough to allow
valid decisions to be based on the simulation. The value of ¢€ is set by the analyst.
Specifically, if Y is the simulation output and 4 = E(Y), then we execute the simulation and
form a confidence interval for jz, such as Y + ty /2,n-1S//n. The determination of whether to accept
the model as valid or to refine the model depends on the best-case and worst-case error implied by
the confidence interval.
1. Suppose the confidence interval does not contain jo. [See Figure 10.6(a).]
(a) If the best-case error is > e, then the difference in performance is large enough, even in
the best case, to indicate that we need to refine the simulation model.
(b) If the worst-case error is < ¢, then we can accept the simulation model as close enough
to be considered valid.
(c) If the best-case error is < €, but the worst-case error is > €, then additional simulation
replications are necessary to shrink the confidence interval until a conclusion can be
reached.
2. Suppose the confidence interval does contain 19. [See Figure 10.6(b).]
(a) If either the best-case or worst-case error is > ¢, then additional simulation replications
are necessary to shrink the confidence interval until a conclusion can be reached.
(b) If the worst-case error is < ¢, then we can accept the simulation model as close enough
to be considered valid.
In Example 10.2, 49 = 4.3 minutes, and “close enough” was € = | minute of expected customer
delay. A 95% confidence interval, based on the 6 replications in Table 10.2, is
Y + t0.025,58//n
2.51 + 2.571(0.82/6)
408
lO
Chapter
A 10 Verification,
I clad tessa tetbabsl aioe dosValidation
Calibration,an be Se Models
gcSimulation
Seated he of oad
f best case +
— {
worst case Lo
(a)
best case
——
$$$ f
+> 1
worst case Lo
(b)
Figure 10.6 Validation of the input-output transformation (a) when the known value falls
outside, and (b) when the known value falls inside, the confidence interval.
yielding the interval [1.65, 3.37]. As in Figure 10.6(a), 40 = 4.3 falls outside the confidence interval.
Since in the best case |3.37 — 4.3] = 0.93 < 1, but in the worst case |1.65 — 4.3] = 2.65 > 1,
additional replications are needed to reach a decision.
When using artificially generated data as input data, as was done to test the validity of the bank models
in Section 10.3.3, the modeler expects the model to produce event patterns that are compatible with,
but not identical to, the event patterns that occurred in the real system during the period of data
collection. Thus, in the bank model, artificial input data {X1,, X2n,n = 1,2,...} for interarrival
and service times were generated, and replicates of the output data Y2 were compared to what was
observed in the real system by means of the hypothesis test stated in equation (10.1). An alternative
to generating input data is to use the actual historical record, {Ay,, Sp,n = 1,2,...}, to drive the
simulation model and then to compare model output with system data.
To implement this technique for the bank model, the data Aj, A2,... and $1, S2,... would have
to be entered into the model into arrays, or stored in a file to be read as the need arose. Just after
customer n arrived at time ft, = )°7_, Aj, customer n + 1 would be scheduled on the future event
list to arrive at future time t, + Ap+1 (without any random numbers being generated). If customer n
were to begin service at time f,,, a service completion would be scheduled to occur at time #/, + S,
This event scheduling without random-number generation could be implemented quite easily in a
general-purpose programming language or most simulation languages by using arrays to store the
data or reading the data from a file.
When using this technique, the modeler hopes that the simulation wil! duplicate as closely as
possible the important events that occurred in the real system. In the model of the Fifth National Bank
of Jaspar, the arrival times and service durations will exactly duplicate what happened in the real
system on that Friday between 11:00 A.M. and 1:00 P.M. If the model is sufficiently accurate, then
the delays of customers, lengths of lines, utilizations of servers, and departure times of customers
predicted by the model will be close to what actually happened in the real system. It is, of course,
the model-builder’s and model-user’s judgment that determines the level of accuracy required.
Section 10.3. Calibration and Validation of Models 409
To conduct a validation test using historical input data, it is important that all the input data
(An, Sn, ...) and all the system response data, such as average delay (Z), be collected during the
same time period. Otherwise, the comparison of model responses to system responses, such as the
comparison of average delay in the model (Y2) to that in the system (Zz), could be misleading.
The responses Y2 and Z2 depend both on the inputs A, and S, and on the structure of the system
(or model). Implementation of this technique could be difficult for a large system, because of the
need for simultaneous data collection of all input variables and those response variables of primary
interest. In some systems, electronic counters and devices are used to ease the data-collection task by
automatically recording certain types of data. The following example was based on two simulation
models reported in Carson et al. [198 1a, b], in which simultaneous data collection and the subsequent
validation were both completed successfully.
Exemple10.3:) 7The Candy fuctory: 425 Squish 5c Gust Seen A sire over
The production line at the Sweet Lil’ Things Candy Factory in Decatur consists of three machines
that make, package, and box their famous candy. One machine (the candy maker) makes and wraps
individual pieces of candy and sends them by conveyor to the packer. The second machine (the
packer) packs the individual pieces into a box. A third machine (the box maker) forms the boxes and
supplies them by conveyor to the packer. The system is illustrated in Figure 10.7.
Each machine is subject to random breakdowns due to jams and other causes. These breakdowns
cause the conveyor to begin to empty or fill. The conveyors between the two makers and the packer
are used as a temporary storage buffer for in-process inventory. In addition to the randomly occurring
breakdowns, if the candy conveyor empties, a packer runtime is interrupted and the packer remains
idle until more candy is produced. If the box conveyor empties because of a long random breakdown
of the box machine, an operator manually places racks of boxes onto the packing machine. If a
conveyor fills, the corresponding maker becomes idle. The purpose of the model is to investigate the
frequency of those operator interventions that require manual loading of racks of boxes as a function
of various combinations of individual machines and lengths of conveyor. Different machines have
Conveyor for
Conveyors
for boxes
different production speeds and breakdown characteristics, and longer conveyors can hold more in-
process inventory. The goal is to hold operator interventions to an acceptable level while maximizing
production. Machine stoppages (whether due to a full or an empty conveyor) cause damage to the
product, so this is also a factor in production.
A simulation model of the candy factory was developed, and a validation effort using historical
inputs was conducted. Engineers in the candy factory set aside a 4-hour time slot from 7:00 A.M. to
11:00 A.M. to collect data on an existing production line. For each machine—say, machine i—time
to failure and downtime duration
were collected. For machine i(i = 1, 2, 3), Tj is the jth runtime (or time to failure), and Dj; is the
successive downtime. A runtime 7; can be interrupted by a full or empty conveyor (as appropriate),
but resumes when conditions are right. Initial system conditions at 7:00 A.M. were recorded so that
they could be duplicated in the model as initial conditions at time 0. Additionally, system responses of
primary interest—the production level Z,, and the number Z2 and time of occurrence Z3 of operator
interventions—were recorded for comparison with model predictions.
The system input data 7j; and Dj; were fed into the model and used as runtimes and random
downtimes. The structure of the model determined the occurrence of shutdowns due to a full or empty
conveyor and the occurrence of operator interventions. Model response variables Y;, i = 1, 2, 3 were
collected for comparison to the corresponding system response variables Z;, i = 1, 2, 3.
The closeness of model predictions to system performance aided the engineering staff consider-
ably in convincing management of the validity of the model. These results are shown in Table 10.5.
A simple display such as Table 10.5 can be quite effective in convincing skeptical engineers and
managers of a model’s validity—perhaps more effectively than the most sophisticated statistical
methods!
With only one set of historical input and output data, only one set of simulated output data can be
obtained, and thus no simple statistical tests based on summary measures are possible. However, if K
historical input data sets are collected, and K observations Z;;, Zj2, ... , Zix of some system response
variable Z; are collected, such that the output measure Z;; corresponds to the jth input set, an objective
statistical test becomes possible. For example, Z;; could be the average delay of all customers who
were served during the time the jth input data set was collected. With the K input data sets in hand,
the modeler now runs the model K times, once for each input set, and observes the simulated results
Wii, Wi2,... , Wix corresponding to Zj,j = 1,..., K. Continuing the same example, W;; would be
Table 10.6 Comparison of System and Model Output Measures for Identical Historical Inputs
the average delay predicted by the model for the jth input set. The data available for comparison
appears as in Table 10.6.
If the K input data sets are fairly homogeneous, it is reasonable to assume that the K observed
differences d; = Zj; — Wij, j = 1,... , K, are identically distributed. Furthermore, if the collection of
the K sets of input data was separated in time—say, on different days—it is reasonable to assume that
the K differences dj, ... , dx are statistically independent and, hence, that the differences dj, ... , dx
constitute a random sample. In many cases, each Z; and W; is a sample average over customers,
and so (by the Central Limit Theorem) the differences d; = Zj — Wi are approximately normally
distributed with some mean //g and variance os: The appropriate statistical test is then a f test of the
null hypothesis of no mean difference:
Ho (hd = 0
Ay: ba FO
The proper test is a paired f test (Z; is paired with W;;, each having been produced by the first input
data set, and so on). First, compute the sample mean difference, d, and the sample variance, 35) by
the formulas given in Table 10.6. Then, compute the f statistic as
i: =
d— [a (10.4)
SalVK
(with 4g = 0), and get the critical value ty/2,x-1 from Table A.5, where a is the prespecified
significance level and K — 1 is the number of degrees of freedom. If |fo| > ta/2,x-1, reject the
hypothesis Ho of no mean difference, and conclude that the model is inadequate. If |fo| < ta/2,x-1,
do not reject Ho, and hence conclude that this test provides no evidence of model inadequacy.
Validation of Simulation Models
412
Sf ee ene 10
Chapter ee A tN Calibration,and
Verification, iil ad eh a aie
d 5343.2
(@) FF eee —_-S TC >. el
Saf//K -8705.85//5
From Table A.5, the critical value is te/2,x-1 = t0.025,4 = 2.78. Since |fo| = 1.37 < t0.025,4 =
2.78, the null hypothesis cannot be rejected on the basis of this test—that is, no inconsistency is
detected between system response and model predictions in terms of mean production level. If Ho
had been rejected, the modeler would have searched for the cause of the discrepancy and revised the
model, in the spirit of Figure 10.3.
system. The 10 reports are randomly shuffled and given to the engineer, who is asked to decide which
reports are fake and which are real. If the engineer identifies a substantial number of the fake reports,
the model builder questions the engineer and uses the information gained to improve the model.
If the engineer cannot distinguish between fake and real reports with any consistency, the modeler
will conclude that this test provides no evidence of model inadequacy. For further discussion and an
application to a real simulation, the reader is referred to Schruben [1980]. This type of validation
test is commonly called a Turing test. Its use as model development proceeds can be a valuable
tool in detecting model inadequacies and, eventually, in increasing model credibility as the model is
improved and refined.
10.4 Summary
Validation of simulation models is of great importance. Decisions are made on the basis of simulation
results; thus, the accuracy of these results should be subject to question and investigation.
Quite often, simulations appear realistic on the surface because simulation models, unlike
analytic models, can incorporate any level of detail about the real system. To avoid being “fooled”
by this apparent realism, it is best to compare system data to model data and to make the comparison
by using a wide variety of techniques, including an objective statistical test, if at all possible.
As discussed by Van Horn [1969, 1971], some of the possible validation techniques, in order of
increasing cost-to-value ratios, include
1. Develop models with high face validity by consulting people knowledgeable about system
behavior on model structure, model input, and model output. Use any existing knowledge in
the form of previous research and studies, observation, and experience.
2. Conduct simple statistical tests of input data for homogeneity, for randomness, and for good-
ness of fit to assumed distributional forms.
3. Conduct a Turing test. Have knowledgeable people (engineers, managers) compare model
output to system output and attempt to detect the difference.
4. Compare model output to system output by means of statistical tests.
5. After model development, collect new system data and repeat techniques 2 to 4.
6. Build the new system (or redesign the old one), conforming to the simulation results, collect
data on the new system, and use the data to validate the model (not recommended if this is
the only technique used).
7. Do little or no validation. Implement simulation results without validating. (Not recom-
mended.)
It is usually too difficult, too expensive, or too time consuming to use all possible validation
techniques for every model that is developed. It is an important part of the model-builder’s task to
choose those validation techniques most appropriate, both to assure model accuracy and to promote
model credibility.
Verification, epee of
[amb Validation
Calibration,and ekSimulation Models
414
fe ee lt hl 10nent
Chapter cabo Saiatowbabahalll aa
REFERENCES
BALCL, O. [1994], “Validation, Verification and Testing Techniques throughout the Life Cycle of a Simulation
Study,” Annals of Operations Research, Vol. 53, pp. 121-174.
BALCI, O. [1998], “Verification, Validation, and Testing,” in Handbook of Simulation, J. Banks, ed., J ohn Wiley,
New York.
BALCI, O. [2003], “Verification, Validation, and Certification of Modeling and Simulation Applications,” in
Proceedings of the 2003 Winter Simulation Conference, S. Chick, P. J. Sanchez, D. Ferrin, and D. J. Morrice,
eds., New Orleans, LA, Dec. 7-10, pp. 150-158.
BALCI, O., AND R. G. SARGENT [1982a], “Some Examples of Simulation Model Validation Using Hypothesis
Testing,” in Proceedings of the 1982 Winter Simulation Conference, H. J. Highland, Y. W. Chao, and O. S.
Madrigal, eds., San Diego, CA, Dec. 6-8, pp. 620-629.
BALCI, O., AND R. G. SARGENT [1982b], “Validation of Multivariate Response Models Using Hotelling’s
Two-Sample T? Test,” Simulation, Vol. 39, No. 6, pp. 185-192.
BALCI, O., AND R. G. SARGENT [1984a], “Validation of Simulation Models via Simultaneous Confidence
Intervals,’ American Journal of Mathematical Management Sciences, Vol. 4, Nos. 3 & 4, pp. 375-406.
BALCI, O., AND R. G. SARGENT [1984b], “A Bibliography on the Credibility Assessment and Validation of
Simulation and Mathematical Models,” Simuletter, Vol. 15, No. 3, pp. 15-27.
BORTSCHELLER, B. J., AND E. T. SAULNIER [1992], “Model Reusability in a Graphical Simulation
Package,” in Proceedings of the 24th Winter Simulation Conference, J. J. Swain, D. Goldsman, R. C. Crain,
and J. R. Wilson, eds., Arlington, VA, Dec. 13-16, pp. 764-772.
CARSON, J. S., [1986], “Convincing Users of Model’s Validity is Challenging Aspect of Modeler’s Job,”
Industrial Engineering, June, pp. 76-85.
CARSON, J. S. [2002], “Model Verification and Validation,” in Proceedings of the 34th Winter Simulation
Conference, E. Yucesan, C.-H. Chen, J. L. Snowdon, and J. M. Charnes, eds., San Diego, Dec. 8-11, pp.
52-58.
CARSON, J. S., N. WILSON, D. CARROLL, AND C. H. WYSOWSKI [1981a], “A Discrete Simulation
Model of a Cigarette Fabrication Process,” Proceedings of the Twelfth Modeling and Simulation Conference,
University of Pittsburgh, PA., Apr. 30—May 1, pp. 683-689.
CARSON, J. S., N. WILSON, D. CARROLL, AND C. H. WYSOWSKI [1981b], “Simulation of a Filter Rod
Manufacturing Process,” Proceedings of the 1981 Winter Simulation Conference, T. I. Oren, C. M. Delfosse,
and C. M. Shub, eds., Atlanta, GA, Dec. 9-11, pp. 535-541.
GAFARIAN, A. V., AND J. E. WALSH [1970], “Methods for Statistical Validation of a Simulation Model for
Freeway Traffic near an On-Ramp,” Transportation Research, Vol. 4, p. 379-384.
GASS, S. I. [1983], “Decision-Aiding Models: Validation, Assessment, and Related Issues for Policy Analysis,”
Operations Research, Vol. 31, No. 4, pp. 601-663.
HINES, W. W., D. C. MONTGOMERY, D. M. GOLDSMAN, AND C. M. BORROR [2002], Probability and
Statistics in Engineering, 4th ed., Wiley, New York.
KLEIJNEN, J. P. C. [1987], Statistical Tools for Simulation Practitioners, Marcel Dekker, New York.
KLEIJNEN, J. P. C. [1993], “Simulation and Optimization in Production Planning: A Case Study,’ Decision
Support Systems, Vol. 9, pp. 269-280.
KLEIJNEN, J. P. C. [1995], “Theory and Methodology: Verification and Validation of Simulation Models,”
European Journal of Operational Research, Vol, 82, No. 1, pp. 145-162.
LAW, A. M., AND W. D, KELTON [2000], Simulation Modeling and Analysis, 3d ed., McGraw-Hill, New York.
NAYLOR, T. H., AND J. M. FINGER [1967], “Verification of Computer Simulation Models,” Management
Science, Vol. 2, pp. B92-B101.
Estimation of Absolute
Performance
Output analysis is the examination of data generated by a simulation. Its purpose is either to predict
the performance of a system or to compare the performance of two or more alternative system designs.
This chapter deals with estimating absolute performance, by which we mean estimating the value
of one or more system performance measures; Chapter 12 deals with the comparison of two or more
systems, in other words relative performance.
The need for statistical output analysis is based on the observation that the output data from
a simulation exhibits random variability when random-number generators are used to produce the
values of the input variables—that is, two different streams or sequences of random numbers will
produce two sets of outputs, which (probably) will differ. If the performance of the system is measured
by a parameter 0, the result of a set of simulation experiments will be an estimator 6 of @. The precision
of the estimator 6 can be measured by the standard error of 6 or by the width of a confidence interval
for 9. The purpose of the statistical analysis is either to estimate this standard error or confidence
interval or to figure out the number of observations required to achieve a standard error or confidence
interval of a given size—or both.
417
418 gr
e
MO pp he ea e eePerformance
Chapter 11 Estimation of Absolute ee ee
Consider a typical output variable Y, the total cost per week of an inventory system; Y should
be treated as a random variable with an unknown distribution. A simulation run of length 1 week
provides a single sample observation from the population of all possible observations on Y. By
increasing the run length, the sample size can be increased to n observations, Yj, Y2,..., Yn, based
on arun length of n weeks. However, these observations do not constitute a random sample, in the
classic sense, because they are not statistically independent. In this case, the inventory on hand at
the end of one week is the beginning inventory on hand for the next week, and so the value of Y; has
some influence on the value of Y;;1. Thus, the sequence of random variables Yj, Y2,..., Y, could
be autocorrelated (i.e., correlated with itself). This autocorrelation, which is a measure of a lack of
statistical independence, means that classic methods of statistics, which assume independence, are
not directly applicable to the analysis of these output data. The methods must be properly modified
and the simulation experiments properly designed for valid inferences to be made.
In addition to the autocorrelation present in most simulation output data, the specification of
the initial conditions of the system at time 0 can pose a problem for the simulation analyst and
could influence the output data. By “time 0” we mean whatever point in time the beginning of the
simulation run represents. For example, the inventory on hand and the number of backorders at time 0
(Monday morning) would most likely influence the value of Y, the total cost for week 1. Because of
the autocorrelation, these initial conditions would also influence the costs Y2,... , ¥, for subsequent
weeks. The specified initial conditions, if not chosen well, can have an effect on estimation of the
steady-state (long-run) performance of a simulation model. For purposes of statistical analysis, the
effect of the initial conditions is that the output observations might not be identically distributed and
that the initial observations might not be representative of the steady-state behavior of the system.
Section 11.1 distinguishes between two types of simulation—transient versus steady state—and
defines commonly used measures of system performance for each type of simulation. Section 11.2
illustrates by example the inherent variability in a stochastic (i.e., probabilistic) discrete-event simu-
lation and thereby demonstrates the need for a statistical analysis of the output. Section 11.3 covers
the statistical estimation of performance measures. Section 11.4 discusses the analysis of terminating
simulations, and Section 11.5 the analysis of steady-state simulations. To support the calculations
in this chapter a spreadsheet called SimulationTools.x1s is provided at www. bcnn.net.
SimulationTools.x1s is a menu-driven application that allows the user to paste in their own
data—generated by any simulation application—and perform statistical analyses on it. A detailed
user guide is also available on the book web site, and SimulationTools.x1s has integrated
help.
In analyzing simulation output data, a distinction is made between terminating or transient simulations
and steady-state simulations.
A terminating simulation is one that runs for some duration of time Tg, where E is a specified
event (or set of events) that stops the simulation. Such a simulated system opens at time 0 under
well-specified initial conditions and closes at the stopping time Tg. The next four examples are
terminating simulations.
Section 11.1 Types of Simulations with Respect to Output Analysis 419
Example 11.1
The Shady Grove Bank opens at 8:30 A.M. (time 0) with no customers present and 8 of the 11 tellers
working (initial conditions) and closes at 4:30 P.M. (time 7; = 480 minutes). Here, the event E is
merely the fact that the bank has been open for 480 minutes. The simulation analyst is interested in
modeling the interaction between customers and tellers over the entire day, including the effect of
starting up and closing down at the end of the day.
Hae Cees CGE EeMBESg,PeMEL ts Re aM a GSI Lee tehalam cle ksade _cected A Adah Ps GRA a cee.
A communications system consists of several components plus several backup components. It is
represented schematically in Figure 11.1. Consider the system over a period of time Tz, until the
system fails. The stopping event E is defined by E = {A fails, or D fails, or (B and C both fail)}.
Initial conditions are that all components are new at time 0.
Notice that, in the bank model of Example 11.1, the stopping time Tz = 480 minutes is known,
but in Example 11.3, the stopping time 7, is generally unpredictable in advance; in fact, Tg is
probably the output variable of interest, as it represents the total time until the system breaks down.
One goal of the simulation might be to estimate E(T¢), the mean time to system failure.
Example 11.4
A manufacturing process runs continuously from Monday mornings until Saturday mornings. The
first shift of each work week is used to load inventory buffers and chemical tanks with the components
420
i i ae 11 Estimation
Chapterae ee eee Performance
eeeof Absolute a eee
and catalysts needed to make the final product. These components and catalysts are made continually
throughout the week, except for the last shift Friday night, which is used for cleanup and maintenance.
Thus, most inventory buffers are near empty at the end of the week. During the first shift on Monday,
a buffer stock is built up to cover the eventuality of breakdown in some part of the process. It is
desired to simulate this system during the first shift (time 0 to time Tz = 8 hours) to study various
scheduling policies for loading inventory buffers.
In the simulating of a terminating system, the initial conditions of the system at time 0 must be
specified, and the stopping time T,—or, alternatively, the stopping event E—must be well defined.
Although it is certainly true that the Shady Grove Bank in Example 11.1 will open again the next day,
the simulation analyst has chosen to consider it a terminating system because the object of interest is
one day’s operation, including start up and close down. On the other hand, if the simulation analyst
were interested in some other aspect of the bank’s operations, such as the flow of money or operation
of automated teller machines, then the system might be considered as a nonterminating one. Similar
comments apply to the communications system of Example 11.3. If the failed component were
replaced and the system continued to operate, and, most important, if the simulation analyst were
interested in studying its long-run behavior, it might be considered as a nonterminating system. In
Example 11.3, however, interest is in its short-run behavior, from time 0 until the first system failure
at time Tg. Therefore, whether a simulation is considered to be terminating depends on both the
objectives of the simulation study and the nature of the system.
Example 11.4 is a terminating system, too. It is also an example of a transient (or nonstationary)
simulation: the variables of interest are the in-process inventory levels, which are increasing from
zero or near zero (at time 0) to full or near full (at time 8 hours).
A nonterminating system is a system that runs continuously, or at least over a very long period of
time. Examples include assembly lines that shut down infrequently, continuous production systems
of many different types, telephone and other communications systems such as the Internet, hospital
emergency rooms, police dispatching and patrolling operations, fire departments, and continuously
operating computer networks.
A simulation of a nonterminating system starts at simulation time 0 under initial conditions
defined by the analyst and runs for some analyst-specified period of time Tg. (Significant problems
arise concerning the specification of these initial and stopping conditions, problems that we discuss
later.) Usually, the analyst wants to study steady-state, or long-run, properties of the system—that
is, properties that are not influenced by the initial conditions of the model at time 0. A steady-
state simulation is a simulation whose objective is to study long-run, or steady-state, behavior of a
nonterminating system.
The next two examples are steady-state simulations.
Example 11.5
Consider the manufacturing process of Example 11.4, beginning with the second shift, when the
complete production process is under way. It is desired to estimate long-run production levels and
production efficiencies, For the relatively long period of 13 shifts, this may be considered as a steady-
state simulation. To obtain sufficiently precise estimates of production efficiency and other response
variables, the analyst could decide to simulate for any length of time, Tz (even longer than 13 shifts).
That is, 7, is not determined by the nature of the problem (as it was in terminating simulations);
rather, it is set by the analyst as one parameter in the design of the simulation experiment.
Section 11.2 Stochastic Nature of Output Data
eee 421
eee ee ee
Example 11.6
HAL Inc., a large web-based order-processing company, has many customers worldwide. Thus, its
large computer system with many servers, workstations, and peripherals runs continuously, 24 hours
per day. To handle an increased work load, HAL is considering additional CPUs, memory, and
storage devices in various configurations. Although the load on HAL’s computers varies throughout
the day, management wants the system to be able to accommodate sustained periods of peak load.
Furthermore, the time frame in which HAL’s business will change in any substantial way is unknown,
so there is no fixed planning horizon. Thus, a steady-state simulation at peak-load conditions is
appropriate. HAL systems staff develops a simulation model of the existing system with the current
peak work load and then explores several possibilities for expanding capacity. HAL is interested in
long-run average throughput and utilization of each computer. The stopping time 7g is determined
not by the nature of the problem, but rather by the simulation analyst, either arbitrarily or with a
certain statistical precision in mind.
This is clearly a terminating simulation experiment with time 0 corresponding to 8 A.M. and Tr
corresponding to 4 P.M. (another way to terminate the simulation is when the last call that arrives
before 4 P.M. is terminated, in which case Tg is arandom value). Table 11.1 shows the results from
4 replications of the SMP simulation; each row presents the average number of minutes callers spent
on hold, and the average number of callers on hold, during the course of a simulated 8-hour day.
422 ne i ce Sh ag ee I 11.
Chapter Estimation of Absolute Performance
PE ae oe
Table 11.1. Results of Four Independent Replications of the SMP Call Center
The table illustrates that there is significant variation in these averages from day to day, implying
that a substantial number of replications may be required to precisely estimate, say, the mean values
wo = E(Wg) andLg = E (Lg). Even with a large number of replications there will still be some
estimation error, which is why we want to report not just a single value (the point estimate), but also
a measure of its error in the form of a confidence interval or standard error.
These questions are addressed in Section 11.4 for terminating simulations such as Example 11.7.
Classic methods of statistics may be used because Wai, Wo2, Wo3, and Woe4 constitute a random
sample—that is, they are independent and identically distributed. In addition, wg = E (Wo,) is the
parameter being estimated, so each Wg, is an unbiased estimate of the true mean waiting time wo.
The analysis of Example 11.7 is considered in Example 11.10 of Section 11.4. A survey of statistical
methods applicable to terminating simulations is given by Law [1980]. Additional guidance may be
found in Alexopoulos and Seila [1998], Kleijnen [1987], Law [2007], and Nelson [2001].
The next example illustrates the effects of correlation and initial conditions on the estimation of
long-run mean measures of performance of a system.
Example 11.8 SN
eee a
Semiconductor wafer fabrication is an amazingly expensive process involving hundreds of millions
of dollars in processing equipment, material handling, and human resources. FastChip, Inc. wants
to use simulation to design a new “fab” by evaluating the steady-state mean cycle time (time from
wafer release to completion) that they can expect under a particular loading of two distinct products,
called C-chip and D-chip. The process of wafer fabrication consists of two basic steps, diffusion
and lithography, each of which contains many substeps, and the fabrication process requires multiple
passes through these steps. Product will be released in cassettes at the rate of 1 cassette/hour, 7 days a
week, 24-hours a day, to achieve a desired throughput of | cassette/hour. The hope is that the current
design will achieve long-run average cycle times of less than 45 hours for the C-chip and 30 hours
for the D-chip.
Figure 11.2 shows the first 30 Chip-C cycle times from one run of the simulation model of
the fab. The goal is to estimate the long-run average cycle time as, conceptually, the number of
cassettes produced goes to infinity. Notice that the first few cycle times seem lower than the ones
that follow; this occurs because the simulation was started with the fab “empty and idle,” which is
not representative of long-run or steady-state conditions. Also notice that when there is a low cycle
time (say, around 40 hours), it tends to be followed by several additional low cycle times. Thus, the
cycle time of successive cassettes may be correlated. These features suggest that the cycle-time data
Section 11.3 Absolute Measures of Performance and Their Estimation
cE
E eR
US rr 423
ce2
60.00
50.00
-o S ro)
Time
Cycle
Dens ee Oey Ooo LOT is 14 15) 16) 17at 8119 20121 22:93-94 25:96 97 28.09 30
Product #
Figure 11.2 First 30 Chip-C cycle times from one simulation run.
are not a random sample, so classic statistical methods may not apply. The analysis of Example 11.8
is considered in Section 11.5.
6= y, (11.1)
Ss
le
i=1
424 Se 11.
HOOT BS Chapter Estimation
ee of Absolute Performance
ee
OI i ES
where @ is a sample mean based on a sample of size n. Computer simulation languages may refer to
this as a “discrete-time,” “collect,” “tally,” or “observational” statistic.
The point estimator @ is said to be unbiased for @ if its expected value is 0; that is, if
E(@) =0 (11.2)
In general, however,
E@) 40 (11.3)
and E(@) — @ is called the bias in the point estimator 6. It is desirable to have estimators that are
unbiased, or, if this is not possible, to have a small bias relative to the magnitude of 6. Examples of
estimators of the form of Equation (11.1) include W and Wg of Equations (6.5) and (6.7), in which
case Y; is the time spent in the (sub)system by customer i.
The point estimator of @, based on the data {Y(t), 0 < t < Tg}, where Tz is the simulation run
length, is defined by
E@) #¢ (11.5)
and d is said to be biased for @. Again, we would like to obtain unbiased or low-bias estimators.
Examples of time averages include L and Lg of Equations (6.3) and (6.4) and Y; of Equation (11.4)
Generally, 6 and ¢ are regarded as mean measures of performance of the system being simulated.
Other measures usually can be put into this common framework. For example, consider estimation
of the proportion of days on which sales are lost through an out-of-stock situation. In the simulation,
let
1, if out of stock on day i
see ;
0, otherwise
With n equal to the total number of days 6, as defined by Equation (11.1), is a point estimator of 6,
the proportion of out-of-stock days. For a second example, consider estimation of the proportion of
time queue length is greater than 10 customers. If Lg(f) represents simulated queue length at time f,
then (in the simulation) define
Se [ Siehat eet
YS) 0, otherwise
Then d, as defined by Equation (11.4), is a point estimator of @, the proportion of time that the queue
length is greater than 10 customers. Thus, estimation of proportions or probabilities is a special case
of the estimation of means.
A performance measure that does not fit this common framework is a quantile or percentile.
Quantiles describe the level of performance that can be delivered with a given probability p. For
Section 11.3. Absolute Measures of Performance and Their Estimation
ee 425
ee ee ee MAtON. lt
instance, suppose that Y represents the delay in queue that a customer experiences in a service
system, measured in minutes. Then the p = 0.85 quantile of Y is the value 6 such that
As a percentage, 6 is the 100pth or 85th percentile of customer delay. Therefore, 85% of all customers
will experience a delay of @ minutes or less. Stated differently, a customer has only a 0.15 probability
of experiencing a delay of longer than @ minutes. A widely used performance measure is the median,
which is the 0.5 quantile or 50th percentile.
The problem of estimating a quantile is the inverse of the problem of estimating a proportion or
probability. In estimating a proportion, 6 is given and the probability p is to be estimated; however,
in estimating a quantile, p is given and 6@ is to be estimated.
The most intuitive method for estimating a quantile is to form a histogram of the observed
values of Y, then find a value @ such that 100p% of the histogram is to the left of (smaller than) 6.
For instance, if we observe n = 250 customer delays {Y,... , Y25o}, then an estimate of the 85th
percentile of delay is a value 6 such that (0.85)(250) = 212.5 & 213 of the observed values are
less than or equal to 6. An obvious estimate is, therefore, to set 6 equal to the 213th smallest value
in the sample (this requires sorting the data). When the output is a continuous-time process, such
as the queue-length process {Lo(t),0 < t < Tg}, then a histogram gives the fraction of time that
the process spent at each possible level (queue length, in this example). However, the method for
quantile estimation remains the same: Find a value @ such that 100p% of the histogram is to the left
of 6.
be the sample variance across the R replications. The usual confidence interval, which assumes the
Y; are normally distributed, is
- S
Nem bo/2,n—-1—=
Jn
where fy /2,r—1 is the quantile of the ¢ distribution with R — | degrees of freedom that cuts off w/2 of
the area of each tail. (See Table A.5.) We cannot know for certain exactly how far Y is from @, but
the confidence interval attempts to bound that error. Unfortunately, the confidence interval itself may
be wrong. A confidence level, such as 95%, tells us how much we can trust the interval to actually
bound the error between Y and 6. The more replications we make, the less error there is in Y, and
our confidence interval reflects that because fy/2,n-1S/ /n will tend to get smaller as n increases,
converging to 0 as n goes to infinity.
Now suppose we need to make a promise about what the average cycle time will be on a particular
day. A good guess is our estimator Y, but it is unlikely to be exactly right. Even 6 itself, which is
the center of the distribution, is not likely to be the actual average cycle time on any particular day,
because the daily average cycle time varies. A prediction interval, on the other hand, is designed to
be wide enough to contain the actual average cycle time on any particular day, with high probability.
The normal-theory prediction interval is
Ef | 1
ies te /2,n—1S 1+ A
The length of this interval will not go to 0 as n increases. In fact, in the limit it becomes
(G) ae Za /2O
to reflect the fact that, no matter how much we simulate, our daily average cycle time still varies.
In summary, a prediction interval is a measure of risk, and a confidence interval is a measure of
error. We can simulate away error by making more and more replications, but we can never simulate
away risk, which is an inherent part of the system. We can, however, do a better job of evaluating
risk by making more replications.
Example 11.9
Suppose that the overall average of the average cycle time on 120 replications of a manufacturing
simulation is 5.80 hours, with a sample standard deviation of 1.60 hours. Since f0,925, 119 = 1.98, a
95% confidence interval for the long-run expected daily average cycle time is 5.80+1.98(1.60/./120)
or 5.80 + 0.29 hours. Thus, our best guess of the long-run average of the daily average cycle times
is 5.80 hours, but there could be as much as £0.29 hours error in this estimate.
On any particular day, we are 95% confident that the average cycle time for all parts produced
on that day will be
or 5.80 + 3.18 hours. The +3.18 hours reflects the inherent variability in the daily average cycle
times and the fact that we want to be 95% confident of covering the actual average cycle time on a
particular day, rather than simply covering the long-run average.
A caution: In neither case (prediction interval or confidence interval) can we make statements
about the cycle time for individual products because we observed the daily average. To make valid
statements about the cycle times of individual products, the analysis must be based on individual
cycle times.
Consider a terminating simulation that runs over a simulated time interval [0, 7¢] and results in
observations Y;,..., ¥Y,. The sample size n may be a fixed number, or it may be a random variable
(say, the number of observations that occur during time Tg). A common goal in simulation is to
s=e(:3y
estimate
1 n
1=
When the output data are of the form {Y(t), 0 < t < Tg}, the goal is to estimate
Be
The method used in each case is the method of independent replications. The simulation is repeated
a total of R times, each run using a different random number stream and independently chosen initial
conditions (which includes the case that all runs have identical initial conditions). We now address
this problem.
number of parts produced in each replication might differ. Table 11.2 shows, symbolically, the results
of R replications.
The across-replication data are formed by summarizing within-replication data:! Y;. is the sample
mean of the n; cycle times from the ith replication, S? is the sample variance of the same data, and
Ap ta/2,n;
Dy j 1 ache
i; (11.7)
BS, - (11.8)
oe
is}
]
R = =
Pes Yi. - ¥.P (11.9)
i=l
and finally, the confidence-interval half-width
S
11= fa/2,R-1 Te (11.10)
The quantity S/4/R is the standard error, which is sometimes interpreted as the average error in Y..
as an estimator of @. Notice that S is not the average of the within- -replicationsen variances S?;
rather, it is the sample variance of the within-replication averages Yj., Y.,... , Yp..
'We use the convention that a dot, as in the subscript i-, indicates summation over the indicted subscript; and a bar, as in
., indicates an average of that subscript.
Section 11.4 Output Analysis
a eafor Terminating
Re SimulationsOS on eet oe ee 429
Within a replication, work in process is a continuous-time output, denoted Y;(f). The stopping
time for the ith replication Tz, could be a random variable, in general; in this example, it is the end
of the second shift. Table 11.3 is an abstract representation of the data produced.
The within-replication sample mean and variance are defined appropriately for continuous-time
data:
r Te;
y;,= = Y;(t)dt (111d)
Tr, 0
and
S-2i= =
in
dt
(Y;(t) — ¥;.)° (11.12)
TE, 0
H; ee south
oa (11.13)
e The overall sample average Y.. and the individual replication sample averages Y;. are always
unbiased estimators of the expected daily average cycle time or daily average WIP.
e Across-replication data are independent, since they are based on different random numbers;
are identically distributed, since we are running the same model on each replication; and tend
to be normally distributed, if they are averages of within-replication data, as they are here.
This implies that the confidence interval Y.. +H is often pretty good.
430
C g NaEstimation
aChapter 11 eg Performance
ae a of Absolute Bee
e Within-replication data, on the other hand, might have none of these properties. The individual
cycle times may not be identically distributed, if the first few parts of the day find the system
empty; they are almost certainly not independent, because one part follows another; and
whether they are normally distributed is difficult to know in advance. For this reason, S?
and H;, which are computed under the assumption of independent and identically distributed
(i.i.d.) data, tend not to be useful, although there are exceptions.
e There are situations in which Y.. and Y;. are valid estimators of the expected cycle time for an
individual part or the expected WIP at any point in time, rather than the daily average. (See
Section 11.5 on steady-state simulations.) Even when this is the case, the confidence interval
Y. +H is valid, and Y;. +4; is not. The difficulty occurs because Sr is areasonable estimator
of the variance of the cycle time, but SF /n; and Se /Tg, ate not good estimators of the Var[Y;.];
more on this in Section 11.5.2.
ee ee RIE aL
te ar 0.88
+ 5.04 + 4.13 + 0.52
Thus, the standard error of Wg = 2.64 is estimated by s.e.(Wg) = S//4 = 1.14. Obtain to.025,3 =
3.18 from Table A.5, and compute the 95% confidence interval half-width by Equation (11.10) as
S
fal Maa eer = (3.18.14) = 3.62
giving 2.64 + 3.62 minutes with 95% confidence. Clearly 4 replications are far too few to make any
sort of useful estimate of wg, since a negative mean waiting time is impossible. Later in the chapter
we return to this problem and determine how many replications are required to get a meaningful
estimate.
Section 11.4 Output Analysis for Terminating Simulations
e 431
e
where S? is the sample variance and R is the number of replications. In simulation we seldom need
to select R and just accept the confidence interval half-length that results. Instead, we can drive the
number of replications to be large enough so that H is small enough to facilitate the decision that
the simulation is supposed to support. The acceptable size of H depends on the problem at hand: it
might be +$5000 in a financial simulation, +6 minutes for product cycle time, or +1 customer on
hold in a call center.
Suppose that an error criterion of € is specified; in other words, it is desired to estimate @ by Y..
to within te with high probability—say, at least 1 — aw. Thus, it is desired that a sufficiently large
sample size R be taken to satisfy
PUY 0) se eh ey
When the sample size R is fixed, no guarantee can be given for the resulting error. But if the sample
size can be increased, an error criterion (such as € = $5000, 6 minutes, or 1 customer) can be
specified.
Assume that an initial sample of size Ro replications has been observed—that is, the simulation
analyst initially makes Ro independent replications. We must have Ro > 2, with 10 or more being
desirable. The Ro replications will be used to obtain an initial estimate 8 of the population variance
a7. To meet the half-length criterion, a sample size R must be chosen such that R > Ro and
S
H = ta) <e (11.14)
Solving for R in Inequality (11.14) shows that R is the smallest integer satisfying R > Ro and
2
R> (s2=*) (11.15)
€
An initial estimate for R is given by
2
R> (=2*)
€
(11.16)
where Zq/2 is the 100(1 — w/2) percentage point of the standard normal distribution from Table A.3.
And since ty/2,r-1 © 2/2 for large R, say, R = 50, the second inequality for R is adequate when R
is large. After determining the final sample size R, collect R — Ro additional observations (i.e., make
R — Ro additional replications, or start over and make R total replications) and form the 100(1 — a) %
confidence interval for 6 by
= S S
er fa/2R-1 Te af SEY FP fa/R-1 Fe (11.17)
432ha
ke ee a a eT Chapter
nd A a 11 eenEstimation
i be ofaAbsolute
het ee Performance
where Y.. and S* are computed on the basis of all R replications, Y.. by Equation (11.8), and S* by
Equation (11.9). The half-length of the confidence interval given by Inequality (11.17) should be
approximately € or smaller; however, with the additional R — Ro observations, the variance estimator
S? could differ somewhat from the initial estimate SA. possibly causing the half-length to be greater
than desired. If the confidence interval from Inequality (11.17) is too large, the procedure may be
repeated, using Inequality (11.15), to determine an even larger sample size.
Number of Observations 20
; Precision
OK | Help | Cancel |
Figure 11.3 Input Dialog for the “Get Sample Size” utility of SimulationTools.xls
Section 11.4
es Ea Output Analysis
oh forEL
Terminating Simulations 433IJ
A RA t
11.4.3 Quantiles
To present the interval estimator for quantiles, it is helpful to review the interval estimator for a
mean in the special case when the mean represents a proportion or probability p. In this book, we
have chosen to treat a proportion or probability as just a special case of a mean. However, in many
Statistics texts, probabilities are treated separately.
When the number of independent replications Y;,..., Yr is large enough that fy/2n-1 = Za/2;
the confidence interval for a probability p is often written as
D+
pl —P)
P= Z%a/2 Bad
where Pp is the sample proportion (algebra shows that this formula for the half-width is precisely
equivalent to Equation (11.10) when used in estimating a proportion).
As mentioned in Section 11.3, the quantile-estimation problem is the inverse of the probability-
estimation problem: Find @ such that Pr{Y < 0} = p. Thus, to estimate the p quantile, we find that
value @ such that 100p% of the data in a histogram of Y is to the left of é, or, stated differently, the
npth smallest value of Yj,..., Yr.
Extending this idea, an approximate (1 — a) 100% confidence interval for 6 can be obtained
by finding two values: 6, that cuts off 100p¢% of the histogram and 6, that cuts off 100p,,% of the
434 Chapter 11 Estimation of
ea Absolute Performance
cee
SPF ee a a a
histogram, where
ews pl —p)
Pore PaneleM ip 4
(hes
+ Za/2
Pu =P = m (11.18)
(Recall that we know p.) In terms of sorted values, 0 is the Rpe smallest value (rounded down) and
6, is the Rp, smallest value (rounded up), of Y1,..., Yr.
{= 0.75(0.25)
[ELE A eo) POP) 05= 106. = 0700
R—1 299
pil — p) 0.75(0.25)
Pu = =Pt+2u/2 ee =| 0.75
P + L, 1.96,/ a ace ae ep: 0.799
s
The lower bound of the confidence interval is o = 3.13 minutes (the 300 x pg = 210th smallest value,
rounding down); the upper bound of the confidence interval is 6, = 3.98 minutes (the 300 x Das
240th smallest value, rounding up). Thus, with 95% confidence, the point estimate of 3.39 minutes
is no more than max{3.98 — 3.39, 3.39 — 3.13} = 0.59 minutes off the true 0.75 quantile.
It is important to understand the 3.39 minutes is an estimate of the 0.75 quantile (75th percentile)
of daily average customer delay, not an individual customer’s delay. Because the call center starts
the day empty and idle, and the customer load on the system varies throughout the day, the 0.75
quantile of individual customer waiting time varies throughout the day and cannot be given by a
single number.
Knowing the equation for the confidence interval half-width is important if all that the simulation
software provides is Y.. and H and you need to work out the number of replications required to get
a prespecified precision, or if you need to estimate a probability or quantile. You know the number
of replications, so the sample standard deviation can be extracted from H by using the formula
HV/R
y= ses
fy/2,R—1
Section 11.5 Output Analysis eee
for Steady-State
ae a eae me Simulations
i ye ee 435
£
Pr{¥j. <o)~Pr{Z < : ae: 6a
and
Ox Y. +28
The following example illustrates how this is done.
5 dt RriniB24/25..
ef \v
S Sf
~ to0524 C7
Then, we use the normal approximations and Table A.3 to get
350
— 218
Pr{¥;. < 350} © Pr {zae
<n
= Pr{Z < 1.42} = 0.92
and
Consider a single run of a simulation model whose purpose is to estimate a steady-state, or long-run,
characteristic of the system. Suppose that the single run produces observations Yj, Y2,... , which,
436
F ee ree11 Estimation
Chapter ee re eePerformance
of Absolute eee
generally, are samples of an autocorrelated time series. The steady-state (or long-run) measure of
performance, 0, is defined by
1
6= lim -\y (11.19)
with probability 1, where the value of 6 is independent of the initial conditions. (The phrase “with
probability 1” means that essentially all simulations of the model, using different random numbers,
will produce a series Y;, i = 1, 2,... whose sample average converges to 6.) For example, if Y; was
the time customer i spent talking to an operator, then 6 would be the long-run average time a customer
spends talking to an operator; and, because @ is defined as a limit, it is independent of the call center’s
conditions at time 0. Similarly, the steady-state performance for a continuous-time output measure
{Y(t), t > 0}, such as the number of customers in the call center’s hold queue, is defined as
with probability 1.
Of course, the simulation analyst could decide to stop the simulation after some number of
observations, say, n, have been collected; or the simulation analyst could decide to simulate for some
length of time Tz that determines n (although n may vary from run to run). The sample size n (or
Tg) is a design choice; it is not inherently determined by the nature of the problem. The simulation
analyst will choose simulation run length (n or Tz) with several considerations in mind:
1. Any bias in the point estimator that is due to artificial or arbitrary initial conditions. (The bias
can be severe if run length is too short, but generally it decreases as run length increases.)
2. The desired precision of the point estimator, as measured by the standard error or confidence
interval half-width.
3. Budget constraints on the time available to execute the simulation.
The next subsection discusses initialization bias and the following subsections outline two
methods of estimating point-estimator variability. For clarity of presentation, we discuss only esti-
mation of @ from a discrete-time output process. Thus, when discussing one replication (or run), the
notation
YivYo Veron
will be used; if several replications have been made, the output data for replication r will be denoted
by
Y;1, r2» Y,3, wale a 1.20)
state that is more representative of long-run conditions. This method is sometimes called intelligent
initialization. Examples include
1. setting the inventory levels, number of backorders, and number of items on order and their
arrival dates in an inventory simulation;
2. placing customers in queue and in service in a queueing simulation;
3. having some components fail or degrade in a reliability simulation.
There are at least two ways to specify the initial conditions intelligently. If the system exists,
collect data on it and use these data to specify more nearly typical initial conditions. This method
sometimes requires a large data-collection effort. In addition, if the system being modeled does not
exist—for example, if it is a variant of an existing system—this method is impossible to implement.
Nevertheless, it is recommended that simulation analysts use any available data on existing systems to
help initialize the simulation, as this will usually be better than assuming the system to be “completely
stocked,” “empty and idle,” or “brand new” at time 0.
A related idea is to obtain initial conditions from a second model of the system that has been
simplified enough to make it mathematically solvable. The queueing models in Chapter 6 are very
useful for this purpose. The simplified model can be solved to find long-run expected or most likely
conditions—such as the expected number of customers in the queue—and these conditions can be
used to initialize the simulation,
A second method to reduce the impact of initial conditions, possibly used in conjunction with
the first, is to divide each simulation run into two phases: first, an initialization phase, from time 0
to time To, followed by a data-collection phase from time 7p to the stopping time 7p + Tz; that is,
the simulation begins at time O under specified initial conditions /p and runs for a specified period
of time 7p. Data collection on the response variables of interest does not begin until time 7o and
continues until time 7) + Tz. The choice of To is quite important, because the system state at time
Tp, denoted by J, should be more nearly representative of steady-state behavior than are the original
initial conditions Jp at time 0. In addition, the length Tg of the data-collection phase should be long
enough to guarantee sufficiently precise estimates of steady-state behavior. Notice that the system
state J at time 7p is a random variable and to say that the system has reached an approximate steady
state is to say that the probability distribution of the system state at time 7o is sufficiently close to
the steady-state probability distribution as to make the bias in point estimates of response variables
negligible. Figure 11.5 illustrates the two phases of a steady-state simulation. The effect of starting
a simulation run of a queueing system in the empty and idle state, as well as a useful plot to aid the
simulation analyst in choosing an appropriate value for To, are given in the following example.
Figure 11.5 Initialization and data collection phases of a steady-state simulation run.
438 a
dh a NP ea mc seChapter
A AE A Performance
11 Estimation of Absolute
Example 11.14
Consider the FastChip wafer fabrication problem discussed in Example 11.8. Suppose that a total
of R = 10 independent replications were made, each run long enough so that 250 cycle times were
collected for each of Chip-C and Chip-D. We will focus on Chip-C.
Normally we average all data within each replication to obtain a replication average. However,
our goal at this stage is to identify the trend in the data due to initialization bias and find out when it
dissipates. To do this, we will average corresponding cycle times across replications and plot them
(this idea is usually attributed to Welch [1983]). Such averages are known as ensemble averages.
Specifically, let Y,1, Y,2,..., Y;259 be the 250 Chip-C cycle times, in order of completion, from
replication r. For the jth cycle time, define the ensemble average across all R replications to be
R “4 a.
a
II (11.21)
(R = 10 here). The ensemble averages are plotted as the solid line in Figure 11.6 (the dashed lines
will be discussed below). This plot was produced by the SimulationTools.x1s spreadsheet,
and we will refer to it as a mean plot.
Notice the gradual upward tread at the beginning. The simulation analyst may suspect that this
is due to the downward bias in these estimators, which in turn is due to the fab being empty and idle
at time 0. As time becomes larger (so the number of cycle times recorded increases), the effect of the
initial conditions on later observations lessens and the observations appear to vary around a common
mean. When the simulation analyst feels that this point has been reached, then the data-collection
phase begins. Here approximately 120 Chip-C cycle times might be adequate.
Although we have determined a deletion amount of 120 cycle times, the initialization phase is
almost always specified in terms of simulation time Tp, because we need to consider the initialization
phase for each performance measure, and they may differ. In the FastChip example we would also
Mean Plot
60,0000 -—-——— a —
40.0000
30.0000 -+---------------------
22+--22222-2222 eee ee eee eee eee eeeeee {|—— CL
10.0000 + ---------------2----
2222-22222 nnnee nee eee eee eee |
Figure 11.6 Mean Plot Across 10 Replications of 250 Product C Cycle Times
Section 11.5 Output Analysis for Steady-State Simulations 439
need to examine the cycle times for Chip-D. If we specified the initialization phase as a count for
each output, then we would have to track these counts. When the initialization phase is specified
by time, on the other hand, then a single event can be scheduled at time 7p, and data only retained
beyond that point.
To specify To by time we have to convert the appropriate counts into approximate times and use
the largest of these. For instance, for the FastChip simulation 120 Chip-C cycle times corresponds
to approximately 200 hours of simulation time, since cassettes are released at 1 per hour and 60% of
the releases are Chip-C: 120 cycle times/0.6 releases/hour = 200 hours (we could also increase this
somewhat to account for the time it takes the 120th Chip-C release to traverse the fab). This turns
out to be larger than the initialization phase for Chip-D, so we set To = 200 hours.
Plotting ensemble averages is extra work (although SimulationTools.x1s makes iteasier),
so it might be tempting to take a shortcut. For instance, one might plot the output from only replica-
tion 1, Y11, Yi2,..., Yin, or the cumulative average from a replication |
i ee
Y= 7 2 Yi;
for £ = 1,2,...,n. Figure 11.7 shows both. Notice that the raw data from a single replication can
be too highly variable to detect the trend, while the cumulative plot is clearly biased low because it
retains all of the data from the beginning of the run. Thus, these shortcuts should be avoided and the
mean plot used instead.
70.0000
60,0000
50.0000
Measurements
1 14 27 40 53 66 79 92 105 118131 144 157 170 183 196 209 222 235 248
Time (f)
Figure 11.7 Plot of 250 Product C Cycle Times from 1 Replication, Raw and Cumulative Average
440 Chapter 11. Estimation of Absolute
ee Performance
RR a ee
Some additional factors to keep in mind when using the mean plot to determine the initialization
phase:
1. When first starting to examine the initialization phase, a run length and number of replications
will have to be guessed. Ensemble averages, such as Figure 11.6, will reveal a smoother and
more precise trend as the number of replications R is increased, and it may have to be increased
several times until the trend is clear. It is also possible that the run length guess is not long
enough to get past the initialization phase (this would be indicated if the trend in the mean
plot does not disappear), so the run length may have to be increased as well.
2. Ensemble averages can also be smoothed by plotting a moving average rather than the original
ensemble averages. In a moving average, each plotted point is actually the average of several
adjacent ensemble averages. Specifically, the jth plot point would be
j+m
1
ae ee i=j—m
oe ie
for some m > 1, rather than the original ensemble average Y;. The value of m is typically
chosen by trial and error until a smooth plot is obtained. The mean plot in Figure 11.6 was
obtained with m = 2.
3. Since each ensemble average (or moving average) is the sample mean of i.i.d. observations
across R replications, a confidence interval based on the ¢ distribution can be placed around
each point, as shown by the dashed lines in Figure 11.6, and these intervals can be used to
judge whether or not the plot is precise enough to decide that bias has diminished. This is
the preferred method to determine a deletion point.
4, Cumulative averages, such as in Figure 11.7, become less variable as more data are averaged.
Therefore, it is expected that the left side of the curve will always be less smooth than the right
side. Remember that cumulative averages tend to converge more slowly to long-run perfor-
mance than do ensemble averages, because cumulative averages contain all observations,
including the most biased ones from the beginning of the run. For this reason, cumulative
averages should be used only if it is not feasible to compute ensemble averages, such as when
only a single replication is possible.
There has been no shortage of solutions to the initialization-bias problem. Unfortunately, for
every “solution” that works well in some situations, there are other situations in which either it is
not applicable or it performs poorly. Important ideas include testing for bias (e.g., Kelton and Law
[1983], Schruben [1980], Goldsman, Schruben, and Swain [1994]); modeling the bias (e.g., Snell
and Schruben [1985]); and randomly sampling the initial conditions on multiple replications (e.g.,
Kelton [1989]).
Yip ony Yn} are not statistically independent, then S?/n, given by Equation (11.9), is a biased
estimator of the true variance V(6). This is almost always the case when {Y;,..., Y,}isa sequence of
Section 11.5 ee
Output Analysis for Steady-State
a ee ee Simulations 441
output observations from within a single replication. In this situation, Y|, Y,... is an autocorrelated
sequence, sometimes called a time series.
Suppose that our point estimator for 6 is the sample mean Y = yo, Yi/n. A general result?
from mathematical statistics is that the variance of Y is
x 1 n n
where cov(¥;, Yj) = V(Y¥;). To construct a confidence interval for 6, an estimate of V(Y) is required.
But obtaining an estimate of Equation (11.22) is futile, because each term cov(Y,, Y;) could be
different, in general. Fortunately, systems that have a steady state will, if simulated long enough
to pass the transient phase, produce an output process that is approximately covariance stationary.
Intuitively, stationarity implies that Y;,, depends on Y;+; in the same manner as Y; depends on Y.
In particular, the covariance between two random variables in the time series depends only on the
number of observations between them, called the lag.
For a covariance-stationary time series, {Y;, Y2,...}, define the lag-k autocovariance by
Pe = = (11.25)
If atime series is covariance stationary, then Equation (11.22) can be simplified substantially. Algebra
shows that
: at n—1 k
This general result can be derived from the fact that, for two random variables Y; and Y2, V(Y; + Y2) = V(%1) +
V(Y2) + 2cov(Y}j, Y2).
442 oh 1 SE oe ah ie a eee ee ee ee Estimation of Absolute Performance
OR 11ee
Chapter
RBs
@ = E(Y;)
(a)
A= E(Y})
(b)
(c)
Figure 11.8 (a) Stationary time series Y; exhibiting positive autocorrelation; (b) stationary time
series Y; exhibiting negative autocorrelation; (c) nonstationary time series with an upward trend.
On the other hand, if some of the px < 0, the series Yi, Y2,... will display the characteristics
of negative autocorrelation. In this case, large observations tend to be followed by small observa-
tions, and vice versa. Figure 11.8(b) is an example of a stationary time series exhibiting negative
autocorrelation. The output of certain inventory simulations might be negatively autocorrelated.
Section 11.5
Se Output Analysis for Steady-State Simulations 443
Figure 11.8(c) also shows an example of a time series with an upward trend. Such a time series
is not stationary; the probability distribution of Y; is changing with the index i.
Why does autocorrelation make it difficult to estimate V(Y)? Recall that the standard estimator
for the variance of a sample mean is S*/n. By using Equation (11.26), it can be shown [Law, 1977]
that the expected value of the variance estimator S/n is
S? ‘
E (=) = BV(Y) (11.27)
where
a
Helsn— | (11.28)
and c is the quantity in brackets in Equation (11.26). The effect of the autocorrelation on the estimator
Ss?/n is derived by an examination of Equations (11.26) and (11.28). There are essentially three
possibilities:
Case 1
Case 2
If the autocorrelations p, are primarily positive, then c = 1 + papSorta | —k/n)px > 1, so that
n/c <n, and hence B < 1. Therefore, 57 /n is biased low as an estimator of V( Y). If this correlation
were ignored, the nominal 100(1 — w)% confidence interval given by Expression (11.10) would be
too short, and its true confidence coefficient would be less than 1 — a. The practical effect would be
that the simulation analyst would have unjustified confidence in the apparent precision of the point
estimator due to the shortness of the confidence interval. If the correlations p, are large, B could be
quite small, implying a significant underestimation.
Case 3
If the autocorrelations p, are substantially negative, then 0 < c < 1, and it follows thatB > 1 and
S?/n is biased high for V(Y). In other words, the true precision of the point estimator Y would be
greater than what is indicated by its variance estimator S? /n, because
3S Ss?
Vn (=)
n
As a result, the nominal 100(1 — w)% confidence interval of Expression (11.10) would have true
confidence coefficient greater than 1 — a. This error is less serious than Case 2, because we are
unlikely to make incorrect decisions if our estimate is actually more precise than we think it is.
444
e eeee | ALI11 a Estimation
ee ew cr rareChapter iia ee Performance
of Absolute iL hg
A simple example demonstrates why we are especially concerned about positive correlation:
Suppose you want to know how students on a university campus will vote in an upcoming election. To
estimate their preferences, you plan to solicit 100 responses. The standard experiment is to randomly
select 100 students to poll; call this experiment A. An alternative is to randomly select 20 students
and ask each of them to state their preference 5 times in the same day; call this experiment B.
Both experiments obtain 100 responses, but clearly an estimate based on experiment B will be
less precise (will have larger variance) than an estimate based on experiment A. Experiment A
obtains 100 independent responses, whereas experiment B obtains only 20 independent responses
and 80 dependent ones. The five opinions from any one student are perfectly positively correlated
(assuming a student names the same candidate all five times). Although this is an extreme example, it
illustrates that estimates based on positively correlated data are more variable than estimates based on
independent data. Therefore, a confidence interval or other measure of error should account correctly
for dependent data, but S7/n does not.
Two methods for eliminating or reducing the deleterious effects of autocorrelation upon estima-
tion of a mean are given in the following sections. Unfortunately, some simulation languages either
use or facilitate the use of S?/n as an estimator of V(Y), the variance of the sample mean, in all situ-
ations. If used uncritically in a simulation with positively autocorrelated output data, the downward
bias in $*/n and the resulting shortness of a confidence interval for @ will convey the impression
of much greater precision than actually exists. When such positive autocorrelation is present in the
ouput data, the true variance of the point estimator Y can be many times greater than is indicated by
S71.
If initialization bias in the point estimator has been reduced to a negligible level (through some
combination of intelligent initialization and deletion), then the method of independent replications
can be used to estimate point-estimator variability and to construct a confidence interval. The basic
idea is simple: Make R replications, initializing and deleting from each one the same way.
If, however, significant bias remains in the point estimator and a large number of replications
are used to reduce point-estimator variability, the resulting confidence interval can be misleading.
This happens because bias is not affected by the number of replications R; it is affected only by
deleting more data (i.e., increasing Ty) or extending the length of each run (i.e., increasing Tz).
Thus, increasing the number of replications R could produce shorter confidence intervals around the
“wrong point.” Therefore, it is important to do a thorough job of investigating the initial-condition
bias.
If the simulation analyst decides to delete d observations of the total of n observations in a
replication, then the point estimator of 6 is
z 1 R | n
V.(nd)= = I Do i (11.29)
et J=d+1
That is, the point estimator is the average of the remaining data. The basic raw output data {Y,;, r =
1,...,R;j =1,...,n} are exhibited in Table 11.4. For instance, Y,; could be the delay of customer
j in queue, or the response time of job j in a job shop, on replication r. The number d of deleted
SectionAE
11.5 Output
NRAnalysis for Steady-State
ES
N Bt U ROR Simulations ere 445C4
Replication
Replication 1 vee d d+]... n Averages
observations and the total number of observations n might vary from one replication to the next,
in which case replace d by d,; and n by n,. For simplicity, assume that d and n are constant over
replications.
When using the replication method, each replication is regarded as a single sample for the
purpose of estimating @. For replication r, define
: 1 :
Y,.(n, d) = sy My (11.30)
as the sample mean of all (nondeleted) observations in replication r. Because all replications use
different random-number streams and all are initialized at time 0 by the same set of initial conditions
Io, the replication averages y N
Y,.(n, d), Asi Yr.(n, d)
are independent and identically distributed random variables; that is, they constitute a random sample
from some underlying population having unknown mean
as can be seen from Table 11.4 or from using Equation (11.21). Thus, it follows that
E[Y..(n,
d)] = na
446
CO
Chapter 11
ae Estimation ae Performance
rere of Absolute ee
also. If d and n are chosen sufficiently large, then 6,4 * 9, and Y..(n, d) is an approximately unbiased
estimator of@.The bias in Y..(n, d) is Gri 0. 5
For convenience, when the value of n and d are understood, abbreviate Y,.(n, d) (the mean of the
undeleted observations from the rth replication) and Y..(n, d) [the mean of Y;.(n, d),..., Yr.(n, d)]
by Y,. and Y.., respectively. To estimate the standard error of Y.., first compute the sample variance,
R R g . py2
1 = ws 1
Ss = Roi ) —J ia
ra or = ats () ik2_ Rv?) CLS)
s.e.(Y..) = = (11.34)
: eee S.
Y:: — ty/2R-1—=< GS Y. + ty/2,R-1 aes (TSS)
mer JR
where fy/2,r-1 is the 100(1 — w/2) percentage point of at distribution with R — 1 degrees of freedom.
This confidence interval is valid only if the bias of Y.. is approximately zero.
As arough rule, the length of each replication, beyond the deletion point, should be at least ten
times the amount of data deleted. In other words, (n — d) should at least 10d (or more generally, Tg
should be at least 107). Given this run length, the number of replications should be as many as time
permits, up to about 25 replications. Kelton [1986] established that there is little value in dividing the
available time into more than 25 replications, so, if time permits making more than 25 replications
of length Ty + 107, then make 25 replications of longer than Ty + 107, instead. Again, these are
rough rules that need not be followed slavishly.
In this section we have presented results as if the run length is n observations and d of them
are deleted. As discussed earlier, in practice we make a run of length 7p + Tg time units and delete
the first 7p time units. In this case d becomes the number of observations obtained up to time 7p,
and n — d is the number of observations recorded between times Tp and Tg + 7p, both random
variables that typically differ from replication to replication. In the following example we will use
(To + Tg, To) instead of (n, d) in the notation to make this point clear.
i >)
as
The replication averages Y,.(To + Tg, To), r = 1,2,..., 10, are shown in Table 11.5. The point
estimator is computed by Equation (11.32) as
zi S
s.e.(Y..T + Tr, To)) = Tk ant eB)
and using a = 0.05 and fo,925,9 = 2.26, the 95% confidence interval for long-run mean queue length
is given by Inequality (11.35) as
or
46.52 < w < 47.20
The simulation analyst may conclude with a high degree of confidence that the long-run mean cycle
time for Chip-C is between 46.52 and 47.20 hours. The confidence interval computed here as given
by Inequality (11.35) should be used with caution, because a key assumption behind its validity is
that enough data have been deleted to remove any significant bias due to initial conditions—that is,
that Tp and Tg + 7p are sufficiently large so that the bias 67,47,7) — 6 is negligible.
448
Pe ee ee
Estimation
Le) dh iD11rected
ee a URSIN SEN Chapter heAAR lsd hE Performance
of Absolute ATA
Suppose it is desired to estimate a long-run performance measure 6 within te, with confidence
100(1 —a)%. In a steady-state simulation, a specified precision may be achieved either by increasing
the number of replications R or by increasing the run length Tz. The first solution, controlling R, is
carried out as given in Section 11.4.2 for terminating simulations.
it is necessary to have saved the state of the model at time 7) + Tg and to be able to restart the
model and run it for the additional required time. Otherwise, the simulations would have to be rerun
from time 0, which could be time consuming for a complex model. Some simulation languages have
the capability to save enough information that a replication can be continued from time Tg onward,
rather than having to start over from time 0.
ise Tete Coe?UESisfhe cet SDSS GRAY OR et) Weaca Oe NT asEee Te EE 27SO
In Example 11.16, suppose that run length was to be increased to achieve the desired error +0.1
hours. Since R/Ro = 82/10 = 8.2, the run length should be (R/Ro)(To + Te) = 8.2(2200) = 18,040
hours. The data collected from time 0 to time (R/Ro)To = 8.2(200) = 1640 hours would be deleted,
and the data from time 1640 to time 18,040 used to compute new point estimates and confidence
intervals.
One disadvantage of the replication method is that data must be deleted on each replication and, in
one sense, deleted data are wasted data, or at least lost information. This suggests that there might
be merit in using an experiment design that is based on a single, long replication. The disadvantage
of a single-replication design arises when we try to compute the standard error of the sample mean.
Since we only have data from within one replication, the data are dependent, and the usual estimator
is biased.
The method of batch means attempts to solve this problem by dividing the output data from one
replication (after appropriate deletion) into a few large batches and then treating the means of these
batches as if they were independent. When the raw output data after deletion form a continuous-time
process {Y(t), To < t < To + Tr}, such as the length of a queue or the level of inventory, then we
form k batches of size m = Tg/k and compute the batch means as
a 1 sim
ia Y(t + To) dt
M JGj—1)m
forj = 1,2,...,k. In other words, the jth batch mean is just the time-weighted average of the process
over the time interval [Jo + G — 1)m, To + jm).
When the raw output data after deletion form a discrete-time process {Y;, i = d+1,d+2,...,n},
such as the customer delays in a queue or the cost per period of an inventory system, then we form
k batches of size m = (n — d)/k and compute the batch means as
{aie
Biers os Yi+d
i=(j—-1)m+1
for j = 1,2,..., k (assuming k divides n — d evenly, otherwise round down to the nearest integer).
That is, the batch means are formed as shown here:
Vi, see ad) Wg tice, dime Cdingies + 02 Vat 2m ees Week Darcie ADL don
————— Or” ee”
SP: —_—_—_—_—_—_—_—_—“—
deleted y; Y> Y;
450 Chapter of Absoiute
EstimationghSB
11 Sees Dad Performance
Ss
el A AP CAM ces [yids Aaa i
Starting with either continuous-time or discrete-time data, the variance of the sample mean is
estimated by
= = k 92
eerie
eeee y2
SP el
Di TE YS
ee Nee ee (11.36)
Koka hod k(k
— 1)
where Y is the overall sample mean of the data after deletion. The batch means Y1, Yo,..., Yq are
not independent; however, if the batch size is sufficiently large, successive batch means will be
approximately independent, and the variance estimator will be approximately unbiased.
Unfortunately, there is no widely accepted and relatively simple method for choosing an accept-
able batch size m (or equivalently, choosing a number of batches k). But there are some general
guidelines that can be culled from the research literature:
a. Schmeiser [1982] found that, for a fixed total sample size, there is little benefit from dividing
it into more than k = 30 batches, even if we could do so and still retain independence between
the batch means. Therefore, there is no reason to consider numbers of batches much greater
than 30, no matter how much raw data are available. He also found that the performance of
the confidence interval, in terms of its width and the variability of its width, is poor for fewer
than 10 batches. Therefore, a number of batches between 10 and 30 should be used in most
applications.
b. Although there is typically autocorrelation between batch means at all lags, the lag-1 auto-
correlation p; = corr(Y;, ¥;+1) is usually studied to assess the dependence between batch
means. When the lag-1 autocorrelation is nearly 0, then the batch means are treated as inde-
pendent. This approach is based on the observation that the autocorrelation in many stochastic
processes decreases as the lag increases. Therefore, all lag autocorrelations should be smaller,
in absolute value, than the lag-1 autocorrelation.
c. The lag-1 autocorrelation between batch means can be estimated (see below). However, the
autocorrelation should not be estimated from a small number of batch means (such as the
10 < k < 30 recommended above); there is bias in the autocorrelation estimator. Law and
Carson [1979] suggest estimating the lag-1 autocorrelation from a large number of batch
means based on a smaller batch size, perhaps 100 < k < 400. When the autocorrelation
between these batch means is approximately 0, then the autocorrelation will be even smaller
if we rebatch the data to between 10 and 30 batch means based on a larger batch size.
Hypothesis tests for 0 autocorrelation are available, as described next.
d. If the total sample size is to be chosen sequentially, say to attain a specified precision, then it
is helpful to allow the batch size and number of batches to grow as the run length increases. It
can be shown that a good strategy is to allow the number of batches to increase as the square
root of the sample size after first finding a batch size at which the lag-1 autocorrelation is
approximately 0. Although we will not discuss this point further, an algorithm based on it
can be found in Fishman and Yarberry [1997]; see also Steiger and Wilson [2002] and Lada,
Steiger, and Wilson [2006].
Section 11.5 Output Analysis for Steady-State Simulations
aR 451
LE IS SI een -A)
Ls Obtain output data from a single replication and delete as appropriate. Recall our guideline:
collecting at least 10 times as much data as are deleted.
2: Form up to k = 400 batches (but at least 100 batches) with the retained data, and compute
the batch means. Estimate the sample lag-1 autocorrelation of the batch means as
5 pares — ¥)(Yj41 — Y)
a Smee
rae pT {FLATT Eye eT
Vy(Yj— ¥)
. Check the correlation to see whether it is sufficiently small.
(a) If p; < 0.2, then rebatch the data into 30 < k < 40 batches, and form a confidence
interval using k — 1 degrees of freedom for the ¢ distribution and Equation (11.36) to
estimate the variance of Y.
(b) If 6; > 0.2, then extend the replication by 50% to 100% and go to Step 2. If it is
not possible to extend the replication, then rebatch the data into approximately k = 10
batches, and form the confidence interval, using k — 1 degrees of freedom for the t
distribution and Equation (11.36) to estimate the variance of Y.
4. As an additional check on the confidence interval, examine the batch means (at the larger or
smaller batch size) for independence, using the following test. See, for instance, Alexopoulos
and Seila [1998]. Compute the test statistic
| k2—1 [.
e401Vis)"isles
4 (2, ey)
Soe PROVE
8Bie
If C < zg then accept the independence of the batch means, where £ is the Type I error
level of the test, such as 0.1, 0.05, 0.01. Otherwise, extend the replication by 50% to 100%
and go to Step 2. If it is not possible to extend the replication, then rebatch the data into
approximately k = 10 batches, and form the confidence interval, using k — 1 degrees of
freedom for the ¢ distribution and Equation (11.36) to estimate the variance of Y.
This procedure, including the final check, is conservative in several respects. First, if the lag-1
autocorrelation is substantially negative then we proceed to form the confidence interval anyway. A
dominant negative correlation tends to make the confidence interval wider than necessary, which is
an error, but not one that will cause us to make incorrect decisions. The requirement that p; < 0.2 at
100 < k < 400 batches is pretty stringent and will tend to force us to get more data (and therefore
create larger batches) if there is any hint of positive dependence. And finally, the hypothesis test at
the end has a probability of 6 of forcing us to get more data when none are really needed. But this
conservatism is by design; the cost of an incorrect decision is typically much greater than the cost
of some additional computer run time.
The batch-means approach to confidence-interval estimation is illustrated in the next example.
452 Chapter 11. Estimation of Absolute Performance
Example 11.18
Reconsider the FastChip wafer fab simulation of Example 11.8. Suppose that we want to estimate the
steady-state mean cycle time for Chip-C, w, by a 95% confidence interval. To illustrate the method
of batch means, assume that one run of the model has been made, simulating 5000 cycle times after
the deletion point. We then form batch means from k = 100 batches of size m = 50 and estimate
the lag-1 autocorrelation to be ~; = 0.37 > 0.2. Thus, we decide to extend the simulation to 10,000
customers after the deletion point, and again we estimate the lag-1 autocorrelation. This estimate,
based on k = 100 batches of size m = 100, is 0} = 0.15 < 0.2.
Having passed the correlation check, we rebatch the data into k = 40 batches of size m = 250.
The point estimate is the overall mean
S2 40_, Yr
32 —40Y75)
Ni ale Pei dite — 0.049
k 40(39)
or
Thus, we assert with 95% confidence that true mean cycle time w is between 46.55 and 47.45 hours.
If these results are not sufficiently precise, then the run length should be increased to achieve greater
precision.
As a further check on the validity of the confidence interval, we can apply the correlation
hypothesis test. To do so, we compute the test statistic from the k = 40 batches of size m = 250
used to form the confidence interval. This gives
which confirms the lack of correlation at the 0.05 significance level. Notice that, at this small number
of batches, the estimated lag-1 autocorrelation appears to be slightly negative, illustrating our point
about the difficulty of estimating correlation with small numbers of observations.
Taking the easier case first, suppose that the output process from a single replication, after
appropriate deletion of initial data, is Yy41,..., Yn. To be concrete, Y; might be the delay in queue
of the ith customer. Then the point estimate of the pth quantile can be obtained as before, either from
the histogram of the data or from the sorted values. Of course, only the data after the deletion point
are used. Suppose we make R replications and let 4, be the quantile estimate from the rth. Then the
R quantile estimates, 6}, .. ., Op, are independent and identically distributed. Their average is
=3b8
ele
a
¥e 5
0. + ty/2.R-1
a/ —=
JR
11.6 Summary
This chapter emphasized the idea that a stochastic discrete-event simulation is a statistical experiment.
Therefore, before sound conclusions can be drawn on the basis of the simulation-generated output
data, a proper statistical analysis is required. The purpose of the simulation experiment is to obtain
estimates of the performance measures of the system under study. The purpose of the statistical
analysis is to acquire some assurance that these estimates are sufficiently precise for the proposed
use of the model.
A distinction was made between terminating piaaladnnn and steady-state simulations. Steady-
state simulation output data are more difficult to analyze, because the simulation analyst must address
the problem of initial conditions and the choice of run length. Some suggestions were given regarding
these problems, but unfortunately no simple, complete, and satisfactory solution exists. Nevertheless,
simulation analysts should be aware of the potential problems, and of the possible solutions—namely,
deletion of data and increasing of the run length. More advanced statistical techniques (not discussed
in this text) are given in Alexopoulos and Seila [1998], Bratley, Fox, and Schrage [1996], and Law
[2007].
454 i ta Se LO Estimation of Absolute Performance
Chapter 11 nN er
TE et eg I la
REFERENCES
ALEXOPOULOS, C., AND A. F. SEILA [1998], “Output Data Analysis,” Chapter 7 in Handbook of Simulation,
J. Banks, ed., Wiley, New York.
BRATLEY, P., B. L. FOX, AND L. E. SCHRAGE [1996], A Guide to Simulation, 2d ed., Springer-Verlag, New
York.
FISHMAN, G.S., AND L. S.YARBERRY [1997], “An Implementation of the Batch Means Method,” INFORMS
Journal on Computing, Vol. 9, pp. 296-310.
GOLDSMAN, D., L. SCHRUBEN, AND J. J. SWAIN [1994], “‘Tests for Transient Means in Simulated Time
Series,’ Naval Research Logistics, Vol. 41, pp. 171-187.
KELTON, W. D. [1986], “Replication Splitting and Variance for Simulating Discrete-Parameter Stochastic
Processes,” Operations Research Letters, Vol. 4, pp. 275-279.
KELTON, W. D. [1989], “Random Initialization Methods in Simulation,” 7E Transactions, Vol. 21, pp. 355-367.
KELTON, W. D., AND A. M. LAW [1983], “A New Approach for Dealing with the Startup Problem in Discrete
Event Simulation,” Naval Research Logistics Quarterly, Vol. 30, pp. 641-658.
KLEIJNEN, J. P. C. [1987], Statistical Tools for Simulation Practitioners, Dekker, New York.
LADA, E. K., N. M. Steiger, AND J. R. WILSON [2006], “Performance Evaluation of Recent Procedures for
Steady-state Simulation Analysis,” ITE Transactions, Vol. 38, pp. 711--727.
LAW, A. M. [1977], “Confidence Intervals in Discrete Event Simulation: A Comparison of Replication and
Batch Means,” Naval Research Logistics Quarterly, Vol. 24, pp. 667-78.
LAW, A. M. [1980], “Statistical Analysis of the Output Data from Terminating Simulations,’ Naval Research
Logistics Quarterly, Vol. 27, pp. 131-43.
LAW, A. M. [2007], Simulation Modeling and Analysis, 4th ed., McGraw-Hill, New York.
LAW, A. M., AND J. S. CARSON [1979], “A Sequential Procedure for Determining the Length of a Steady-State
Simulation,’ Operations Research, Vol. 27, pp. 1011-1025.
NELSON, B. L. [2001], “Statistical Analysis of Simulation Results,” Chapter 94 in Handbook of Industrial
Engineering, 3d ed., G, Salvendy, ed., Wiley, New York.
SCHMEISER, B. [1982], “Batch Size Effects in the Analysis of Simulation Output,” Operations Research, Vol.
30, pp. 556-568.
SCHRUBEN, L. [1982], “Detecting Initialization Bias in Simulation Output,” Operations Research, Vol. 30,
pp. 569-590.
SNELL, M., AND L. SCHRUBEN [1985], “Weighting Simulation Data to Reduce Initialization Effects.” JE
Transactions, Vol. 17, pp. 354-363.
STEIGER, N. M., AND J. R. Wilson [2002], “An Improved Batch Means Procedure for Simulation Output
Analysis,’ Management Science, Vol. 48, pp. 1569-1586.
WELCH, P. D. [1983], “The Statistical Analysis of Simulation Results,” in The Computer Performance Modeling
Handbook, 8. Lavenberg, ed., Academic Press, New York, pp. 268-328.
EXERCISES
ee a 455
pe ae A
EXERCISES
1. For each of the systems described below, under what circumstances would it be appropriate to
use a terminating simulation versus a steady-state simulation to analyze this system?
2. Suppose that the output process from a queueing simulation is L(t), 0 < t < T, the total number
in queue at time ft. A continuous-time output process can be converted into the sort of discrete-
time process Y;, Y2,... described in this chapter by first forming k = T/m batch means of size
m time units:
b9 ce
L(t) dt
Mm JGj—-1)m
for j = 1, 2,...,k. Ensemble averages of these batch means can be plotted to check for initial-
condition bias.
(a) Show algebraically that the batch mean over 2m time units can be obtained by averaging
two adjacent batch means over m time units. [|Hint: This implies that we can start with batch
means over rather small time intervals m and build up batch means over longer intervals
without reanalyzing all of the data.]
(b) Simulate an M/M/1 queue with A = 1 and w = 1.25 for a run length of 4000 time units,
computing batch means of the total number in queue with batch size m = 4 time units. Make
replications and use a mean plot to determine an appropriate number of batches to delete.
Convert this number of batches to delete into a deletion time.
3. Using the results of Exercise 2 above, design and execute an experiment to estimate the steady-
state expected number in this M/M /1 queue, L, to within +0.5 customers with 95% confidence.
Check your estimate against the true value of L = A/(f — A).
4. In Example 11.7, suppose that management desired a 95% confidence interval on the estimate
of mean cycle time and that the error allowed was € = 0.05 hours (3 minutes). Using the same
initial data given in Table 11.5, estimate the required total sample size. Although we cut the error
in half, by how much did the required sample size increase?
5. Again, simulate an M/M/1 queue with A = 1 and yu = 1.25, this time recording customer
time in system (from arrival to departure) as the performance measure for 4000 customers.
Make replications and use a mean plot to determine an appropriate number of customers to
delete when starting the system empty, with 4 customers initially in the system, and with 8
customers initially in the system. How does the warmup period change with these different
initial conditions? What does this suggest about how to initialize simulations?
456 Chapter 11 Estimation of Absolute Performance
where J is the level of inventory on hand plus on order at the end of a month, M is the maximum
inventory level, and L is the reorder point. M and L are under management control, so the pair
(M, L) is called the inventory policy. Under certain conditions, the analytical solution of such a
model is possible, but not always. Use simulation to investigate an (M, L) inventory system with
the following properties: The inventory status is checked at the end of each month. Backordering
is allowed at a cost of $4 per item short per month. When an order arrives, it will first be used
to relieve the backorder. The lead time is given by a uniform distribution on the interval [0.25,
1.25] months. Let the beginning inventory level stand at 50 units, with no orders outstanding.
Let the holding cost be $1 per unit in inventory per month. Assume that the inventory position is
reviewed each month. If an order is placed, its cost is $60 + $5Q, where $60 is the ordering cost
and $5 is the cost of each item. The time between demands is exponentially distributed with a
mean of 1/15 month. The sizes of the demands follow this distribution:
Demand _ Probability
(a) Make ten independent replications, each of run length 100 months preceded by a 12-month
initialization period, for the (M, L) = (50, 30) policy. Estimate long-run mean monthly cost
with a 90% confidence interval.
(b) Using the results of part (a), estimate the total number of replications needed to estimate
mean monthly cost within $5.
7, Reconsider Exercise 6, except that, if the inventory level at a monthly review is zero or negative,
a rush order for Q units is placed. The cost for a rush order is $120 + $120, where $120 is
the ordering cost and $12 is the cost of each item. The lead time for a rush order is given by a
uniform distribution on the interval [0.10, 0.25] months.
(a) Make ten independent replications for the (M, L) policy, and estimate long-run mean monthly
cost with a 90% confidence interval.
(b) Using the results of part (a), estimate the total number of replications needed to estimate
mean monthly cost within $5.
EXERCISES 457
CS SE ce a a ad
8. Suppose that the items in Exercise 6 are perishable, with a selling price given by the following
data:
0-1
12
>2
Thus, any item that has been on the shelf more than 2 months cannot be sold. The age is measured
at the time the demand occurs. If an item is outdated, it is discarded, and the next item is brought
forward. Simulate the system for 100 months.
(a) Make ten independent replications for the (M, L) = (50, 30) policy, and estimate long-run
mean monthly cost with a 90% confidence interval.
(b) Using the results of part (a), estimate the total number of replications needed to estimate
mean monthly cost within $5.
At first, assume that all the items in the beginning inventory are fresh. Is this a good assumption?
What effect does this “all-fresh” assumption have on the estimates of long-run mean monthly
cost? What can be done to improve these estimates? Carry out a complete analysis.
(f) The demand on the part of each customer is Poisson distributed with a mean of 3 units.
(g) For simplicity, assume that all demands occur at noon and that all orders are placed imme-
diately thereafter.
Assume further that orders are received at 5:00 P.M., or after the demand that occurred on that
day. Consider the policy having Q = 20. Make ten independent replications, each of length 100
458
Pee A a ee
Chapter 11 Estimation
SSS
Absolute
ofSSS SSS
Performance
days, and compute a 90% confidence interval for long-run mean daily cost. Investigate the effect
of initial inventory level and existence of an outstanding order on the estimate of mean daily
cost. Begin with an initial inventory of Q + 10 and no outstanding orders.
10. A store selling Mother’s Day cards must decide 6 months in advance on the number of cards
to stock. Reordering is not allowed. Cards cost $0.45 and sell for $1.25. Any cards not sold
by Mother’s Day go on sale for $0.50 for 2 weeks. However, sales of the remaining cards is
probabilistic in nature according to the following distribution:
32% of the time, all cards remaining get sold.
40% of the time, 80% of all cards remaining are sold.
Any cards left after 2 weeks are sold for $0.25. The card-shop owner is not sure how many
cards can be sold, but thinks it is somewhere (i.e., uniformly distributed) between 200 and 400.
Suppose that the card-shop owner decides to order 300 cards. Estimate the expected total profit
with an error of at most $5.00. [Hint: Make ten initial replications. Use these data to estimate
the total sample size needed. Each replication consists of one Mother’s Day.]
11. A very large mining operation has decided to control the inventory of high-pressure piping by
a “periodic review, order up to M” policy, where M is a target level. The annual demand for
this piping is normally distributed, with mean 600 and variance 800. This demand occurs fairly
uniformly over the year. The lead time for resupply is Erlang distributed of order k = 2 with its
mean at 2 months. The cost of each unit is $400. The inventory carrying charge, as a proportion
of item cost on an annual basis, is expected to fluctuate normally about the mean 0.25.(simple
interest), with a standard deviation of 0.01. The cost of making a review and placing an order is
$200, and the cost of a backorder is estimated to be $100 per unit backordered. Suppose that the
inventory level is reviewed every 2 months, and let M = 337.
(a) Make ten independent replications, each of run length 100 months, to estimate long-run
mean monthly cost by means of a 90% confidence interval.
(b) Investigate the effects of initial conditions. Calculate an appropriate number of monthly
observations to delete, in order to reduce initialization bias to a negligible level.
12, Consider some number, say N, of M/M/1 queues in series. The M/M/1 queue, described in
Section 6.4, has Poisson arrivals at some rate 4. customers per hour, exponentially distributed
service times with mean 1/j, and a single server. (Recall that “Poisson arrivals” means that
interarrival times are exponentially distributed.) By M/M/1 queues in series, it is meant that,
upon completion of service at a given server, a customer joins a waiting line for the next server.
The system can be shown as follows:
All service times are exponentially distributed with mean 1/,, and the capacity of each waiting
line is assumed to be unlimited. Assume that 4 = 8 customers per hour and 1/j = 0.1 hour.
The measure of performance is response time, which is defined to be the total time a customer
is in the system.
(a) By making appropriate simulation runs, compare the initialization bias for N = 1 (i.e., one
M/M/\ queue) to N = 2 (i.e., two M/M/1 queues in series). Start each system with all servers
idle and no customers present. The purpose of the simulation is to estimate mean response
time.
(b) Investigate the initialization bias as a function of N, for N = 1, 2,3, 4, and 5.
(c) Draw some general conclusions concerning initialization bias for “large” queueing systems
when at time 0 the system is assumed to be empty and idle.
12. Jobs enter a job shop in random fashion according to a Poisson process at a stationary overall
rate, two every 8-hour day. The jobs are of four types. They flow from work station to work
station in a fixed order, depending on type, as shown below. The proportions of each type are
also shown.
Processing times per job at each station depend on type, but all times are, approximately, normally
distributed with mean and s.d. in hours as follows:
1 2 c 4
(207 ST (UO) eA) MU)
(18, 2) (60,5) (10, 1)
(20) ie) (OU 18) wo Adee)
(30, 5) (1552)
Station i will have c; workers, i = 1, 2, 3, 4. Each job occupies one worker at a station for the
duration of a processing time. All jobs are processed on a first-in—first-out basis, and all queues
for waiting jobs are assumed to have unlimited capacity. Simulate the system for 800 hours,
preceded by a 200-hour initialization period. Assume that c} = 8, c2 = 8, c3 = 20, cq = 7.
Based on R = 20 replications, compute a 95% confidence interval for average worker utilization
at each of the four stations. Also, compute a 95% confidence interval for mean total response
time for each job type, where a total response time is the total time that a job spends in the shop.
460 Chapter 11 ee of Absolute Performance
Estimationet
haa a a
14. Change Exercise 13 to give priority at each station to the jobs by type. Type I jobs have priority
over type II, type II over type III, and type III over type IV. Use 800 hours as run length, 200
hours as initialization period, and R = 20 replications. Compute four 95% confidence intervals
for mean total response time by type. Also, run the model without priorities and compute the
same confidence intervals. Discuss the trade-offs when using first-in-first-out versus a priority
system.
15. Consider a single-server queue with Poisson arrivals at rate A = 10.82 per minute and normally
distributed service times with mean 5.1 seconds and variance 0.98 seconds’. It is desired to
estimate the mean time in the system for a customer who, upon arrival, finds 7 other customers
in the system; that is, to estimate
Ww; = E(WIN
=1) fort®= 01,2)...
where W is a typical system time and N is the number of customers found by an arrival. For
example, wo is the mean system time for those customers who find the system empty, w is the
mean system time for those customers who find one other customer present upon arrival, and
so on. The estimate W; of w; will be a sample mean of system times taken over all arrivals who
find i in the system. Plot #; vs i. Hypothesize and attempt to verify a relation between w; and i.
(a) Simulate for a 10-hour period with empty and idle initial conditions.
(b) Simulate for a 10-hour period after an initialization of one hour. Are there observable differ-
ences in the results of (a) and (b)?
(c) Repeat parts (a) and (b) with service times exponentially distributed with mean 5.1 seconds.
(d) Repeat parts (a) and (b) with deterministic service times equal to 5.1 seconds.
(e) Find the number of replications needed to estimate wo, wi, ... , We with a standard error for
each of at most 3 seconds. Repeat parts (a)—(d), but using this number of replications.
16. At Smalltown U., there is one specialized graphics workstation for student use located across
campus from the computer center. At 2:00 A.M. one day, six students arrive at the workstation
to complete an assignment. A student uses the workstation for 10 + 8 minutes, then leaves to go
to the computer center to pick up their graphics output. There is a 25% chance that the run will
be OK and the student will go to sleep. If it is not OK, the student returns to the workstation
and waits until it becomes free. The round trip from workstation to computer center and back
takes 30 £5 minutes. The computer becomes inaccessible at 5:00 A.M. Estimate the probability
p that at least five of the six students will finish their assignment in the 3-hour period. First,
make R = 10 replications, and compute a 95% confidence interval for p. Next, work out the
number of replications needed to estimate p within +.02, and make this number of replications.
Recompute the 95% confidence interval for p.
17. Four workers are spaced evenly along a conveyor belt. Items needing processing arrive according
to a Poisson process at the rate of 2 per minute. Processing time is exponentially distributed,
with mean 1.6 minutes. If a worker becomes idle, then he or she takes the first item to come
by on the conveyor. If a worker is busy when an item comes by, that item moves down the
conveyor to the next worker, taking 20 seconds between two successive workers. When a worker
EXERCISES
a a pe 461
le
finishes processing an item, the item leaves the system. If an item passes by the last worker, it
is recirculated on a loop conveyor and will return to the first worker after 5 minutes.
Management is interested in having a balanced workload—that is, management would like
worker utilizations to be equal. Let p; be the long-run utilization of worker i, and let p be the
average utilization of all workers. Thus, p = (p; + p2 + p3 + p4)/4. According to queueing
theory, o can be estimated by p = A/c, where A = 2 arrivals per minute, c = 4 servers, and
1/j4 = 1.6 minutes is the mean service time. Thus, p = A/cu = (2/4)1.6 = 0.8; so, on the
average, a worker will be busy 80% of the time.
(a) Make 10 independent replications, each of run length 40 hours preceded by a one-hour
initialization period. Compute 95% confidence intervals for p; and p4. Draw conclusions
concerning workload balance.
(b) Based on the same 10 replications, test the hypothesis Ho:0; = 0.8 at a level of significance
a = 0.05. Ifa difference of +.05 is important to detect, determine the probability that such a
deviation is detected. In addition, if it is desired to detect such a deviation with probability at
least 0.9, figure out the sample size needed to do so. [Hint: See any basic statistics textbook
for guidance on hypothesis testing. ]
(c) Repeat (b) for Ho:p4 = 0.8.
(d) From the results of (a)-(c), draw conclusions for management about the balancing of work-
loads.
18. At a small rock quarry, a single power shovel dumps a scoop full of rocks at the loading area
approximately every 10 minutes, with the actual time between scoops modeled well as being
exponentially distributed, with mean 10 minutes. Three scoops of rocks make a pile; whenever
one pile of rocks is completed, the shovel starts a new pile.
The quarry has a single truck that can carry one pile (3 scoops) at a time. It takes approximately
27 minutes for a pile of rocks to be loaded into the truck and for the truck to drive to the processing
plant, unload, and return to the loading area. The actual time to do these things altogether is
modeled well as being normally distributed, with mean 27 minutes and standard deviation 12
minutes.
When the truck returns to the loading area, it will load and transport another pile if one is waiting
to be loaded; otherwise, it stays idle until another pile is ready. For safety reasons, no loading
of the truck occurs until a complete pile is waiting.
The quarry operates in this manner for an 8-hour day. We are interested in estimating the utiliza-
tion of the trucks and the expected number of piles waiting to be transported if an additional
truck is purchased.
19. Big Bruin, Inc. plans to open a small grocery store in Juneberry, NC. They expect to have two
checkout lanes, with one lane being reserved for customers paying with cash. The question they
want to answer is: How many grocery carts do they need?
During business hours (6 A.M.-8 P.M.), cash-paying customers are expected to arrive at 8 per
hour. All other customers are expected to arrive at 9 per hour. The time between arrivals of each
type can be modeled as exponentially distributed random variables.
462 Chapter 11 Estimation of Absolute Performance
The time spent shopping is modeled as normally distributed, with mean 40 minutes and standard
deviation 10 minutes. The time required to check out after shopping can be modeled as lognor-
mally distributed, with (a) mean 4 minutes and standard deviation 1 minute for cash-paying
customers; (b) mean 6 minutes and standard deviation 1 minute for all other customers.
We will assume that every customer uses a shopping cart and that a customer who finishes
shopping leaves the cart in the store so that it is available immediately for another customer.
We will also assume that any customer who cannot obtain a cart immediately leaves the store,
disgusted.
The primary performance measures of interest to Big Bruin are the expected number of shopping
carts in use and the expected number of customers lost per day. Recommend a number of carts
for the store, remembering that carts are expensive, but so are lost customers.
20. Develop a simulation model of the total time in the system for an M/M/1 queue with service
rate 4 = 1; therefore, the traffic intensity is o = A/u = X, the arrival rate. Use the simulation,
in conjunction with the technique of plotting ensemble averages, to study the effect of traffic
intensity on initialization bias when the queue starts empty. Specifically, see how the initialization
phase 7p changes for p = 0.5, 0.7, 0.8, 0.9, 0.95.
21. Many simulation-software tools come with built-in support for output analysis. Use the web to
research the features of one of these products.
12
Estimation of Relative
Performance
Chapter 11 dealt with the precise estimation of measures of performance. This chapter discusses a
few of the many statistical methods that can be used to compare the relative performance of two or
more system designs. This is one of the most important uses of simulation. Because the observations
of the response variables contain random variation, statistical analysis is needed to discover whether
any observed differences are due to differences in design or merely to the random fluctuation inherent
in the models.
The comparison of two system designs is easier than the simultaneous comparison of multiple
system designs. Section 12.1 discusses the case of two system designs, using two possible statistical
techniques: independent sampling and correlated sampling. Correlated sampling is also known as
the common random numbers (CRN) technique; simply put, the same random numbers are used to
simulate both alternative system designs. If implemented correctly, CRN usually reduces the variance
of the estimated difference of the performance measures and thus can provide, for a given sample size,
more precise estimates of the mean difference than can independent sampling. Section 12.2 extends
the statistical techniques of Section 12.1 to the comparison of multiple (more than two) system designs
using simultaneous confidence intervals, screening and selection of the best. These approaches are
somewhat limited with respect to the number of system designs that can be considered, so Section 12.3
describes how a large number of complex system designs can sometimes be represented by a simpler
463
464 Chapter 12 Estimation
eee of Relative Performance
Tg ee ee ee a age
metamodel. Finally, for comparison and evaluation of a very large number of system designs that
are related in a less structured way, Section 12.4 presents optimization via simulation.
Suppose that a simulation analyst desires to compare two possible configurations of a system. In
a queueing system, perhaps two possible queue disciplines, or two possible sets of servers, are to
be compared. In a supply-chain inventory system, perhaps two possible ordering policies will be
compared. A job shop could have two possible scheduling rules; a production system could have
in-process inventory buffers of various capacities. Many other examples of alternative system designs
can be provided.
The method of replications will be used to analyze the output data. The mean performance
measure for system i will be denoted by 6;, i = 1, 2. If it is a steady-state simulation, it will be
assumed that deletion of data, or other appropriate techniques, have been used to ensure that the
point estimators are approximately unbiased estimators of the mean performance measures 6;. The
goal of the simulation experiments is to obtain point and interval estimates of the difference in mean
performance, namely 6; — 62. Two methods of computing a confidence interval for 6; — 62 will be
discussed, but first an example and a general framework will be given.
USE TY toadBP-2 A pace eee le Dees Lod es SR na eee ele NSN ted th tee be
Recall the Software Made Personal (SMP) call center problem, Example 11.7: SMP has a customer
support call center where 7 operators handle questions from owners of their software from 8 A.M.
to 4 P.M. Eastern Time. When a customer calls they use an automated system to select between
the two product lines, finance and contact management. Currently each product line has its own
operators (4 for financial, 3 for contact management) and hold queue. SMP wonders if they can
reduce the total number of operators needed by cross-training them so that they can answer calls for
any product line. This is expected to increase the time to process a call by about 10%. The current
system is illustrated in Figure 12.1(a). and the alternative system design is shown in Figure 12.1(b).
Contact
Callers Callers
Select Operators
(a)
Figure 12.1 Current SMP call center (a) and an alternative design (b).
Section 12.1 Comparison of Two System Designs
e 465
n a
Table 12.1. Simulation Output Data and Summary Measures for Comparing Two Systems
Sample — Sample
cae
Naitletiee See Ric Mean Variance
Before considering reducing operators, SMP wants to know what their quality of service would be
if they simply cross-train the 7 operators they currently have. Incoming calls have been modeled as
a nonstationary Poisson arrival process and distributions have been fit to operator time to answer
questions. Comparisons will be based on the average time from call initiation until the customer’s
questions are answered.
When comparing two systems, such as those in Example 12.1, the simulation analyst must
decide on a run length 7,
T.? for each model i= 1, 2, a number of replications R; to be made of each
model, or both. From eoieisn r of system i, the simulation analyst obtains an estimate Y,; of
the mean performance measure 6;. In Example 12.1, Y,; would be the average call response time
observed during replication r for system 7, where r = 1, 2,... , R;; andi = 1, 2. The data, together
with the two summary measures, the sample means Y,;, and the sample variances ge are exhibited
in Table 12.1. Assuming that the estimators Y,; are, at least approximately, unbiased, it follows that
In Example 12.1, SMP is initially interested in a comparison of two system designs (same
number of operators, but specialized vs. cross trained), so the simulation analyst decides to compute
a confidence interval for 6; — 62, the difference between the two mean performance measures. This
will lead to one of three possible conclusions:
1. Ifthe confidence interval (c.i.) for 0; —42 is totally to the left of zero, as shown in Figure 12.2(a),
then there is strong evidence for the hypothesis that 6; — 62 < 0, or equivalently 6; < 6).
In Example 12.1, 6; < 62 implies that the mean response time for system 1, the original system, is
smaller than for system 2, the alternative system.
2. If the c.i. for 0; — 42 is totally to the right of zero, as shown in Figure 12.2(b), then there is
strong evidence that 6; — 62 > 0, or equivalently, 6, > 62.
In Example 12.1, 6; > 62 can be interpreted as system 2 being better than system 1, in the sense that
system 2 has a smaller mean response time.
3. If the c.i. for 6; — 42 contains zero, then, with the data at hand, there is no strong statistical
evidence that one system design is better than the other.
466 Chapter 12 Estimation of Relative Performance
Figure 12.2. Three confidence intervals that can occur in the comparing of two systems.
Some statistics textbooks say that the weak conclusion 6; = @2 can be drawn, but such statements can
be misleading. A “weak” conclusion is often no conclusion at all. Most likely, if enough additional
data were collected (i.e., R; increased), the c.i. would shift, and definitely shrink in length, until
conclusion 1 or 2 would be drawn. Since the confidence interval provides a measure of the precision
of the estimator of 6; — 62, we want to shrink it until we can either conclude that there is a difference
that matters, or the difference is so small that we do not need to separate them.
In this chapter, a two-sided 100(1 — a)% c.i. for 0; — 62 will always be of the form
where Y.; is the sample mean performance measure for system 7 over all replications
Y= : ye Y,; (12.2)
v is the degrees of freedom associated with the variance estimator, fy /2,, is the 100(1 —aw/2) percentage
point of af distribution with v degrees of freedom, and s.e.(-) represents the standard error of the
specified point estimator. To obtain the standard error and the degrees of freedom, the analyst uses
one of two statistical techniques. Both techniques assume that the basic data, Y,; of Table 12.1, are
approximately normally distributed. This assumption is reasonable provided that each Y,; is itself a
sample mean of observations from replication r, which is indeed the situation in Example 12.1.
By design of the simulation experiment, Y,1(r = 1,...,R,) are independently and identi-
cally distributed (i.i.d.) with mean 6; and variance or, say. Similarly, Y,.(r = 1,..., Rz) are iid.
with mean 02 and variance ae, say. The two techniques for computing the confidence interval in
Equation (12.1), which are based on different sets of assumptions, are discussed in the following
subsections.
Section aera
a 12.1 Comparison
eenof Two System Designs 467
enna eS IE nn pth or
Whether a difference within these bounds is practically significant depends on the particular problem.
Independent sampling means that different and independent random number streams will be used
to simulate the two systems. This implies that all the observations of simulated system 1, namely
{Y,1,r = 1,...,R1}, are statistically independent of all the observations of simulated system 2,
namely {Y,2,r = 1,...,R2}. By Equation (12.2) and the independence of the replications, the
variance of the sample mean Y,; is given by
For independent sampling, ¥.; and Y.2 are statistically independent; hence,
There are confidence-interval procedures that are appropriate when the two variances are equal
but unknown in value; that is, o? = 07. However, since we would rarely know this to be true
in practice, and the advantages of exploiting equal variances are slight provided the number of
replications is not too small, we refer the reader to any statistics textbook for this case (for instance
Hines ef al. [2002)).
468 Relative Performance
Chapter 12 Estimation of eee
e
BOS. a a e
An approximate 100(1 — a)% c.i. for 6; — 2 can be computed as follows. The point estimate is
computed by
with Y.; given by Equation (12.2), while the sample variance for system / is
S? = nb"
=——1 ! - RY;r
R;
(oY, (12.5)
Ren r=!
2
s.e.(Y.1 bee Y2) = a
= ae 3
Ro (12.6)
CRN (also known as correlated sampling) means that, for each replication, the same random numbers
are used to simulate both systems. Therefore, R; and R2 must be equal, say, Rj = Ro = R. Thus, for
each replication r, the two estimates Y,; and Y,2 are no longer independent, but rather are correlated.
However, independent random numbers are used on different replications, so the pairs (Y,1, Ys) are
mutually independent when r # s. For example, in Table 12.1, the observation Yj; is correlated
with Yj2, but Y;; is independent of all other observations. The purpose of using CRN is to induce a
positive correlation between Y,; and Y,2 for each r and thus to achieve a variance reduction in the
point estimator of mean difference, Y.;— Y.2. In general, this variance is given by
Now, compare the variance of Y.. — Y.2 arising from the use of CRN [Equation (12.8), call
it Vern] to the variance arising from the use of independent sampling with equal sample sizes
[Equation (12.3) with Ry; = Ro = R; call it Viyp]. Notice that
y)
Vern = Vinp — ae (12.9)
If CRN works as intended, the correlation (12 will be positive; hence, the second term on the right
side of Equation (12.9) will be positive, and, therefore,
That is, the variance of the point estimator will be smaller with CRN than with independent sampling.
A smaller variance, for the same sample size R, implies that the estimator based on CRN is more
precise, leading to a shorter confidence interval on the difference, which implies that smaller differ-
ences in performance can be detected.
To compute a 100(1 — w)% c.i. with correlated data, first compute the differences
Dr =Yn—Yp (12.10)
which, by the definition of CRN, are i.i.d.; then compute the sample mean difference as
1 3
D=—)°D, (12.11)
Thus, D = Y.; — Y.2. The sample variance of the differences {D,} is computed as
R
1 =
3 = ER y (D, — D)?
r=}
R
1 *
= (>ie - 0°) (12.12)
r=]
which has degrees of freedom v = R—1.The 100(1—a)% c.i. for 6; —62 is given by Expression (12.1),
with the standard error of Y.; — Y.2 = D, estimated by
Because Sp/ VR of Equation (12.13) is an estimate of ./Vcry and Expression (12.6) is an estimate of
J/Vinp, CRN typically will produce ac.i. that is shorter for a given sample size than the c.i. produced
by independent sampling if p12 > 0. In fact, the expected length of the c.i. will be shorter with the
use of CRN if 12 > 0.1, provided R > 10. The larger R is, the smaller p12 can be and still yield a
shorter expected length [Nelson 1987].
470ba 12. Estimation of Relative Performance
enChapter eo
a a
For any problem, there are many ways of implementing common random numbers. Tt is never
enough to simply use the same seed on the random-number generator(s). Each random number used
in one model for some purpose should be used for the same purpose in the second model—that is, the
use of the random numbers must be synchronized. For example, if the ith random number is used to
generate the call service time of an operator for the Sth caller in model 1, then the ith random number
should be used for the very same purpose in model 2. For queueing systems or service facilities,
synchronization of the common random numbers guarantees that the two systems face identical
work loads: both systems face arrivals at the same instants of time, and these arrivals demand equal
amounts of service. (The actual service times of a given arrival in the two models may not be equal;
they could be proportional if the server in one model were faster than the server in the other model.)
For an inventory system, in comparing different ordering policies, synchronization guarantees that
the two systems face identical demand for a given product. For production or reliability systems,
synchronization guarantees that downtimes for a given machine will occur at exactly the same times
and will have identical durations, in the two models. On the other hand, if some aspect of one of
the systems is totally different from the other system, synchronization could be inappropriate—or
even impossible to achieve. In summary, those aspects of the two system designs that are sufficiently
similar should be simulated with common random numbers in such a way that the two models behave
similarly; but those aspects that are totally different should be simulated with independent random
numbers.
Implementation of common random numbers is model-dependent, but certain guidelines can be
given that will make CRN more likely to yield a positive correlation. The purpose of the guidelines
is to ensure that synchronization occurs:
1. Dedicate arandom-number stream to a specific purpose, and use as many different streams as
needed. (Different random-number generators, or widely spaced seeds on the same generator,
can be used to get two different, nonoverlapping streams.) In addition, assign independently
chosen seeds to each stream at the beginning of each replication. It is not sufficient to assign
seeds at the beginning of the first replication and then let the random-number generator
merely continue for the second and subsequent replications. If simulation is conducted in
this manner, the first replication will be synchronized, but subsequent replications might not
be.
2. For systems (or subsystems) with external arrivals: As each entity enters the system, the next
interarrival time is generated, and then immediately all random variables (such as service
times, order sizes, etc.) needed by the arriving entity and identical in both models are generated
in a fixed order and stored as attributes of the entity, to be used later as needed. Apply
guideline 1; Dedicate one random-number stream to these external arrivals and all their
attributes.
3. For systems having an entity performing given activities in a cyclic or repeating fashion,
assign a random-number stream to this entity. (Example: a machine that cycles between
two states: up—down—up—down-. .. . Use a dedicated random-number stream to generate the
uptimes and downtimes.)
4. If synchronization is not possible, or if it is inappropriate for some part of the two models,
use independent streams of random numbers for this subset of random variates.
Section 12.1 Comparison of Two System Designs
a 471
eg he ee
Unfortunately, there is no guarantee that CRN will always induce a positive correlation between
comparable runs of the two models. It is known that if, for each input random variate X, the estimators
Y,; and Y,2 are increasing functions of the random variate X (or both are decreasing functions of
X), then 12 will be positive. The intuitive idea is that both models, that is, both Y,, and Y,o,
respond in the same direction to each input random variate, and this results in positive correlation.
This increasing or decreasing nature of the response variables, called monotonicity, with respect
to the input random variables, is known to hold for certain queueing systems, such as the GI/G/c
queues, when the response variable is customer delay, so some evidence exists that common random
numbers is a worthwhile technique for queueing simulations. (For simple queues, customer delay
is an increasing function of service times and a decreasing function of interarrival times.) Wright
and Ramsay [1979] reported a negative correlation for certain inventory simulations, however. In
summary, the guidelines recently described should be followed, and some reasonable notion that the
response variable of interest is a monotonic function of the random input variables should be evident.
Time
Total
1 2 3 4 5 6 7 8 9 10
Replication
Figure 12.3 Average response times from the first 10 replications of the current and proposed
(7 operators) SMP call center
A more quantitative way to see the impact of CRN is to look at the standard errors, because
they directly impact the length of the confidence interval. For instance, the standard error for the
comparison of the two proposed systems is 0.88 minutes. We can approximate what the standard error
would have been with independent sampling by using the sample variances and Equation (12.6):
14.86 23.99
.e. = ,/—10 ets
s.e +— 10 = LT1.70
Table 12.2 Comparison of System Designs for the SMP Call Center
Average Response Time Differences
1 0.05 . —2.16
2 9.06 10.64 18.03.| 2-150 on 130
a) 8.02 9.53 1O.L7 | oe ESI —6.64
4 99 6.15 TAO | O22) 1.25
5 8.31 7.83 12.70 0.48 —4.87
6 Jol 6.09 8.26 | —O.17 . —2.17
fi! 8.74 702 eae 112 ..—4.70
8 7.78 1303 11.40 0.75 —4,37
9 —35,.64 10.24
10 1:85 —6,73
Sample mean 13.20 | —0.86 —5.05
Sample variance 1.60 4.86 23.99 3.87 WF
Ati
Standard error 0.62 0.88
Section 12.1 Comparison of Two System Designs
a
s pa
e eee e ee473
Time
Total
1 2 3 4 2) 6 7 8 ) 10
Replication
Figure 12.4 Average response times from the first 10 replications of the proposed designs with
7 and 6 operators for the SMP call center
Understanding why the strength of the induced correlation due to CRN is different in these two
cases helps in understanding when CRN will be most effective. The two proposed designs (6 and
7 operators) are structurally the same system, differing only in the number of operators. Therefore,
when the random inputs are such that the system with 7 operators is congested, the system with
6 operators is certain to be at least as congested, and probably more so. The current system is
structurally different from the alternatives, having a distinct queue for each type of caller. When
congestion occurs in the financial software queue it may not occur in the contact management queue,
so long response times for the financial callers could be averaged with shorter response times for the
contact management callers. Thus, the relationship with the proposed system that has a single queue
is not as strong.
Let the subscript C denote the current call center, P7 the proposal with 7 operators, and P6 the
proposal with 6 operators. Then a 95% confidence interval for the difference 6¢ — @p7 is
—0.86 + 2.26(0.62)
or
Therefore, from these 10 replications we cannot say for certain which design has a lower response
time, although we can say with high confidence that the difference is within +1.4 minutes.
On the other hand, and as we might expect, the 7 and 6 operator systems are distinctly different,
with 95% confidence interval
showing that having a 7th operator is at least 3 minutes faster, on average, than only having 6.
474 Chapter 12
Pe Estimation of Relative Performance
br A Se a en en Te
Section 11.4.2 described a procedure for obtaining confidence intervals with specified precision.
Confidence intervals for the difference between two systems’ performance can be obtained in an
analogous manner.
Suppose that we want the error in our estimate of 6, — @2 to be less than te (the quantity €
might be a practically significant difference). Therefore, our goal is to find a number of replications
R such that
ty /2,R-1SD
—— <
VR
We can approximate this by finding the smallest integer R satisfying R > Ro and
ae Za/2Sp \”
v €
(1.96)?(3.87)
= mere ae = 238
implying that 238 replications are needed, 228 more than in the initial experiment. There are 300
replications available at www. bcnn.net and the reader is encouraged to check the result.
eee
Section 12.2 Comparison of Several System Designs
ae
e 475
e Ee
Suppose that a simulation analyst desires to compare K alternative system designs. The comparison
will be made on the basis of some specified performance measure 6; of system i, fori = 1,2,...,K.
Many different statistical procedures have been developed that can be used to analyze simulation
data and draw statistically sound inferences concerning the parameters 6;. These procedures can
be classified as being either fixed-sample-size procedures or sequential-sampling (or multistage)
procedures.
In the first type, a predetermined sample size (i.e., run length and number of replications) is
used to draw inferences via hypothesis tests or confidence intervals. Examples of fixed-sample-size
procedures include the interval estimation of a mean performance measure (Section 11.3.2) and the
interval estimation of the difference between mean performance measures of two systems [as by
Expression (12.1) in Section 12.1]. Advantages of fixed-sample-size procedures include a known or
easily estimated cost in terms of computer time before running the experiments. When computer
time is limited, or when a pilot study is being conducted, a fixed-sample-size procedure might be
appropriate. In some cases, clearly inferior system designs may be ruled out at this early stage. A
major disadvantage is that a strong conclusion could be impossible. For example, the confidence
interval could be too wide for practical use, since the width is an indication of the precision of the
point estimator. A hypothesis test may lead to a failure to reject the null hypothesis, a weak conclusion
in general, meaning that there is no strong evidence one way or the other about the truth or falsity of
the null hypothesis.
A sequential sampling scheme is one in which more and more data are collected until an estimator
with a prespecified precision is achieved or until one of several alternatives is selected, with the
probability of correct selection being larger than a prespecified value. A two-stage (or multistage)
procedure is one in which an initial sample is used to estimate how many additional observations
are needed to draw conclusions with a specified precision. An example of a two-stage procedure for
estimating the performance measure of a single system was given in Sections 11.4.2 and for two
systems in Section 12.1.3.
The proper procedure to use depends on the goal of the simulation analyst. Some possible goals
are the following:
The first three goals will be achieved by the construction of confidence intervals. The number of such
confidence intervals is C = K, C = K—1,andC = K(K —1)/2, respectively. Hochberg and Tamhane
[1987] and Hsu [1996] are comprehensive references for such multiple-comparison procedures. The
fourth goal requires the use of a type of statistical procedure known as a multiple ranking and selection
procedure. Procedures to achieve these and other goals are discussed by Kleijnen [1975, Chapters
II and V], who also discusses their relative merit and disadvantages. Goldsman and Nelson [1998]
and Law [2007] discuss those selection procedures most relevant to simulation. A comprehensive
476 a a 12
eye Chapter Estimation of Relative Performance
OO gs Eo et
reference is Bechhofer, Santner, and Goldsman [1995]. The next subsection presents a fixed-sample-
size procedure that can be used to meet goals 1, 2, and 3, and is applicable in a wide range of
circumstances. Subsection 12.2.2 presents a procedure to achieve goal 4.
Suppose that C confidence intervals are computed and that the ith interval has confidence coefficient
1 — a;. Let S; be the statement that the ith confidence interval contains the parameter (or difference
of two parameters) being estimated. This statement might be true or false for a given set of data, but
the procedure leading to the interval is designed so that statement S; will be true with probability
1 —a;. When it is desired to make statements about several parameters simultaneously, as in goals 1,
2 and 3, the analyst would like to have high confidence that all statements are true simultaneously.
The Bonferroni inequality states that
Cc
P(all statements S; are true,i=1,...,C) >1-— SBE =1-QaQ,z (12.16)
j=l
where a = ye aj is called the overall error probability. Expression (12.16) can be restated as
or equivalently,
Thus, @ provides an upper bound on the probability of a false conclusion. To conduct an experiment
that involves making C comparisons, first select the overall error probability, say, ~¢ = 0.05 or 0.10.
The individual a; may be chosen to be equal (@; = az /C), or unequal, as desired. The smaller the
value of a;, the wider the jth confidence interval will be. For example, if two 95% c.i.s (aj = a2 =
0.05) are constructed, the overall confidence level will be 90% or greater (ag = a, + a2 = 0.10). If
ten 95% c.i.s are constructed (a; = 0.05,i = 1,..., 10), the resulting overall confidence level could
be as low as 50% (ag = one , @; = 0.50), which is far too low for practical use. To guarantee an
overall confidence level of 95%, when 10 comparisons are being made, one approach is to construct
ten 99.5% confidence intervals for the parameters (or differences) of interest.
The Bonferroni approach to multiple confidence intervals is based on Expression (12.16). A
major advantage is that it holds whether the models for the alternative designs are run with indepen-
dent sampling or with common random numbers.
The major disadvantage of the Bonferroni approach in making a large number of comparisons is
the increased width of each individual interval. For example, for a given set of data and a large sample
size, a 99.5% c.i. will be 20,0025/Z0.025 = 2.807/1.96 = 1.43 times longer than a 95% c.i. For small
sample sizes, say, for a sample of size 5, a 99.5% c.i. will be to.0025,4/to,025,4 = 5.598/2.776 = 1.99
times longer than an individual 95% c.i. The width of a c.i. is a measure of the precision of the
estimate. For these reasons, it is recommended that the Bonferroni approach be used only when a
Section 12.2 Comparison
RUS ofBAS
Several SystemRA
Designs
SE ee 18477
small number of comparisons are being made. Twenty or so comparisons appear to be the practical
upper limit.
Corresponding to goals 1, 2, and 3, there are at least three possible ways of using the Bonferroni
Inequality (12.16) when comparing K alternative system designs:
1. Individual c.i.s: Construct a 100(1 — a;)% c.i. for parameter 6; by using Expression (11.10),
in which case the number of intervals is C = K. If independent sampling were used, the
K c.i.s would be mutually independent, and thus the overall confidence level would be
(1 — a1) x (1 — a2) x -++ x (1 — a), which is larger, but not by much, than the right
side of Expression (12.16). This type of procedure is most often used to estimate multiple
parameters of a single system, rather than to compare systems; because multiple parameter
estimates from the same system are likely to be dependent, the Bonferroni inequality typically
is needed.
2. Comparison to an existing system: Compare all designs to one specific design, usually to an
existing system. That is, construct a 100(1 — a)% c.i. for 0; — 0;(i = 2,3,..., K), using
Expression (12.1). (System 1 with performance measure 6; is assumed to be the existing
system.) In this case, the number of intervals is C = K — 1. This type of procedure is most
often used to compare several competitors to the present system in order to learn which are
better.
3. All pairwise comparisons: Compare all designs to each other. That is, for any two system
designs i # j, construct a 100(1 — ajj)% c.i. for 6; — 6;. With K designs, the number of
confidence intervals computed is C = K(K — 1)/2. The overall confidence coefficient would
be bounded below by 1 —ag = 1-—)>> igi aj; [which follows by Expression (12.16)]. It is
generally believed that CRN will make the true overall confidence level larger than the right
side of Expression (12.16), and usually larger than will independent sampling. The right side
of Expression (12.16) can be thought of as giving the worst case, that is, the lowest possible
overall confidence level.
Hixanipbenmers ©.4.) wirstrssee tet Pair etsy ri eat settee sharia ots) alee See eS Se
Reconsider the call center design problem of Example 12.1. The alternative system designs are the
following:
Using the data in Table 12.2, confidence intervals for 9¢ — 4p7,6c — Ope, and Op7 — Op6
will be constructed, having an overall confidence level of 95%. Recall that CRN was used in all
models, but this does not affect the overall confidence level, because, as mentioned, the Bonferroni
Inequality (12.16) holds regardless of the statistical independence or dependence of the data.
Since the overall error probability is ag = 0.05 and C = 3 confidence intervals are to be
constructed, let w; = 0.05/3 = 0.0167. Then use Expression (12.1), with proper modifications, to
construct C = 3 confidence intervals with a = a; = 0.0167 and degrees of freedom v = 10—1 = 9.
478 Chapter 12 Estimation of Relative
ee Performance
e
ED e
The value of to;/2,R-1 = 0.00839 = 2.97 is obtained from Table A.5 by interpolation; the point
estimates and standard errors are obtained from Table 12.2
The three confidence intervals, with overall confidence coefficient at least 95%, are given by
or
The simulation analyst has high confidence (at least 95%) that all three confidence statements are
correct. Notice that the c.i. for 9¢ — @p7 again contains zero; thus, there is no statistically significant
difference between the current design and the alternative with 7 cross-trained operators, a conclusion
that supports the previous results in Example 12.1. The other confidence intervals lie completely to
the left of 0, indicating that C and P7 both dominate P6.
Some of the exercises at the end of this chapter provide an opportunity to compare CRN and inde-
pendent sampling and to compute simultaneous confidence intervals under the Bonferroni approach.
Suppose that there are K system designs, and the unknown expected value of the ith system’s
performance is 6;. We are interested in which system design is best, where “best” means having the
maximum or minimum 6;, depending on the problem. For instance, in Example 12.3 we will compare
K = 8 designs for the semiconductor fabrication facility described in Example 11.8 of Chapter 11,
in which 6; is the steady-state mean cycle time of one product family for the ith design, and smaller
6; is better. We want a procedure that is capable of handling large K, say, on the order of 100
Let B denote the (unknown) index of the best system design. The smaller the true differences
|0g — O;|,i # B are, and the more certain we want to be that we find the best system, the more
replications are required to achieve our goal. Therefore, instead of demanding that we find the
best design B no matter what, we compromise and ask to find B with high probability whenever
the difference between system B and the others is significantly large. More precisely, we want the
probability that we select the best system to be at least | — w whenever |@g — 6;| > € for alli 4 B,
where € depends on the problem. If there are systems that are within € of the best, then we will be
satisfied to select either the best or any one of the near-best designs. Both the probability of correct
selection 1 — aw and the practically significant difference € will be under our control.
The following procedure accomplishes this goal by undertaking two stages of simulation: In
the first stage, Ro replications are obtained from each of the system designs. Those systems that are
statistically significantly inferior to the others are screened out, meaning that they are eliminated
Section 12.2 Comparison of Several System Designs 479
from further consideration, with confidence 1 — a/2. If more than one system survives screening,
then the survivors receive enough additional replications to select the best, or a system design within
€ of the best, with confidence 1 —a@/2. Then, using reasoning related to the Bonferroni inequality, the
selected system is the best or a near-best system with confidence at least 1 — aw for the combination
of screening and selection.
The first-stage screening can be very important because, as Example 12.3 will illustrate, the
number of replications required in the second stage to select the best can be quite large, and there
is no reason to waste time and effort on system designs that are obviously not competitive. This is
especially important when K is large.
The procedure below is implemented in SimulationTools.xl1s, which is available at
www.bcnn.net. The SimulationTools.x1s implementation guides the user as to what data
need to be obtained for each stage.
Select-the-Best Procedure
1. Specify the desired probability of correct selection 1/k < 1—a < 1, the practically significant
difference € > 0, an initial number of replications Ro > 10, and the number of competing
systems K. Set
1
1—(1—a1/2)
F-1 ,Rg—1
and obtain Rinott’s constant h = h(Ro, K, 1 — a@/2). (Hint: Both critical values are calcu-
lated automatically by SimulationTools.xl1s; small tables of t and h are given in the
Appendix. ]
2. Make Ro replications of each system. Calculate the first-stage sample means and sample
variances
Ro
= 1
igcs eae Y,
ri
Ro r=
1 no Fs
That is, retain every system i whose sample mean is no more than max{0, W;; — €} smaller
than any other system’s sample mean.
(b) If smaller is better, then form the survivor subset S containing every system design i
such that
3 1 Ri
Yj = —_ ‘ Ve
Rj r=]
for all i in S. If bigger is better, then select the system with the largest Yy.; as best. Otherwise
select the system with the smallest Y.; as best.
The critical values ¢ and h cannot usually be obtained directly from tables so interpolation is
often required. As an alternative, algorithms for computing quantiles from the f distribution can be
found in most numerical analysis and spreadsheet software; unfortunately, this is not the case for
Rinott’s constant h. Therefore, use of SimulationTools.x1s is highly recommended for this
procedure.
Nelson et al. [2001] prove that the Select-the-Best Procedure finds the system with the largest
or smallest 6;, or one within € of the best 6, with confidence level at least 1 — a, provided the data
are normally distributed and the systems are simulated independently. The procedure will also work
if CRN is applied.
Bigger is better
Confidence Level
|
T™ 909%
Help | Cancel |
Including the base case there are K = 8 system designs, and the performance measures
61, 02,... , @g are the steady-state mean cycle times for C-chip under each design. FastChip decides
that even very small improvements in cycle time are important, so they set € = 0.15 hours (or 9
minutes). This is a very tight requirement when cycle times average around 45 hours, and we will
see the consequences of this choice below. FastChip would like 95% confidence they have selected
the best design, or one within 9 minutes of the best. Figure 12.5 shows the input screen to set up this
problem in SimulationTools.x1s, which uses the term “indifference level” for the practically
significant difference e€.
Table 12.3 contains the first-stage results obtained from Ro = 10 replications of each system
design. The alternative that adjusts the Oxidize step seems to provide the smallest average cycle
time, but all of the alternatives appear close. The complete data from this example are available at
www.bcnn.net.
The ft value we need for screening is
1 =a 1 = 10,0036,9 = 3.455
1—(1—a/2)k-1,Rp-1 1—(0.975)7,9
482 Chapter 12 Estimation
ee ee of Relative Performance
ee
BOR a
The subset S will always contain the sample best system, which is system i = 4 (Oxidize) here.
To avoid being screened out, the other systems’ sample means must be small enough to satisfy
Y; < Y; +max{W, — €} for all j4 i, and in particular Y; < Y4 + max{0, Wi — €}. Table 12.4
contains the thresholds Y.4 + max{0, Wj4 — €}; only alternative 2 with sample mean 45.70 hours is
small enough to survive, so S = {2, 4}.
To determine the second-stage sample sizes for alternatives 2 and 4, we need h = h(Ro, K, 1 —
a/2) = h(10, 8, 0.975) = 4.635, which could be obtained from SimulationTools.x1s. Then
Rz = [(hS2/€)?] = 349 and Ry = [(AS4/e)*] = 90. Thus, alternative 2 requires 349 — 10 = 339
additional replications, while alternative 4 requires 90 — 10 = 80 additional replications. Notice that
the number of replications for alternative 2 is quite large. In general, the second-stage sample size is
large if the sample variance is large, making it is difficult to detect differences, or if € is small, even
small differences matter, as in this example.
After obtaining the additional replications, Y.. = 45.88 and Y.4 = 45.00; therefore, alternative 4,
Oxidize, is selected. The procedure guarantees, with 95% confidence, that 44 is the smallest steady-
state mean cycle time of the 8 alternatives, or, if it is not, then it is within € = 0.15 hours of being
the best.
The Select-the-Best Procedure, as presented here, does both screening and selection. This makes
it particularly useful for finding the best system design from among a large number of alternatives
since many of them may be screened out in the first stage. The procedure can be used for screening
alone, without selection, if the objective is to eliminate designs that are not competitive. This might
be worthwhile when we want to choose from a collection of competitive alternatives based on criteria
beyond mean performance alone. For screening only, the procedure stops at Step 3 and can use the
critical value
Ll
1—(1—a) F-1,,Ro-1
Similarly, if the number of system designs K is small, and the goal is to select the best, then
screening can be skipped and all systems receive second-stage replications. For selection only, skip
Step 3 and use the somewhat smaller critical value h = h(Ro, K, 1 — «).
12.3. Metamodeling
Suppose that there is a simulation output response variable Y that is related to k independent variables,
Say, X1, X2,... , Xx. The dependent variable Y is a random variable, while the independent variables
Xj, X2,...,X~ are called design variables and are usually subject to control. The true relationship
between the variables Y and x is represented by the often complex simulation model. Our goal is to
approximate this relationship by a simpler mathematical function called a metamodel. In some cases,
the analyst will know the exact form of the functional relationship between Y and x1, x2, ... , Xk, Say,
Y =f (x1, X2, ... , Xx). However, in most cases, the functional relationship is unknown, and the analyst
must select an appropriate f containing unknown parameters, and then estimate those parameters
from a set of data {Y, x}. Regression analysis is one method for estimating the parameters.
where o is the intercept on the Y axis, an unknown constant, and f; is the slope, or change in Y for
a unit change in x, also an unknown constant. It is further assumed that each observation of Y can
be described by the model
where ¢ is a random error with mean zero and constant variance a”. The regression model given
by Equation (12.18) involves a single variable x and is commonly called a simple linear regression
model.
484 Chapter 12. Estimation of Relative Performance
Suppose that there are n pairs of observations (Y, x1), (Y2, x2), .-. » Yn Xn). These observations
can be used to estimate Bo and f; in Equation (12.18). The method of least squares is commonly used
to form the estimates. In the method of least squares, Bo and f; are estimated in such a way that the
sum of the squares of the deviations between the observations and the regression line is minimized.
The individual observations in Equation (12.18) can be written as
Ej = Y; — Bo rr Bix; (12.20)
and represents the difference between the observed response Y; and the expected response Bo + B1Xj,
predicted by the model in Equation (12.17). Figure 12.6 shows how ¢; is related to x;, ¥;, and E(¥;|x;).
The sum of squares of the deviations given in Equation (12.20) is given by
L= oi — 6 - Gi - DP
i=l
y= Bo + Bix; +e;
To minimize L, find 0L/d8) and 8L/3£;, set each to zero, and solve for Bi and Bi. Taking the
partial derivatives and setting each to zero yields
Bi Di -— 2 = DONG — 2) (12.23)
gen=y=) B —Yj
Bo : (12.24)
and
be) Vip x)
Ais
aC eee
is
where S,, denotes the corrected sum of cross products of x and Y. The denominator of Equation
(12.25) is rewritten for computational purposes as
n n n \2
Sex = DUG) 0? = DOF - Peel te?
i=] i=]
where S,, denotes the corrected sum of squares of x. The value of Bo can be retrieved easily as
Bo = By — Bix (12.28)
8.1
Time
complete
to
Figure 12.7 Relationship between number of claims and hours of processing time.