Usccse91 500
Usccse91 500
1. Introduction
Like many fields in their early stages, the software field has had its share
of project disasters: the software equivalents of Beauvais Cathedral, the S.S.
Titanic, and the "Galloping Gertie" Tacoma Narrows Bridge. The frequency of
these disaster projects is a serious concern: a recent survey of 600 firms
indicated that 35% of them had at least one "runaway' software project [1].
Current approaches to the software process make it too easy for software
projects to make high-risk commitments that they will later regret. The sequential,
document-driven "waterfall" process model tempts people to overpromise
software capabilities in contractually-binding requirements specifications before
they understand their risk implications. The code-driven evolutionary
development process model tempts people to say, "Here are some neat ideas I'd
like to put into this system. I'll code them up, and if they don't fit other people's
ideas, we'll just evolve things until they work." This sort of approach works fine in
some well-supported mini-domains such as spreadsheet applications, but in
more complex application domains, it most often creates or neglects intrinsic high
risk elements and leads the project down the path to disaster.
At TRW and elsewhere, I have had the good fortune to observe many
software project managers at work first-hand, and to try to understand and apply
the factors that distinguished the more successful project managers from the less
successful ones. Some were able to successfully use a waterfall approach,
others successfully used an evolutionary development approach, and others
successfully orchestrated complex mixtures of these and other approaches
involving prototyping, simulation, commercial software, executable specifications,
tiger teams, design competitions, subcontracting, and various kinds of cost-
benefit analyses.
One pattern that emerged very strongly was that the successful project
managers were good risk managers. Although they generally didn't use such
terms as risk identification, risk assessment, risk management planning, or risk
monitoring, that's what they were doing. And their projects tended to avoid pitfalls
and produce good products.
Risk Management Planning produces plans for addressing each risk item
(e.g., via risk avoidance, risk transfer, risk reduction, or buying information),
including the coordination of the individual risk-item plans with each other and
with the overall project plan. Typical techniques include checklists of risk
resolution techniques, cost-benefit analysis, and standard risk management plan
outlines, forms, and elements.
Risk Resolution produces a situation in which the risk items are eliminated
or otherwise resolved (e.g., risk avoidance via relaxation of requirements).
Typical techniques include prototypes, simulations, benchmarks, mission
analyses, key-personnel agreements, design-to-cost approaches, and
incremental development.
The next section of this article provides definitions of the basic software
risk management terms and concepts. The following section describes and
illustrates some of the primary software risk management principles and
practices identified in Figure 1. A final section provides an overview of the use of
risk management throughout the software life cycle, and presents a overall
summary of the article's key points. Next, the companion article to this one [3] will
discuss the application of the risk management principles and practices to a large
software-intensive application: the FAA Advanced Automation System.
Definitions
The satellite platform manager identifies two major options for reducing
the risk of losing the experiment:
The decision tree in Figure 2 then shows, for each of the two major
decision options, the possible outcomes; their probabilities; the losses associated
with each outcome, the risk exposure associated with each outcome, and the
total risk exposure (or expected loss) associated with each decision option. In
this case, the total risk exposure associated with the experiment-team option is
$2M. For the IV&V option, the total risk exposure is only $1.3M; thus it represents
the more attractive option.
Besides providing individual solutions for risk management situations, the
decision tree also provides a framework for analyzing the sensitivity of preferred
solutions to the risk exposure parameters. Thus, for example, the experiment-
team option would be preferred if the loss due to a critical software error were
less than $13M, if the experiment team could reduce their critical-software-error
probability to less than 0.065, if the IV&V team cost more than $1.2M, if the IV&V
team were unable to reduce the probability of critical error to less than 0.075, or if
there were various partial combinations of these possibilities.
Even with this sort of sensitivity analysis, there may not be enough
information available to quantify the risk exposure parameters well enough to
perform a precise analysis. However, the risk exposure framework supports
some more approximate but still very useful approaches, such as range
estimation and scale-of-10 estimation, which will be discussed in the next
section.
Another significant point with respect to the top-1 0 list is that most of the
critical risk items have to do with shortfalls in domain understanding and in
properly scoping the software job to be done -- areas which are generally
underemphasized in computer science literature and education.
After using all of the various risk identification checklists, plus the other
risk identification techniques involved in decision driver analysis, assumption
analysis, and decomposition, one very real risk is that the project will identify so
many project risk items that the project could spend years just investigating them.
This is where risk prioritization and its associated risk analysis activities become
essential.
The most effective technique for risk prioritization involves the Risk
Exposure quantity which we discussed earlier. It allows us to order the candidate
risk items identified and determine which are most important to address.
Three key points emerge from Figures 5 and 6. First, projects often focus
on factors having either a high Prob(U0) or a high Loss(UO), but these may not
be the key factors with a high RE combination. One of the highest Prob(U0)'s
comes from item G (data reduction software errors), but the fact that these errors
are recoverable and not mission-critical leads to a low loss factor and a resulting
low RE of 8. Similarly, item I (insufficient memory) has a high potential loss, but
its low probability leads to a low RE of 7. On the other hand, a relatively low-
profile item such as item H (user interface shortfalls) becomes a relatively high-
priority risk item because its combination of moderately high probability and loss
factors yield a RE of 30.
The third key point emerging from Figures 5 and 6 deals with the
probability rating ranges given for items A, B, and C. It often occurs that there is
a good deal of uncertainty in estimating the probability or loss associated with an
unsatisfactory outcome. (The assessments are frequently subjective, and are
often the product of surveying several different domain experts.) The amount of
uncertainty is itself a major source of risk, which needs to be reduced as early as
possible. The primary example in Figures 5 and 6 is the uncertainty in item C
about whether the software fault tolerance features are going to cause an
unacceptable degradation in real-time performance. If Prob(UO) is rated at 4, this
item has only a moderate Risk Exposure of 28; but if Prob(UO) is 8, the RE has a
top-priority rating of 56.
One of the best ways of reducing this source of risk due to uncertainty is
to buy information about the actual situation. For the issue of fault tolerance
features vs. performance, a good way to buy information would be to invest in a
prototype, to better understand the performance impact of the various fault
tolerance features. We will elaborate on this under Risk Management Planning
next.
Once the Risk Assessment activities determine a project's major risk items
and their relative priorities, we need to establish a set of Risk Control functions to
bring the risk items under control. The first step in this process is to develop a set
of Risk Management Plans which lay out the activities necessary to bring the risk
items under control.
One aid in doing this is the Top-10 checklist in Figure 3 which identifies
the most successful risk management techniques for the most common risk
items on a software project. As an example, the uncertainty in performance
impact of the software fault tolerance features is covered under item 9 of Figure
3, Real-Time Performance Shortfalls. The corresponding risk management
techniques include simulation, benchmarking, modeling, prototyping,
instrumentation, and tuning. Let us assume, for example, that a prototype of
representative safety features is the most cost-effective way to determine and
reduce their impact on system performance.
The next step in risk management planning is to develop individual Risk
Management Plans for each risk item. Figure 7 shows the individual plan for
prototyping the fault tolerance features and determining their performance
impact. The plan is organized around a standard format for software plans,
oriented around answering the standard questions of "why, what, when, who,
where, how, and how much." This plan organization allows the plans to be
concise (e.g., fitting on one page), action-oriented, easy to understand, and easy
to monitor.
Once a good set of risk management plans are established, the risk
resolution process consists of implementing whatever prototypes, simulations,
benchmarks, surveys, or other risk reduction techniques are called for in the
plans. Risk monitoring ensures that this is a closed-loop process by tracking risk
reduction progress and applying whatever corrective action is necessary to keep
the risk-resolution process on track.
Figure 8 shows how a project top-1 0 list could have been working for the
spaceborne experiment project, as of Month 3 of the project. The project's top
risk item in month 3 is a critical staffing problem. Highlighting it in the monthly
review meeting will stimulate a discussion of the staffing options by the project
team and the boss ( make the "unavailable" key person available, reshuffle
project personnel, look for new people within or outside the organization). This
should result in an assignment of action items to follow through on the preferred
options chosen, including possible actions by the project manager's boss.
The #2 risk item in Figure 8, target hardware delivery delays, is also one in
which the project manager's boss may be able to help expedite a solution -- by
cutting through corporate-procurement red tape, for example, or by escalating
vendor-delay issues with the vendor's higher management.
As seen in Figure 8, some risk items are going down in priority or going off
the list, while others are escalating upward or coming onto the list. The ones
going down the list, such as the design V&V staffing, fault tolerance prototyping,
and user interface prototyping, still need to be monitored but frequently do not
need special management action. The ones going up or onto the list, such as the
data bus design changes and the testbed interface definitions, are generally the
ones needing higher management attention to help in getting them resolved
quickly.
As can be seen from this example, the top-10 risk item list is a very
effective way to focus higher management attention onto the project's critical
success factors. It is also very efficient with respect to management time; the
usual monthly review spends most of its time on things the higher manager can't
do anything about. Also, if the higher manager surfaces an additional concern, it
is easy to add it to the top-10 risk item list to be highlighted in future reviews.
In the process, critical success factors get neglected, the project fails, and
nobody wins.
Good people, with good skills and good judgement, are what make
software projects work. Risk management can provide you with some of the
skills, an emphasis on getting good people, and a good conceptual framework for
sharpening your judgement. I hope you find these useful on your next software
project.
References
1. J. Rothfeder, "It's Late, Costly, and Incompetent - But Try Firing a Computer
System," Business Week, November 7, 1988, pp. 164-1 65.
3. V.R. Basili, B.W. Boehm, and A.E. Salwin, "Ada Risk Management: A Large-
Project Example," Software, (TBD).