Embedded Systems Online Testing
Embedded Systems Online Testing
8 42
Testing of Embedded On-line Testing of
System Embedded Systems
Version 2 EE IIT, Kharagpur 1 Version 2 EE IIT, Kharagpur 2
Instructional Objectives such as aerospace systems. Using embedded systems to incorporate functions previously
considered exotic in low-cost, everyday products is a growing trend.
After going through this lesson the student would be able to Since embedded systems are frequently components of mobile products, they are exposed to
vibration and other environmental stresses that can cause them to fail. Embedded systems in
x Explain the meaning of the term On-line Testing automotive applications are exposed to extremely harsh environments, even beyond those
experienced by most portable devices. These applications are proliferating rapidly, and their
x Describe the main issues in on-line testing and identify applications where on-line testing more stringent safety and reliability requirements pose a significant challenge for designers.
are required for embedded systems Critical applications and applications with high availability requirements are the main candidates
x Distinguish among concurrent and non-concurrent testing and their relations with BIST for online testing.
and on-line testing Embedded systems consist of hardware and software, each usually considered separately in
the design process, despite progress in the field of hardware-software co design. A strong
x Describe an application of on-line testing for System-on-Chip synergy exists between hardware and software failure mechanisms and diagnosis, as in other
aspects of system performance. System failures often involve defects in both hardware and
On-line Testing of Embedded Systems software. Software does not “break” in the common sense of the term. However, it can perform
inappropriately due to faults in the underlying hardware or specification or design flaws in either
hardware or software. At the same time, one can exploit the software to test for and respond to
1. Introduction the presence of faults in the underlying hardware.
Online software testing aims at detecting design faults (bugs) that avoid detection before the
EMBEDDED SYSTEMS are computers incorporated in consumer products or other devices
embedded system is incorporated and used in a product. Even with extensive testing and formal
to perform application-specific functions. The product user is usually not even aware of the
verification of the system, some bugs escape detection. Residual bugs in well-tested software
existence of these systems. From toys to medical devices, from ovens to automobiles, the range
typically behave as intermittent faults, becoming apparent only in rare system states. Online
of products incorporating microprocessor-based, software controlled systems has expanded
software testing relies on two basic methods: acceptance testing and diversity [1]. Acceptance
rapidly since the introduction of the microprocessor in 1971. The lure of embedded systems is
testing checks for the presence or absence of well-defined events or conditions, usually
clear: They promise previously impossible functions that enhance the performance of people or
expressed as true-or-false conditions (predicates), related to the correctness or safety of
machines. As these systems gain sophistication, manufacturers are using them in increasingly
preceding computations. Diversity techniques compare replicated computations, either with
critical applications— products that can result in injury, economic loss, or unacceptable
minor variations in data (data diversity) or with procedures written by separate, unrelated design
inconvenience when they do not perform as required.
teams (design diversity). This chapter focuses on digital hardware testing, including techniques
Embedded systems can contain a variety of computing devices, such as microcontrollers,
by which hardware tests itself, built-in self-test (BIST). Nevertheless, we must consider the role
application-specific integrated circuits, and digital signal processors. A key requirement is that
of software in detecting, diagnosing, and handling hardware faults. If we can use software to test
these computing devices continuously respond to external events in real time. Makers of
hardware, why should we add hardware to test hardware? There are two possible answers. First,
embedded systems take many measures to ensure safety and reliability throughout the lifetime of
it may be cheaper or more practical to use hardware for some tasks and software for others. In an
products incorporating the systems. Here, we consider techniques for identifying faults during
embedded system, programs are stored online in hardware-implemented memories such as
normal operation of the product—that is, online-testing techniques. We evaluate them on the
ROMs (for this reason, embedded software is sometimes called firmware). This program storage
basis of error coverage, error latency, space redundancy, and time redundancy.
space is a finite resource whose cost is measured in exactly the same way as other hardware. A
function such as a test is “soft” only in the sense that it can easily be modified or omitted in the
2. Embedded-system test issues final implementation.
The second answer involves the time that elapses between a fault’s occurrence and a problem
Cost constraints in consumer products typically translate into stringent constraints on product arising from that fault. For instance, a fault may induce an erroneous system state that can
components. Thus, embedded systems are particularly cost sensitive. In many applications, low ultimately lead to an accident. If the elapsed time between the fault’s occurrence and the
production and maintenance costs are as important as performance. corresponding accident is short, the fault must be detected immediately. Acceptance tests can
Moreover, as people become dependent on computer-based systems, their expectations of detect many faults and errors in both software and hardware. However, their exact fault coverage
these systems’ availability increase dramatically. Nevertheless, most people still expect is hard to measure, and even when coverage is complete, acceptance tests may take a long time
significant downtime with computer systems—perhaps a few hours per month. People are much to detect some faults. BIST typically targets relatively few hardware faults, but it detects them
less patient with computer downtime in other consumer products, since the items in question did quickly.
not demonstrate this type of failure before embedded systems were added. Thus, complex These two issues, cost and latency, are the main parameters in deciding whether to use
consumer products with high availability requirements must be quickly and easily repaired. For hardware or software for testing and which hardware or software technique to use. This decision
this reason, automobile manufacturers, among others, are increasingly providing online detection requires system-level analysis. We do not consider software methods here. Rather, we emphasize
and diagnosis, capabilities previously found only in very complex and expensive applications the appropriate use of widely implemented BIST methods for online hardware testing. These
methods are components in the hardware-software trade-off.
Version 2 EE IIT, Kharagpur 3 Version 2 EE IIT, Kharagpur 4
3. Online testing testing scheme may create conflicting goals. High coverage requires high error latency, space
redundancy, and/or time redundancy. Schemes with immediate detection (error latency equaling
Faults are physical or logical defects in the design or implementation of a digital device. 1) minimize time redundancy but require more hardware. On the other hand, schemes with
Under certain conditions, they lead to errors—that is, incorrect system states. Errors induce delayed detection (error latency greater than 1) reduce time and space redundancy at the expense
failures, deviations from appropriate system behavior. If the failure can lead to an accident, it is a of increased error latency. Several proposed delayed-detection techniques assume
hazard. Faults can be classified into three groups: design, fabrication, and operational. Design equiprobability of input combinations and try to establish a probabilistic bound on error latency
faults are made by human designers or CAD software (simulators, translators, or layout [2]. As a result, certain faults remain undetected for a long time because tests for them rarely
generators) during the design process. Fabrication defects result from an imperfect appear at the CUT’s inputs.
manufacturing process. For example, shorts and opens are common manufacturing defects in To cover all the operational fault types described earlier, test engineers use two different
VLSI circuits. Operational faults result from wear or environmental disturbances during normal modes of online testing: concurrent and non-concurrent. Concurrent testing takes place during
system operation. Such disturbances include electromagnetic interference, operator mistakes, and normal system operation, and non-concurrent testing takes place while normal operation is
extremes of temperature and vibration. Some design defects and manufacturing faults escape temporarily suspended. One must often overlap these test modes to provide a comprehensive
detection and combine with wear and environmental disturbances to cause problems in the field. online-testing strategy at acceptable cost.
Operational faults are usually classified by their duration:
x Permanent faults remain in existence indefinitely if no corrective action is taken. Many
4. Non-concurrent testing
are residual design or manufacturing faults. The rest usually occur during changes in
This form of testing is either event-triggered (sporadic) or time-triggered (periodic) and is
system operation such as system start-up or shutdown or as a result of a catastrophic
characterized by low space and time redundancy. Event triggered testing is initiated by key
environmental disturbance such as a collision.
events or state changes such as start-up or shutdown, and its goal is to detect permanent faults.
x Intermittent faults appear, disappear, and reappear repeatedly. They are difficult to Detecting and repairing permanent faults as soon as possible is usually advisable. Event-
predict, but their effects are highly correlated. When intermittent faults are present, the
triggered tests resemble manufacturing tests. Any such test can be applied online, as long as the
system works well most of the time but fails under atypical environmental conditions.
required testing resources are available. Typically, the hardware is partitioned into components,
x Transient faults appear and disappear quickly and are not correlated with each other. each exercised by specific tests. RAMs, for instance, are tested with manufacturing tests such as
They are most commonly induced by random environmental disturbances. March tests [3].
One generally uses online testing to detect operational faults in computers that support critical or Time-triggered testing occurs at predetermined times in the operation of the system. It detects
high-availability applications. The goal of online testing is to detect fault effects, or errors, and permanent faults, often using the same types of tests applied by event-triggered testing. The
take appropriate corrective action. For example, in some critical applications, the system shuts periodic approach is especially useful in systems that run for extended periods during which no
down after an error is detected. In other applications, error detection triggers a reconfiguration significant events occur to trigger testing. Periodic testing is also essential for detecting
mechanism that allows the system to continue operating, perhaps with some performance intermittent faults. Such faults typically behave as permanent faults for short periods. Since they
degradation. Online testing can take the form of external or internal monitoring, using either usually represent conditions that must be corrected, diagnostic resolution is important. Periodic
hardware or software. Internal monitoring, also called self-testing, takes place on the same testing can identify latent design or manufacturing flaws that appear only under certain
substrate as the circuit under test (CUT). Today, this usually means inside a single IC—a system environmental conditions. Time-triggered tests are frequently partitioned and interleaved so that
on a chip. There are four primary parameters to consider in designing an online-testing scheme: only part of the test is applied during each test period.
x error coverage—the fraction of modeled errors detected, usually expressed as a
percentage. Critical and highly available systems require very good error coverage to
minimize the probability of system failure. 5. Concurrent testing
x error latency—the difference between the first time an error becomes active and the first
time it is detected. Error latency depends on the time taken to perform a test and how Non-concurrent testing cannot detect transient or intermittent faults whose effects disappear
often tests are executed. A related parameter is fault latency, the difference between the quickly. Concurrent testing, on the other hand, continuously checks for errors due to such faults.
onset of the fault and its detection. Clearly, fault latency is greater than or equal to error However, concurrent testing is not particularly useful for diagnosing the source of errors, so test
latency, so when error latency is difficult to determine, test designers often consider fault designers often combine it with diagnostic software. They may also combine concurrent and
latency instead. non-concurrent testing to detect or diagnose complex faults of all types.
A common method of providing hardware support for concurrent testing, especially for
x space redundancy—the extra hardware or firmware needed for online testing.
detecting control errors, is a watchdog timer [4]. This is a counter that the system resets
x time redundancy—the extra time needed for online testing.
repeatedly to indicate that the system is functioning properly. The watchdog concept assumes
The ideal online-testing scheme would have 100% error coverage, error latency of 1 clock that the system is fault-free—or at least alive—if it can reset the timer at appropriate intervals.
cycle, no space redundancy, and no time redundancy. It would require no redesign of the CUT The ability to perform this simple task implies that control flow is correctly traversing timer-reset
and impose no functional or structural restrictions on it. Most BIST methods meet some of these points. One can monitor system sequencing very precisely by guarding the watchdog- reset
constraints without addressing others. Considering all four parameters in the design of an online- operations with software-based acceptance tests that check signatures computed while control
AB1 AB2
DC/DC
Converter
Battery &
Power supply to the cores Charger
Data and Control paths
IEEE 1149.4/1149.1 Boundary Scan Bus
Analog Buses (1149.4) AB1 and AB2
Fig. 42.2 Block Diagram of the SOC Representing On-Line Test Capability