0% found this document useful (0 votes)
11 views135 pages

DT3 Speaker Notes

The document provides an overview of system-on-a-chip (SoC) design principles, emphasizing platform-based design to manage complexity and reuse existing designs. It details the processes of platform design, IP block design, and introduces the audioport IP block as a case study for practical application. Key components of SoC architecture, design methodologies, and verification techniques are discussed to facilitate understanding of modern digital system design.

Uploaded by

Sabbir Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views135 pages

DT3 Speaker Notes

The document provides an overview of system-on-a-chip (SoC) design principles, emphasizing platform-based design to manage complexity and reuse existing designs. It details the processes of platform design, IP block design, and introduces the audioport IP block as a case study for practical application. Key components of SoC architecture, design methodologies, and verification techniques are discussed to facilitate understanding of modern digital system design.

Uploaded by

Sabbir Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 135

L1

1 What is a system-on-a-chip
2 Platform Based Design
3 Platform Design
4 IP Block Design
5 audioport IP
6 ARM Easy
7 audioport Block Diagram
8 Memory Map
9 TLM Model
10 Summary
L2
2 RTL Design and Verification Flow
3 GRS
4 FRS
5 RTL Architecture Design
6 Whitebox Assertions
7 Verification Plan Design
8 Blackbox Assertions
9 Coverage Model
10 Coding Phase and Verification
11 RTL Flow Summary
12 Week2
13 Memory -Mapped I/O
14 Testbench
15 APB if
16 abp_if usage
17 Test program organization
18 Coding Tips
L3
2 Principle of Assertion-Based Functional Verification
3 Counter Example 1
4 Counter Example 2
5 Assertions in SystemVerilog
6 Concurrent Assertions
7 How to define properties
8 Booleans
9 Sequences
10 Property Expressions
11. Operator Summary
12. Summary: Assertion Creation and Most Common Building Blocks
13 Installation of Concurrent Assertions in a Design
14 Debugging Assertions in QuestaSim
15 Tip: Auxiliary Code
L4
2 Register-Transfers in Computer Engineering
3 RTL Coding Rules for Combinational Logic
4 RTL Coding Rules for Sequential Logic
5 RTL Architect's Pattern Language: Combinational Blocks
6 RTL Architect's Pattern Language: Registers
7 RTL Architect's Pattern Language: Counters
8 RTL Architect's Pattern Language: Shift Registers
9 RTL Architect's Pattern Language: Control Logic
10 Using the Design Patterns: FIR Filter
11 FIR Filter RTL Code
12 But Wait, There's More!
13 Simple RTL Architecture Design Procedure
14 sramctrl: Register Allocation
15 sramctrl: RTL Architecture Refinement
16. sramctrl: Control
17 Whitebox Assertions
L5 Formal Verification
2 Counter revisited
3 Formal Property Verification
4. Example: Verification of Property
5 Example (Continued)
6 Presentation of a Counterexample in a Questa PropCheck
7 Formal Verification vs. Simulation
8 Questa Results
9 Formal Verification Constraints
10 Writing Assertions for Formal Verification
11 Example: control_unit's Interrupt Output Properties
12 Advanced
L6
2 Coverage
3 Code Coverage
4 Code Coverage Types
5 Code Coverage Example
6 Functional Coverage
7 Functional Coverage Example
8 covergroup
9 Coverage Collection in Simulation
10 Covergroup Examples: Bin Creation
11 Covergroup Examples: Bin Creation and Cross Coverage
12 Transition Coverage
13 Covergroup Creation Summary
14 Constrained Random Stimulus Generation
15 Coverage Tracking with an XML Testplan
16 Functional Qualification
L7
2 VHDL vs SystemVerilog
3 Logic Data Types for RTL Design
4 Working with Bit Vector and Numeric Data Types
5 Contents of a Design File
6 Procedural Blocks: Combinational Logic Process
7 Procedural Blocks: Sequential Logic Process
8 Procedural Control Statements
9 Operators
10 FSM
11 VHDL Constructs for RTL Models
12 Miscellaneous
L8
2 Clock Domain Crossings
3 Metastability
4 Metastability Removal Using a Synchronizer
5 Synchronizer Reliability
6 Synchronization of Multibit Data (1)
7 Synchronization of Multibit Data (2)
8 Synchronization of Multibit Data (3)
9 Handshaking State Machines
10 Other Multibit CDC Schemes
11 Data Reconvergence
12 CDC Logic Verification
13 Questa CDC Flow Used in the Project
14 Debugging Violations in Questa CDC
L9
2 SystemC Overview
3 Modeling Levels Supported in SystemC
4 Short Introduction to C++ Classes
5 SystemC Module
6 SystemC Ports, Channels and Processes
7 SC_METHOD Example: RTL Modeling
8 SC_CTHREAD Example: Algorithm-Level Modeling
9 Hierarchical Models
10 Testbench for reg8 Module
11 sc_main
12 Data Types
13 Operators Supported by Integer Data Types
L10
2 RTL Architecture Exploration for Algorithm R4 = R1 + R2 + R3
3 The RTL Design Problem
4 High-Level Synthesis Based Design
5 How Do HLS Programs Work?
6 Scheduling
7 Resource Allocation and Binding
8 RTL Generation
9 High-Level Synthesis Process
10 CDFG Transformations: Loop Unrolling and Pipelining
11 CDFG Transformation: Array Handling
12 Siemens EDA Catapult HLS Tool
L11
1 What is UVM?
2
3 UVM Looks Very Complicated. How Do I Start?
5 UVM Testbench Structure and Verification Components
6 How Is an UVM Component Class Defined and How Does It Work?
7 Transaction-Level Modeling Based Communication Between UVM Components
8 How Does the UVM Testbench Talk to the DUT?
9 Test Data Generation with UVM Sequences
10 Summary of UVM Testbench Creation
L12 UVM Continued
2 UVM TLM Connections
3 TLM Connector Classes
4 Hierarchical Connections
5 audioport_uvm_test
6 TLM Communications: Predictor
7 TLM Communications: Comparator
8 Sequence Execution
9 Factory Overrides
10 Concluding Remarks on UVM
L13
2 'Back-End' IC Design Tasks in DT3
3 Logic Synthesis
4 Timing and Area Reports
5 Design for Testability (DFT)
6 Fault Model Based Test Pattern Generation
7 Test Pattern Generation Example
8 DFT Techniques: Scan Path
9 Scan Path Creation and Usage
10 Scan Insertion in Practice
11 Power Consumption in Digital CMOS Circuits
12 Power-Aware RTL Design
13 Power Minimization with Clock Gating
14 Power Report
L14
2 Physical Layout Design
3 Standard Cell Layout Principle
4 Standard Cell Placement and Routing
5 Block Layout Design Phases
6 Clock Tree Synthesis
7 Post-Layout Timing Analysis: Delays
8 Post-Layout Timing Analysis: Clocks
9 Post-Layout Power Analysis
L1

1 What is a system-on-a-chip
A system-on-a-chip is an integrated circuit that integrates all functions of a digital system on a
single silicon chip.

On this slide you can see a block diagram of a generic SoC design.

The key part of an SoC is the central processing unit, the CPU, which is an instruction set
processor that executes the operating system and application software on the SoC .

Like in any computer system, the CPU is connected to memory and peripheral devices with a
system bus. SoCs can have many buses. A high-performance system bus serves components
that need high-bandwidth access to the memory and other parts of the system. On the other
hand, peripherals that do not require high-speed connections can be placed on a
low-performance bus in order to save power and silicon area.

In addition to the buses, SoCs must have a large number of standard computer system
peripherals such as bus arbiters, interrupt controllers, and memory interfaces.

Different use cases are served by adding subsystem designs that are commonly known as
intellectual property blocks, or IPs for short. These can be general purpose blocks that are
needed to implement for instance various standard interface protocols. Another group of IPs are
application-specific blocks that serve functions that are needed in the end-product the SoC is
designed for. These can be, for instance, radio modems and multimedia codecs.

As digital designs, SoCs are large. Measured in logic gates, the design size can be equivalent
to hundreds of millions or even billions of logic gates. The effort required to design a
system-on-a-chip from scratch is therefore huge.

2 Platform Based Design


The obvious question now is: How can we design such complex chips under the time
constraints set by current product and technology generation life-cycles?

The answer is to not start from scratch every time but instead reuse existing designs as much
as possible. This is made possible by the so-called platform-based design principle.

An SoC platform is a solution based on the use of standard components, interfaces and
protocols, and design methods and tools. New product versions can be derived from the basic
platform by adding and removing some hardware and software parts without having to redesign
the whole SoC architecture every time.

The main facilitators of platform based design are the following.

First of all, the platform is based on a standard CPU, such as an ARM or RISC V processor, that
is supported by industry-standard compiler and debug tool chains, making software
development and reuse easy.

A second facilitator of platform-based design is the use of standard on-chip bus architectures
and protocols. This makes it possible to develop IP blocks independently of the SoC platform,
allowing concurrent design of the SoC and IP blocks. It also makes it easy to use IP acquired
from external providers.

A third facilitator is the use of electronics design automation tools that support standard design
languages and data exchange formats. This allows SoC designers to build tool chains that
support their requirements of tools available from many vendors, and allows design teams and
companies to work seamlessly together. Tool chains can also be automated by using scripting
languages, which frees designers from executing routine tasks manually.

In a platform-based design approach, the majority of the design work takes place in the
development of IP blocks. The IP blocks are integrated with the SoC hardware with a standard
memory-mapped register interface, and with the SoC software with device drivers and
application-programming interface libraries. These are the platform-specific parts that an IP
block design must provide for every target platform. The application-specific hardware of the IP
can be developed independently of the target SoC platform.

The deliverables of an IP block design project to the SoC design project include the hardware
and software code needed to verify and implement the IP in the SoC context, as well as the tool
scripts required for executing these tasks with electronics design automation tools. The
deliverables can be packaged in a machine-readable form by using the IP XACT data
representation format.

3 Platform Design
We can present the SoC development task as two separate processes. The first process is the
one in which the platform itself is developed. The second process is the one where the IP blocks
are designed. There are actually a large number of IP block design processes that can be
executed concurrently with or even before the platform design.

Let's begin by giving an overview of platform design tasks.


First of all, notice that the design process has been divided in three phases here. The phases
are the electronics system level, the RTL level, and the physical level.

In the system-level design phase, a virtual hardware model of the platform is created. The
functions of hardware components are modeled with high-level languages, such as, C++ or
SystemC. Interfaces are represented with untimed or loosely timed transaction-level models.
The purpose of this virtual hardware platform is to allow software development to begin before
the actual hardware design has been done. The software can be executed on the virtual SoC
platform just like on real-hardware, only at a much slower speed.

In the RTL design phase, the hardware of the SoC is defined. Since most of the RTL
functionality is inside the IP blocks, the RTL design of the platform consists of the integration of
the parts by designing the bus fabric, and verification of the design. The verification is often
done using a hardware-based emulator instead of a simulator, which allows for fast execution of
software and connecting with real physical peripheral devices and interfaces.

The final design phase is physical design. In this phase the RTL code is synthesized and the
gate-level model is used to drive layout design.

The arrows that enter the platform design process flow from the right represent design data and
constraints that are imported from the IP block design processes.

4 IP Block Design
From the point of view of the SoC platform design process, we can think of all of the parts of
the platform as IP blocks that are just connected to each other in the SoC platform design
process. Most of the functionality of an SoC is therefore defined in the IP block development
processes.

Each IP block design consists of the hardware models that are needed in different phases of
SoC platform design, and the software parts that are needed to give the operating system and
application software access to the hardware.

In the system-level design phase, the first task in IP block design is to decide on the
hardware-software partitioning for the functions of the block. This means that we have to decide,
which functions are implemented in hardware and which functions in software, and how the
hardware and software parts interact. The results of this design phase are the software
components, such as device drivers and application program interface libraries, and a
transaction-level, software-based reference model of the hardware parts that can be used in the
virtual SoC platform. This hardware model can also function as a reference model in RTL
verification.

In the RTL design phase, the RTL architecture of each IP block must be defined, coded and
verified. Verification is carried out in a verification environment developed for the block by using
techniques, such as the Universal Verification Methodology, UVM, which allows the verification
resources to be reused in SoC-level verification.

It is common to do a physical prototype implementation of an IP block so that its


power-performance-area properties can be estimated before the real physical implementation is
done in the SoC layout. Scripts and constraint files created for the prototype implementation
can be used in the SoC implementation.

5 audioport IP

This course is built around a design project, in which you will learn many important design and
verification methods and tools by designing the hardware of a fairly complex IP block. The name
of the IP block is audioport.

The audioport IP block is a serial audio interface that converts parallel audio data into a serial
bit-stream that complies with the I2S standard. The CPU of the SoC writes the data to be
serialized into a memory-mapped register bank inside the audioport. The serialized data stream
is sent off-chip to an external audio codec and headphone amplifier chip. The IP block works
with the AMBA advanced peripheral bus that is common in SoCs. The block also contains digital
signal processing functions and has two clock domains. It therefore has many features that
require the use of special design and verification techniques.

The operating principle of the audioport block is simple: it requests data from the host CPU by
raising an interrupt signal. The software running on the CPU responds to the interrupt by writing
data into a memory-mapped register bank inside the audioport. After that the audioport
processes the data and writes out the results in a serial format.

6 ARM Easy
The audioport IP can in principle be installed on any SoC platform that has an APB bus. To give
you a concrete example, we use the ARM Easy platform as a reference SoC platform for the
project. You can find the documentation for this platform on ARM's web pages. By studying the
platform's documentation you can learn a lot about what a system-on-a-chip platform design
contains.

We assume that we connect the audioport on the peripheral bus of an SoC, and then wire the
interrupt signal to an interrupt controller that is available on the platform.

In the simulation testbench, we model the basic functions of this platform:


The control software that runs on the CPU.
The APB bus interface that generates bus transactions according to the APB protocol.
The interrupt handling subsystem.
This way the audioport block "thinks" that it is running inside this kind of system when we are
simulating it in its testbench.

7 audioport Block Diagram


Let's next take a closer look at the audioport design.

The IP block contains four top-level modules. Each module is completely independent and
serves a different learning goal of this course.

The first module, the control_unit, serves as a playground for learning SystemVerilog-based
RTL coding and verification techniques. We spend many weeks designing and verifying this
module, and completing this module is the first milestone of the project.

Next we design two modules, the i2s_unit and the cdc_unit.

The i2s_unit serves as a small RTL design and coding exercise. In this case, the code will be
written in the VHDL language so that you also get some experience of using this language
which is still used in many companies.

The cdc_unit implements clock-domain crossing logic. You will learn to design and verify various
synchronizer structures when you create this module.

The dsp_unit is the most complex module. It will contain hundreds of thousands of logic gates.
This is why we are going to implement it with SystemC and high-level synthesis, which saves a
lot of time compared to RTL design and coding.

The last milestone covers the verification and prototype implementation of the complete
audioport design. For the verification, you will learn to create a UVM testbench. UVM is an
object-oriented programming framework that is used to build reusable testbenches for SoC
designs.

You can start the audioport coding effort by defining the top-level modules and module
instantiations according to the block diagram shown here.

8 Memory Map
As we saw before, an IP block connects to the SoC system bus, and communicates with the
CPU or other system components using the protocol specific to the bus standard. The
communication hardware inside the IP block consists of a memory-mapped register interface.
The IP block contains a register bank that from the point of view of the CPU functions as a
memory device.
Designing the interface involves tasks in both the platform and in the IP block design processes.

The platform designer must define the system's memory map, which defines the memory
ranges that are reserved for different devices that are attached to the bus.

The IP block designer, on the other hand, must decide how many registers are needed and thus
the size of the memory area that must be allocated for the block.

The memory map of the EASY system is shown on the left in this slide.

As you can see, the address range starting from the hexadecimal address 8000 0000 has been
reserved for peripherals that are placed on the APB bus.

In the APB memory area, some ranges have already been allocated for the interrupt controller,
timers and other devices. We can place the audioport in the undefined section of the APB
memory map, which begins at address 8C00 0000.

The audioport has a large number of registers. Every register has a specific address in the
memory map. The addresses refer to 32-bit memory locations, which is why register addresses
increase in steps of 4.

You must define the address range, and the addresses of some specific registers in the setup
files so that you can refer to the registers with symbolic names in your code.

9 TLM Model

As mentioned earlier, a transaction-level model is often created for IP blocks for use in
system-level design. In the audioport project, a TLM model written in SystemC is available. You
can use it to learn how the audioport should work. You can simulate this TLM model after you
have defined the memory map addresses and this way check that your address definitions are
correct.

At the bottom of the waveform view you can see the bus transactions, that is, the addresses and
data values that the CPU writes to the audioport. Notice that the TLM model does not use a
pin-level bus interface and clock cycle accurate timing.

The image at bottom right shows a detail of these transactions. You can see that the audioport
raises its interrupt output. The CPU responds by writing audio data to fill a buffer register bank
inside the audioport.

In the image in the middle you can see the bit-serial data generated by the audioport.
The waveforms in the top half show the audio input and output data in analog format. The audio
data has been filtered in the dsp_unit of the audioport.

10 Summary
The key points of this lecture can be summarized as follows.

We have introduced the concept of a system-on-a-chip, which describes a complete digital


system that is implemented on a single silicon chip.

An SoC contains one or more CPUs along with common computer system parts and
peripherals, as well as many industry-standard and application-specific intellectual property
blocks.

The SoC design process has two parts: the platform design and the design of the IP blocks
used in the platform. This course concentrates in IP block design and verification.

IP blocks are connected to the SoC with memory-mapped register interfaces, and they can use
standard computer system communication paradigms such as interrupts and direct memory
access for communicating with the software and with other peripheral devices.
L2

2 RTL Design and Verification Flow

This lecture gives you an overview of the register-transfer level design and verification flow.

A flow defines the order of design and verification tasks in a design project. Every company and
design team uses a flow that has been developed for their specific needs. In this lecture we
describe the flow on a general level.

For our purposes, we divide the flow into four phases. Each phase can consist of many
separate tasks.

The phases are the Specification Phase, the Design Phase, the Coding Phase and the
Verification Phase.

In the beginning of the project, in the specification phase, the detailed functional requirements of
the IP block's hardware are defined in the functional requirements capture task shown as the
first task in the flow diagram. The functional requirements specification created here will be the
foundation for all design and verification tasks that follow.

The next phase in the flow is the design phase. Here the flow diverges into two different paths,
the design path and the verification path.

In the RTL architecture design task, a detailed specification of the hardware is created.
Concurrently with RTL design, a verification plan is developed. The verification plan describes
the tests and other verification tasks that must be executed to ensure that the RTL design
implements the functional requirements correctly.

Once the detailed RTL architecture has been defined, its description can be included in the
verification plan as indicated by the RTL assertions data flow arrow so that the verification
environment can check that the RTL code that will be developed based on the RTL design
implements the design intent correctly.

The next phase after the design phase is the coding phase. Here the code that implements the
RTL design and the verification plan is written.

The final phase is the verification phase. It consists of a functional verification task in which the
RTL code is simulated in the testbench created according to the verification plan. Formal
methods can be used to complement simulation.
In the following slides we describe the key tasks of the design and verification flow in more
detail.

3 GRS
In the beginning of every design project, some kind of general requirements specification must
exist or be created. The general requirements specification describes the purpose and the
technical and economical constraints of the hardware to be designed. It is often created in a
feasibility study based on which the decision to start the design project is made.

The inputs for the general requirements specification typically include datasheets, protocol
specifications, standards and other documents that are relevant to the product.

Outputs of the study can include an interface specification that lists all input and output signals,
and a functional description. A transaction level simulation model is also often created as part of
the feasibility study to demonstrate the intended functionality of the IP block.

To make this specification task more concrete, a simple general requirements specification is
presented on this slide. The specification describes a static random-access memory controller
IP block. The starting point in this case is the textual requirements description shown in small
print in the text box. It references some datasheets that also become part of the specification.

Drawing 1 describes the intended use case of the SRAM controller as part of an SoC design.

Drawing 2 describes the interface of the IP block. The interface signals have been derived from
the SoC bus specification and the memory component's datasheet.

Drawing 3 described the interface signal waveforms for the memory access transactions the IP
block must support.

The information shown here would probably be comprehensive enough to allow a detailed, bit
and clock-cycle accurate functional requirements specification to be created.

4 FRS
The functional requirements specification is a refined version of the general requirements
specification. It must be comprehensive, unambiguous and detailed enough so that it can be
used as a definite specification for RTL design and verification.

Functional requirements can be described in many ways and in many formats. In this
presentation, we describe the functional requirements as a set of properties the design must
have.
Properties are bit and clock cycle accurate signal value combinations and sequences,
cause-and-effect behaviors or other similar functions that must be designed into the hardware
and that must be checked in verification.

All design and verification tasks that follow are based on the functional requirements
specification. Creation of this specification is therefore the most critical task in the design flow as
any flaws that remain in the specification will be detected only by chance in verification.

The example shown on the right illustrates the creation of a functional requirements
specification. The textual specification is broken into property definitions that are clarified with
timing diagrams. This way the "human readable" general specification is broken into a large set
of simpler but exact properties that can be tracked throughout the design and verification
process.

The word "feature" is used in this example for a design property to emphasize the difference
with the SystemVerilog keyword "property", which is used to encode the functional properties in
an executable form in the coding phase.

5 RTL Architecture Design


After the functional requirements of the hardware have been identified and fixed, the project can
move into the design phase, RTL architecture design on one hand, and verification plan design
on the other.

Let's begin with RTL architecture design.

RTL design is mostly done by human designers, even though solutions for creating RTL
architectures from algorithmic models automatically exist and are used on a limited scale.

An RTL architecture defines the name and interface of the design, the names and types of
internal signals, and the combinational and sequential logic functions that drive the internal
signals and output ports.

An RTL architecture specification can be represented in many formats, such as block diagrams,
pseudo-code, state-charts and tables, algorithmic state machine diagrams, logic equations, and
so on. The representation must be unambiguous, because it functions as the specification for
the RTL code that will be written later, often by a different person.

The example shown in this slide presents the RTL implementation of one functional property,
called f_wctr_logic. A combinational logic function described in the table has been allocated as
an RTL block to implement the functional requirement.
The block diagram shows the allocated combinational block in the context of the complete RTL
architecture. All parts of the block diagram must be defined in a similar way and linked to the
functional requirements specification.

6 Whitebox Assertions
Assertions are pieces of code that are used to check functional properties during verification,
most commonly in simulation. Assertions will be covered later in the course. Here we discuss
only their use cases in the design and verification flow.

The term whitebox assertion describes assertions that have visibility inside the design, which
means that they can observe the state of the internal signals of the design.

When the RTL architecture has been designed, it is a common practice to describe the intended
function of all combinational and sequential blocks and interconnect signals with assertions.
These whitebox assertions can be incorporated in RTL verification to check that the RTL code
implements the intended functionality correctly. Whitebox assertions should therefore be
specified by the RTL architecture designer, not the RTL code designer, as their purpose is to
catch coding errors.

Assertions consist of a property statement that describes the required function, and assert and
cover directives. An assert directive checks that a property is always true, while a cover directive
counts how many times the property was observed to be true.

The logic functions of RTL blocks can often be complex. That is why they are often described
with more than one whitebox assertion, as shown in the example here.

7 Verification Plan Design


Verification plan creation is done concurrently with RTL design in the design and verification
flow. RTL design and verification planning tasks are typically executed by separate design
teams that work from a common functional requirements specification.

The verification plan describes the methods that are used to verify the RTL design. These
include functional verification by simulation, verification by formal proof of properties, and the
application of various static code checking methods.

SImulation is the most important verification method. It requires a testbench, which models the
design's environment. Testbenches of even simple IP blocks can be complex, and their
composition must therefore be accurately described in the verification plan.

The block diagram shows some common parts of testbenches.


A test program is needed to generate the stimulus data that is fed into the input ports of the
design under test, the DUT, during simulation.

Testbenches are often designed to be self checking, which means that a reference model and
an analyzer component are needed to check the results produced by the DUT.

Property checking of functional, black-box assertions is an essential part of a modern testbench,


as is a coverage model that measures the quality of the verification process.

Testbenches often use parts reused from other projects, or acquired from other parties. Bus
functional models and verification IP blocks are examples of these. These parts and resources
must also be specified in the verification plan.

8 Blackbox Assertions
We have already discussed whitebox assertions that are used to check that RTL code
implements the specified RTL architecture correctly.

Black box assertions, on the other hand, are used to check that the RTL code implements the
functional requirements specifications correctly. These assertions observe the DUT only from its
input and output ports. They can therefore be derived directly from the functional requirements
specification.

In this example, you can see the specifications of two blackbox assertions that are used to
cover some properties in simulation. They will be implemented with property statements and
cover directives in the Coding Phase.

9 Coverage Model
Coverage is a measure of verification quality. It measures how completely the required
properties of a design have been checked in simulation and formal verification.

Functional coverage is typically measured using two methods, by using property statements
enabled with cover directives, or by using so-called covergroups that can be used to create
more complex coverage measurement constructs.

The example in the table shows a specification of a covergroup that measures the coverage of
write accesses to different memory addresses. This covergroup can be implemented as a
SystemVerilog covergroup declaration in the coding phase.

Having a robust coverage model is essential so that the verification team can tell when the
verification process can be stopped. Functional verification is usually carried out by running a
large number of test programs in a sequence to verify different properties. Each test program
increases the coverage. When the coverage has reached 100% additional tests are not needed
anymore.

10 Coding Phase and Verification


With all designing and planning done, the project can move to the coding phase.

Here the RTL architecture design and the verification plan are implemented in the chosen
language or languages. For RTL coding, VHDL, Verilog or SystemVerilog is used. The
implementation of the verification plan, on the other hand, often requires the use of many
languages.

It is important to understand that the purpose of the coding phase is only the conversion of the
specifications into executable form. New design items, such as signals or registers, cannot be
added in this phase. The specifications must also be implemented literally. A name of a register,
for instance, must remain the same in all code that is written based on the specification.

In this example you can see that a register called addr_r is referenced in RTL code, in a
property statement, and in a tool command language script that controls the simulator. If any of
the three code writers for some reason would choose a different name, the verification
environment would break.

The verification phase is the last phase in our representation of the design and verification flow.
This phase is a highly-automated process where the EDA tools are run using scripts that
execute the verification tasks as defined in the verification plan. In the design bring-up phase
the focus is on detecting and fixing bugs. When the design seems to have stabilized the focus
moves on tracking coverage, for which various coverage analysis and reporting tools can be
used, as shown in the image on the right.

11 RTL Flow Summary


This slide shows a summary of the RTL design and verification flow phases, and some of the
tasks that must be executed in each phase. As the waterflow representation shows, all tasks
that belong to a specific phase must have been completed before the project can move to the
next phase.

Most tasks require the use of some electronics design automation software tool, and the usage
of a specific tool depends on the results generated using an upstream tool. The EDA tool flow
must therefore also be carefully planned so that the design and verification work can proceed
smoothly.
In the course project, we begin from the stage where most of the specification and design tasks
have already been done, and the focus is mostly on coding and verification. Some specification
and design tasks still remain, so there will be enough work in the project.

12 Week2
As mentioned in lecture 1, the control_unit module will be our playground for learning RTL
design and verification methods. If you study the project documents, you will notice that the RTL
architecture and verification plan specifications have already been created. This means that we
can begin with any of the coding tasks that must be done next. We assume that you already
have a fair amount of RTL coding experience, which is why we begin by doing some verification
plan coding tasks first so as to make things more interesting to you.

The first part of the verification plan we are going to implement is the test program, which is
described in the "testplan" section of the specifications. The testplan describes a set of tests,
which are simple programs that write data into the control_unit's inputs so as to activate different
properties defined in the functional requirements specification of the control_unit module.

A SystemC model of the control_unit is available, and it can be used as the DUT to test the test
program itself. You cannot simulate the test program with RTL code yet because the code has
not been written.

The purpose of this exercise is to learn to understand the memory-mapped input output
principle, the operation of the APB bus and the SystemVerilog interface construct that is used to
model the bus in simulation, and the use of SystemVerilog "program" blocks and "task"
subprograms.

13 Memory -Mapped I/O


As we have learned earlier, IP blocks are connected to a bus on a system-on-a-chip, and they
communicate with other parts of the SoC, most commonly with the CPU, using the
memory-mapped input-output principle. For this to work, each IP block must contain the logic for
decoding the bus signals, as well as a register bank or memory block for storing the data as
required by the IP block's functions. The control_unit implements these functions in the
audioport.

This slide shows how the software running on a CPU might communicate with the control_unit
in practice. The top half shows the point of view of the software, and the bottom half the
hardware.

In the software-side code example, a section of the SoCs memory is mapped into the address
space of the Linux process, starting from the base address assigned to the audioport in the
system's memory map. The variable "command register" is then assigned the base address
value. On the last line of code, the hexadecimal value 4 is written into this address by using the
variable "command register" as a pointer.

When the CPU executes the variable assignment it's hardware creates a memory write
transaction on the bus just like with any other variable assignment. The address stored in the
command register variable is placed on the address bus PADDR, and the value 4 assigned
using the pointer reference is placed on the write data bus PWDATA, and the bus control signals
are "wiggled" according to the bus protocol requirements.

The control_unit is designed to detect transactions that are targeted at it from the bus signals.
When it detects a write transaction, it stores the data value in the register selected by the bus
address. Read transactions would work the same way.

The IP block is free to do whatever it has to do with the data in the register bank. The data can
be used to control the block, for instance to enable and disable some functions, or it could be
data that should be processed by the block.

14 Testbench
This slide describes the organization of the testbench you can use to simulate the control_unit
with the test program you will create.

The testbench module control_unit_tb instantiates the RTL module control_unit and the
SystemVerilog program control_unit_test. The program implements the "software part" in this
setup and it can also write and read the signals that will connect the control_unit to other parts
of the audioport hardware.

The testbench also instantiates two SystemVerilog interface objects that function as bus
functional models. The APB interface is used extensively in this exercise.

The test program has ports of these interface types, which means that you can use the tasks
defined in the interfaces to write and read the bus signals.

The control_unit module on the other hand is connected to the signals inside the interfaces with
assign statements.

As you can see from the connections, the test program has full control of the control_unit's ports
which is of course required for verification.
15 APB if
You can use the "abp i f" interface provided with the project files to model the functionality of an
APB bus. Here you can see fragments of the main parts of the interface declaration.

The signals of the interface are represented as variables.

The interface can also have inputs, such as the clock and reset inputs here, through which
external signals can be connected to the interface.

The "modport" declarations can be used to define the interface signals as inputs or outputs in
different scenarios where the interface is used as a port of a module.

A clocking block can be used to synchronize the sampling of interface signals to a clock signal
and to change signal timings.

The task subprograms can be used to read and write interface signals. Since tasks in
SystemVerilog can contain timing controls, it is possible to generate complex timing waveforms
with a task. A test program can then call the task to create full bus transaction waveforms. This
way the bus signaling protocol does not have to be coded into the test program, which makes it
possible to change the bus specification by just choosing a different interface.

16 abp_if usage
By using the tasks defined in the APB interface, test program coding becomes easy.

To create a bus write transaction, you'll only have to set up variables to hold the address and
data and pass these variables to the write task.

For a read transaction, you have to pass a variable in which the read value will be returned.

The third argument in the apb.write and apb.read calls is used for returning status data.

The APB interface tasks can also handle wait states that the control_unit may insert by keeping
the signal P READY low during the access phase of the APB transactions. The write and read
task calls will block until the transaction has completed or the maximum allowed number of wait
states has been exceeded.

17 Test program organization


The test program code is created in two files.

The file control_unit_test.sv contains a SystemVerilog program block that executes the tests.
The tests are defined as SystemVerilog tasks in another file, called control_unit test.svh, which
is included into the program file at compile time using the "include" compiler directive.

The test program contains an "initial" procedure from which the test tasks are started
one-by-one. The testplan section of the control_unit's specification describes the required
functionality of each test task. When you have written the code for a task, you can add a call to
it in the initial procedure of the program to test it. So you don't have to write the code for all
tasks first before you can simulate any of them.

Notice that tasks can call other tasks. In this example, the apb_test task calls the reset test task
to execute a reset, avoiding replicating the reset sequence code.

18 Coding Tips
This slide presents some coding tips for creating the test tasks. Notice that most of these code
examples cannot be used in RTL code.

The first example shows the syntax for creating APB write and read transactions. You have to
declare and initialize the variables passed as arguments before you can use them. It is a good
practice to check the value of the fail-variable after each task execution.

The second example shows the usage of the monitor task of the irq interface for reading the
state of the irq_out output of the control_unit.

The third example shows how you can create an immediate assertion to check if some logical
condition is true, and to report the result and update test statistics. An immediate assertion is
created by placing an "assert" statement inside procedural code. The assert statement works
like an if-statement, but typically only the else block of the statement is defined. The code in the
else block is executed if the assert condition evaluates to false. You must use the identifier
defined in the testplan for the assert statement so that the code is linked with the verification
plan and will be compatible with various setup files. Use the function "assert error" defined in the
project's package file for error reporting.

Example 4 shows how you can delay program execution for a certain number of clock cycles
using the repeat loop and the posedge timing control.

Example 5 shows how a level-sensitive wait can be defined using the wait statement.

Example 6 shows how the negedge timing control can be used to change signal values in the
middle of the clock cycle. Having input signals change in the middle of the clock cycle makes
the interpretation of simulation waveforms easier.
The last example shows how the fork statement can be used to create concurrently running
processes inside procedural code. This is useful if you have to control or monitor several signals
at the same time for instance by writing to inputs constantly while at the same time waiting for
results to appear in the outputs.

A fork block can contain any number of code blocks. When simulation enters the fork block, all
code blocks begin to execute at the same time. In the example the code blocks are empty, but
you can write any procedural statements between the begin and end keywords.

The "join any" keyword at the end allows execution of code that follows to continue when any of
the code blocks inside the fork block has run to its end. The "disable fork" statement ends all
processes that may have been left running when execution of code following the fork block was
enabled by the join statement.
L3

2 Principle of Assertion-Based Functional Verification


Assertions are pieces of code that check that the design has the required properties, meaning
that all required functions have been implemented and that they produce correct results in
simulation.

In the course project, you have already implemented the test program for the control_unit. The
next task is to develop the assertions that check its required functional properties when it is
simulated with the test program.

The key benefit of using assertions is that they take a large part of the burden of checking
verification results away from the verification engineers, as demonstrated in the example shown
here.

In a traditional design and verification flow, RTL designers create the design and verification
engineers create the test programs, after which verification engineers check the results by
examining output signal waveforms or comparing output data with reference data. This can be
hard work whose results depend partly on the vigilance of the engineers, and even if all errors
are eventually detected, finding the cause of each error based only on the output data can still
be difficult.

Assertions add two benefits to the verification process. Since they monitor the design constantly,
they can raise an alarm immediately when the error occurs, and not only after the error's effects
have propagated to the outputs. The second benefit is that the location of the bug in the code
can often be easily found based on the specific assertion that failed.

3 Counter Example 1
The following two slides give an example of the benefits of assertion-based verification.

In this example, we have a binary counter that should count up to state 9 when the mode input
is in state 0, and up to state 13, when the mode input is in state 1. The RTL coder's
interpretation of this functionality is shown on the left.

Since this is a very simple design, the RTL coder might module-test it by just simulating the
counter in mode 0 for some time and then some more in mode 1. The results are shown in the
first waveform image on the right. They look good, don't they?

A better-informed engineer might execute a similar test by driving the mode select input with
random data instead, as testing with random test data is known to give better coverage. On the
surface, the results still look good. Only if you check the waveform carefully will you notice some
anomalies.

The code has one problem. The not-equivalent operation is used to decide if the counter should
roll back to zero or continue counting up. If the mode input is changed from 1 to 0 when the
counter value is greater than 9, the counter will count up to 15, which is not allowed in either
mode. A less-than operator should have been used instead even though its hardware
implementation will probably require more logic gates.

4 Counter Example 2
In the third simulation of the counter shown here we have added assertions into the model. You
can see the code for a property that defines the required functionality for mode 0, and an assert
directive that checks this property. This is a so-called concurrent assertion that is executed on
every clock edge when the reset is not enabled, as indicated in the property statement.

As the red error indicators in the waveform window show, faulty operation is reported
immediately when it is detected, so there is no need to examine the waveforms. We are also
informed that the variable ctrdiv_r is in the wrong state.

5 Assertions in SystemVerilog
The SystemVerilog language has two kinds of assertions. We have already discussed the
procedural assert statement, which is typically placed in procedural code in test programs. This
kind of assertion is also known as an immediate assertion. Its limitation is that it performs its
check only when the assert statement is executed in the procedural code. It is therefore useful
for validating data and checking directed test results at specific times.

The right-hand side of the slide presents the idea of a concurrent assertion, which is the second
assertion type in SystemVerilog.

Concurrent assertions are placed outside procedural code and they typically check the target
property on every clock cycle. Furthermore, the properties to be checked can also be
sequences of logic functions that span many clock cycles, and even more complex behaviors. A
concurrent assertion could check, for instance, that if a request pulse is seen, an
acknowledgement pulse is generated within a certain number of clock cycles.

By coding all required properties of a design as concurrent assertions we can create a robust,
self-checking verification environment that monitors the correctness of the design all the time in
simulation.

6 Concurrent Assertions
A concurrent assertion declaration consists of two parts.
The property to be asserted must first be declared using a property statement. SystemVerilog
has a built-in property declaration language for this.

The properties can be checked with assertion directives. This slide described their usages.

The assert directive is used to check that a property holds. It is used for detecting bugs.

The cover directive is used to count how many times a property has been true in verification. It
is typically used to check that all properties were activated in simulation by the test data
generated in the test program.

The assume directive behaves in simulation like the assert directive. In formal verification it is
used as a constraint for input data that is used to prove properties. You typically create 'assume'
properties to define the valid input data values and ranges or waveforms. This way you will get
notified, if the test program generates the wrong kind of data.

The restrict directive has no effect in simulation, and works like the assume directive in formal
verification. It is very seldom used.

7 How to define properties


As you have seen from the previous examples, property statements can look quite cryptic. They
will begin to look less obscure when you have learned the basic syntax and principles of
constructing them.

In its simplest form, the body of a property statement contains only a Boolean expression. You
can therefore define a property just like you would write the condition of an if statement.

If a simple logical condition cannot describe the behavior you want to model, you can define a
sequence of Boolean expressions. The sequence states which Boolean expressions should be
true on which clock cycles. Some additional syntax-elements are needed for defining
sequences. They will be described shortly.

You can declare even more complex properties by using property expressions to define
relationships between, or functions of, sequences. You can for instance state that if one
sequence occurs, then another must follow after a certain time, or that one sequence must
contain another. A large number of property operators are available for declaring property
expressions.

Once you know the syntax for defining sequences and the operators for declaring property
expressions, parsing a complex property statement becomes easier.
In the next few slides you will see examples of different kinds of properties.

8 Booleans
Here you can see a property called 'in bounds' that is activated on every clock edge if the reset
signal is not in its active state. This is a typical enabling condition for properties. When the
system is in reset, its signals will not get correct values even if we bring correct data into the
inputs.

This property is defined using a simple Boolean expression, which states that the value of the
counter variable should always be less than 25.

These kinds of simple range checks are very powerful in practice. Design bugs are caused by
human errors, and they can have random effects that can manifest themself as completely out
of bounds values. Basic range checks like this one are very useful in detecting events nobody
would have expected.

9 Sequences
This example shows how you can define a property as a sequence of Booleans. For that we
have to use the cycle delay operator, ##. The cycle here refers to the clock specified in the
enabling condition of the property.

The body of a property declares a sequence that consists of some Boolean equations of the
variables load and shift.

The matching of the property with data is started on every clock cycle, and if all elements of the
sequence match on their respective clock cycles, then the sequence matches and property
evaluates to true. The wave diagram shows some matching trials. Only one complete match
occurs, starting from the second clock cycle.

Notice that the property is activated using a cover directive, which means that in simulation the
simulator will count how many matching data sequences were seen.

An assert directive would not be useful here. It would require that the sequence matches on
every clock cycle, which is not possible.

Notice from the remarks on the left that in addition to fixed delays, you can also define delay
ranges, and even ranges with indefinite length with the $ operator.

The repetition operators shown there are also useful as they can make sequence expressions
much shorter. The "goto repetition" operator shown last does not require a match on every
consecutive clock cycle.
10 Property Expressions
This slide shows an example of a property that is defined as a property expression. The
expression in case is an implication, which is probably the most common property expression
and one that you should definitely know.

The implication defines a relationship between two sequences. The sequences are separated
by an implication arrow operator.

The left hand side sequence is called the antecedent. The term "enabling condition" is also often
used.

The right-hand-side sequence is called the consequent.

A non-overlapping implication evaluates to true if the antecedent matches, and the consequent
matches starting from the next clock cycle after the end of the antecedent. The implication also
evaluates to true if the antecedent does not match.

An overlapping implication works the same way, but the last cycle of the antecedent and the first
cycle of the consequent overlap.

In the code example, the sequence property from the previous slide has been rewritten as an
implication. The first sequence element now functions as the antecedent of the implication. This
means that the full sequence must only match starting from the clock cycle on which the first
Boolean is true. In other words, the implication requires that if the start symbol is seen, the full
sequence must follow, but the start symbol does not have to occur on every clock cycle.

It now makes more sense to use the assert directive. With the same data, the assertion now
passes on every clock cycle, either because the antecedent does not match, or because the
antecedent and the consequent sequence both match.

In this slide we have used the implication operator as an example of SystemVerilog property
operators. The next slide presents a list of other likewise powerful operators that you can use to
define property expressions.

11. Operator Summary


This slide lists all sequence and property operators defined in the SystemVerilog standard. Use
the standard document or resources on the Internet to learn more about these operators. You
don't have to know that many operators to be able to write useful assertions, but the more you
know the easier it is. For Boolean expressions you can use the basic operators of
SystemVerilog.
The third column of the table shows the associativity of the operators. Associativity defines the
evaluation order of operators in an expression that contains many operators. You can always
use parentheses to force the evaluation, which is a good way to avoid unpleasant surprises.

12. Summary: Assertion Creation and Most Common Building


Blocks

This slide shows the workflow, if you will, for writing assertions.

You begin from the bottom, and first decide if the property can be defined as a Boolean
expression, which is evaluated in one clock cycle. Boolean properties are good for checking
conditions that must prevail all the time, for instance, A and B must not be both zero, or C must
be less than D.

Notice that you can use SystemVerilog's sampled value functions such as $rose and $fell in
Boolean expressions and this way avoid creating a two-cycle sequence. Bit-vector functions
such as $ is unknown and $onehot can also be useful.

If the property has to track signal values over many clock cycles, you have to define a
sequence. First define the Booleans that check the expected signal values on every clock cycle
of the sequence, and then construct the sequence using delay and repetition operators. If the
sequence consists of several overlapping or intersecting subsequences, you can use sequence
operators to describe the composite sequence.

You finish the property design by constructing a property expression using the property
operators. A simple Boolean can be used as it is, but multi-cycle sequences usually require
some operator such as an implication to be usable with assert directives, as was shown earlier.

When the property has been created, it is put into action with an assert directive: assert, cover
or assume. Which one to use depends on the verification plan. The general principle is that you
use asserts to detect situations where something that should have happened did not happen. In
other words, if wrong signal values or a wrong sequence were seen. Covers will tell you if what
you expected to happen also happened. Assume directives are usually applied to input ports to
define the allowed data ranges or protocols.

13 Installation of Concurrent Assertions in a Design

Concurrent assertions can be placed in many places in the design hierarchy, even inside RTL
modules. Their code is not synthesizable, but synthesis programs usually gracefully ignore it.
The most common and flexible way to use assertions is to declare them in a separate module
that only contains the assertions and other verification code. This 'assertion module' can then be
instantiated inside a design module by using the SystemVerilog bind statement.

This technique has two advantages. First of all, you don't have to touch the RTL modules and
risk corrupting them. Mixing RTL code and verification code is also a bad idea in general from a
data management point-of-view. The second advantage is that you can easily disable the
assertion module by just commenting out the bind statement when you don't need the
assertions. Assertion evaluations can consume a lot of CPU cycles in simulation.

When the design hierarchy is elaborated, the assertion module gets instantiated inside the
target module. If the assertion module's port names match the port and variable names of the
target module the assertion module sees everything that happens inside the target module.

The code fragments on the left show the principle of using the bind statement. In this case it is
in a separate file that is included into the testbench module. Using a separate file allows other
tools such as formal verification programs to use the same bind file.

14 Debugging Assertions in QuestaSim


If an assertion fails in simulation, the reason is either a bug in the design or a bug in the
assertion code.The assertions should therefore be verified thoroughly before use so that they
can be trusted in design verification. Simulators usually provide good tools for debugging
assertions.

In this example you can see a property that declares an implication between two sequences.
The antecedent uses an externally defined sequence to simplify the property code. The
sequence statement can be used to create named sequences that can then be used in property
declarations.

The waveform image shows how the evaluation of an assertion thread is presented in the
QuestaSim simulator. The simulator used different colors and symbols to indicate the clock
edges on which the matching of the antecedent sequence began, when it matched, and when
the consequent matched or did not match. The property variable values are shown on top of the
active thread.

15 Tip: Auxiliary Code


When the simulator sees that checking of an assertion should begin, it starts a processing
thread to evaluate the assertion. On every clock cycle, a large number of threads is started, and
because assertion sequences can be long, an even larger number of threads can be active
simultaneously. This is why assertion code should be optimized for performance so that it
doesn't slow down simulation too much.
It is often the case that a large number of properties depend on the same set of signals, and
that the properties define the same Boolean or sequences with these signals. Properties
describing memory-mapped register interfaces, such as the one shown here, are a good
example of this.

To detect an access to a register, we must check for some Boolean combination of the bus
control signals. If there is a large number of registers we want to track, the same code must be
replicated in every property that tracks accesses to some register. The evaluation of the same
expression therefore occurs many times on every clock cycle.

A good way to optimize assertion code is to identify expressions that occur in many properties,
and compute their value in separate processes that assign the values to temporary variables.
The property code can then reference these variables. This way, the common subexpressions
are evaluated only once. The use of auxiliary code also makes the property code more
readable.
L4

2 Register-Transfers in Computer Engineering


In the previous lectures we have discussed different methods for the verification of
register-transfer level designs. In this lecture, we shall focus on the heart of the matter, that is,
RTL design itself. The assumption in this course is that you already have a basic understanding
of what RTL design is. In this lecture we aim to give you some insight on how RTL design is
done in practice.

Let's begin by recalling what we mean when we say that the operation of a digital circuit is
described at the register transfer level. We will use these examples that may look familiar from
computer engineering textbooks, in which computer architecture operation is usually
represented as register transfers.

Here we have three types of register transfers.

The first one represents a direct transfer to a register from an input, or from another register. A
register is actually a bunch of flip-flops that share a clock and reset signals, as shown in the
small schematic. The flip-flops in this configuration are loaded with new data on every rising
edge of the clock signal. In the RTL block diagrams clocks and resets are not usually shown, as
that would not add any relevant information. We just assume that the rectangles represent
registers that are loaded on every clock cycle.

The second example shows a conditional transfer that occurs only when the input K of the
register is true. This could be realized in many ways on the flip-flop level, but let's just assume
that we have a combinational logic based two-input multiplexer in front of the flip-flops. The
control signal K drives the select input of the multiplexer. The flip-flops are still loaded on every
clock edge, but depending on the state of K, they will get new or old data.

In the third example we have added a data processing block, a combinational adder in this
case, in front of the register. In computer architecture terminology, this arrangement defines a
register transfer with a micro-operation. Instead of the adder, any other combinational logic
based data processing function could be used to implement the micro-operation.

These three operations, register transfers, conditional transfers, and transfers with data
processing operations are all that we need to design RTL architectures. Instead of block
diagrams, we could use a textual format, such as the classical RTL notation shown in the
bottom right corner of the slide, to describe the operations. In practice, hardware description
languages are used for this.
3 RTL Coding Rules for Combinational Logic
If you want to model RTL functions with an HDL so that the code can be synthesized into logic
gates and flip-flops, you have to follow certain coding rules. This is because HDLs have not
been developed only for writing RTL code, which is why they allow you to describe functions
and behaviors that do not synthesize well or at all.

Let's begin by reviewing the rules for combinational logic, which are shown in this slide. The
same rules apply to any HDL. In this slide, we have examples in SystemVerilog and VHDL.

The basic building block of functional models is a process, which in HDLs is a construct that
contains statements that are executed sequentially, one by one, but inside one simulation
timestep in ideal RTL models. Processes are concurrent with respect to other processes, and
one design unit can contain any number of processes. In SystemVerilog, the always_comb
procedure is used for modeling combinational logic processes, but you can also use the basic
always procedure for this. In VHDL, the process statement is used.

You can think of processes as loops that are executed forever. In hardware modeling, the top of
the processing loop contains an expression that suspends the loop until the condition defined by
the expression is met. This expression is called an event list in SystemVerilog and a sensitivity
list in VHDL. In combinational processes, the list must contain all inputs of the combinational
logic block. This is the first rule to remember. SystemVerilog's always_comp has an implicit
event list that automatically contains all inputs. In VHDL 2008 you can use the keyword all
instead of listing all names.

The second rule states that you must assign a value to every output of the combinational block
when the process is executed, otherwise the code will generate latches that implement the
implied memory function in synthesis. Implied memory means just that if a signal is not assigned
any value, it must preserve its state, which requires a latch or flip-flop.

The third rule states that you must assign a value to a variable before you can use the variable
in an expression. The aim is again to avoid describing a function that requires memory.
Combinational circuits can only compute values from the current inputs.

The fourth rule advises you not to write code that creates combinational feedback loops. They
are usually dangerous in synchronous digital circuits.

The examples show the SystemVerilog and VHDL code for a 2-input multiplexer. Notice that the
blocking assignment operator must be used in combinational processes in Verilog and
SystemVerilog.
4 RTL Coding Rules for Sequential Logic
This slide presents the coding rules for modeling sequential logic. The basic construct is again a
process. In SystemVerilog, the special always_ff procedure should be used for processes that
represent sequential logic whose state memory is meant to be implemented with edge-sensitive
flip-flops in synthesis.

The structure of the model of a sequential circuit depends on the active clock edge and of the
type of the reset. The examples shown here represent the case where the clock is active on the
rising edge, and the reset is asynchronous and active low. In that case, both the clock and the
reset must be included in the starting condition list of the process. In SystemVerilog, the active
edge of the clock and reset should be indicated with the posedge or negedge operator.

In the process body, the code should first check with an if statement if the reset is active. In
SystemVerilog models, the else branch of the if statement should describe the next state
encoding logic. In VHDL, an elsif statement whose condition expression detects the clock edge
should be used. Notice that with VHDL, sequential logic functions can be described also in
many other ways. The coding style shown here is recommended as it closely resembles the
style used in SystemVerilog and Verilog models.

If the sequential circuit does not have reset functionality, or if the reset is synchronous, the
process template must be changed accordingly.

5 RTL Architect's Pattern Language: Combinational Blocks


We have now learned that an RTL model defines a design's registers that store data,
multiplexers that control data flow, and combinational logic based data processing blocks that
compute the next state values of registers. We also know the principles for modeling these
building blocks in SystemVerilog or VHDL. This allows us to verify the functionality of the model
with a simulator, and implement it as logic gates and flip-flops by using a logic synthesis
program. The logical question now is, how do we design the RTL model itself.

It is general knowledge that good design, in any field, is based on experience. That's why you
cannot become a good designer overnight. However, you can speed up the learning process by
studying how things are usually done in your field.

In any field, designers face recurring and similar design problems that can usually be solved in
the same way or using the same principles. These are called design patterns. If you learn the
common design patterns of your field, you can benefit from the experiences of previous
generations of designers.

In this and the next few slides, we shall take a look at some common design patterns that recur
in register transfer level architecture design. We shall begin with combinational blocks.
The multiplexer, also known as a data selector, is probably the most common building block in
digital designs. A multiplexer selects one of many inputs and forwards the selected input's value
to the output. A multiplexer can be used to control data flow in general, to implement conditional
register transfers, or to control resource sharing between multiple operations. Here you can see
two different ways of describing a multiplexer in SystemVerilog. Notice that conditional
statements, such as if and case statements, will in general produce multiplexers in synthesis,
even if you are not writing them with the purpose of explicitly defining a multiplexer. It is also
good to understand that nested if statements will produce nested multiplexers.

Combinational blocks that implement arithmetic and logical operations are another important
category in RTL designs. You can define the function of these blocks by just writing arithmetic or
logical expressions. The synthesis tool will automatically map the operators to the respective
combinational resources. A multiplication operator will produce a multiplier block, and so on.

It is important to notice that in RTL design operation bit-width and signedness handling is the
responsibility of the designer. Especially when using the loosely-typed SystemVerilog language,
you have to be very careful with these issues. In the code example, the unsigned and signed
computations do not yield the same results, so forcing the compiler to handle signedness the
way you want it is crucial.

As the title of the third example states, you are not limited to using single operations or
expressions when describing combinational functions. You can write an algorithm of any
complexity to describe the translation between inputs and outputs, as long as you follow the
rules for modeling combinational logic. In the code example, you can see a combinational logic
based implementation of the Bubble-sort algorithm. It uses loops to describe repeated
operations, but everything can be implemented as combinational logic.

6 RTL Architect's Pattern Language: Registers


Let's move on to discuss common sequential logic based building blocks. This slide shows
some use cases of registers.

In the first example, we have three registers connected in series. There's no control logic visible,
so we can assume that the registers are loaded on every clock cycle. They therefore form a
delay line, where the total delay is equal to the number of registers multiplied by the clock
period. The code examples show two ways of modeling this functionality with RTL code.

Notice that all of the code examples represent registers that have no reset to make the codes
shorter. In simulation, the registers would start from an unknown state, and get valid contents
when the input data reached them. In a real circuit, the registers would start from a random
state, or from some default state, 0 or 1, depending on the technology. You could make the
registers resettable by adding the reset code, as shown in the slide.
In the second example, multiplexers have been added to the basic delay line. This makes it
possible to control when the registers are loaded. This is the most common way to use
registers.

The third example presents a special case, to which we have given the name delayless register
in this slide. The idea is to make the data available at the output already in the same clock cycle
when it is being loaded into the register.

To implement this kind of functionality in RTL code, you must represent the multiplexer and the
register with separate processes, and use the register's next-state signal as the output.
Some designers prefer to use this design pattern for all registers, as it allows you to flexibly
reference both the current and next-state values of registers in a design.

7 RTL Architect's Pattern Language: Counters


It is possible to implement any kind of sequential logic circuit by just taking a basic register and
adding an appropriate next-state decoding logic block in front of it. However, it still makes sense
to study examples of registers that have a specific kind of next state decoding logic function,
which makes them useful in some specific design problems. Counters are such circuits.

A counter is a sequential logic circuit whose next-state logic makes it step through a fixed state
sequence. This slide shows three examples.

The first example presents a binary counter, whose state is incremented by one on every clock
cycle if the enable input is high. In the SystemVerilog code, the counter is allowed to overflow
when it has reached the highest value. Assuming that the counter register has enough bits, its
state shows how many cycles the input en_in has been in state 1. In essence, the counter
measures time, which is why counters have numerous applications in electronics.

In the second example, we have added an output decoder to the counter design. The output 'div
2 out' is directly connected to bit 0 of the counter register. The waveform seen at this output is a
square wave whose frequency is half of the clock frequency, which is why this type of counter is
called a clock divider.

You can create a divide-by-4 counter by using the second bit, and so on.

The state of the third output, div10_out, is decoded to be 1 when the counter's value is greater
than 4. The counter's next-state decoder has been modified so that it forces the counter's value
back to 0 from state 9. This arrangement creates a divide-by-10 counter.

In the third example, we have a simple binary counter with a complex output decoder. The
output block is implemented as a lookup-table that just maps input codes to output codes. This
is a very useful design pattern, as it allows you to flexibly create arbitrary data sequencers or
control units by using a counter to drive the decoder logic.
8 RTL Architect's Pattern Language: Shift Registers
Like counters, shift-registers are sequential circuits that have many uses in electronics. In
shift-registers the next state logic moves the bits in the register to left or right.

In the first example, we have an 8-bit register. Its next-state logic works so that when the input
mode_in is high, the register is loaded from input d_in. When mode_in is low, the contents of the
register are shifted once to the left. The output of the left-most bit, bit 7, of the register is
connected to the output of the circuit. The design can therefore function as a parallel-in
serial-out type shift-register. By first loading an 8-bit value into the register, and then keeping
mode_in low for 8 clock cycles, the value can be shifted out in serial format, bit-by-bit. This kind
of circuit has many uses in communications equipment that usually store data in parallel format
but transfer it serially over optical or wireless links, for instance. With minor modifications, the
circuit can be made to function as a serial-in parallel-out shift-register, which is useful at the
other end of the communications link.

A limitation of the design in the first example is that it requires an external control signal,
mode_in, to function. Sometimes we just want to serialize data at a constant rate, in chunks of 8
bits for instance. In such cases, it makes sense to add some control logic to the basic
shift-register design. A divide-by 8 counter is a perfect choice for this purpose. The second
example shows the clock divider generating mode select pulses that enable parallel loading on
every eighth clock cycle.

Counters and shift-registers, especially when used together, are probably the most important
sequential building blocks especially in communications circuit design, where they are used for
serializing and deserializing data, doing rate conversions, and so on.

9 RTL Architect's Pattern Language: Control Logic


In the previous slide we saw an example of the use of a counter as a controller. That kind of
solution works as long as the control sequence is fixed. If the controller has to make decisions
based on input data, you will need a finite-state machine controller. A counter is of course also a
state machine in principle, but the design procedure for these is different. The next-state logic is
usually simple and well known for different kinds of counters. The next state logic of FSMs, on
the other hand, can be very complex and requires the use of a robust design method that can
capture the intended behavior correctly.

In the design of data processing circuits, whose datapaths contain large numbers of
multiplexers, enabled registers and multifunction combinational logic blocks, and that have
many control inputs and status outputs, the design procedure based on the finite-state-machine
abstraction is the most flexible. The finite-state machine with datapath is in fact a common
architecture-level design pattern.
The algorithmic state machine chart is the preferred method for describing the operation of
state-machines. Its advantage over conventional state charts is that you don't have to write
complex, mutually exclusive logic expressions in the state-transition arcs, which is impractical
and error prone, if the FSM has many inputs. By using an ASM chart you can be sure that the
specification is unambiguous, and that you can easily convert it into an equivalent HDL model.

The RTL model of the FSM shown on this slide is based on a two-process RTL code "template",
in which one process models the combinational parts and the other the state register.

In the code, an enum datatype is used for current and next-state variables. This allows you to
use symbolic state names in the code, which makes it more readable, and makes it easier to
interpret simulation results.

10 Using the Design Patterns: FIR Filter


Let's now try to use the design patterns presented in the previous slides to solve a real design
problem. In this case, it's a 3-tap FIR-filter algorithm.

Recall that an FIR-filter is a digital filter that reads data samples into a fixed-length queue
memory, and after having read a sample, multiplies every sample in the queue with a fixed
coefficient, and computes the sum of these multiplication results.

The code shown on the left describes the operation of the filter whose RTL architecture we are
trying to design. Lines 1 and 2 describe the operation of the queue memory represented by the
variable 'data'. Lines 3 and 4 describe the computations, and line 5 the scaling of the results.

The circuit should read in new samples at 10-clock-cycle intervals, and hold the previous result
in its output until a new result is ready. Filter coefficients are available as constant inputs.

The description of the data queue matches the operating principle of the delay line presented
earlier. Since the queue operates at 10-cycle intervals, and not all the time, a version with
enabled registers is applicable here. This covers lines 1 and 2 of the specification.

The next task is to find a hardware implementation for the multiplications and additions. As we
learned earlier, they are directly mapped to combinational logic based multiplier and adder
components in synthesis, so there is no need to design them more specifically. The result is
shown in the middle part of the block diagram. The right-shifting function does not require
hardware resources, which is why it is not shown in the block diagram.

The output-hold function can quite obviously be implemented with a basic enabled register.
The datapath of the circuit is now ready. However, we have a couple of dangling control signals:
The signal c_r that enables shifting of the delay line, and signal c_f that enables loading the
output register. We need some control logic to drive these signals.

The design of the control unit often gives you the opportunity to use creativity, because the
control can often be realized in many ways. In this case, let's decide that the circuit should work
so that it reads in the new sample into the delay line on one clock cycle, and computes the
result and stores it in the output register on another, and then waits until 10 clock cycles have
passed.

The two-state state-machine whose function is defined by the ASM chart at bottom left can
handle this. In the first state, called WAIT, it waits to be notified by the signal 'ten' that it's time to
start. When this happens, it raises the data register load-enable signal c_r, and enters the
execution state, EXE. In this state, the FSM just enables the loading of the results into the
output register. The filter algorithm is thus executed in two clock cycles.

At this point, only one thing remains unsolved. That is, where do we get the signal 'ten' that
should move the state machine into state EXE once in every ten clock cycles?

This again sounds like a job for a clock divider. In this case, the output of the clock divider
should be in state 1 for only one clock cycle in the counting cycle. We have chosen that state to
be state 9.

As a final touch, the counter is equipped with an enable input that can be used to start and stop
the filter.

The architecture design is now ready. The next task is to write the RTL code for it and then
check with simulation if it really works.

11 FIR Filter RTL Code


This slide shows the functional parts of the SystemVerilog code written based on the
architecture design.

There are a few things to note here.

The code uses 'for' loops to describe the data shifting and filtering operations. In RTL synthesis,
loops are just a shorthand way of writing multiple similar statements. Synthesis tools always
unroll loops when they compile the code, as shown in the slide. The loops therefore do not imply
sequential behavior, and the unrolled statements can be implemented as combinational logic.

The second thing you should notice is that it is important to implement type and data width
handling correctly in the code. In this case, the accumulator variable 'acc' has been declared as
a signed 18-bit bit-vector so that it can accommodate the sum of three 16-bit multiplication
results. Binary input data are handled as signed by using conversion functions.

Also notice that while the accumulation loop contains three additions, the first one does not
appear in the synthesized circuit. Because the value of variable acc is zero in the first addition,.
the synthesis program judges that an adder is not needed.

The waveform image shows the result of simulating this code.

12 But Wait, There's More!


In the FIR-filter design example, we made the problem look simple by just mapping the required
functions directly to well known design patterns. However, it is usually not enough to implement
just the functions correctly, as designs can also have performance and cost constraints.

The direct implementation of the FIR-filter has the problem that it consumes a lot of resources,
multipliers and adders, to be specific. For a 3-tap filter this is not a big deal, but if you imagine
what a 256-tap filter would look like, you'll get the idea.

The block diagram shown in this slide presents another solution to the same filter design
problem. This version is based on the assumption that we are allowed to use only one multiplier
and one adder in the architecture.

With these resources, we must realize the filter by computing the three filter taps one at a time,
in three clock cycles, by sharing the multiplier and adder components between the operations in
the algorithm. To share the multiplier we must add multiplexers in its inputs. The adder can be
shared by adding a register at its output to accumulate the results. Multiplexers are not needed
for the adder, as it always gets its data from the multiplier and accumulator register.

The control state machine must be redesigned to reflect the changes in the architecture. New
states must be added for the tap computations, and the output decoder must be changed to
generate control signals for the multiplexers and the accumulator register.

The silicon area of this version is 30% smaller, but its latency is three clock cycles longer than in
the previous design.

This example highlights an essential problem in RTL design, which is finding the best trade-off
between area and performance. In the filter design case, we would also have had the option of
using two multipliers, which would have produced a result somewhere in between of the two
solutions presented here. If you again imagine a more complex algorithm, you can conclude that
the number of alternatives can be huge.
13 Simple RTL Architecture Design Procedure
As said, RTL design is a creative process where experience and knowing the design patterns
help. However, they do not completely eliminate the fact that finding a good solution often takes
a lot of work.

This slide outlines a workflow that may be useful, if you have to design an RTL architecture
completely from scratch, and you don't have any obvious solutions in sight.

The first thing to do is to study the functional requirements so that you understand the problem
completely. It is a good idea to write a computer program to model the problem. Having an
executable model at hand makes it a lot easier to learn how the design should work.

In RTL architecture design, you basically just have to figure out how many and what kinds of
registers you need, and then specify the next state logic functions of the registers.

You can therefore start with the registers. By studying the design's functional requirements,
allocate the registers you think are needed in the implementation of the design's functional
properties. Registers are needed for data that must be stored for one or more clock cycles.

When you have the registers, step through the data processing task the circuit should execute
on clock-cycle by clock-cycle basis, and based on this analysis, refine the register descriptions
to include their next-state functions and the required control signals.

You will have to iterate steps 2a and 2b many times. Use block diagrams, spreadsheets or
whatever feels natural to capture your ideas. Eventually you will have an RTL block diagram that
shows the required combinational and sequential logic blocks.

You can now move on to define the control logic for the blocks defined in step 2 as an ASM
chart or state table. After you have done that, the design is ready.

When the RTL architecture is complete, it is a good idea to specify the intended RTL
functionality as whitebox assertions. It will make debugging the code written from your
specification a lot easier.

14 sramctrl: Register Allocation


Let's see how the procedure presented in the previous slide could work in practice with the
S-ram-controller example.

After studying the functional requirement specification document carefully, we are ready to start
allocating registers. You can see the specifications in the table. The text does not define any
registers, so we have to figure them out ourselves.
The first two properties in the table refer to the APB protocol. From the protocol specification we
know that the APB interface can in principle be in two phases, the SETUP and the ACCESS
phase. An educated guess is to allocate a state register to represent these phases. Also
remember that you are never designing circuits in a vacuum, so by examining publicly available
APB interface designs you will soon learn that a two-state state machine is a common design
pattern in APB interfaces.

The second property states that our design must support wait states. It means that the design
must be able to count clock cycles, which in turn means that some kind of a counter is needed.
We therefore add a counter register in the design.

The next three properties state that the circuit must store address and data values and hold
them for some time. Registers are obviously needed for this.

The final property states explicitly that the output we_out must be driven from a flip-flop, so one
of those must be added in the architecture.

We now have the registers allocated. If we later find out that something is missing or that
something is not needed, we can return back to this phase. In real life, you will have to do so
many times.

15 sramctrl: RTL Architecture Refinement


In the next phase, we have to figure out how the registers should exactly work. Let's begin with
the data registers.

From the wave diagrams in the specification we find out that when the peripheral device select-
signal p-sel rises, the value of the address bus paddr should appear immediately in the output
addr-out. The value should also be held there until the next bus transaction is detected. A
conventional enabled register cannot be used here as it would introduce a one-cycle delay in
the address signal. We therefore use the delayless register configuration in which the output is
taken from the register's next-state signal.

The same reasoning applies to the other two data registers. All that remains is the wait state
register.

Here we obviously need a binary counter whose counting is enabled by some yet-to-be-defined
control signal.

The number of wait states is defined by the value that is present in the configuration input
wctrl_in. The wait state counter should probably make it known when this preset value has been
reached in counting, so we add an output decoder to detect that.
At this point it looks like we have all the datapath components done. What remains is to design
a controller to drive the dangling control signals.

16. sramctrl: Control


The design of the control unit is again the icing on the cake in this puzzle. In this case the
control unit must also drive the SRAM control signal outputs in addition to the signals that
control the registers created in the previous steps.

By examining the state table of the finite state machine, you can see that the controller stays in
the ACCESS state as long as the wait counter holds the wctr_full signal low, this way extending
the ACCESS phase.

The controller implementation is described in detail in a separate document. There you can find
the ASM chart, from which its operation may be easier to understand than from the state table
presented in this slide.

The RTL design of the memory controller is now complete.

The aim of the principles and examples presented in this lecture has been to give you some
ideas of how you can get started in RTL design. The presented solutions do not aim to provide a
comprehensive answer to the problem, and they may not be applicable to all kinds of designs,
but they will probably make the design process more comprehensible and less intimidating.

17 Whitebox Assertions
We have already discussed the role of whitebox assertions in earlier lectures. Let's end this
lecture by examining how they knit together with the RTL design task that has just been
completed.

The purpose of whitebox assertions is to capture the designer's intent in an executable form. In
a register-transfer level design, this intent covers every signal in the design. For the RTL
architecture designer, writing whitebox assertions is an excellent way to document his or her
design ideas precisely, and in a format that is readable to both humans and machines.

On the left in this slide you can see how the required function of the next-state signal addr_ns of
the address register is described with SystemVerilog property statements. The description of the
RTL signal has been split into two properties.

To highlight the difference with blackbox assertions, a couple of blackbox properties associated
with this RTL function are shown on the right. The blackbox properties define the expected
functionality as seen from the design's inputs and outputs. The whitebox properties define at the
signal-level, how the RTL design should be implemented so that the required input-output
functionality will be realized.
L5 Formal Verification

2 Counter revisited
Let's recall the case of the troublesome counter from the assertion-based verification lecture.
This counter, whose code is shown on the left, should always count up to either state 9 or to
state 13, but because of the poorly chosen comparison operator, it can in some situations count
up to state 15.

We learned earlier that assertions like the one shown in this slide can help us detect these kinds
of bugs in simulation. We also learned that the bug can only be found if the input data applied
during simulation makes the bug show up. Said in another way, the functional coverage of the
test stimulus must be good enough for all properties of the design to become activated. The
functional coverage of verification, or even the verification itself, is often overlooked in the case
of nearly-trivial designs like this counter. This way tiny bugs like this can be left lurking inside
small modules, ready to cause big trouble when the small module is used as part of a complex
design.

In this lecture you will learn how formal verification techniques can be used to tackle this
coverage problem.

3 Formal Property Verification


Formal property verification is a static verification method, which means that it only uses the
RTL code and assertions, but does not need a testbench to create stimulus data. The quality of
a test program will therefore not affect verification results because a test program is not used at
all.

A formal verification program works on one property at a time, and tries to prove that the
property either does not always hold if it is an assert-property, or that it cannot be covered, if it is
a cover-property.

The program proves this by finding a counterexample that shows that the property can fail in
one of these ways. The counterexample is a sequence of input data that the tool invents by
itself.

Sequential logic circuits can basically have two kinds of bugs. Either the next-state value
computed from the current-state and inputs is wrong, or an output value computed from the
current state is wrong. A formal verification tool tries to prove that this either can or cannot
happen under any circumstances.

The procedure for proving a property is outlined on the right-hand-side of this slide.
The tool first sets the design, implying its registers, in an initial state and puts this initial state on
its internal checklist. The initial state is usually the reset state.

In the second step, the tool checks if the property can fail with any legal input data in any of the
states on the checklist. In the beginning, only the initial state is on the list, so only one state has
to be checked.

If the property can be made to fail with any legal input values in any state on the checklist, the
property has been proved false. The sequence of input data that lead into the failing states is
the counterexample for the property. Verification can stop here with the status 'fail'.

If the checklist was exhausted without finding a counterexample, the tool continues the
verification. In the third step, it finds all possible next states for the states that were just
checked. The possible next states are those that can be entered from the current states by
controlling the design's input ports. The states found this way become the new checklist.
However, states that have already been checked are not put on the list anymore.

If this refill operation found states that had not already been checked before, the tool jumps
back to step 2 to verify the property in the states that are on the new checklist. On the other
hand, if the tool did not find any new states to check, it can conclude that no counterexample
exists, and the verification ends with the status 'pass' for the property in case.

The tool can now pick the next property from the collection, and repeat this procedure until all
properties have been proven to either fail or pass.

As you can see, this verification method is based on an exhaustive search for a counterexample
through the design's state-space, which is why it is so powerful.

4. Example: Verification of Property


Let's next work like a formal verification program and try to prove this property of the counter
design shown here. The property states that if the mode is 0 and the counter state is greater or
equivalent to 9, the next state must be 0.

We begin by entering the initial state on the checklist. We can easily see that the property is
always true in this state because the antecedent, or the enabling condition, of the implication is
false. The right-hand side is therefore not checked.

We can now generate a list of next states by using all possible input values. We have only one
input, the counter mode input, so we have only two cases. The next state is the same, state 1,
in both cases. In this graphical representation of the search we show the state twice, because
we can get to this next state with different input data. At this point, the search radius is 1,
because we are now one step away from the initial state.

Now we can try to find a counterexample in state 1 on both branches of the search tree. We can
again conclude that one does not exist because the implication antecedent again evaluates to
false.

At this point you probably begin to see how the search for a counterexample will proceed.
Interesting things begin to emerge only when the counter reaches the state 9.

5 Example (Continued)
We continue the exhaustive search and keep expanding the search tree until the whole search
graph has been searched, the property has failed, or the search radius has become too big and
we have run out of memory or time.

The counter steps through states 1, 2, 3 and so on in both modes, and the property is true in all
of these states, so we skip them here.

In state 9 things change. For the input value 0, the next state would be state 0. This state has
already been checked, so the search on this branch stops there.

For input value 1, the next state is state 10.

In this state, a counterexample is found. If the mode input value is now 0, the next state will be
11, which makes the consequent of the implication and thus the whole property to evaluate to
false. The series of mode input values that led the search into this node of the graph is the
counterexample that proves the property to be false.

6 Presentation of a Counterexample in a Questa PropCheck


Let's discuss formal verification tool usage briefly. These tools are easy to use in principle: you
just provide the RTL code and the assertions, and the tool proves the properties and reports the
results.

The proofs can be debugged by examining the counterexamples in a waveform viewer, such as
the one shown here. The waveform shows the property signal values and the control point
values by clock cycles starting from the initial state and ending in the state in which the property
failed.

To help understand the reasons that led to this, the user can also examine waveforms of other
signals of the design when trying to find the root cause of the error.
7 Formal Verification vs. Simulation
This slide presents the main advantages and disadvantages of formal verification compared to
simulation.

The main advantage is in the coverage, as demonstrated by this state space graph that
describes the operation of a sequential circuit. In simulation, the circuit traverses only through
one path of the graph. The path is determined by the input data used in the simulation run. If the
path does not include the state that has the bug, the bug will go unnoticed. Formal verification
by definition visits every state of the graph and inevitably finds the bug.

The other advantage of formal verification is that you don't have to create a testbench for
verification. This means that RTL verification can begin when the assertions or only some of the
assertions have been created. It is also possible to use formal verification on a partially
completed RTL design, which may actually help to improve the RTL design while it is being
created.

The biggest disadvantage of formal verification is that it cannot be used with very large designs,
in a relative sense. If the number of reachable states is too big, verification can end as
inconclusive when computing resources run out.

The second disadvantage is the requirement to define constraints for input data so that the
counterexamples the formal verification tool invents make sense. If for instance a register bank
has 100 registers and its address input therefore has 7 bits, we must tell the formal verification
tool to apply only addresses that are in the range 0 to 99 to that input. Without that constraint
the tool would use any 7-bit values to prove that the design does not work. These kinds of
constraints are not needed in simulation, so creating them requires extra work.

8 Questa Results
This slide shows the most common per-property verification outcomes for assert and cover type
properties in the Questa PropertyCheck tool. The terminology is similar in other vendors' tools.

The principal outcomes for assert properties are 'proof' and 'firing'. Notice the reminder shown
here that the property itself and the assumptions used should be correct before you can trust
these results.

If the result is a vacuous proof, there is probably something wrong with the property or the
assumptions used to constrain input data, which causes the property to always be true,
regardless of what the design does. The implication property shown on the right has this
problem. Because the antecedent part is always false, the value of y is never checked and the
property is always true.
A vacuous proof can also indicate a bug in the design. For instance, if a state machine has an
unreachable state, an implication property that compares the state variable with this state code
in its antecedent will always be true.

For properties enabled with the cover directive, the good result is 'coverable' and the bad result
'uncoverable'.

The inconclusive result means that the program timed-out before it could prove or disprove the
assertion. This can be fixed by allocating more computing resources, or by rewriting the
properties so that they are better suited for formal verification. We will cover this topic shortly.

9 Formal Verification Constraints


This slide presents some of the most common constraints that you can give to a formal
verification program.

In the previous slide we already mentioned input data constraints. These can be conveniently
specified as assume-type properties. You can see a couple of examples here.

These assume-properties state that the peripheral select signal psel of an APB bus can be in
state 1 only when the bus address is inside the range assigned to the specific bus device we
are verifying. To verify a bus interface, you typically need a bunch of such constraints to prevent
the verification tool from creating counterexamples that violate the bus protocol.

One thing that you should be aware of when you create constraints for input data is that the
constraints must not be too strict. This means that you should be very careful in ruling out data
values that you think will not occur in practice even if they can actually still occur in principle.
Over-constraining the data will prevent formal verification from finding bugs that emerge when
the circuit is excited with data that is perfectly legal but which the designer had not taken into
account.

The next two constraint types shown here can be used if the design you want to verify seems to
be too large for formal verification.

Up to now we have assumed that the initial state of formal verification is the reset state. For
large designs, this is not usually practical. It may take too long to reach all states from the initial
state if the circuit has a large number of flip-flops and thus possible states.

To keep the search radius reasonable, an initialization sequence can be defined for verification
runs that are executed to prove properties that require the circuit to be in a state that is far from
the initial state. In practice this means that the design is first simulated with the initialization
sequence as input data to bring it into the wanted state, after which verification is started. How
the initialization sequence is defined in practice is tool dependent.
Another constraining technique for large designs is the use of cut points. This means that you
cut out the logic that drives some internal signals that control the parts of the circuit you are
interested in, and then replace the signals with constant values in verification. This simplifies the
verification process, as the tool does not have to use all possible value-combinations of the
cut-out signals when it searches for counterexamples. You can use this technique for instance in
cases where the property to be verified is applicable only in a functional mode in which the
cut-out signals are always in a constant state. Cut points are also a tool specific feature in
formal verification.

10 Writing Assertions for Formal Verification


You can use most properties created for simulation-based verification also with formal
verification tools. However, in practice, some types of properties will cause performance
problems in formal verification because of its exhaustive-search-based operating principle. On
this slide you can see some examples of such problematic properties.

In the first example, we have a property that states that the signal 'ack' must be true between 1
to 100 clock cycles after the signal 'req' rises. When the simulator notices that 'req' just rose, it
starts a state-machine thread that follows 'ack' through 100 clock cycles. Computationally, this is
an easy task. A formal verification tool, on the other hand, has to do a radius-one-hundred
search, which requires a lot more computational effort.

In the second example, the property requires the signal 'ack' to eventually become true, as
defined by the $ operator that denotes an unbounded range. This is a tricky case for formal
verification. There are also issues with unbounded ranges that are not related to performance
that we will discuss in a following slide.

The third example shows that some properties can implicitly define an unbounded range, even if
the property's code does not use the $ operator.

The conclusion we can draw from these examples is that properties originally developed with
simulation in mind don't always work very well in formal verification.

The right hand side of the slide shows an example of a different kind of problem. Arithmetic
circuits are known to be challenging for Boolean-logic-based formal verification tools, as
demonstrated by this example.

Here we have an integrator-type circuit that consists of an adder, multiplexer and a register.
Let's assume that we have a fancy hand-coded special adder, which is why we want to verify its
logic thoroughly. The input data has 64 bits, and the register has 128. The circuit therefore has 2
to the power 128 states, and 65 control points. Proving the property that describes its function is
therefore not a fast process. Complex signal processing circuits can have hundreds of this kind
of building blocks which is why a general-purpose formal verification tool may not be a good
choice for their verification.
The conclusion we can draw from these examples is that you should avoid defining properties
that are based on matching extremely long sequences, or that depend on very wide input data.

11 Example: control_unit's Interrupt Output Properties


This slide shows an example from the course project. These two properties are OK for
simulation but take a relatively long time to prove in formal verification. The properties are used
to check that the control_unit's interrupt generation logic works as expected.

The property f_irq_out_rise_first, shown on the right, requires that the irq_out output should rise,
if
the play-out output has first risen, and a certain number of tick_out pulses have then occurred
so that a stop command has not been written into the command register during that time, and
that the interrupt acknowledgement command is not written at the same time with the last
tick-out pulse. So the event we want to check is that 'irq out' rises, if the fairly complex sequence
has been seen before that.

We can probably agree that this property has some of the hallmarks of a property that is not
very well suited for formal verification. Its evaluation will last for thousands of clock cycles
because there is a long interval between the tick_out pulses. This is not a problem in simulation,
but a formal verification tool has to spend a fair amount of time on proving this kind of property.
It is justified to ask if this is the only or the best way to check the functionality. It might be
possible to describe the same behavior with a group of simpler properties. However, in this form,
it makes for a good coding exercise!

On the left you can see some tips that can help you write the code for these properties.

If you try to break the description into pieces that can be realized using the property building
blocks, you may find out that the goto repetition operator can be used for counting the tick_out
pulses that occur with some time in between them.

To check that a command-stop code is not written into the command register during the tick_out
sequence you could use the 'throughout' sequence operation, which requires a Boolean
condition to match throughout a sequence.

The CMD_STOP and CMD_IRQ values must be decoded from the APB bus signals. Use of
some auxiliary code could be beneficial there so as to simplify the property expressions.

12 Advanced
This slide highlights some advanced topics that are worth studying if you plan to do some
serious work with assertions and formal verification.
The term safety property describes properties that can be always proved false with a finite
counterexample. The first code fragment here is an example of this. We can tell if the design
works by following the signal 'ack' for a couple of clock cycles. We can say that safety properties
are safe to use both in simulation and formal verification.

Liveness properties on the other hand are properties for which no finite counterexample exists.
A liveness property cannot fail in simulation. These kinds of properties typically contain an
unbounded repetition operation or delay, which may leave the assertion evaluation still open
when simulation is stopped. In that case we cannot say for sure, if the expected event would still
have happened if we had simulated the design a little bit longer.

In formal verification, liveness properties can fail. This happens when the formal verification
program manages to prove that the thing the property is ready to wait forever can actually never
happen.

Sequence strength is an important concept in SystemVerilog, because it affects how liveness


properties are handled in simulation and formal verification. Sequences can be classified as
strong or weak. Sequences that require a finite-length match are said to be strong. Weak
sequences on the other hand do not require a finite match.

And now comes the important part. By default, sequences inside an 'assert' are weak in
simulation, but strong in formal verification. This is why the liveness-type property does not fail
in simulation if the simulation ends before the property sequence has been completely matched.
It is possible to change this behavior by adding the keyword 'strong' in the property declaration,
as shown in the third example. With this change, the property will now fail if the simulation ends
while the assertion thread is still waiting for the 'ack' to happen.

The SystemVerilog standard defines strong versions of sequence operators that you can use
instead of the 'strong' keyword. The names of these operators have the 's_' prefix. Forcing
unfinished liveness-type properties to fail in the end of simulation might be beneficial
sometimes. It could for instance reveal that the simulation ended before a test case was
completed.
L6

2 Coverage

We have already touched the coverage topic in previous lectures. We have for instance seen
that even if a design works in simulation with some initially good looking test data that has been
designed to directly activate the design's basic functions, it can fail when we apply more diverse
input data. This raises the question, how can we tell that we have excited the design long
enough with data that is diverse enough. To be able to tell that, we need a measure for it. This
measure is called 'coverage'.

On a general level, we can say that coverage is a measure of how well a design has been
verified. A design project usually has a predefined goal for coverage, and verification is
continued until the goal has been reached.

The method for measuring coverage, and the coverage goal, are defined in the design's
verification plan. The plan must first define all functions that have to be verified. After that, the
coverage measurement method must be specified for each function.

Two types of coverage measurement methods are commonly used in digital design.

The simplest form is known as code coverage, which is sometimes also called implicit coverage.
This type of coverage is measured automatically by the simulator.

A more advanced form of coverage is known as functional coverage. This coverage


measurement method is designed and implemented as code in the design's testbench by the
verification engineers. Functional coverage is sometimes called explicit coverage.

3 Code Coverage
Code coverage analysis has been used for decades in all design that is done by writing code.

In RTL verification, code coverage analysis is an automatic feature in all simulators. When the
code is executed in the simulator, the simulator counts how many times statements, branches,
expressions and variables in expressions are activated. When the simulation has finished, the
simulator reports the coverage statistic, from which the user can draw some conclusions about
the quality of the test program and even the design itself.

Code coverage analysis is an invaluable tool for verification engineers, but from the coverage
point of view, it is just the necessary first step. A 100% statement coverage is required for all
designs but that is never a sufficient result for coverage.
The image here shows the code coverage analysis view of a simulator. The code window shows
the covered statements with checkmarks. You can see that all but one statement were covered,
but as indicated in the slide, we can not tell for sure that because a statement was executed, all
functions of the design affected by that statement were completely verified.

4 Code Coverage Types


This slide presents the code coverage analysis types that are typically available in simulators.

Statement coverage is often called line coverage, even though these terms refer to different
things. One line can have many statements, and statement coverage tracks each statement
separately. A statement is marked as covered if it was executed in simulation even once.

Branch analysis checks whether the condition expression of an 'if' statement assumes both the
false and true values in simulation, while condition expression analysis checks whether each
variable in a condition expression has been the controlling variable whose value has determined
the outcome.

Expression coverage is similar to condition coverage, but it tracks assignment statements.

Toggle coverage counts how many times the bits of a design's signals have toggled.

And finally, finite state machine coverage tracks state changes and traversed paths in state
machine models.

All in all, a fair amount of coverage information can be obtained automatically by just simulating
the code.

5 Code Coverage Example

The example shown here aims to help you understand what kind of insight code coverage
analysis results can give.

The example concentrates on the two lines of code shown here. We assume that the simulator
reported the results for statement, branch and condition coverage you can see on this slide.

The statement coverage report shows that the assignment statement on line 227 was executed.

The branch coverage report adds some insight by saying that the if statement on line 226 that
controls the assignment was completely covered. This means that some times during the
simulation, the condition expression evaluated to both true and false, implying that the
assignment statement on the next line was not executed all the time.
The condition coverage report shown last shows that the condition expression's parts were not
completely covered. The detail-view reveals that when the first two subexpressions of the 'and'
function were both true, the third subexpression, int_addr < 254, was never false. This could be
because of a bug in RTL code, or because the input data was not able to activate that function,
which in turn would be a coverage problem.

6 Functional Coverage
Functional coverage is a measure of how much of the functionality of a design has been
exercised in verification. This kind of coverage cannot be measured with automatic tools.
Functional properties are always design-specific, and each property requires an appropriate
coverage measurement solution.

We therefore have to first list the functional properties to be verified and establish a coverage
goal for each, and then develop a method for measuring the coverage of each property. After
simulating the design with the coverage collection method included, the achieved coverage can
be computed by comparing the coverage results collected in simulation with the coverage goal.

The SystemVerilog language provides two methods for collecting functional coverage and
creating coverage models.

The simplest of these is the cover directive, which we have already discussed in previous
lectures.

In this lecture we concentrate on the covergroup construct, which can be used to create very
sophisticated coverage collection and analysis models.

7 Functional Coverage Example


Before we go on to describe how the covergroup construct can be used to define coverage
models, let's look at some of the key concepts with the example shown here.

Assume that we are testing a 3D printer prototype, and we want to be sure that we have tested
it properly before releasing it to the market.

The printer is very simple and has only two user-settings. You can select the color of the object
to be printed from red, green or blue. With the other setting, you can select the type of object to
be either cylinder or cube. The testing method is also simple: We just select the inputs, and then
check what comes out.

If we would think of this testing as a verification process, then we would be interested in


checking that what we selected was actually printed.
For coverage analysis, we are interested in the input data: Did we select the input data so that
we activated all required functions? What is required is of course design-specific. We might
want to verify that the printer can print in all colors and forms, and that the color and form
settings are combined correctly, and that the color or form can be changed without problems.

The first case, can we print in all colors and forms, is an example of point coverage. To measure
it, we can create a bin for each possible value of each input variable, and increment the value of
each bin when the respective value of the respective variable was used as input data. After the
short test shown in the drawing the bin values would be as shown inside the bins. Assuming
that 100% coverage would require at least one hit in each bin, the test would have reached a full
100% point coverage.

Point coverage does not tell us, is it possible to print both object types in all colors. For this we
have to know the cross coverage. It can be measured by creating a bin for each input variable
value combination, and incrementing a bin when the respective combination is detected in the
input data. As you can see from the bins, the cross coverage of the test was only 67%.

The third case, can we change the form and color without problems, is an example of transition
coverage. To measure it, we create a bin for every possible value pair for each variable whose
coverage we want to measure. In the slide you can see the bins for color transitions. Now the
coverage is only 33%.

It is of course possible to come up with other types of coverage models, but point, cross and
transition coverage are the most common ones used in hardware design, and provide enough
flexibility for most situations.

8 covergroup
We can now introduce the SystemVerilog covergroup construct.

A covergroup can encapsulate the entire coverage model. It is actually a user-defined data type,
based on which multiple covergroup instances can be created, which allows the same
covergroup to be used in many places in a design.

On the right you can see a simple covergroup whose type name is chosen to be g1.

The part highlighted in red specifies the sampling event for the covergroup. The sampling event
defines the times when the coverage measurement is done. In this case it is specified to happen
on every clock edge, but more complex expressions are also allowed here. It is also possible to
specify a custom sampling function which allows the sampling to be triggered from nearly
anywhere in the design hierarchy.
The coverpoint keyword is used to specify the variable or expression, on which we want to
collect coverage. In this example it is a variable called color. The coverpoint can be given a
name, such as the name "c" in this example. Names are needed if you want to measure the
cross coverage of two or more coverpoints.

In this case, one bin is created automatically for every value of the enum datatype of the
coverpoint variable color, and one of these bins is incremented on every clock edge based on
the value of the variable. This default bin behavior can be modified in many ways, as you will
learn shortly. You can also change the coverage computation rule and the goal and many other
features. These advanced features are not covered in this lecture.

As said, the covergroup declaration just creates a datatype, which by itself doesn't do anything.
To use this type, you have to create a covergroup instance. This is done by declaring a variable
of the covergroup's type and initializing this variable with the 'new' operator. The new operator
works the same way as in other object-oriented languages, such as C++.
When you have created the covergroup instance, it begins to collect coverage right away.

9 Coverage Collection in Simulation


Here you can see how covergroup results are represented in a simulator. The covergroup in
case is the same as in the previous slide.

The simulator groups the results first based on every covergroup type, and then by every
instance. The bins show that there were no hits in the bin for the value 'blue' which is why the
total coverage is 66.6%. Notice that the default goal of the bins is 1, so only one hit is needed
for 100% bin coverage.

The default rule for total coverage computation is shown in the bottom right corner.

10 Covergroup Examples: Bin Creation


If you have a specific coverage model in mind, you'll want to define the covergroup bins yourself
instead of relying on the automatic bins that the simulator creates. This slide shows some basic
techniques for bin creation.

Here we assume that we are collecting coverage from the simple design presented as a block
diagram on the top-right. It shows a register creg_r that is loaded on the rising edge of the clock
when the input creg_en is in state 1. As said in the slide, the data stored in the register is
expected to be one-hot encoded, meaning that only one of its bits can be in state 1 at a time.

In the first covergroup cg_cmd1 the coverpoint is the register variable creg_r. The list inside the
braces defines the coverage bins. The first bin is called commands, and it is defined to cover all
the legal one-hot codes. The second bin is created using the keyword default. This bin is
incremented every time the coverpoint has a value that does not fall in the range defined for any
other bin. Notice that the sampling expression in this case contains also the enable signal.

The simulation results for this covergroup are shown on the right. The commands bin was hit
four times and the default bin 15 times. From this we can judge that there were 4 legal codes
and 15 illegal ones. We cannot however tell, if all legal codes, or how many different legal codes
were seen, as any legal code would have incremented the bin value.

The second covergroup cg_cmd_2 defines the commands-bin a little bit differently. The name of
the commands-bin is now followed by brackets, which means that a separate bin must be
created for every value in the list of values assigned for this bins-definition. The simulation
results now give a much better view of the coverage. We can see that only one legal code was
seen in the simulation.

11 Covergroup Examples: Bin Creation and Cross Coverage


We continue with yet another example of bin creation. This example also shows how you can
set up cross-coverage collection.

For this example, assume we have a 32-bit bus system with four devices, as shown in the block
diagram. The address ranges of the devices are shown beside the devices in the block diagram.
Each device has a register bank that consists of 32-bit registers, implying that register
addresses increase in steps of 4. We want to create a covergroup that tells us which registers
inside the devices were written and read during simulation.

The covergroup declaration is shown on the left. Its sampling event expression is a logic
function of the bus control signals that indicates a bus access.

The first coverpoint, called devices, is used to track accesses to the address ranges of the bus
devices. The coverpoint expression selects bits 31 down to 2 of the bus address variable,
PADDR. This way the coverpoint only tracks the addresses of the 32-bit registers, but skips the
byte addresses between 4-byte boundaries.

The bins expression defines a bin for the address range of each device. In the bin value-range
definitions, two least significant bits of the address are again ignored.

The second coverpoint, modes, collects coverage on bus access modes, that is, writes and
reads. It has two bins, one for write mode and one for read mode.

The last item in the covergroup defines a cross-coverage point that collects cross coverage of
device accesses and bus access modes.

The simulation results now give a comprehensive view of what happened in simulation. All
devices were accessed, and both write and read-modes were used, but as the cross-coverage
results show, three devices were accessed only in one mode. This tells us that the test program
used in simulation should be improved so that these cases would also be covered.

12 Transition Coverage
This slide shows a covergroup that contains examples on transition coverage models.

The first two coverpoints are simple coverpoints similar to the group that was used earlier in the
first covergroup example.

The coverpoint 'color tests' shows examples on how to define bins that collect coverage on
transitions and sequences of transitions.

The bin definition 'all' creates a bin for every possible transition between any three color values,
9 bins in total.

The bin L6 creates a bin that tracks the color setting sequence that was used in the 3D printer
example presented earlier in this lecture.

The bin 'reds' shows how the repetition operation can be used to define a value sequence in a
bins definition.

The simulation results show the coverage obtained with the test data used in the 3D printer
example.

13 Covergroup Creation Summary


Here you can see a summary of the steps for creating covergroup definitions.

You begin by defining the sampling condition. You usually have to write an expression that
contains a reference to a clock edge, and a conditional expression. You can also define a
custom sampling function.

The next step is to select the variable or expression that you want to collect coverage on.

It is usually safer to specify coverpoint bins explicitly than to let the simulator create bins
automatically. Bin creation is probably the most difficult part in writing covergroups.
Understanding the purpose of the brackets operator is important here.

We have not shown examples of covergroup options usage in this lecture. You can use options
to change coverpoint goals and weights, among other things.
The final step is to create the covergroup instances. They must be created in a context where
the covered variable is visible, for instance, in a verification module bound to an RTL module.

14 Constrained Random Stimulus Generation


You now have a good basic understanding of coverage model creation and analysis.This slide
explains why they are so important and often indispensable in verification.

The tests executed first in functional verification are usually directed tests. These are carefully
planned tests for which the test data is chosen so that specific functions of the design will be
activated in simulation. The total coverage increases more or less linearly with the time spent for
designing and coding directed tests and running simulations. The verification team usually has a
good idea of the expected coverage of such tests.

The problem with using directed tests is that they require a lot of work to create, as each test
has to be designed and coded on a case-by-case basis. This is the reason why directed tests
are often complemented with tests that use random data.

In random testing, the coverage curve looks different. The coverage initially increases slowly
with the verification effort, because creating a randomized test environment requires some extra
work compared to directed testing. However, after some time the speed picks up, as there is no
need to design the new tests anymore. You can just simulate longer and let the test environment
create the test data. In the end, the coverage goal can be reached faster with the help of
random testing. A robust functional coverage measurement setup is now essential, because it is
not possible to predict the coverage of each test run.

Plain random data is not very useful in functional verification, as no design works with random
inputs. The data generated with a random number generator must be constrained so that it is
valid for the design. This constraining could of course be done by just writing code that does
that, but SystemVerilog has a better method for it. You can use the SystemVerilog class
construct to model both the data and its randomization constraints, as shown in the code
example here.

The class keyword has the same meaning in SystemVerilog as in other object-oriented
languages. A class data structure can contain member variables, such as the 'addr' and 'data'
variables in this example. These variables can be made randomizable by adding the 'rand'
keyword in front of the variable declaration.

When an object has been created from this class, its random variables can be assigned new
random values by just calling the built-in member function 'randomize', as shown in the test
program code on the right.
The randomization function can be controlled by defining constraints for the random variables
with the 'constraint' keyword. In this example, the constraint named 'AHB addr' limits the value
of the variable 'addr' to be less than the specified hexadecimal value, and the value of its two
least-significant-bits to be always zero. This way the randomization will always generate valid
addresses.

15 Coverage Tracking with an XML Testplan

This slide shows how coverage tracking works in practice, in this case in the Questa verification
tools.

First turn your focus in the bottom left corner of the slide. The flow graph there illustrates the
principle of coverage collection.

Every time you run a tool, such as the simulator, the tool saves the coverage data it collected in
a database at the end of each run. This data contains information about passing and failing
assertions, covered and uncovered cover directives, covergroup results, and the results of code
coverage analysis. New data is added into the database every time you execute a new test
case.

A verification tracking tool can be used to display the current coverage results stored in the
database in a graphical format, as shown here. This helps you identify the things your
verification efforts should concentrate on next so that the coverage goal can be reached.

Viewing raw coverage data even in graphical form is not practical because of the huge volumes
of the data. An IP block, let alone a complete system-on-a-chip design, can have thousands of
coverage items to be tracked, but what you are interested in is getting an overall understanding
of different classes of coverage.

In the Questa tools, this problem is solved so that the user creates a spreadsheet that lists the
coverage items that should be reported under one coverage goal in the tracker tool. In principle
this means that you must create a list of the names, that is, the hierarchical paths, of all items
for which you want to see the joint results. This highlights the need of using a consistent naming
practice for all design items, such as assert statements or covergroup instances, so that you
can use wildcards to select all items in the same category at once, without having to list the
names of all of them.

In the example shown here, the names of all blackbox assertions have the prefix 'af_' allowing
them to be included in the tracking report with only one hierarchical path reference that contains
the wildcard character. The naming of other types of coverage items is based on the same
principle
16 Functional Qualification
We have now discussed verification topics in many lectures, and covered many verification
methods and tools. At this point, we might be ready to draw the conclusion that if all of our tests
pass in verification and we can reach a 100% functional coverage, then our design is flawless in
the sense that all the required properties have been implemented correctly.

That, however, is a hasty conclusion.

We can only trust the verification results if we also believe that the verification environment itself
does not have any bugs, and that the functional coverage model we use to track the verification
really covers the required functionality comprehensively. The verification methods we have
covered so far do not give any certainty about these things.

Consider this simple example.

Assume that we have a design that can be in two modes, in which one of two different
computations is performed. Depending on the mode setting, the result of the selected
computation is assigned to the output. Assume further that most of the time, the results of these
computations are the same.

Now consider the verification of this simple if statement shown on the left.

This seems to be an easy function to cover. You just simulate the design first with mode set to
zero, and then with mode set to some other value. However, since the result can only be
observed from the output, the design must be simulated long enough or with test data that is
diverse enough so that the results of the two computations will be different at some point.
Otherwise you cannot tell if the mode selection function, that is, the if statement, has been
implemented correctly. If the coverage model is incomplete at this spot, the simulation will not
detect a bug in the if statement.

So it looks like we still have a gap in our defense through which a bug might sneak in.

Functional qualification is a technique that can be used to close that gap.

Functional qualification is used to check that if a design has a bug, will the verification
environment be able to detect it. So instead of the design, it focuses on checking the quality of
the verification environment, usually the simulation testbench.

The operating principle of functional qualification is simple. A functional qualification tool is first
used to inject artificial bugs in the RTL design, and the design is then simulated to find out if the
test program can detect these bugs.
This method can be used to improve the testbench by following the simple procedure shown on
the right. The functional qualification tool is first used to inject one bug in the design. The design
is then simulated using its full test suite. If the bug is detected, simulation can immediately be
stopped. However, if the full test suite failed to detect the bug, the verification engineer must find
out why that happened, and improve the test program. After that, the process is repeated with
another bug.

A functional qualification program can create the artificial bugs automatically. Simple bugs, such
as the ones shown as example here are usually sufficient to detect most weaknesses in the
verification environment. In the first example, the condition expression of the if statement has
been logically complemented. This would immediately reveal the coverage problem discussed
in the first example. Even if the results of computation 1 and computation 2 would again be the
same, we would know that the simulation should now fail. In the second example, the output is
set to a fixed value. This is a so-called stuck-at fault.

Because functional qualification requires simulating the design with every possible bug, it is
computationally very intensive. That's why you should begin to apply it early in the design
project, in module-level verification, when the design-size is reasonable. This helps to improve
the quality of module-level designs, and module-level verification assets such as assertions
before they are used in sub-system or S-O-C level verification.

In some industries, such as the automotive industry, use of functional qualification methods is
mandated by safety standards, which is why electronics engineers should know them.
.
L7

2 VHDL vs SystemVerilog

This lecture presents you with a short introduction on the use of the VHDL language for RTL
modeling. So far we have used only the SystemVerilog language in our digital design courses,
but since VHDL is still widely used in RTL design in many places, it's useful to learn the basics
of RTL coding also in VHDL. This is not a difficult task. Regardless of the language, in RTL
coding you always have to know only a small subset of all available language features.

Let's begin by making a general comparison between VHDL and Verilog, which is the other
language used for RTL design. As SystemVerilog is the latest version of the Verilog standard,
we use the name SystemVerilog in this presentation.

The geographical footprint of the languages in the northern hemisphere differs so that VHDL is
nowadays the preferred language in Europe, and SystemVerilog in North America and Asia.
Here in Finland, VHDL is the most commonly used RTL design language.

As computer languages, VHDL and SystemVerilog have big differences. VHDL, which was
developed by the Department of Defence of the United States, has its roots in the ADA
language. Verilog was originally a much simpler and therefore easier-to-adopt language, whose
syntax, operators and system functions resemble those of the C language and its standard
library functions. If you have programmed in C, you'll probably get up to speed a little bit faster
with SystemVerilog than with VHDL.

Another difference is that VHDL is a strongly-typed language, while SystemVerilog isn't. One
example of strong typing is that in an assignment statement, the left-hand-side and the
right-hand-side must always have the same type, otherwise the code will not compile.
SystemVerilog has built in rules for handling different types in situations like this. You can assign
a floating point value to an integer variable without problems, provided that you know and
understand the built-in type handling rules. With VHDL, you always have to define type
conversions explicitly, which requires you to know the syntax of how to do that.

Even though VHDL is widely used as a design language, all verification code even for VHDL
designs is nowadays written almost exclusively in SystemVerilog. SystemVerilog, which includes
Verilog, is therefore the number one language for digital design and verification engineers, as
you can conclude from the survey results shown here.

3 Logic Data Types for RTL Design


In RTL design, we mostly need logic data types that represent bits and bit vectors.
VHDL has built-in bit data types, but they have some shortcomings which is why they are almost
never used. Instead of them, designers use data types defined in the IEEE standard 1164
package. To make these data type declarations visible in a design unit, its code must be
preceded by an IEEE library declaration followed by a use-clause for the std_logic_1164
package.

The bit data type in the standard package is called std_logic.

A bit vector can be declared by using the std_logic_vector data type whose name is followed by
a bit range definition. Notice the format of the bit range definition: the 'downto' keyword must be
used as a separator if the index of the left-most bit is greater than the index of the right-most. In
the opposite case, the keyword 'to' should be used.

The examples after the type declarations show how you can assign bit and bit-vector literal
values to signals. Notice the interesting syntax that is used to assign the same value to all bits in
a vector using the others keyword.

On the right-hand side of the slide, you can see how you can declare types for bit-vectors that
should be treated as unsigned or signed binary numbers. Again, you have to use a package, the
numeric_std package in this case. This package requires the std_logic_1164 package to be also
visible inside the design unit.

The example shows three unsigned bit vectors, and an arithmetic addition operation on signals
of this data type. Notice that this code would not have worked with the std_logic_vector data
type, as arithmetic operations have not been defined at all for standard-logic-vectors in the
std_logic_1164 package.

4 Working with Bit Vector and Numeric Data Types


Here you can see some examples of the usage of bit-vectors and numeric data types.

The signal declarations at the top of the slide define two 8-bit unsigned numbers, one 8-bit
bit-vector, and an 8-bit signed number.

Next in the slide you can see three assignments with arithmetic expressions that do not work. In
all cases, all operands are not of the same type, which is why the VHDL compiler will issue an
error message.

The following four examples show code that works.


In the first two examples, the type cast operation is used on the operands to force them to be
treated as signed bit vectors. This is possible because the base type of the unsigned and signed
data types is std_logic_vector.
In the third example, the conversion function to_unsigned is used to convert the integer literal 5
to an 8-bit unsigned bit-vector. The conversion function is defined in the numeric_std package.
This is a thing you just have to know.

In the last example, the integer literal 5 is used without problems with unsigned bit-vectors. The
reason why this works now is that the "+" operator has been overloaded to support the addition
of unsigned and integer type values. The overloading is implemented in the numeric-std
package as a function subprogram that is called when the arguments of the "+" operator match
the function's arguments. This is again a thing you just have to know.

5 Contents of a Design File


This slide shows how you should organize a design file. It is a common practice to define every
RTL design unit in its own file and name that file according to the name of the design unit.

As mentioned in the previous slide, design files usually have the standard library and package
clauses at top, as shown here.

In VHDL, designs are declared with the keyword 'entity'. An entity declaration defines the
design's name, and it contains a port list that specifies the inputs and outputs of the design, and
their data types. Generic parameters of the design can also be declared here. That is all you
typically have to include in the entity declaration.

The contents of the design are specified inside an architecture declaration, which is associated
with an entity by the name of the entity. The name of the architecture can be chosen freely, and
one entity can have many differently named architectures. In RTL design, it is customary to use
the name RTL for architectures that contain the RTL code of the design.

In the declarative part of the architecture, before the begin keyword, you can declare the signals
that represent combinational block and register outputs. Each signal must have a name and a
type.

The function of the design is specified between the begin and end keywords using concurrent
statements, such as assignment and process statements. In this example we have a conditional
signal assignment statement that describes a multiplexer.

6 Procedural Blocks: Combinational Logic Process


The main building block of functional models is the process statement, which you have to use if
the design cannot be modeled by using only simple concurrent assignment statements.
A process contains code that is executed sequentially, statement by statement, but processes
themselves are concurrent with other processes. When triggered, the processes compute their
outputs in zero time by executing the sequential statements.
Here you can see a process statement that models the function of a combinational
multiplier-adder circuit. The name of the process is mac-arithm. Notice that the name must be
used both at the beginning and end of the process.

After the keyword 'process' a sensitivity list can be defined. It contains the names of the signals
whose value-changes will trigger the execution of the process. For a combinational block, all of
its inputs must be included in the sensitivity list. In this example, we assume that a_in and b_in
are input ports of the entity, and acc_r is a signal declared in the architecture.

The declarative part of the process declares three variables. Variables are used for local storage
inside processes, and they are not visible outside the process they are declared in.

The statements between the begin and end keywords are executed sequentially, but in zero
time inside the current simulation timestep. Notice the type of the assignment operator that must
be used to assign values to process variables. This assignment takes effect immediately.

On the last line of the process, the signal 'sum' is assigned using the signal assignment
operator. This kind of assignment takes effect when the process suspends.

Notice that even though the syntax is different in VHDL and SystemVerilog, the same coding
rules that must be followed so that the code will synthesize correctly, apply in both cases.

7 Procedural Blocks: Sequential Logic Process


Here you can see an architecture that contains a process that describes a sequential logic
circuit.

In this case, the sensitivity list must contain the clock signal, and since the design has an
asynchronous reset, also the reset signal. The body must contain an if-else statement, in which
the if-part defines the reset behavior, and the else-part the synchronous behavior, that is, the
computation of the next state and output values of the sequential circuit. The basic structure of
the code and its functional principle is therefore again the same as in SystemVerilog coding.

In VHDL code, you can define clock edge detection in many ways. The recommended way is to
use the rising_edge function defined in the std-logic-1164 package for this, as it makes the
intended function of the code very clear.

8 Procedural Control Statements


The code inside processes consists of assignment statements that assign values of right-hand
side expression to variables and signals. The execution of these statements can be controlled
by procedural control statements, such as if-statements and loops. You are probably familiar
with the usage of such controls, so we shall just highlight some of the syntax features here.
The first two examples show an if-else construct, and a select construct, which is defined with
the keyword 'case' in VHDL. Pay close attention to the syntax which is a little different than in
SystemVerilog. You don't have to use a begin-end block or parentheses to include many
statements in a conditional branch. Indentation is not required and it has no effect on how the
code behaves.

The third example shows a VHDL for-loop, which works differently than in SystemVerilog. You
don't have to declare the loop variable, and you cannot define an expression that computes its
next value at the end of each loop iteration. The variable is always either incremented or
decremented by one depending on how you define the iteration bounds.

The second loop type in VHDL is the while-loop. It works the same way as while-loops do in
other languages.

9 Operators
This slide presents a summary of the operators you can use in expressions. All the usual
operators are available, but for most of the operators, the symbol or keyword is different than for
instance in SystemVerilog or in C. There are also hardwarish sounding logical operators, such
as nand and nor, that you won't find in other languages.

VHDL has the usual shifting operators, but these are seldom used for reasons that require a
longer explanation and are skipped here. The take-home message is that you should use the
shift_left and shift_right functions defined in the numeric-std package for unsigned and signed
data types. If you want to shift standard bit-vectors, you can just use range-selection and
concatenation to create the shifted vector.

10 FSM
Here you can see an example of a finite-state machine model written in VHDL. The basic
structure of the model is again the same you would use with SystemVerilog. In this case, the
model is based on the two-process-template, in which one process models the state register,
and the other the next-state and output decoding logic.

This example also shows how you can create an enumerated data type to represent the current
and next-state values. In VHDL it is not possible to define the values of the enumeration
constants in the type declaration itself. You can do that by creating VHDL attribute definitions for
the enum values, but this technique is tool-dependent and will not be discussed here.

11 VHDL Constructs for RTL Models


Here you can see a summary of the VHDL keywords that are used to define the essential parts
of RTL models. The respective SystemVerilog keywords are shown in red. As you can see,
there are not that many things you have to learn in order to be able to move from using one
language to using the other. Learning the syntax of operators and procedural control constructs,
and the type handling rules, will require some more effort.

12 Miscellaneous
This slide presents some additional things it's good to know about VHDL.

You should definitely know that VHDL is not a case sensitive language. This means that lower
and upper case letters are treated as equivalent. The three forms of the name ctr shown here
would therefore refer to the same object. This can cause a lot of confusion if you have been
using only case sensitive languages before.

Another peculiarity of VHDL is that you cannot define a multidimensional array by just defining
the array indices with the array name as in most other languages. In VHDL, you have to define
a new data type that represents the array, as shown here. The type my_type defines an array
data type that consists of 8 8-bit bit-vectors. To declare the actual array, you would have to
create a signal or variable of this data type.

We have not discussed hierarchical modeling techniques in this lecture. You can create a
hierarchical model by instantiating another entity as a component inside the architecture of
another entity. VHDL has many ways for doing that. The one shown here is the most
straight-forward. It creates a component instance named as U1 of the entity my-reg and selects
the architecture RTL to represent my-reg in this instance. My-reg is taken from the library 'work'.
The port map statement specifies the connections between the ports of the instance with the
signals and ports of the parent design.

The disadvantage of the method shown here is that the design library from which the
instantiation is made is hard-coded in the RTL code. More flexible instantiation methods are
available, but they are not discussed in this introductory lecture.

We have already mentioned on a couple of occasions that, to write VHDL efficiently, you must
know the contents of the std-logic-1164 and numeric-std packages quite well. The best way to
learn them is to find the source code of the packages on the Internet and study it to learn which
operators are overloaded and which utility functions are available, and so on.

You may also find packages developed by some EDA tool vendors on the Internet. A package
called std-logic-arith is an example of these. Don't use such packages. The same functionality is
available in the I-triple-E standard packages, and using them guarantees that your code will
work with tools of all vendors.
L8

2 Clock Domain Crossings

In this lecture we are going to discuss the design and verification of clock domain crossing
structures.

A system-on-a-chip design can have a large number of asynchronous inputs in which the data
that arrives at the flip-flops of the circuit is not synchronous with the clock signal of the flip-flops.
This means that the data can change at any time with respect to the clock signal edges.

An SOC design can also have many clock domains, which are parts of the circuit that are
clocked by clocks that have a different frequency and phase.

The asynchronous inputs and the clock domain boundaries represent clock domain crossings
on which you must synchronize the signals so as to minimize timing problems in the flip-flops
that are on the receiving sides of the crossings. In essence, you must design the crossings so
that they can tolerate the inevitable setup and hold time violations that can occur in these
flip-flops.

In this lecture we are going to study techniques that can be used to design reliable clock domain
crossings and techniques that can be used to verify such designs on the RTL level.

3 Metastability
To understand the problem that has to be solved, you have to understand the timing
requirements of flip-flop devices, and the adverse effects that can follow if these requirements
are violated.

A flip-flop stores the value, 0 or 1, of its data input at the moment when there is a rising edge in
the flip-flop's clock input. For this to work, the data input must be stable when the clock
changes. The data input must have settled a certain time, called the setup time, before the
rising edge of the clock, and it must remain stable for a certain time, called the hold time, after
the rising edge of the clock occurred.

If the data input value changes during the setup-hold time window, the flip-flop's output can
enter a metastable state, in which its voltage hovers between the two valid logic levels.

The metastable state can last so long that the illegal voltage level is seen by the next flip-flop of
the circuit on the next clock cycle. This way metastability can propagate inside the circuit. The
consequences can be anything between a minor glitch that goes unnoticed and a complete
system crash.

Because all systems inevitably have asynchronous inputs, systems must be designed to
tolerate metastability.

4 Metastability Removal Using a Synchronizer


The standard solution to the metastability problem is the 2-flip-flop synchronizer structure shown
here.

Its idea is simple. We just add two extra flip-flops clocked by the clock signal of the receiving
circuit in front of the data input. The first flip-flop, FF1 in this case, will inevitably enter a
metastable state every now and then. However, the metastable signal 'sync' has a complete
clock cycle to recover, which is why the second flip-flop FF2 will go metastable very seldom. The
output of this second flip-flop driving the signal rx in the schematic drives the actual data input of
the circuit.

If we want to make the probability of metastability reaching point rx even smaller, we can insert
an additional flip-flop after FF2. However, it is not possible to completely eliminate the chance
that the metastable state propagates through the synchronizer flip-flops. We can just make the
chance extremely small.

The waveform diagram shows a couple of examples of what can happen if the data signal tx
violates the timing constraints of flip-flop FF1.

In the first case, FF1 enters a metastable state from which it recovers in time to state 0.
Therefore the new value of signal tx is clocked into FF2 only on the second clock edge. The
valid data value reaches point rx after the third clock edge.

In the second case, FF1 again goes metastable but recovers to state 1. Signal rx therefore gets
the new value one clock cycle earlier. There is thus a random one-clock cycle delay variation in
the synchronizer.

5 Synchronizer Reliability
It is important to analyze the reliability of a synchronizer before it is accepted for the design.
This slide presents a simple method for doing that.

In this example, we try to estimate the probability of a synchronization failure using data and
clock signal properties, and the properties of the synchronizer flip-flop. A synchronization failure
occurs when a synchronizer flip-flop does not recover from the metastable state before the next
clock edge.
The parameters we have to know for this analysis are the data rate of the input signal, the
receive side clock frequency, the setup-hold time window length, and the flip-flop's recovery time
from metastable state, which we denote with the Greek letter tau. Notice that using setup and
hold times in this estimation gives an overly pessimistic result. Setup and hold time margins are
design guidelines, and the times the flip-flop actually tolerates are shorter.

The formula shown in this slide presents a method for computing the error rate, which is the
number of synchronizer failures per time unit. The more commonly used parameter, the mean
time between failures, or MTBF, is just the inverse of the error rate.

Let's examine the formula term-by-term.

The first term, fTX, is the input data rate. In the most pessimistic scenario, in which there is a
synchronization failure every time the data changes, this parameter alone would determine the
error rate. Fortunately things are not that bad.

The second term of the formula represents the probability with which a change in the data input
actually causes a timing violation. This will occur if the data changes inside the setup-hold
window of the flip-flop. The probability of this is just the window length divided by the
receiver-side clock period. So if the setup-hold window length is for instance 100 picoseconds,
and the clock period is 1000 picoseconds, the probability of a timing violation is 10%.

Now we know the probability of timing violations. We now assume that the synchronizer flip-flop
goes metastable every time there is a timing violation, and calculate the probability with which
the metastable state lasts for the whole clock cycle, causing a synchronization failure.

This probability can be estimated by the function e to the power of minus t per tau1. This is the
probability of the flip-flop still being in the metastable state after the time t. The value of the
recovery time parameter tau is flip-flop specific. For the time t, we use the receiver clock
frequency. The decreasing exponential function models the synchronizer behavior: the longer
you wait the more certain you can be that the flip-flop has recovered from the metastable state.

Notice that the error rate formula does not apply if the transmitter side clock edge causes
multiple changes in the data signal. This can happen if the data signal is driven from
combinational logic that can have glitches caused by hazards. This is why synchronizer inputs
should always be driven from a flip-flop in the transmitting clock domain. Clock domain crossing
verification programs will report an error if they detect combinational logic at synchronizer
inputs.

1
The robot voice has this wrong!
6 Synchronization of Multibit Data (1)
It looks like we have now solved the synchronization problem, but that is unfortunately not true.
The 2-flip-flop synchronizer only works with 1-bit data.

If the data that we want to move from one clock domain to another has multiple bits, the random
one-clock-cycle delay variation of the synchronizers can cause problems, if we just synchronize
every data bit separately, as in the crossed-over schematic shown here.

The timing diagram on the right shows some examples of what can happen. The synchronizer
input data 'tx' changes from all-zeros to all-ones at the same time with the clock edge seen by
the synchronizer. This will have random consequences. In this example, two flip-flops enter the
metastable state but recover to different states, while the third flip-flop captures the new state 1.
The received data therefore consists of the values 0-0-0, 1-0-1 and 1-1-1, which is different from
what was transmitted. The simple synchronization solution obviously does not work for multibit
data.

7 Synchronization of Multibit Data (2)


For multibit data, we need a solution in which the asynchronous data is held stable for some
time and the synchronizer is notified of this so that it can capture the data when it is guaranteed
to be stable. The data multiplexer synchronizer shown in this slide is the simplest solution that
uses this principle.

The transmitter side signals in this case are the multibit data signal tx, and a control signal tx-en,
which is in state 1 when the data in tx is valid and stable. Both signals are clocked into registers
at the transmitter.

The data-valid indicator signal tx_en is synchronized with a standard 2-flip-flop synchronizer,
and the synchronized signal enables the loading of the receiver's data register. The
transmitter-side data does not have to be synchronized because we know that it will remain
stable throughout this process.

This solution is simple but it has a drawback. The transmitter side must know the clock
frequency of the receiver so that it can keep the data and control signals stable long enough for
the receiver to catch them. The data multiplexer is therefore not a universal solution for
synchronizing multibit data.

8 Synchronization of Multibit Data (3)


On this slide you can see a more robust but also more complex solution to the multibit data
synchronization problem. It is known as the handshake synchronizer, and it is based on a
closed-loop solution in which the receiver side notifies the transmitter when it is ready to start
the next synchronization event.
The signal arrangement is the same as before: We have the multibit data signal tx and the
data-valid indicator tx-en. We also have the transmitter and receiver side data registers.

In this synchronizer, the data-valid indicator tx-en is connected as input to a finite-state machine
controller TX FSM.

In the beginning of the synchronization event, tx-en is set to state 1. This causes the control
state machine to enable the loading of the data register txr with the signal txr_en. At the same
time, the FSM raises the request signal 'req'. The request signal is synchronized to the rx-side
clock, and its synchronized version 'sreq' enables the loading of the rx data register, just like in
the data multiplexer synchronizer.

In this solution, the synchronized signal 'sreq' also drives a receiver-side control state-machine,
the RX FSM. When RX FSM sees 'sreq', it raises an acknowledgement signal 'ack', which is
then synchronized to the transmitter-side clock.

When the TX FSM sees the synchronized 'sack' signal, it drops 'req', and when the RX FSM
sees that 'sreq' has dropped, it drops 'ack', and finally, when TX FSM sees that 'sack' has
dropped, it returns to an idle state from which it is ready to start a new handshaking transaction.
The transmitter side application logic can check from the TX FSM state register when it is ready
to accept new data.

The waveform diagram shows a complete synchronization cycle.

With the handshaker synchronizer, the transmitter and receiver side clock frequencies don't
have to be known in advance. The downside in addition to the increased complexity is that the
synchronization requires a large number of clock cycles because of the closed-loop operating
principle.

9 Handshaking State Machines


This slide shows algorithmic state machine diagram descriptions of the control state machines
of the handshake synchronizer. They are easy to design in principle, but there is one thing that
you should take into account.

The state machines drive the signals 'req' and 'ack' that function as inputs of the 2-flip-flop
synchronizers. As we learned earlier, synchronizer flip-flops must not be driven from
combinational logic. This must be taken into account in the design of the state machines.

The code fragment shown here is an example of coding style that cannot be used here. If you
use conditional statements 'case' or 'if' to decode the values of the handshaking signals, there is
a danger that the synthesized circuit will contain logic gates that drive these signals. And in any
case, a CDC verification program will treat this coding style as a violation of design rules.
An easy solution to this problem would be to add flip-flops at the 'req' and 'ack' signals, but this
would increase the synchronization latency by four clock cycles. A better solution is to use
one-hot state-encoding for the FSM states, and drive the signals directly from the state register
bit that represents the state where 'req' or 'ack' must be high, as shown here.

10 Other Multibit CDC Schemes


Many other multibit data synchronization schemes exist. Here we present two of them.

The data re-encoding method shown on the left is a low-latency option that can be useful in
some applications. The idea here is to encode the data using the Gray code, which is a code in
which only one bit changes between sequential code words. This allows us to use 2-flip-flop
synchronizers, as shown here. This is not a general purpose solution that works with arbitrary
input data sequences, but it can be used, for instance, to synchronize control data sequences
generated with a counter or a state-machine that executes a fixed state sequence. Its
advantage is the short synchronization latency, which is often needed in control applications.

The second synchronization scheme, shown on the right, is based on the use of an
asynchronous first-in first-out memory component. With an asynchronous fifo, both sides can
use their own clock to write to and read from the memory. True synchronization is of course
required in this case too, but it is handled by the fifo's internal circuitry, and has therefore
already been solved by the designer of the fifo.

11 Data Reconvergence
You can probably solve any synchronization problem by using one of the solutions presented in
this lecture.

The general principle is that you must synchronize every data signal only in one synchronizer.
This way you will not create many versions of the data that have a random one-clock-cycle
timing difference caused by the delay variation of the synchronizers.

In large designs, problems can still emerge if the data that originates from one point travels over
different paths and through different synchronizers to another clock domain and converges in
another point. This situation is called data reconvergence. In the block diagram, the two paths
highlighted with red that start from register A and end in register C are an example of this.

The data that passes through the two synchronizers is different, so the synchronize-only-once
principle is not violated here. However, because the data on both paths was launched from the
same register A, and can experience a different delay in the synchronizers, a wrong pair may be
combined in the combinational logic block X that sits in front of the register C. This could be a
problem, but not necessarily. It depends on what the circuit does.
A similar kind of situation exists between registers C and 'D'. In this case the data that originates
from register A diverges into two registers that can therefore receive improperly aligned data.

The path highlighted with green has no reconvergence or divergence issues because it travels
through only one synchronizer.

CDC verification programs can detect data reconvergence issues in a design. It is up to the
designer to check if they can cause problems in practice.

12 CDC Logic Verification


Special clock domain crossing logic verification programs are nowadays widely used in RTL
verification to detect synchronization problems in an early phase. RTL code will usually work in
simulation even if it has CDC issues. The problems only appear in gate-level simulation when
the timing checks in flip-flop models notify the user of setup or hold time violations.

RTL designs or code can have three kinds of CDC problems.

First of all, a design that has clock domain crossing may not contain proper synchronization at
all. The reason for this could be that the designer did not notice that a signal crosses a clock
domain boundary in some place.

Sometimes the wrong kind of CDC solution has been chosen. For instance, a bank of 2-flip-flop
synchronizers is used for multibit data.

The third group is errors made in RTL coding. The idea may be correct, but the RTL code does
not produce the required gate-level topology. The most common example is combinational logic
synthesized into the inputs of synchronizer flip-flops.

CDC verification programs first detect CDC structures in RTL code, then match them with
known-to-be-good synchronizer templates and finally inform the user whether the CDC solution
is good for the crossing. Both formal proof or simulation-based verification techniques are used
to verify the detected CDC logic.

In the course project, we use the Questa CDC tool to check the synchronizers you have
designed. The tool works in two phases.

In static analysis of the RTL code the tool classifies the identified crossings as violations,
cautions and proven.

The violation status indicates some fundamental problem, like a missing synchronizer.
Cautions and evaluations are border-cases, where the CDC logic looks good, but it must be
verified by simulating it in the design's context. The tool might have identified a data multiplexer
synchronizer, and wants to see that the data input is kept stable long enough.

The proven status is given when the tool has been able to formally prove that the CDC solution
cannot fail.

In a static analysis run the tool can automatically create assertions that can be used in the
simulations that are required to verify crossings that were given the Caution or Evaluation
status.

The tool can also add so-called metastability injectors into the RTL model for use in simulations.
The injectors add random delays into CDC data signals when the TX and RX clocks align. This
way synchronization effects can be observed already in RTL simulations.

Reconvergence analysis is also available in the static analysis run.

In the dynamic analysis phase, the RTL code is simulated with the protocol assertions and
metastability injectors created in the static analysis run. The simulation should of course work.
It is also possible to get coverage data of certain CDC properties collected by the protocol
assertions. You can for instance see if some corner cases like minimum intervals between
synchronization events were covered in the simulation.

13 Questa CDC Flow Used in the Project


This slide shows an overview of the CDC verification flow that is used to check the cdc_unit in
the project.

You begin with the functionally verified RTL code. In other words, the design should already
work.

In the first phase, you run the static analysis script. If you see violations, you must fix the code.
Otherwise you can move on.

If there were cautions or evaluations, you must run a simulation that uses the protocol checker
assertions the tool created. You can find information about these checkers in a text report file.

After simulation, you can check the results using a dynamic analysis script. If the protocol
checker assertions passed, you should see that the status of evaluations has changed to
evaluated.
The coverage results are often less than 100%. It is possible that the test program does not
generate data that stress-tests the synchronizers in all corner cases that have been modeled in
the protocol assertions.

14 Debugging Violations in Questa CDC


If static analysis generates violations, you can use the tools available in the graphical user
interface to try to find out what's wrong. The schematic tool is often useful in locating the
problematic points in the design. By using the Help function, you can get a detailed explanation
of the violation.

In this slide you can see how the combinational logic before synchronizer violation is
represented in the GUI.

Notice that in this example the handshake synchronizer check has also failed because of the
faulty 2-flip-flop synchronizer that it contains.
L9

2 SystemC Overview

SystemC is a method for creating untimed and timed system models with the C++ language. It
is defined and promoted by Accellera, and has been approved by the IEEE as standard 1666
dash 2011.

This slide describes the main features of SystemC.

SystemC is implemented as an open-source C++ class library. From this it follows that you
should have some knowledge of C++ programming to be able to use SystemC proficiently.

The SystemC library provides the base classes for hardware-description-language-like modeling
of hierarchical modules, ports and interconnect channels, concurrent processes and time. It also
provides macros that hide some C++ constructs from the user and make adoption easier for
HDL designers. The hardware modeling paradigm, represented by the parts with orange
background in this diagram, is therefore similar to that of HDLs.

The SystemC library also contains a simulator "kernel" that is linked with the user code, and
allows the models to be executed as normal programs in Linux or Windows, or under the control
of an external simulator. It supports both clock-based and untimed or loosely timed
transaction-level modeling.

SystemC also defined some hardware-oriented data types, such as arbitrary precision integer
and fixed-point types.

3 Modeling Levels Supported in SystemC


This slide introduces the two most commonly used modeling styles that are used to create
SystemC models.

The principle of transaction-level modeling, or TLM for short, is well described by this excerpt for
Wikipedia: The details of communication among modules are separated from the details of the
implementation of functional units or of the communication architecture.

What this means in practice is that you use function calls in your models to pass data from one
module to another. A TLM function could in principle look like the 'put' function shown in the
slide: it just copies the argument you passed to it to the receiver. The implementation will be
more complex in practice, but for the user this view is sufficient, because the user only writes
function calls.
The 'put' functions shown here would in practice be owned by a socket object in a SystemC
model. It could be used so that a processing thread of one module would call an initiator
socket's put-method, which would call the target socket's put-method, which would copy the
data to a variable in the module that owns the target socket. The SystemC library implements
the details of this communication, and the designer can just concentrate on selecting the
interface types and using them in the models.

SystemC TLM 2.0 supports both "loosely" and "approximately" timed modeling. In loosely-timed
models, transactions can have delays. In approximately-timed models transactions can further
be divided into phases that can have delays. Transactions can have generic payloads which
means that transactions can move any kind of data structures.

SystemC also supports accurately-timed models, in which communication signals, signal timing
and protocols are defined with pin and clock cycle accuracy. This is the level of accuracy used in
register transfer level modeling. The user's code must therefore actually wiggle the control and
data pins when it wants to move data. Accurately-timed models are used to describe
cycle-based functionality, which means that all operations are timed by clock signal edges.

TLM models are typically used in the system design phase to create virtual hardware models.
Accurately timed models are used as hardware specifications and as inputs for high-level
synthesis tools. This lecture covers accurately-timed, clock-based modeling.

4 Short Introduction to C++ Classes


Before we begin to discuss SystemC more deeply, let's review the basics of C++ classes and
object-oriented programming in general.

In object-oriented programming, a class is a declaration of a data structure that consists of a


collection of variables and function subroutines that can be used to manipulate the variables of
the class. The variables and functions are often called the members of the class. Member
variables are sometimes called class attributes, and member functions class methods. In
programs you can create an instance of a class by declaring a variable whose data type is the
name of the class. The class instances are called objects.

In C++, class members can be either private or public. Private members can only be accessed
from the member functions of the class, while public members can also be accessed from
outside the class. Member variables are usually private and member functions public.

In the left-hand-side code box, you can see the declaration of the class geom_object that has
four integer-type member variables, x, y, w and h; and four member functions, geom_object,
r_move_to, resize and report.
The function geom_object, which bears the name of the class, is a special function called the
constructor. It will be called automatically when an object of this class is created. In this
example, it initializes the member variables and prints out a message into the c-out standard
output stream, which has been made available by the statements at the beginning of the file.

For the three other member functions, only a function prototype is defined. The implementations
must therefore be given outside the class declaration.

Class declarations are usually written in C++ header files that have the .h suffix.

The right-hand-side code box shows the implementations of the class member functions. The
implementation file has the suffix .cpp. This file is passed to the C++ compiler, which then reads
in the header file geom_object.h because of the include directive.

The function declarations are associated with the class with the double-colon scope resolution
operator.

The small code box on the right shows a simple program that uses the geom_object class.

The variable x has the type geom_object which makes it an object of that class. It can be
manipulated by calling the member functions of the class. Notice that it would be an error to
reference private member variables from here by using the same syntax.

The last box on the right shows an example of how you could compile and execute this program
from the Linux command line. In this case, the Gnu C++ compiler program g++ is used. The
output from the program is shown in yellow font.

In this slide, we have presented only the essentials of C++ classes. You are also recommended
to study features such as inheritance and polymorphism to get a more comprehensive view of
C++.

5 SystemC Module
The basic building block of hardware models in SystemC is the module. A module defines the
name of a component, its input and output ports and their data types, as well as the processes
that describe the function of the component.

A SystemC module is implemented as a C++ 'struct' data structure. In C++, a struct is just a
class whose members are all public. You can therefore think of a SystemC module declaration
as a C++ class declaration.

The module declaration shown here does not contain the class keyword. This is because it uses
the SC_MODULE macro, which has been defined in the systemc.h include file that contains all
SystemC declarations. The purpose of this macro, as well as other SystemC macros, is to make
it easier for HDL coders to begin to use SystemC. In this case, the compiler expands the macro
into a valid class declaration.

Inside the module, a group of member variables of type sc_in and sc_out are declared. They
represent the ports of the module. sc_in and sc_out are parameterized SystemC classes, for
which the data type handled by the port must be given as a template parameter, such as the
bool or sc_uint types in this example. Notice that the unsigned integer type sc_uint is also a
parameterized class that accepts the number of bits as the template parameter inside the angle
brackets.

The reg_r variable declared after the port declarations is a regular C++ member variable, which
represents the memory of a register in this module.

The function prototype 'run' is declared next. Its purpose will be explained shortly.

The SC_CTOR string is again a SystemC macro. It expands into a constructor function
declaration in compilation.

Inside the constructor, the SC_CTHREAD macro defines the function 'run' to be a clocked
thread that is sensitive to the rising edges of the clock signal clk.

The reset_signal_is function call defines the reset signal for the clocked thread.

This closes the class declaration.

As you can see, the syntax is a little bit peculiar, but luckily all module declarations will contain
similar parts, so there are not too many things you'll have to learn.

The implementation file reg8.cpp shown on the right contains the code for the member function
'run'.

At this point it is sufficient to pay attention to two things.

First of all, the ports are accessed using their read and write methods. You can often read from
an input port by just using the port name and write to an output port by using the assignment
operator because of operator overloading, but you will have less compilation problems if you
always use the read and write methods explicitly.

The second thing to notice is the use of the SystemC wait function call statement. The wait
statement is used to synchronize the execution of the code to the clock. When a wait statement
is executed, control is passed from the executing thread, such the 'run' thread in this case, back
to the simulator. The wait statement returns, when the simulator detects the clocking event
specified for the thread in the module constructor.
6 SystemC Ports, Channels and Processes
Let's next examine the most important hardware modeling elements that are used to create
modules.

In the previous slide, we already mentioned input and output ports. Ports inside a module
instance are objects that are used by calling their read and write methods. The third important
method is the bind function. It is used to connect a port into a channel, such as the sc-signal
described next. The parentheses operator has been overloaded so that you can use it instead of
an explicit call to the bind function, which makes port mapping statements a little bit simpler.

Channels are SystemC classes that define connections between processes and modules. In
hardware design, you will mostly need the sc_signal channel type.

A value assigned to a signal channel is updated when the processing thread exits or is
suspended by wait so that all readers of the signal get the same value. A regular C++ variable is
updated immediately and should not be used for interprocess communication.

Processes are used to describe functions. The two most commonly used SystemC process
types are SC_METHOD and SC_CTHREAD.

SC-methods are processes that are executed when a signal in their sensitivity list changes.
After execution they return the control back to the simulator kernel. They are suitable for
combinational and sequential logic modeling on RTL level.

Threads in SystemC are processes that can be suspended and reactivated. A thread is
suspended when wait is called and reactivated by an event in a signal on its sensitivity list. The
SC_CTHREAD is a special case of a thread process that is sensitive only to its clock signal. It is
commonly used for algorithm-level modeling for synthesis.

7 SC_METHOD Example: RTL Modeling


This example makes use of SC_METHOD processes to model combinational and sequential
logic functions. If you examine the code, you will notice that despite the different syntax, the
coding patterns used are the same you would use to model combinational and sequential logic
in VHDL or in SystemVerilog.

In this example, the code represents the RTL design whose block diagram is on the right. The
design contains two enabled registers whose outputs are connected into a multiplexer block.

In the SystemC code, the relevant parts are the sc_signal type member variables a_r and b_r
that represent the register outputs, and the seq and comb member functions, that have been
defined to be SC_METHOD type processes in the constructor. The sequential process has the
clock edge and reset edge in its sensitivity list, while the combinational process requires all of its
inputs to be in the sensitivity list.

Notice that the syntax of the sensitivity list for SC_METHODis different from the syntax used
with SC_CTHREADs.

The yellow and blue code boxes contain the code of the seq and comb processes. The code
structure follows the general RTL coding rules you would use with any other RTL language.

8 SC_CTHREAD Example: Algorithm-Level Modeling


This slide presents an example of an algorithm-level model of an FIR filter, modeled using an
SC_CTHREAD process.

Let's first look at some noteworthy parts inside the module declaration.

The module defines two member functions, of which the run function is made an SC_CTHREAD
process in the constructor. The second one, 'sat', is just a normal C++ function that has no
special simulation behavior. When it is called from the processing thread, it is executed in zero
time inside the current simulation timestep.

The three member variables at the end of the module declaration are declared as normal C++
variables, and not as sc_signals. This is okay, since they are only used as local storage
locations, and not for interprocess communication.

In the code of the run function, pay attention to the general structure of the code.

The lines that precede the first wait statement define the reset behavior of the module, since
they will be executed in the beginning of the simulation, before the first clock edge.

The code that follows the first wait statement defines the steady-state operation of the module.
An SC_CTHREAD process by convention should contain one while-loop from which it will never
exit. This is a requirement if you write the code for high-level synthesis in mind. The loop is a
processing loop that repeatedly executes the algorithm defined inside the loop. In this case it is
the FIR filter algorithm.

The eternal processing loop must contain at least one wait statement so that it models clocked
behavior correctly. It can contain any larger number of wait statements. In this case there are
two: one inside the loop in the beginning, where the code waits for the valid-in input to become
true, and one at the end where the code polls the ready-in input.

The data processing section in this case is presented as an untimed algorithm that will be
executed in zero-time.
Notice that labels have been used generously in the code to name code blocks and loops. This
is not required, but it makes it easier to apply constraints on the code in high-level synthesis
tools.

9 Hierarchical Models
Just like any other hardware description language, SystemC allows you to create hierarchical
models. In hierarchical models, you typically place the functional data-processing code in the
leaf-modules of the hierarchy, and then connect these modules to each other inside a higher
level module.

In the example shown here, the top-level module reg8-top instantiates the module reg8 and its
testbench reg8_tb, and connects them to each other with signals.

To create the interconnect signals, you have to declare sc-signal type variables and give them
the appropriate data type as the template parameter.

Module instantiations are done by just declaring variables that have the module's name as their
type, as shown here for the variables reg8_instance and reg8_tb_instance.

You also have to initialize these variables with their name strings in the constructor's member
initializer list. You can see an example of this in the code. The member variable initializer list
may look like an odd thing, but it is a feature of C++, not SystemC. You can find more
information on its usage in C++ documentation.

The clock period used in simulation is also defined in the constructor's initializer list by initializing
the member variable 'clk' with values that define a 20 nanosecond clock period and a 50% duty
cycle.

The interconnect signals are bound to the module instances' ports in the constructor. You have
to bind every port with a separate function call, and provide the interconnect signal as the
function argument.

This model also declares an SC_THREAD type process reset_thread. This process is executed
only once in the beginning of the simulation. It holds the reset signal low for two clock cycles
and then raises it and exits.

10 Testbench for reg8 Module


This slide presents the code for the testbench module mentioned in the previous slide just for
completeness.It generates test data that drives the inputs of the reg8 module in simulation.
11 sc_main
We have now covered the basics of hardware modeling in SystemC.

We have yet to show how we can make the hardware simulation model available to a Linux or
Windows process, or to an external simulator process. For this we have to provide an entry
point to our code. This is called sc_main in SystemC, and its purpose is more or less the same
as the purpose of the main function of C-programs. When you start a SystemC program either
from the operating system or from an external simulator tool, the parent process will look up the
sc_main function and start executing it.

You have to provide the sc_main function. It has to create an instance of the top-level
SC_MODULE of your design hierarchy, and then call the sc_start SystemC function to start
simulation. The simulation continues until the sc-function sc_stop is called from some part of the
model.

Notice that the C++ 'new' operator is used here to create the top-level instance object. This will
allocate memory for the object in the heap of the process, instead of the stack, which may be
safer if the design hierarchy consumes a lot of memory.

12 Data Types
This slide shows a summary of the most important SystemC data types you will need in
hardware models.

You need sc-in and sc-out ports and sc-signals to represent data. The other data types shown
here are the most common ones used as template parameters of port and signal data types.

The Boolean type 'bool' of C++ is commonly used to represent one-bit data in SystemC code,
even though SystemC has the 2-valued bit type sc_bit and the multi-valued bit-type sc_logic.
You will get along with bool, if you are doing high-level modeling.

In practice you will probably mostly use the sc_uint and sc_int data types to represent
bit-vectors. Notice that these data types support only word-lengths of up to 64-bits. For larger
word-lengths you have to use the big_int types that are slower in simulation.

The sc-ufixed and sc-fixed data types can be used to define arbitrary length fixed-point data
types that have user-defined quantization and overflow modes. These are very useful in the
design of digital signal processing circuits. Similar data types are not available in SystemVerilog.
13 Operators Supported by Integer Data Types
This slide presents a summary of the operators supported by the integer data types, sc-uint and
sc-int. As the top part of the table shows, all standard C++ operators are available.

From a hardware designer's point of view the interesting operators are those used for
manipulating unsigned and signed integers as bit-vectors. You can use the brackets operator to
select a bit, just like in SystemVerilog, but to select a part, you must place the
range-specification inside the parentheses. Concatenation of bit-vectors is also done with
parentheses. You can see examples of the use of these operators in the code box.

The conversion functions shown in the table are useful for converting between C++ and
SystemC integers. In the code box, you can see an example of converting a SystemC integer
value into a text string. The SC-BIN constant selects the binary base to be used in the string
representation. You can also specify decimal, hexadecimal or octal formats.

We can conclude that with sc-uint and sc-int you can do the same things as with 'logic unsigned'
and 'logic signed' in SystemVerilog, and unsigned and signed in VHDL.
L10

2 RTL Architecture Exploration for Algorithm R4 = R1 + R2 + R3


The topic of this lecture is the high-level synthesis of digital circuits.

The term high-level synthesis is used to describe electronics design automation tools that can
generate optimized, register-transfer-level architectures from algorithm-level models written in a
high-level programming language. In other words, they automatically solve similar problems
that RTL designers solve using their experience and creativity. Before we begin to discuss
high-level synthesis techniques themselves, it's useful to spend some time reviewing the RTL
architecture design problem to which these techniques provide a solution.

We'll use this simple algorithm, R4 = R1 + R2 + R3, as the design problem, and explore the
alternatives we have for creating an RTL implementation for it. We assume that the R-variables
represent registers in the architecture.

We begin by assuming that there are no constraints set for the design. The most
straight-forward solution is then to implement the addition operations with combinational adder
blocks. Since adders conventionally have two inputs, we need two adders to compute the sum
of three values. The RTL architecture will therefore be like the one on the left.

We can now analyze the properties of this implementation. Using the delay values of the adders
shown in the block diagram, we see that the total combinational delay is 14 nanoseconds. If we
leave some margin, we can clock this circuit using a clock signal whose period is 15
nanoseconds. This will also be the latency, or processing delay, of the architecture. The
throughput is the inverse of that, implying that we get one result every 15 nanoseconds. Latency
and throughput are the most important architecture-level performance figures.

In the second case, presented in the middle, we change the problem by introducing a resource
constraint that allows the use of only one adder component. As there are two addition
operations to be executed on only one adder, the additions must be done sequentially,
one-by-one. This requires an architecture that first computes the sum R1 + R2 and saves the
result in a register, and then adds R3 to the value in the register, and stores the final sum in R4.
We need a temporary register for storing the intermediate sum, but we can use R4 for that. The
computation now takes two clock cycles, and on these clock cycles, the inputs of the adder are
different. We therefore have to place multiplexers before the inputs of the adder. This reasoning
yields the architecture shown in the middle.

Compared with the first architecture, the number of resources used in this solution seems to be
smaller because of the sharing of the one adder between two operations, but since we had to
add two multiplexers, the result is not self-evident. This design would also need a simple,
two-state state machine for controlling the multiplexer's select inputs.
Using the delay estimates shown in the diagram, we can conclude that this design will work with
a 10 nanosecond clock period. The latency will therefore be 20 nanoseconds, and the
throughput will be 1 per 20.

Assuming that an adder is bigger than two multiplexers and the state-machine, we can conclude
that the resource sharing saved some silicon area, but that came at the cost of performance.

In the third case shown on the right, we have a different kind of constraint. We now have to
create an architecture that works with a clock whose period is less than 10 nanoseconds.
Neither of the previous solutions is good for that.

We can solve the problem by modifying the first architecture so that we split the delay path in
half by adding an extra register R5 in between the two adders. After that the combinational
delay is only 7 nanoseconds. However, after this change, the circuit does not work correctly any
more because the second adder will see a delayed version of R1 + R2. This can be fixed by
delaying R3, too, with a register, R6. Now the data will remain properly aligned when it flows
through the circuits. The circuit can now also operate as a pipeline, in that it can accept new
data and produce results on every clock cycle even though the computation of every result
takes two clock cycles.

The results are again different.

This is probably the most complex solution because of the two additional registers. The latency
is actually worse than in the first case, but the real difference is in the throughput, which is
clearly the best of all three solutions. This is because of the pipelined operation.

3 The RTL Design Problem


We can draw some general conclusions from the architecture exploration exercise presented in
the previous slide.

First of all, the latency can be usually reduced by using more combinational resources, and vice
versa, increasing latency creates opportunities for resource sharing.

The second observation was that resource sharing increases multiplexing and control costs,
which complicates the trade-off analysis.

The third conclusion is that we can improve throughput, often substantially, by pipelining the
computations.
In general, even for small designs, there are a large number of different solutions you can come
up with by trading off area for latency and applying pipelining. The size of this multidimensional
search space is directly proportional to the number of operations in the computation.

In RTL design, the aim is to find a solution that is optimal with respect to design constraints. This
requires a lot of work and experience.

In traditional RTL design, the designers make all the decisions. They define the architecture,
write and verify the code, and analyze the results. Many iterations are often required, because it
is not easy to analyze the silicon area and timing properties of the solution without actually
implementing it first.

And notice that even if the RTL code can look algorithmic, like in the code example shown on
the right, it completely defines the architecture, and therefore the resources, latency and the
level of pipelining. When the code is given to a logic synthesis program, it only optimizes the
combinational logic parts for area and propagation delay.

4 High-Level Synthesis Based Design


We can now contrast the traditional RTL design method with high-level synthesis based design.

The first difference is that you can use algorithmic models written in a high-level language as
input to the synthesis tool. You don't have to describe with clock-cycle-level accuracy the
functions of the combinational and sequential blocks that form the architecture. You can
describe an FIR filter with a simple for-loop, as shown here.

The second difference is in how a high-level synthesis tool works. It schedules the execution of
the operations in the algorithm in clock cycles, and then allocates the combinational resources,
registers and a control state machine. In essence, this means that the HLS tool generates the
RTL block diagram automatically. This is done according to the optimization constraints.

Since high-level synthesis tools work at the architecture level, the constraints you can give them
are different from logic synthesis constraints. You can constrain architecture-properties, such as,
latency, resource usage and pipelining, in addition to logic-level constraints such as the clock
period and input and output delays.

All in all, high-level synthesis opens completely new opportunities, because it allows you to
rapidly evaluate different architectures before committing to one that will be implemented.
After this exploration phase, the best found solution can be given to a logic synthesis program
which creates the gate-level implementation, just like in the traditional RTL design flow.

As of today, there are a few HLS tools on the market that have gained some traction in the
industry. We can divide these into two groups based on the input languages they accept.
Some tools support C and C++ language inputs. This means that the input models cannot have
a clock as a time reference and that it is not possible to represent concurrency in the same way
that is possible in hardware description languages.

A design must therefore be modeled as a function call hierarchy according to


HLS-tool-vendor-specific semantic rules. And because there is no clock, it is not possible to
specify clock-cycle-accurate functionality, only algorithmic. Because of the untimed modeling
paradigm, interfaces must be implemented using predefined I/O library parts shipped with the
HLS tool.

If the HLS tool supports SystemC, the design can be modeled more flexibly. As discussed in the
previous lecture, SystemC is a standardized method for modeling systems with clocked
concurrent processes that can describe functionality on both the algorithmic and RTL levels.
Users can therefore also create clock-cycle accurate interface models that can be easily moved
from one HLS tool environment to another.

5 How Do HLS Programs Work?

This slide gives you an overview of the high-level synthesis process.

An HLS tool begins the synthesis by compiling the input code into an internal representation
format, which is some kind of data flow graph, or data and control flow graph. The data flow
graph identifies the operations in the algorithm and the data dependencies between these
operations. From the graph the tool, and the user, can see how many and what kind of
operations there are, and in which order they must be executed.

The first optimization step is scheduling, which means placing the operations on specific clock
cycles in the schedule. The schedule is represented by the gray bars in the data flow diagram.
The schedule must be feasible, meaning that the total delay of chained operations scheduled in
the same clock cycle must not exceed the clock period. This in turn means that the tool must
generate delay estimates for the operations on the fly. In the scheduling phase, the tool makes
area-latency trade-offs by scheduling operations in parallel or sequentially, depending on the
optimization goals.

After a satisfactory schedule has been found, the HLS tool can allocate the hardware resources
and bind the operations to these resources, while again trying to meet the latency and resource
constraints.

Scheduling is the main optimization problem in HLS. If the goal is to minimize latency, the HLS
tool tries to execute as many as possible operations concurrently, and if the goal is to minimize
area, operations are scheduled in different cycles, allowing similar operations to share
resources.
6 Scheduling
Let's examine a simple scheduling example to make this task more concrete.

The slide presents two possible schedules for the sum-of-products computation from the
previous slide.

Remember that for a schedule to be feasible, the total delay of chained operations scheduled in
one clock cycle must be shorter than the clock period. In the data flow diagrams shown here,
the rectangles represent operations and the height of the rectangles is directly proportional to
their delay.

The first schedule is an example of a latency-constrained schedule. In this scheduling scenario,


the aim is to fit the operations in a given number of clock cycles while at the same time
minimizing resource needs. In this case the latency constraint is 5 cycles. It is obvious that two
multiplications per clock cycle are needed, because it is not possible to meet the latency goal by
scheduling just one multiplication per clock cycle. Two additions per clock cycle are also needed
on most cycles.

The second schedule represents a resource-constrained scheduling scenario. Now the aim is to
create the fastest possible schedule by using the given resources. In this case, the resource
constraint is three multipliers. The schedule can therefore have no more than three
multiplications in one clock cycle. It is possible to fit the computations into four clock cycles
under this constraint.

Understanding the latency and resource constrained scheduling scenarios is important because
most electronics products probably fall into one of these two categories.

Many consumer products, such as audio or video chips, have a strict schedule constraint
dictated by the required sample or frame rate, but they must be as cheap as possible to
manufacture.

On the other hand, some products, such as commercial CPUs, are usually designed with a
specific price category in mind, and the aim is to make them as fast as possible while keeping
the manufacturing cost below a predefined level. The resource constrained scheduling principle
is more suitable in these kinds of cases.

7 Resource Allocation and Binding

The other two main optimization tasks in high-level synthesis are allocation and binding.
After scheduling, the HLS tool must allocate the hardware resources that are needed to
implement the computational operations, and the registers in which data values are stored at
the end of every clock cycle.

In resource allocation, the aim is to allocate the resources so that similar operations scheduled
in different clock cycles can share a resource. This is already partially determined by the
schedule itself, but there is still room for some optimization. For instance, if a 16-bit multiplier is
needed in one clock cycle, and a 17-bit one on the next, it is better to allocate one 17-bit
multiplier instead of two different multipliers.

In register allocation, a register must be allocated for every data flow arrow that crosses the
clock cycle boundary. This process can be optimized based on data lifetime analysis. If a
register has been allocated for a data value on one clock cycle, but that data value is not used
any more on a later clock cycle, the register can be allocated for the storage of some other
value.

Resource binding is an optimization step, where the HLS tool assigns the operations scheduled
in a specific clock cycle to one of the allocated resources that is available for those types of
operations.

In register binding, data values are assigned to registers cycle-by-cycle the same way.

Binding is an important optimization task, because it determines how many multiplexers are
required for resource or register sharing.

If you examine the allocation chart shown on the right, you will notice that because the
operations a1, a2, a3 and a4 have been bound to the resource add1, this adder add1 will
always get its data from registers r1 and r2. The registers' outputs can therefore be directly
connected to the inputs of the adder add1. However, if the bindings of a2 and a4 to add1 and
add2 were swapped for instance on cycle C3, multiplexers would have to be added in the inputs
of the adders add1 and add2.

8 RTL Generation
After the architecture optimization tasks, scheduling, allocation and binding, the RTL model can
be generated. This includes adding the multiplexers that are needed for resource sharing,
creation of the control state machine, and eventually generation of the RTL code.

On the right you can see the datapath architecture created for the two-multiplier version of the
example design, and a state table of its controller.
9 High-Level Synthesis Process
This slide presents an overview of the high-level synthesis process from the user's point-of-view.

The input data an HLS tool needs consists of the code to be synthesized, and the constraints to
be used in synthesis. The constraints include the clock period, the synthesis goal, maximum
latency, resource constraints, and the technology library that is needed for generating area and
timing estimates.

In the first phase, the HLS tool compiles and optimizes the code. Both standard C++ and
hardware-oriented optimizations, such as constant propagation and bit-trimming, are done in
this phase.

After this step, the user can view the data flow graph, at least in some tools.This can help to
understand the properties of the algorithm and to optimize the code.

The next phase is called control and data flow graph transformations. In this phase, the user
can apply commands that change the CDFG somehow so as to make it easier to schedule.
Loop and array operations are typical transformation targets. We will discuss them in the
following slides.

In the architecture optimization phase the tool executes the optimization tasks that were
described in the earlier slides.

The final step is RTL architecture and code generation.

The overall usage pattern is similar to the one used with logic synthesis tools in that the user
defines the constraints after which the tool optimizes the design. However, the high-level
synthesis process typically contains the architecture exploration phase in which the user tries
out different alternatives by applying CDFG transformation directives. This phase is often done
interactively using the graphical user interface.

10 CDFG Transformations: Loop Unrolling and Pipelining


This slide presents the control and data flow graph transformations that can be used to control
how loops are handled in scheduling.

Loops are important constructs in hardware modeling. Remember that in the SystemC C-thread
process, the whole design is modeled as an eternal loop. The design loop, in turn, can contain
other loops that describe sequentially executed operations.

HLS tools by default schedule one loop iteration in at least one clock cycle. The latency of the
processing loop in the example shown here is therefore 5 clock cycles and the throughput is 1
per 5 because of the 5-iteration 'for' loop. This form of the loop is called the 'rolled' form.
Loop unrolling is a transformation in which the loop is removed, and the loop's body is replicated
as many times as there are iterations in the loop. After this transformation, all operations of the
original loop can be scheduled freely, if there were no data dependencies between loop
iterations. In this case there weren't any, so everything can be executed concurrently in one
clock cycle. All other schedules in the range 1 to 5 cycles are also possible. Notice that loop
unrolling increases the number of operations that have to be scheduled, which increases
synthesis run time.

Loop pipelining is another powerful transformation. It means that the next iteration of a loop is
scheduled to begin before the previous one has ended. The user can choose the initiation
interval at which iterations are started. Loop pipelining is most often applied to the main
processing loop of the design and it implies the automatic unrolling of all inner loops.

11 CDFG Transformation: Array Handling


Array transformations is another important group of CDFG transformations.

An array is a data structure that consists of a collection of elements, each of which is identified
by at least one array index. Arrays are commonly used in all kinds of computer programs. In
digital signal processing, which is the most important application area of high-level-synthesis,
arrays are used extensively to represent for instance filter data and coefficients.

In digital circuits, arrays can be implemented using registers or static RAM memory macrocells.
The implementation style chosen has a big impact on the scheduling of array operations. All
registers of a design can be written and read concurrently, but memory blocks typically allow
only one or two read and write operations to take place in one clock cycle.

In HLS, you can handle arrays in two ways. You can completely partition the array to variables
and implement each variable using a separate register, or you can implement the array using an
RAM block, or maybe split the array first into smaller arrays, and implement them using memory
blocks.

The example on the right shows how this affects scheduling.

The code fragment declares a 128-element array, and then moves data from one element to the
next in a 'for' loop.

If the array is mapped into separate registers, we can unroll the loop, and schedule it to be
executed in one clock cycle, effectively creating a shift-register structure. If we map the array
into the memory block shown on the right, unrolling is useless as we can only execute one
memory read or write operation per clock cycle. Execution of the loop will require at least 256
clock cycles.
We can conclude that array handling has a big impact on performance. A register-based
implementation will always be faster, but also more expensive, since the area of a flip-flop is
much larger than the area of one memory cell.

12 Siemens EDA Catapult HLS Tool


In the course project we use the Catapult HLS tool, whose graphical user interface is shown
here.

Catapult accepts design descriptions written in C++ or SystemC, and synthesizes them into
RTL with user-defined area, latency and resource constraints.

The user interface contains a task list that guides the user through the synthesis process
step-by-step. Before each step, the user can apply directives to control the specific task.The
user can view the schedule in a cycle-based Gantt chart.

In the project, we create three versions of the dsp_unit module.

The unconstrained version will produce a minimum-area and maximum-latency version of the
design. This will tell us how much the design would cost at the least.

The as-soon-as possible version will produce a minimum latency version, which tells us whether
the design is feasible at all in that it can meet the predefined latency constraint.

If the design seems to be feasible, then the third version is optimized with the latency constraint
to get an as-cheap-as-possible but fast-enough version.

The three versions can be synthesized easily by just moving back in the task list to the
architecture step, and then changing the tool settings values to those shown in the slide.

The RTL module that Catapult creates can be instantiated in the audioport design hierarchy just
like the manually coded RTL modules.
L11

1 What is UVM?

The topic of this lecture is UVM, the universal verification methodology. This slide lists the basic
facts and then tries to explain its use cases, and why it is needed in the first place.

Let's begin with the facts, shown on the left.

UVM is an open-source SystemVerilog-based object-oriented programming framework for


creating testbenches, managed by Accellera, who also manage SystemC.

UVM defines a class library for creating testbench building blocks, also called verification
components, and for modeling transaction-level-modeling-based communications between
them. The basic idea is therefore the same as with SystemC, only the purpose is different.

UVM allows you to create reusable verification IP blocks, also known as VIPs, that can be
reused between different verification environments, such as block-level and SoC-level
testbenches, and between different designs and even between different companies. This has
created a market for commercial VIP products.

A VIP typically packages all parts required for verifying one interface. Typical parts include a bus
driver and monitor, a test data sequencer, test sequences, analysis components, a predictor,
and a results scoreboard.

UVM is now the standard way for creating testbenches in most companies, which is why it is
very important to know it if you plan to work with SoC design and verification.

The right-hand side of this slide explains some use cases for UVM.

As we have mentioned earlier, SoCs consist of IP blocks, and most of the engineering effort
goes into designing and verifying the IPs.

Every IP block design needs a verification environment. It consists of verification components


that drive data from test sequences into the interfaces of the block, parts that measure
coverage, and parts that predict the expected results, to name the most important.

A block can have many separate interfaces, and each interface must have its specific
verification components. Some interfaces can be custom designed, requiring verification
components designed exclusively for it. Others may implement standard interfaces, allowing the
reuse of verification components from earlier projects or VIPs acquired from external parties.
The verification components attached to an IP block will often have to talk to each other, as
indicated by the gray data flow arrow between the interface driver blocks in the diagram. Some
kind of communication scheme is needed for this.

At this point we can probably agree that a standard way of writing the code for the parts of the
verification environment and for reusing existing code is a good idea.

When the IP block is installed in a SoC testbench, the need for standardization becomes even
more evident. Some interfaces of the IP block can still use the same verification components
that were used for block-level verification. Some interfaces, however, will be replaced by direct
connections with the SoC, which is why the verification component may not be needed at all or
it may have to be configured differently. We may also have to be able to create connections with
verification components in the SoC testbench that come from different teams or from
commercial VIP vendors, and we must be able to create these connections in a plug-an-play
manner. This is possible only if everybody implements their VIPs using the same methodology.

Without a standard methodology that supports reuse, most of the SoC testbench code would be
design-specific and large parts of it would have to be created from scratch for every new project.

2
As we mentioned in the previous slide, UVM is an object-oriented programming framework. We
therefore have to again begin by talking about classes. We already covered classes in the
SystemC lecture, so here we only highlight the differences, and some more advanced features
that were not discussed then.

The code box on the left shows a SystemVerilog class declaration. It's not much different from a
C++ class. The name of the constructor function is new in SystemVerilog, but otherwise the
declaration follows the same principles.

The two first code boxes on the right present class declarations that make use of inheritance,
which is an important object-oriented programming concept. The classes 'triangle' and 'circle'
are declared to extend the base class geom_object with some additional members. Objects
created from these two classes will therefore 'inherit' the members of the base class, and also
contain the members added in their class declarations. Derived classes can also 'override' a
member function of their base class by defining their own version of it. The size function in the
triangle and circle classes is an example of this. This feature is known as polymorphism.

The program in the bottom-right corner shows how objects are created and used in
SystemVerilog. The code declares a variable x whose type is circle. This creates an empty
object handle, but not the object itself. The object is created with the new operator on the next
line. Only at this phase will the memory be allocated for the object and the variable x become a
valid object handle.

The function calls on the next two lines demonstrate how inheritance and polymorphism work in
practice.

The x-dot-r-move-to function call causes the member function of the base class to be executed,
because the derived class circle did not provide its own implementation for this function.
In the second case, the size function is picked up from the derived class.
These two examples highlight the power of object-oriented programming: the code that uses the
objects does not have to know anything about how things must be done with different types of
objects.

3 UVM Looks Very Complicated. How Do I Start?


After these preliminaries, we can begin to tackle UVM.

If you browse through UVM documentation, or even some tutorials, your first impression is
probably that it looks quite complicated. This slide gives some tips on how you can get started.

The essential things to learn first are the following.

You have to know which are the most important verification component classes and what are
these components used for.

After that you have to learn how a testbench is built from these components.

When you know what a complete testbench looks like and is made of, you have to learn how it
can be used to execute tests.

Along the way you should also learn how the components communicate with each other when a
test is running.

Once you understand the basics, it becomes easier to adopt more detailed information from
different sources.

When you are using any object-oriented class library, the most important document you must
have at hand is the class reference manual from which you can easily check what kind of things
different kinds of classes can do.

On the right you can see a simplified UVM class hierarchy diagram. The base class of all UVM
classes is uvm_object, from which the class uvm_report_object is derived. The class
uvm_component in turn is derived from the uvm_report_object.
uvm_component is an important class, because it is the base class of all the component classes
that are used in testbenches. Some of the most important member functions are defined in this
class.

The second branch of the hierarchy that contains uvm_sequence_item and uvm_sequence are
the base classes for test data.

The class diagram helps you navigate the documentation of different classes. If you want to
learn what you can do with for instance an uvm_agent you can look up its documentation. If you
cannot find what you want there, you can look in the base class. The function you need might
be defined there. For reporting functions, you will need to climb all the way up to the
uvm_report_object class.

In this lecture we'll just introduce the most important component classes, and describe their
basic usage. It will then be easier to search for more information in the documentation, or on the
Internet, when you need it.

5 UVM Testbench Structure and Verification Components


We begin our UVM sight-seeing tour by taking a look at the big picture shown here. It shows the
composition of a complete UVM testbench. Let's study it step by step, following the numbers.

Just like a basic SystemVerilog testbench, the UVM testbench must have a top-level
SystemVerilog module, which is the number 1 in this picture. The purpose of this testbench
module is first of all to do all the usual things, like instantiate the RTL design and maybe
generate clock and reset signals for it. For UVM, the testbench must also instantiate
SystemVerilog interface objects, which is also often done in conventional testbenches.

The only UVM-specific thing the testbench module must do is to execute the run-test task from
an initial procedure. This will create the UVM component hierarchy and execute the test.

The component hierarchy is presented in the left-most rectangle, inside the TEST component
derived from the uvm-test class, indicated by the number 2. This component has two
responsibilities: it must first create the environment component that will contain all components
needed by this specific test, and then create test sequences and run them on uvm_sequencer
objects.

Component number 3 is the environment, which is derived from the uvm_env class. As said, all
the components needed in the test will live inside this object.

Component number 4 is an agent, derived from the uvm-agent class. An agent is a container for
all components that are used to handle one specific interface of the design under test. A typical
example is an agent that handles some bus interface. In this example we have only one agent,
but in an SoC-level testbench there can be many. All the active verification components that
actually do something are created inside the agents.

Inside the agent, the component that communicates with the design under test is the driver,
number 5, derived from the uvm_driver class. In a typical use case, the driver gets access to an
interface object created in the testbench module, and uses it to write and read those input and
output ports of the design that belong to the bus or other interface the agent handles.

Component number 6 is a uvm_sequencer that is connected to the driver with a TLM


connection. When a test is running, the driver requests data packets called transactions from
the sequencer, and converts them to real bus transactions using the interface object. The
sequencer therefore does not have any information about the interface ports of the design or the
protocol used on those ports. This information stays inside the driver.

Number 7 indicates sequence objects, derived from the uvm_sequence class. A sequence is a
test program that generates the transactions for the sequencer. After the UVM testbench has
been created, most of the design effort focuses on sequence design, as the comprehensive
verification of an IP block or a complete SoC requires a large number of sequences.

Component number 8 is a monitor, based on the uvm_monitor class. Its purpose is the opposite
of the driver. It monitors the design's interface ports and extracts transactions by decoding the
interface protocol. This information can then be used for analysis purposes. In this example the
monitor is connected to an analysis component, indicated by number 9. A monitor is not
required in an agent, but it is useful for instance if you want to create a reference model,
sometimes called predictor, that has to get all the same input data that the design gets.

The UVM hierarchy is created in the beginning of the simulation. When the run_test task is
executed, the UVM framework creates the test object. After that, every UVM component must
create the components that it contains. This way the component hierarchy is created top-down
inside the first time-step of the simulation.

An important thing to notice is that the UVM component hierarchy, the left-hand-side of the
diagram, is untimed. This means among other things that it does not use the clock. Its execution
is synchronized with the simulation time when the drivers access the design's interfaces for
instance by calling the interfaces' tasks that block until read or write operations have completed.

6 How Is an UVM Component Class Defined and How Does It


Work?
Now it is time to see some code. This slide describes how a UVM component class is declared,
what parts it must contain, and how these parts are used in simulation.

The code box presents the declaration of an agent class. Let's examine it line by line.
The first line declares the class apb_agent as a derived class whose base class is uvm-agent.
So far so good.

On the second line we already come across a UVM peculiarity, a utility macro. UVM has a lot of
macros that you must use to get the testbench to work, and the uvm_component_utils macro is
one of them. Its purpose is to register a UVM class with the UVM factory, which is an internal
mechanism of UVM that allows the creation of UVM components in flexible ways. You can take
two positions on these UVM peculiarities: you can study and learn their purpose thoroughly, or
you can take them as things you just have to remember to include in the code without thinking
too much. You can get started perfectly well with the latter approach.

The next four lines declare member variables.

The types of these variables are names of UVM classes defined elsewhere. The variables
themselves are empty object handles at this phase, as the class instances have not been
created yet.

The first member function 'new' is the constructor. The keyword 'super' refers to the base class
of the current class. In UVM, you must always call the constructor of the base class of your
derived class.

The next two functions are called build_phase and connect_phase.

To understand their purpose, turn your attention to the right-hand side of the slide, which
explains the UVM mechanism classed phasing.

When UVM simulation is started by executing the run_test task, the UVM framework takes over.
It controls the simulation by stepping through a set of phases. In each phase, certain actions are
executed.

There are a large number of phases, but the most relevant ones are the build_phase,
connect_phase and run_phase. Every UVM class can, and sometimes must, declare a phasing
function that will be executed automatically when simulation enters the specific phase. This
means that when simulation enters the build phase for instance, the build_phase functions of
components that have been created so far are executed.

We can now have another look at the build_phase function of the apb_agent class.
The first statement is a call to the base class' build_phase function. This is again a UVM
requirement.
On the next four lines, the components used inside the apb_agent are created.
This is done in a very peculiar way that calls for an explanation.
First of all, it would be perfectly OK to call the new method of these classes to create the
objects, just like was done in the SystemVerilog class introduction in the beginning of this
lecture.
The reason for using this complicated-looking way is again flexibility. The function abp sequence
type_id create, and the other three with similar names, asks the UVM factory to create the
object of the specified type.
This allows the factory to actually create an object of a different type, if the user has configured
the factory to do so. This is called a factory override. It can be useful sometimes if you want to
change the type of a component to be created on-the-fly during simulation. This is why you
should use this method to create UVM components.

The build_phase function highlights an important principle. Every UVM component must create
its children. This will create new components that must be initialized the same way. The build
phase will continue as long as components whose build-phase function has not been executed
remain. As the first component to be built is the test component, the effect is that the component
hierarchy is built from top to bottom.

The third function in the class is the connect-phase function. The purpose of this function is to
create the TLM connection between the child components that require a connection. The
connection is made by calling the connect function of a TLM connector object of one component
and passing the TLM connector object of the other component as an argument. This will be
described in detail later.

In this example, the driver is connected with the sequencer, and the monitor is connected with
the analyzer.

The UVM calls the connect_phase functions starting from the bottom-level components and
proceeding up in the hierarchy.

This agent class does not have a run_phase function. It is a container-type class that doesn't
have anything to do in the run phase, so a run-phase function is not needed.

7 Transaction-Level Modeling Based Communication Between


UVM Components
In the previous slide we saw how UVM components are defined using classes, then created in
the build_phase, and finally connected to each other in the connect_phase. These connections
are used in the run phase to pass data between the components, such as the sequencer and
driver in the example we saw earlier. This slide explains how this transaction-level model-based
communication works in practice in UVM.

Let's begin by defining what we mean by the term transaction.

In UVM, a transaction is an object of a class derived from the uvm_sequence_item base class.
It has member variables that represent the data in the transaction. You can see an example in
this slide on the left.
You can think of transactions as small packets that carry data and information that describes the
data. TLM communications is about moving these packages inside a system model.

UVM components can have connectors that can send and receive transactions of the type given
as a parameter. The variable put port whose declaration is shown here represents a connector
derived from the parameterized uvm_blocking_put_port class. The parameter defines the type
of the transactions this port can handle. In this case the type is the tx transaction class defined
above.

You can create any number of connectors inside a component by just declaring member
variables whose types are UVM connector classes. In some UVM base classes, some essential
connectors have already been declared, which means that there is no need to create them in
the user's class.

There are basically two kinds of TLM connectors in UVM.

The connectors that can initiate a transaction are called ports.

Connectors that execute the transactions are called exports. In practice, export connectors can
be either 'imps' or 'exports'. Imp-type connectors implement the transaction by actually receiving
the data in the destination component. Export connectors just forward the transaction from a
parent component to a child component.

As mentioned in the previous slide, you create a connection between two components by using
the connect member function of the connector object that is the initiator of communications. This
means that you don't have to define any signals or similar objects to represent the connections
in a physical sense. The connection is made in the connect_phase function of the component
that contains the components to be connected.

The example on the right shows the code that creates a connection between the put port of the
component producer and the put export of the component 'consumer'.

When the connection has been made, it can be used in the run_phase. In the case of a put-kind
of connection, this is done by calling the put member function of the put port from some function
or task inside the producer component with a handle to a transaction object as argument.

Let's assume for simplicity that the export connector put expor' is an 'imp' type connector. This
means that the component 'subscriber' is the actual receiver of the data, and not a hierarchical
component that would just pass the transaction to its child.

When the initiator's 'put' function has been called, UVM executes the transaction by calling the
'put' member function of the component that contains the put export, again passing the
transaction object handle as the argument. The user must provide the code for this 'put' function
in the class declaration of the component that has the export connector. The code can just copy
the data to a safe place for later use, or process it and write out the results through some other
TLM connector, or do any other tasks required by the testbench functions. Most of the
run-phase activity in a UVM testbench is inside these functions that implement the TLM
communications.

To summarize this slide, the things you have to do to create a TLM connection are to first define
the transaction class, then create the TLM connectors parameterized for the transaction type
inside the components, and finally create the connection in the connect-phase function of the
component that contains the components at both ends of the connections. After that you have to
write the code for the function that actually executes the transaction in the component that
receives the data.

8 How Does the UVM Testbench Talk to the DUT?


We have now given an overview of how UVM components are defined and created, connected
to each other, and then made to talk to each other. In this slide we examine techniques that are
used to implement the communications between UVM and the design under test, the DUT.

Remember from earlier slides that while the UVM component hierarchy is untimed, it is
synchronized with the design and its clock in the driver component. UVM components also don't
have direct access to the ports of the DUT, like test programs in conventional testbenches have,
and the access must therefore be specifically established inside the driver.

Let's begin with the time issue. As said, the UVM hierarchy is untimed and does not use the
clock. Most of the activity is defined in functions that in SystemVerilog cannot consume time,
which means that they cannot have timing controls such as delays or waits. The run_phase
member of the uvm-driver, however, is defined as a SystemVerilog task, and tasks can consume
time. What this means in practice is that if the run_phase task is blocked for some reason, it
cannot request new data from the sequencer, which in turn blocks the sequencer, and this in
turn blocks the sequence the sequencer is executing. This way the affected parts of the UVM
testbench stop to wait until the driver is ready to continue.

The drawing on the left-hand-side of the slide shows what happens in practice. The driver
requests transactions, one at a time, from the sequencer through its 'get' port. It generates an
actual bus transaction using a SystemVerilog interface, either by controlling the signals of the
interface directly, or by calling interface tasks like the 'read' and 'write' in this example to do the
same thing. This pin wiggling takes time, as the code has to wait for clock edges, and for status
signals to become true, and so on. These waits will block the run_phase task.

The obvious question now is how does the UVM driver get access to the SystemVerilog
interface, which has been created in the testbench module.

The solution is again a little bit tricky, and one of those things that you just have to know.
The procedure begins in the testbench module. After the interface object has been created, and
its signals have been connected to the DUT ports, the interface object's handle is pushed into
the UVM configuration database, which is again a UVM feature.

In the code example, the name uvm_config_db refers to the database object, and the name 'set'
following the double-colon scope resolution operator is a member function of the database class
that can be used to save any type of data in the database. In this case, the parameter of the
function call defines the data to be of the virtual interface apb_if type. A virtual interface is a kind
of a pointer that represents the actual interface object instance. The virtual interface 'apb' is
saved under the key 'KEY' in the database. Anyone who knows this key can get the virtual
interface handle from the database by using the 'get' function.

This is what the driver component has to do. It must first declare a variable of the virtual
interface type, and then get the interface handle from the database into this variable using the
get method of the database. This can be done in the build-phase or in the beginning of
run_phase, for instance.

After the virtual interface variable has been initialized this way, the run-phase task of the driver
can use it to access the signals and tasks of the real interface. The UVM component hierarchy
is now completely connected to the design's module hierarchy. Information about the DUT's
interface is isolated inside the driver component. If the UVM testbench had a monitor
component, it would also have to access the virtual interface the same way.

9 Test Data Generation with UVM Sequences


We have now covered the UVM components and their TLM and DUT connections, and are now
ready to examine how we can get test data moving in the testbench.

Remember that the testbench has the driver that talks to the DUT, and a sequencer that
provides the data for the driver. The sequencer gets the data transactions from a sequence,
which is where the test data is created.

The base class of sequences is the uvm_sequence class. You can create your own sequence
class by extending this base class. The most important extension is the 'body' task. When a
sequence is executed on a sequencer, the body task gets started, and all transactions are then
created in it. You must always define the body task in your sequence class.

The body task of a sequence works more or less like for instance the SystemVerilog 'initial'
procedure works inside program blocks. It consists of sequential statements that can do
whatever program statements in general can do to generate data. When the data is ready to be
sent to the DUT, it must be packaged as UVM transactions.
The code box on the left shows the essential UVM specific parts of the body task.
The transactions to be generated are UVM transaction objects that you must create just like any
other objects. In this example, the code first declares a variable of the tx transaction type, and
then initializes this variable by using the UVM factory create method that was discussed earlier.

When the transaction object has been created, its member variables can be initialized with the
test data you want to send to the DUT, as shown in the code.

The transaction is given to the sequencer by first calling the task start_item and then the task
finish_item. Two calls are needed because of the handshaking between the sequence and the
sequencer on one hand, and the sequencer and the driver on the other. The call to start_item
will block until the driver requests data from the sequencer, and finish_item will block until the
driver has acknowledged that it has received the transaction.

After sending the transaction on its way, the sequence can generate the next one, and repeat
the handshaking procedure. You can use the same transaction object you created in the
beginning all through the sequence, if it is OK to change the values of its member variables
without causing problems in other components that could have received the transaction
somehow. You can also create a new transaction object every time before you send the next
one. UVM has built-in garbage collection for sequence-items which means they are
automatically deleted when they are dereferenced.

With the sequence ready, all that remains is to create a test class, which will then execute the
sequence on a sequencer. The right-hand side of this slide presents this side of a test class.
The build_phase and connect_phase functions that create the UVM environment have been
omitted from this example.

The test class must declare the run-phase task. This task must first create the sequence, and
then execute the sequence on a sequencer that exists in the UVM component hierarchy. The
sequence execution statement is highlighted with a yellow background. The sequence is started
by calling its 'start' function with the hierarchical path of the sequencer component as the
argument. The hierarchical path consists of the member variable names of the components in
the hierarchy that lead to the sequencer object.

The raise-objection and drop-objection function calls control a UVM mechanism that prevents
the simulation from being finished by some component as long as this test is running.

10 Summary of UVM Testbench Creation


We have come to the end of this introductory lecture on UVM. In the next lecture we will discuss
some advanced topics.
The procedure shown on the left-hand-side of this slide outlines the main steps of UVM
testbench creation. Even though UVM now probably feels quite complex, the testbench creation
process itself is very systematic because of its object-oriented nature. You just follow the
procedure, declare the classes you need, and then just fill in the blanks, that is, the
phase-functions. After that you can begin to write sequences, and use the same testbench in
many many tests.

The right-hand side of the slide has some tips for debugging UVM coding errors.

For a beginner, sorting out the compilation errors caused by incorrect syntax, type conflicts and
the like can be a challenge, but once that's done, some run-time errors that emerge in
simulation can cause a new kind of headache.

If you've only written RTL code up until now, you haven't had to worry about memory
management issues. This changes with UVM. Most UVM objects are created dynamically using
the 'new' constructor or the factory create method. If you forget to initialize the variable of the
UVM object in either of these ways, you will have a null handle, a pointer variable that points
nowhere. It can cause testbench or even simulator crashes when you try to use its members or
pass it as an argument to some UVM function. If you see a Linux segmentation violation error
message in the simulator's log, this is the most probable cause.

The second issue that causes trouble for beginners is that all TLM connections must be created
properly. If you forget to create a connection altogether, or use the port and export objects in the
wrong order, the simulation will end unexpectedly, and you will see some kind of UVM run-time
error message in the simulator's log, provided that you know where to look.
L12 UVM Continued

2 UVM TLM Connections

In this lecture, we take a closer look at some features of UVM, such as TLM communications
and sequence creation. We shall begin with TLM topics.

This slide presents a summary of the most commonly used TLM connector and therefore also
communications types that are available in UVM. We can divide these in two categories based
on the port types: regular TLM ports and analysis ports.

Regular TLM ports are meant for point-to-point communications between two components. This
means that both ends of the connection must always be connected to a connector. Leaving a
regular port connector unconnected results in a run-time error in simulation.

Based on their usage model, regular ports can be further divided into two groups, 'put' ports and
'get' ports.

When put-type ports are used, the component that has a put-port creates a transaction and
sends it out by calling the put-port's 'put' method. At the other end, the owner of the connected
'put' export, which must actually be an 'imp' type export, must implement a 'put' method, that is,
a member function called 'put', to copy the received transaction.

'Get' ports work the other way round.

The component that has a get-type port calls the port's 'get' method with an empty transaction
handle as an argument. At the other end, the owner of the connected 'get imp' must implement
a 'get' method, and create a transaction and assign it to the transaction handle it received.

So with put ports, you can push data down-the-line to the next component, while with get ports
you can pull data out of an upstream component. This difference is indicated in TLM diagrams
by the connector symbols: regular ports are represented as square rectangles and exports or
imps as circles. The type of the connection, put or get, can be determined from the direction of
the arrow.

The second group of TLM ports is analysis ports.

These ports are used for one-to-many connections, where the many can also be zero. Analysis
ports can therefore be left unconnected. If a component class declares an analysis port variable,
it must create it as usual, but it is not necessary to connect it anywhere in the connect-phase if
the connection and the associated functionality is not needed in the current testbench
configuration.
The member function used with analysis ports is called 'write'. The owner of an analysis port
must create transactions and write them to the analysis port using the 'write' method. The write
is non-blocking, which means that the sender does not stop to wait until the data has been
received at the other end, as there could be nobody there or there could be many receivers.

The owner of an export connected to an analysis port must implement the 'write' method to copy
the received transaction.

The symbol for analysis ports is a diamond shape in TLM block diagrams.

3 TLM Connector Classes


This slide presents a complete list of the connector classes that are available in UVM.

For each connection type, a port, export and imp class is defined, as indicated in the three
columns of the table. The export classes are used to implement hierarchical connections. These
are discussed in the next slide.

If you look at the lists, you will notice the 'put' and 'get' type interfaces there, but in different
versions. There is also the 'peek' interface type we have not mentioned yet.

All interface types are available as blocking and non-blocking versions. The uvm_blocking
interfaces only need the 'put' or 'get' method to be defined by the user. For uvm_non_blocking
interfaces, the methods try_put and try_get must be defined. These methods should return the
value 'one' if the transaction was executed successfully, and a zero if it wasn't. Non-blocking
interfaces also need the can_put and can_get methods to be defined. These can be used to
check, if a transaction can be executed, without actually executing it.

For the most general uvm_put and uvm_get type interfaces, support for both blocking and
non-blocking operation should be implemented.

The 'peek' interfaces mentioned in the table use the 'peek' method instead of 'get'. These
interfaces are used for the same purpose as get interfaces, except that the retrieved
transactions are not consumed. This means that successive calls to peek will return the same
object. The get-peek interfaces support both get and peek functionality.

4 Hierarchical Connections
TLM transaction data is usually produced and consumed in components that are at the
bottom-level of the UVM testbench hierarchy. However, the producer and consumer can
sometimes be inside two different hierarchical components, for instance inside an agent and a
scoreboard. Neither of these parent components can therefore create the connection, because
it is not completely within their scope.
In this example you can see this kind of case. Here we want to connect port A of the 'producer'
component to the 'imp' connector D of the 'consumer' component. 'Producer' is inside the
component vc1 and consumer inside vc2, so neither vc1 or vc2 can make the connection in
their connect_phase method.

The solution is to create a port connector in vc1, and an export connector in vc2. After that the
connections can be created.

The port in vc1 will be a hierarchical port that just forwards transactions from a connected lower
level port of the producer. This is possible with ports.

At the other end, an export connector is needed in vc2 to forward transactions down to the 'imp'
connector that terminates the connection in the consumer.. This explains why two different kinds
of connectors are needed. The owner of an export does not have to provide the method that
executes the transaction but the owner of an 'imp' does.

If we follow the route starting from port A of 'producer', this port must first be connected to port
'B' created for this purpose in the parent component vc1. The connection must be done in the
connect-phase method of vc1. The code is shown in the box at bottom-left.

The next part is the connection between port B of vc1 and export C of vc2. This must be done in
the connect-phase of the class that contains vc1 and vc2. The code is shown inside the box at
the top. We assume that the variables m_vc1 and m_vc2 are the object handles of the
respective components in the class declaration of the container class.

The final part of the route goes from the export C of vc2 to the imp D of consumer. This
connection must be made in the connect_phase of vc2, as shown in the box at bottom-right.

With these connections, everything works as with direct connections. If the 'put' method of port
A of 'producer' is called, the 'put' method of 'consumer' will be executed by UVM to complete
the transaction.

For hierarchical connections, the rule is to use the 'connect' method of the connector that
initiates the transaction.

5 audioport_uvm_test
This slide shows a block diagram of the UVM environment that is used to execute the
audioport_uvm_test in the course project.

As you can see, it contains many hierarchical components: three agents that handle all
interfaces, and a scoreboard that contains a reference model of the audioport. All functions of
the audioport can be verified with this UVM environment.
Let's examine the components more closely.

The agents are shown on the right, next to the SystemVerilog interface objects that are used to
access DUT ports. These are passed on to the UVM driver and monitors as virtual interfaces
using the UVM configuration database.

The i2s_agent contains a monitor component that converts the serial audio data into
transactions that it then writes into its analysis port. The analysis port is connected to an
analysis port in the i2s_agent through a hierarchical connection. This allows other components
to read I2S data from this agent.

The agent in the middle is the control_unit_agent. It controls the APB bus, and contains a
sequencer and a driver for sending APB transactions to the DUT's bus interface. The agent also
has a monitor that reads the bus signals and extracts APB transactions from them, which it then
writes to its analysis port. This analysis port is connected to an analysis port in the agent in this
case too.

At this point, notice how the control_unit_agent provides the input data stream, the bus
transactions, to the scoreboard, while the i2s_agent provides the audioport's output data
stream. The scoreboard's job is to judge from these, whether the I2S data stream is what it is
supposed to be.

The third agent, the irq-agent, contains a monitor that reads the interrupt signal values and
writes out transactions into its analysis port. This analysis port is also connected to an analysis
port on the agent level. This port is routed to the control_unit, which gets notified of interrupts
this way.

The purpose of the scoreboard component is to check simulation results. The scoreboard class
is derived from the uvm-scoreboard base class, which as of now exists mostly for documenting
its purpose rather than providing ready-to-use scoreboard functionality in the form of member
variables and functions.

The scoreboard has two child-components.

The audioport-predictor class is derived directly from the generic uvm_component base class. It
contains a functional model of the audioport, though without the I2S serializer. It receives APB
transactions from the control_unit_agent's analysis port and predicts the I2S sample values from
these.

The audioport-comparator is also derived directly from the uvm-component base class. It
receives parallelized audio samples from the i2s_agent and compares them with the predicted
samples it obtains from the predictor component.
As you can see from the block diagram, this testbench contains many hierarchical TLM
connections. Many kinds of TLM interfaces are also used, including blocking and unblocking
ports, and analysis ports.

6 TLM Communications: Predictor


Let's now have a closer look at the audioport_predictor component.

Remember that the predictor receives APB transactions through its imp type analysis_export
from the control unit, as shown in the block diagram close-up on the right. The declaration of the
'uvm_analysis_imp' type member variable is shown on the left. It requires two parameters, the
type of the transaction and the type of the parent component.

When the analysis imp receives a transaction, the write function declared in the predictor class
is called. It implements the functions of the control_unit, including command decoding.

You can see the part of the code of the write function in the code box that decodes the start
command from the bus data. It creates a new I2S transaction, enables the play mode, fills the
transaction with zeroes and pushes it to a queue array.

A SystemVerilog queue is a dynamic array that grows and shrinks automatically and allows
addition and removal of elements anywhere. It is very useful in buffering test data.

7 TLM Communications: Comparator


This slide looks at things from the comparator's point of view.

The audioport_comparator receives i2s_transactions from the analysis port of the i2s_agent
with its write method. This works the same way as the analysis port connection explained in the
previous slide.

When the write method of the comparator is executed, it must obtain a reference I2S transaction
that it then can compare with the transaction it just received. It can get the reference data from
its non-blocking 'get' port, represented by the predictor_get_port variable shown on the left.
This port is connected to the respective port in the predictor. Because this is a non-blocking port
we must use the try_get method.

A non-blocking port is used here just for demonstration purposes, but it can be useful in general,
if you are not sure when some component is ready to do transactions and don't want to block
other components.
With both the DUT and reference data transactions available, the comparator's write function
can compare them and report the results. There are many ways to compare transactions in
UVM. The simplest is to just compare member variables one-by-one. You can also write your
own comparison function and override the 'is-equal' operator with it. This is useful if there are
many comparisons in the code.

The predictor's side is shown in the code box on the right. This is the implementation of the
try_get method in the predictor class. This function will be called, when the comparator executes
the try_get function call.

The code pops the next transaction out of the queue and assigns it to the transaction object it
received. It will now be available in the comparator. After that, the code calls the 'do-dsp'
function, which models the functions of the dsp_unit in the predictor's reference mode. The
do_dsp function pushes the next reference transaction into the queue.

8 Sequence Execution
Let's now change the topic, and move on to examine different ways to execute sequences.

Remember that a uvm_sequence is essentially a task that generates transactions. The


sequence is associated with a sequencer, to which it passes the transactions it has created. The
sequencer in turn forwards them to a driver. A sequence is started on a sequencer by calling the
'start' member function of the sequence object with the sequencer handle as argument.

If you have many sequences, you can in principle execute them in many ways.

In a serial execution, you start the sequences one after another, always waiting for the previous
one to finish before starting the next.

Parallel execution is also possible, for instance by using a fork block, as shown in the second
example on the left.

The third way is to run sequences hierarchically. This means that you first start one sequence,
and then start other sequences from the body task of that sequence, as shown in the third
example.

The hierarchical principle is used in the multi-agent testbench example at bottom-left. If you
have multiple agents, you will want to control when each agent executes its sequences. You can
arrange this by creating a virtual sequencer and a sequence for it, whose purpose is to just start
the agent sequences at appropriate times. The word 'virtual' is used here just to indicate that the
purpose of these objects is not to generate test data, and that they are not connected to a
driver. There are no virtual sequencer or sequence classes in UVM.
In this arrangement, the sequences running inside agents are called worker sequences. They
can in turn use so called application program sequences to generate transaction sequences that
occur often in the test data.

The audioport_uvm_test environment has this kind of hierarchical sequence arrangement. The
right-hand-side of the slide describes how this is implemented in practice.

The uvm_test starts a master sequence, which is a kind of virtual sequence. It has two
functions. It first starts the main sequence, which controls the APB interface just like the main
program of an audio software application running on the CPU of the system-on-a-chip would do.
The second function of the master sequence is to start an interrupt service routine sequence
every time the master sequence detects an interrupt event.

The yellow code box shows the relevant parts of the 'body' task of the master sequence.

It first creates the worker sequences using the factory create method. This will become
important in the next slide.

The fork block starts the main sequence, and in parallel with it, a code block that goes into an
eternal loop from which it starts the isr_sequence every time an interrupt occurs.

The isr_sequence uses the 'grab' method of the sequence to overtake its sequencer, as shown
in the purple code box. This will block other sequences that are running on the same sequencer.
The behavior is similar to that of a computer system, where the main program is suspended and
the control is transferred to an interrupt handler when an interrupt occurs

9 Factory Overrides
We have already mentioned the UVM factory on a couple of occasions in connection with object
creation. The recommendation in UVM coding guidelines is that you should use the awkwardish
factory create method instead of the much simpler new-operator of SystemVerilog to create
UVM objects. This slide explains why.

In the previous slide, we presented a simple but pretty useful master sequence that models the
behavior of a CPU that executes a main program and jumps into an interrupt service routine
every now and then. This behavior was realized with two worker sequences, the
main_sequence and the isr_sequence.

Now think of a situation where you would like to reuse the master sequence class as it is, in
other tests, but every time with different worker sequences. The UVM factory override
mechanism helps you do that.

Here is how it works.


You begin by creating dummy base classes for your worker sequences. By dummy we mean
that the class declaration is empty except for the constructor function. The
audioport_main_sequence_base class in code box 1 is an example of this.

As the next step, you should declare the real worker sequences as derived classes of the
dummy sequence classes created in step 1. These derived classes should implement the body
task and other features of a real sequence. The audioport_main_sequence class in code box 2
is implemented like this.

When you now create instances of the worker sequences in the body task of the master
sequence, you must create them from the dummy base classes, as was shown in the previous
slide, and also in code box 3 5.

The next step is important. Before you start the master sequence from the run_phase of the test
class, you must override the dummy base class types with the real sequence classes by using
the factory's type-override method, as shown in code box 4. This has the effect that, from this
point on, every time somebody asks the factory to create an object of the
audioport_main_sequence_base class, the factory will create an object of the
audioport_main_sequence class instead.

So when the master sequence now creates a worker sequence, it does it the old way and is
completely unaware that it has been fooled, as shown in code box 5. From the users' point this
is a good thing because it is now possible to create new tests with different worker sequences
without having to touch the code of the master sequence class.

You can override a class object only with its inherited child class objects this way. This is why
the dummy base classes are needed.

10 Concluding Remarks on UVM


In the previous two lectures, we have introduced you to the main features of UVM. All UVM
features come with a lot of details so there is still much to learn. However, because of the
object-oriented implementation principle of UVM, you don't have to know everything, or even
much, to be able to begin to work with UVM testbench or sequence design. The details can
always be easily looked up in the class reference manual.

One significant feature that we have not covered is the UVM register layer. It allows you to
create abstract models of memory-mapped register banks or memories, giving you 'backdoor'
access to the DUT's registers. This simplifies test sequence creation, as it allows you to bypass
the bus interfaces. It also makes it easier to keep the data in reference models in sync with the
data in the design.
After having studied UVM for some time and having written some UVM code, you have probably
noticed that UVM testbench design is perhaps more about software engineering than traditional
hardware engineering. This is the direction in which system-on-a-chip verification is going in
general. It is becoming very important to be able to verify the hardware and software designs
together. A lot of programming skills and effort with different languages and programming
paradigms will therefore be required in the creation of the complex verification environments
and test cases.

The right-hand side of the slide presents the portable stimulus standard, or PSS, as an example
of emerging high-level SoC verification techniques. The PSS is meant for modeling test cases
on a high level, as activity graphs that define the order and dependencies between the actions
that make up a verification test case. The graph is implemented in text format in a language
called the domain specific language, or DSL.

A DSL model can be compiled into UVM sequences that can be executed in an UVM testbench.
The real advantage, however, is that the activity graph can also be compiled into C code that
can be executed on the CPU of the SoC. This way the same tests could be used for both
hardware verification, and for software-hardware integration testing, making it unnecessary, or
at least easier, to maintain two verification environments.
L13

2 'Back-End' IC Design Tasks in DT3


We have now reached the point in the course where we can assume that we have a verified
functional model of the audioport ready and working.

The next step is to create a physical model of the design. By 'physical' we mean a model that
contains the information that is required for manufacturing of a chip that implements the design.
This phase of the digital circuit design is called the implementation phase, or the back-end
design phase.

On this slide you can see the flow diagram that shows the back-end design tasks we are going
to do in the course project.

The flow contains both design tasks, shown as green rectangles, and verification and analysis
tasks, shown as orange rectangles.

In the design tasks, the RTL model is first mapped to logic gates and flip-flops, after which the
mask layout patterns that define the semiconductor devices and metal wires to be implemented
on the chip are created. The design tasks also include tasks related to testing of the chip.

The purpose of the verification tasks is to verify that the functionality of the design does not
change in these transformations, not so much to detect bugs anymore.

Since the course project concentrates on intellectual property block design, the aim of these
back-end design tasks is only to create a prototype implementation of the block. This way we
can, first of all, find out if the RTL code works with the back-end tools, and second, get accurate
estimates of the power, performance and area properties of the block. The final physical
implementation would be done as part of a complete SoC.

In this lecture we'll discuss logic synthesis, testability and power optimization topics.

3 Logic Synthesis
In logic synthesis, RTL code is translated into flip-flops and logic functions, and mapped to
components taken from a target technology library, while optimizing the design for timing, area
and power.

We assume that you already have a basic understanding of logic synthesis and optimization,
and will therefore not discuss the internal workings of logic synthesis programs, such as the
Design Compiler program we are using in the course project.
Logic synthesis programs are usually used with a synthesis script that contains the commands
the program must execute to convert the RTL model into a gate-level netlist file. A simple
synthesis script is presented on the left. Because of the use of scripts, no user interaction is
required during synthesis.

The main advantage of logic synthesis programs is that they allow very large designs to be
optimized for area, timing and power. This way the same RTL code can be a source for very
different kinds of circuits in terms of these properties.

The listing on the right shows a section of the optimization log of a synthesis program. This tool
has three optimization goals. The highest priority goal is to bring the design-rule optimization
cost function to zero by working on the circuit's structure. Design rules are technology
constraints such as maximum rise and fall times or maximum load capacitances of signals that
must not be exceeded. Once the design rule goal has been reached, the optimizer starts to
optimize timing. The cost function must again be brought down to zero, otherwise the circuit will
not work because of excessive delays on register-to-register paths for instance. The third goal is
area. In this phase the optimizer tries to make the circuit area as small as it can. This is not a
critical goal, as the area only affects cost but not functionality.

If you examine the synthesis script, you will notice the command 'insert_dft', which is used to
add test structures into the circuit, and the '-gate_clock' option of the compilation command,
which is used to add structures that reduce power consumption caused by clocking. These are
examples of commands that change the design by adding new functionality that was not
included in the original RTL design.

4 Timing and Area Reports


After you have run synthesis, the first things you typically want to check are the timing and area
reports. The timing report tells you whether the design works with the given constraints. From
the area report you can conclude how much the design would cost. Power is the third important
property, but we shall discuss that later in this lecture.

On the left-hand side of the slide you can see an example of a timing report. It is the most
commonly reviewed timing report, the critical path report. The critical path of a design is the
route from a start-point through combinational logic to an end-point on which the timing margin,
or slack, is the smallest. The slack is the difference between the required arrival time of data at
the end-point and its true arrival time. The arrival time depends on the propagation delay of the
components on the critical path. The required arrival time is determined by the user-defined
timing constraints, such as the clock period, and external delays of inputs and outputs.

Designers often set up the synthesis script so that it optimizes and reports the timing by path
groups. A path group is just a set of paths selected by the designer. The most commonly used
grouping puts paths that go from register to register, and from inputs to registers and vice versa,
into separate groups. This way the timing reports provide more insight on the properties of the
design. The timing of the reg-to-reg path group, for instance, only depends on the decisions of
the RTL designers. The timing of paths that end at input or output ports, on the other hand, also
depends on how the environment of the design has been modeled.

The area report shown on the right gives a summary of the areas of different types of
components used in the design. The total cell area figure tells the area covered by the
components in total. Notice that the total area of the design is reported as undefined. This is
because after synthesis, we don't know yet how much area is taken by the wiring.

5 Design for Testability (DFT)


In a previous slide we pointed out a synthesis script command that was used to insert test
structures in the design. The aim of this was to add structures that would make it easier to test
the chip after it has been manufactured. Techniques and practices that are used to improve
testability are known as design for testability, or DFT, techniques.

To understand why certain DFT techniques are used, you need a basic understanding of
manufacturing testing of integrated circuits in general.

The aim of testing is to detect faults that have emerged in the manufacturing process. The aim
is 'not' to detect design errors, that's the aim of verification.

Silicon chips can have manufacturing faults that must be detected by feeding in test patterns
and checking the response. Each chip is tested multiple times during the manufacturing
process.

Testing has strict requirements. First of all, because automatic test equipment are very
expensive, the test time per chip must be short. At the same time, test coverage must be very
high, around 100%, implying that the test must detect all faulty chips. These constraints make
test design a challenging task.

The main challenge is that testing must be done using only the input and output pins of the chip,
even though the chip can contain billions of wires and transistors that must be tested. The
controllability and observability of the internal nodes is therefore very limited. DFT techniques
seek to improve these.

It's the job of an IC's designer to plan the test procedure and create the test patterns. From an
RTL designer's point of view, having a basic understanding of DFT methods and their
requirements is important, because it makes it easier to create test-ready RTL designs and
code.
6 Fault Model Based Test Pattern Generation
Test patterns are bit vectors that are applied to the inputs of the circuit. A good test pattern will
produce a different response from a good and a faulty circuit. Running through all possible
bit-vector values would always detect all faults, but the number of test patterns generated this
way would be far too large. Therefore a more efficient way of choosing test patterns must be
used.

The industry-standard solution to the test pattern generation problem is fault-model based
testing. Its idea is that the chip is assumed to have just one fault of a specific kind in a specific
point at a time, and a pattern that detects just that fault is generated based on the nature of the
fault. This is repeated by assuming the fault in every possible point, a wire or component pin for
instance. After this process, a complete test pattern set has been generated for all the expected
faults.

Commonly used fault models include stuck-at, bridging, transistor stuck-open and stuck-short,
and delay fault models. The stuck-at fault model is by far the most common.

The stuck-at model assumes that every component input and output, one at a time, is
permanently tied to 1 or 0. A test pattern is generated for every case.

A test pattern is a set of input values that produce a different response from a good circuit and a
circuit that has the fault.

The schematic diagram on the right explains the principle of stuck-at fault testing. In this case, a
stuck-at-zero fault has been modeled on the first input pin of the second 'and' gate. This means
that this gate will always see the value of the first input pin as zero. To emphasize this, the wire
connected to this pin has been cut in the drawing. By evaluating the logic functions of the
circuit's outputs, you can see that the test pattern 1-1-1 would detect the fault.

After test pattern generation, the faults are classified as detected, undetected and undetectable,
also called redundant. The third group covers faults that cannot be detected by the test but can
be proven to not have any effect on the circuit's output values.

An important metric that can be computed from these statistics is the test coverage. It is defined
as the ratio of the number of detected faults and all faults minus undetectable faults, or more
simply, the number of detected faults divided by the number of testable faults.

Test patterns are usually generated separately for combinational logic and flip-flops.
Combinational logic testing is the difficult part because of the complexity of combinational logic
parts.
7 Test Pattern Generation Example
Let's illustrate the test pattern generation problem with an example that is slightly more difficult
than the one in the previous slide.

On the right you can see a combinational logic circuit with three logic gates. We will outline the
test pattern generation procedure using this circuit as an example. The procedure injects and
tests one fault at a time.

In the first step, a fault is inserted in the circuit model from a fault list. It is a stuck-at-zero fault at
pin X. The fault list typically contains all possible faults for the circuit.

In the next step, the goal is to cause the fault to manifest itself. This means that pin X must be
forced to state 1. This can be achieved by setting input 'A' equal to 0 and 'C' equal to 1. Now the
state of X will be zero in a faulty circuit and one in a good circuit.

At this point, remember that we can only observe the outputs of the circuit in testing. Also, we
can not set a probe on pin X, because it is on an integrated circuit chip, and has the size of a
few nanometers. We must therefore make the state of pin X propagate to the output D by
controlling the inputs. This is called path sensitization.

Since output D is driven by an OR gate, we must set the first input of this gate in state 0 to make
its output depend only on X. The inputs A and B will therefore have to be set to state 0. We have
already fixed A to state 0 so we only have to fix B. We can conclude that the test pattern is
zero-zero-one.

Notice that in the forcing phase, we had to fix some inputs. In this case this did not cause
problems in the sensitization phase, but that is not always, or in general, the case. It can be very
difficult to find input patterns that can both force an internal node to a specific state, and
propagate this state to an output pin. For large circuits with thousands of logic gates, it can be
done only with the help of automatic test pattern generator programs that use sophisticated
algorithms to do the search.

In practice, one test pattern can often detect many different faults. ATPG programs can make
use of this property to reduce the number of different test patterns that actually have to be used.
This technique is called test pattern compression. It can decrease testing time significantly
compared to the case where every single fault is tested separately.

The complexity of the test pattern generation problem is directly proportional to the complexity
of the logic between the fault location and the inputs and outputs. In practice this means that if
the logic is too deep, it may not be possible to find test patterns that provide 100% test coverage
for it. The depth of the logic, in turn, depends on the RTL architecture. This is one of the reasons
why RTL designers should be aware of testability issues.
8 DFT Techniques: Scan Path
As we mentioned earlier, integrated circuits are tested from their input and output pins. This
poses a serious problem for the combinational logic testing method presented in the previous
slides. Most of the combinational parts are deep inside the chip, behind many registers, so it is
not possible to access their inputs and outputs directly. Using test terminology, we can say that
the controllability and observability of these circuit parts is bad.

This problem is solved by creating so-called scan paths inside the design to make the inputs
and outputs of combinational parts accessible from outside the circuit. The insert-dft command
used in the synthesis script shown earlier does just that.

Scan path insertion is a simple process in principle. In the first phase, every regular flip-flop
component is first replaced with a multiplexed flip-flop. These flip-flops are called scan flip-flops.
A control input signal, scan-en in the figure, is then added to drive the select inputs of the
multiplexers inside the scan flip-flops. The data inputs of the original flip-flops are connected to
the first inputs of the multiplexers. These inputs are selected when scan-en is in state 0.

As the final step, the output of each scan flip-flop is connected to the second input of the
multiplexer inside the next scan flip-flop. This means that if scan_en is set to state 1, the scan
flip-flops form a shift register. The input of the first flip-flop must yet be connected to an external
input, and the output of the last flip-flop to an external output. After these modifications, the scan
path is ready. By setting scan_en equal to 1, we can clock test data into the flip-flops. The
flip-flops, in turn, control the combinational logic. We can also read test results data from
combinational logic into the scan flip-flops and then shift it serially out using the scan path.

9 Scan Path Creation and Usage


A scan path can be added in logic synthesis by first mapping flip-flops to scan flip-flops, as done
with the 'compile -scan' command in the script, and then using a synthesis tool command, like
insert-dft, to create the required number of scan paths.

The user has to define the clock and reset inputs that are used during testing, as well as the
scan path enable signal, and the ports that are used as scan path data inputs and outputs. The
user can also specify test mode select inputs for use in cases where the design has parts that
have to be forced into a specific state during testing. Multiplexers that bypass internally
generated clocks are one example of this.

The logic synthesis program can save a model of the scan path architecture for use with
automatic test pattern generator tools and automatic test equipment.

The procedure for applying one test pattern is presented in the slide.

The circuit is first clocked with scan mode enabled until all flip-flops contain test data.
After that, test data is applied to external data inputs.

The scan mode is then disabled, and the circuit is clocked once. This has the effect that test
results from combinational logic are stored in the flip-flops.

To complete the test, scan mode is again enabled, and the circuit is clocked until all test result
bits have been shifted out. New test data can be shifted in at the same time.

The duration of one test depends on the length of the scan path. You can reduce test time by
inserting many scan paths in the design and using them in parallel.

The flip-flops themselves can be tested by just running a test pattern through the scan path.

10 Scan Insertion in Practice


This slide describes how the scan path specifications are presented for the logic synthesis
program in practice, and what kind of problems can emerge when you try to insert the scan
paths.

On the left you can see an example of a DFT configuration file. Tool commands, option switches
and option values are shown in black font, and the values that represent the user's design data
in red. Let's examine the commands one-by-one.

The first command defines the port 'clk' as the clock that is used as the clock in testing. This pin
will be connected to the clock output of the test equipment. The values 45, 55 define the clock
waveform edge times in nanoseconds. The clock frequency depends on the capabilities of the
test equipment. It is typically much lower than what is used as the on-chip clock frequency in
normal use.

The second command defines 'rst_n' as an active-low reset signal.

The third command defines the input scan_en_in as the scan enable signal, which activates the
scan path when it is in state 1.

The next two commands define inputs test_mode_in and mclk as constant inputs. This means
that the test equipment must hold these signals in the specified state throughout the test.

On the next line, the command set_scan_configuration sets the properties for the scan paths we
want to create. The style is multiplexed-flip-flop, which is the most common style and the one we
have discussed in this lecture. The second option states that we want to have two scan paths.
Notice that the term scan chain is often used instead of scan path. They mean the same thing.
The option 'clock mixing' specifies whether the insert_dft command can include cells from
different clock domains in the same scan path.

Since two scan paths were requested, the input and output ports allocated for the paths must be
given, otherwise the tool will create new ports. The I/O ports can be regular data ports that will
be used for this purpose in testing. Multiplexers controlled by the scan-enable signal will be
automatically added in front of the output ports.

The scan path insertion command uses these specifications to add the scan paths. Before that,
it runs various design rule checks to make sure that scan paths can really be added in the
design.

The right-hand-side of the slide presents some design features that often cause testability
design rule checks to fail. To understand these problems, you have to remember that when test
data is shifted in and out, all flip-flops in the design will contain random data. If the design now
has some parts that should function like they do in normal mode, there can be problems.

Internally generated clock and reset signals, and three state buses are examples of such
problematic structures. If an internal clock signal is generated with a clock divider counter, it will
not generate a good clock waveform when test data is shifted through its flip-flops. The same
applies to reset generators. In a three-state bus, only one driver can be active at a time. If the
control logic that enables three-state buffers is not working correctly because of scan operation,
the bus can have many active drivers which can even break the circuit.

Problems like these are often simply caused by bad RTL code, such as the code shown in the
slide.

Here the design should have an internal reset feature, which has been implemented as a
second asynchronous reset signal. This is a bad design practice. Internal resets of flip-flops or
registers should always be implemented by controlling their next-state encoding logic. The
actual reset signal should not have any internal logic that could block or otherwise disturb it.

In this case, the 'and' gate synthesized to the reset line will cause flip-flop ff3 to reset every time
there is a zero in the test data during scan path operation. The test data will therefore be
corrupted in ff3.

If it is not possible to change the RTL code anymore because of schedule reasons, these kinds
of problems can still often be fixed in the synthesis phase with tool commands. This of course
off-loads the work to the back-end engineers who have to change their scripts to fix the
problems caused by bad RTL code.

This again highlights the importance of doing a prototype implementation for an IP block before
releasing its code for SoC level implementation.
11 Power Consumption in Digital CMOS Circuits
Power is the third important physical property of a digital circuit, in addition to timing and area.
To understand power reports, you have to understand the sources of power consumption in
digital circuits manufactured in the complementary metal-oxide semiconductor technology.

The total power can be divided into dynamic and static power components.

Dynamic power consumption is caused by switching activity inside the circuit. Switching means
just the toggling of signals between the 0 and 1 states. This causes power dissipation in two
main ways.

The dominant dynamic power component is the switching power consumed by the charging and
discharging of the load capacitances that the logic components drive. This capacitance is mostly
due to the wires. Switching power can be estimated using the first equation in the slide. In
essence, it states that the power depends on the signals' probability of switching per clock cycle,
the clock frequency, the capacitance and the supply voltage level squared. You would have to
compute this for every signal separately, and then add the results up. The square dependence
on voltage is an important thing to remember here.

The second component of dynamic power is the internal power of logic components. In theory, it
should be zero for CMOS circuits, as one of the stacked PMOS and NMOS transistors is always
off in principle, preventing current flow between the voltage source and the ground. However, in
practice, the transistors need some time to switch on and off, and this shorts the power supply
to the ground for short periods of time.
Internal power consumption is affected by how long both transistors are open, and that in turn
depends on the rise and fall times of input signals. It is therefore important to keep the edges of
logic signals sharp by not overloading them.

In the static power category, leakage power is the dominant component. There are many
leakage mechanisms in MOS transistors, but we shall not discuss them here. The main thing to
understand is that leakage does not depend on circuit activity. This means that as long as the
supply voltage is on, all transistors will leak and consume power.

Now that we know why power is consumed, we can find means for reducing it.

To minimize dynamic power, the most obvious solutions are to minimize switching activity and
lower the supply voltage level. Both techniques require some changes in the design. For
instance, lowering the voltage will increase the delays of logic components, which could call for
changes in the RTL architecture. On IP-block-level, switching activity reduction is the most
commonly used method. The supply voltage should of course be chosen so that it does not
cause unnecessary power consumption, but it is not very common to have several voltage
domains inside an IP block.
On the chip-level, on the other hand, multi-voltage design is a standard practice. An optimal
solution would in principle be to set the voltage as low as possible for every IP block. It is also
common to shut down IP blocks altogether when they are not in use. More advanced
techniques are based on scaling the clock frequency and voltage on-the-fly.

Use of multi-threshold component libraries in synthesis is a power optimization method that can
be used to reduce leakage power. This is based on the behavior of mos-transistors. A transistor
whose threshold voltage is designed to be lower will work faster but leak more. Increasing the
threshold voltage has the opposite effect. If you have both high and low threshold logic gates
available, it makes sense to use fast low-threshold gates on the critical path, and slower
high-threshold gates in other parts. This way most parts of the circuit can be implemented with
low-leakage components without degrading its performance.

In this course, our focus is on IP block design, so we shall study the most important techniques
that are used to minimize dynamic power on the block level, that is, smart design and clock
gating.

12 Power-Aware RTL Design


How you design the RTL architecture has a big effect on power consumption. You should
therefore always also think about power when you design circuits. The most important guideline
is that the RTL architecture should be chosen so that it does not generate unnecessary
switching activity. Even though it may be possible to use power optimization features of
synthesis programs to reduce power consumption later in the design cycle, they cannot
completely solve problems caused by bad design.

The schematic shown on the right presents a case of bad design. The circuit has two 16-bit
shift-registers into which input data is shifted serially, one bit per clock cycle. After 32 clock
cycles, the 32-bit output register is loaded with the results of a 16 times 16-bit combinational
multiplier that computes the product of the 16-bit values stored in the shift-registers.

The problem in this design is that while data is being shifted in, it causes a lot of activity at the
inputs of the multiplier. A multiplier of that size has thousands of logic gates, so the activity will
cause a lot of power consumption. The results, however, are not used on 31 out of 32 clock
cycles, so this power is mostly wasted. A better solution would be to add AND-gates controlled
by the enable signal at the inputs of the multiplier, or to move the 32-bit register over to the
input-side of the multiplier. After these changes, the inputs of the multiplier would remain stable
for most of the time.

13 Power Minimization with Clock Gating


Smart design can help reduce the activity of data signals. The real culprit for dynamic power
consumption, however, is the clock signal. It is by definition the most active signal in the design,
as it changes twice in every cycle. Its capacitive load is also huge, because it has to drive all
flip-flops in the design. The clock alone can consume tens of percent of the total power of a
digital chip, so it's a good idea to focus power reduction efforts on the clock. Clock-gating is the
standard solution for this.

Clock gating means blocking the clock signal from some parts of the circuit when we know that
the clock is not needed there because nothing is going to be clocked into registers there on the
next clock cycle. The problem is, how do we know which registers?

We can find the potential clock gating targets by using simple reasoning. If a register has to be
loaded on every clock cycle, its clock cannot be gated. However, if the register is loaded only
every now and then, then it does not need the clock to be on all the time.

The register represented by the RTL code fragment on the left matches this description. It has a
control signal that enables the loading of the register represented by the variable reg_r. The
enable signal is probably controlled by a control state machine somewhere. Most digital circuits
contain a large number of registers like this.

An RTL synthesis tool can easily detect this kind of register in the code, and insert a clock
gating cell to block the clock signal when the enable signal is off.

You could in principle use a two-input AND-gate as a clock gating cell, but that does not work
very well in practice. The reason is that the output decoder of the control unit can introduce
glitches in the enable signal, that is, unwanted pulses that occur due to the delays of the gates
in the decoder logic. The glitches do no harm in data signals, if they settle before the next clock
edge, but in clock signals they cause flip-flops to be loaded when they shouldn't be.

This is why the clock gating cells also contain a latch for filtering out glitches.
The latch receives the clock signal as inverted, which means that when the main clock is on in
the first half of the clock cycle, the latch is opaque, and the enable signal, and its glitches,
cannot pass through.
When the main clock falls, the latch becomes transparent and the enable signal passes through,
which allows the AND-gate to propagate the clock signal.

Applying clock gating on a design has another advantage, as highlighted on the right-hand-side
of the slide. The multiplexer that works as the load-enable logic is not needed any more, and
can be optimized away after the gating logic has been added to the register. This reduces the
area of the design, and power, too.

Clock gating has some negative effects as it increases clock latency and skew. Clock latency is
the delay between the clock input of the circuit and the clock pins of flip-flops. It changes the
timing of input-to-register and register-to-output paths. Clock skew is the variation of the clock
delay. It reduces the slack of all delay paths.
Clock gating also creates new implicit timing and DFT constraints, which is why it should always
be implemented with EDA tools that can create reliable gating logic automatically, while taking
into account the changes in timing and DFT constraints.

14 Power Report
We'll finish this lecture by examining the power reports generated by a logic synthesis tool.

When you interpret the power results, it is important to understand how they were obtained. As
we learned in this lecture, dynamic power depends on circuit activity, and power reporting
therefore also needs some activity data.

In the course project, the activity information used for dynamic power estimation is obtained
from an SAIF file saved in RTL simulation. The SAIF file contains the activity recorded from
module ports. After synthesis, when the synthesis tool generates the power report, it propagates
this activity to nets, ports and pins in the gate-level model, and the power report is based on this
propagated activity.

There are two things to note here. First of all, the reliability of the power results depends on how
well the simulation test case represents the real use case whose power consumption you are
interested in. If the input data does not contain the same amount of activity that the circuit will
have in real use, the power estimate can be far too low.

The second reservation concerns the estimation method. The activity propagation method used
in the synthesis program is not as accurate as actually simulating the gate-level model with real
input data would be. It is used because it is much faster, and because it is usually good enough
for pre-layout power estimation.

The power reports shown in this slide present the main power components for one design that
does not use clock gating, and another that does. In the ungated version, clock power is
reported as zero because the clock is treated as an ideal signal in logic synthesis. In the second
case, clock power includes the power consumed by the clock gating cells.

In the post-layout phase, the clock signal would be buffered in both cases, so the results would
be different. A post-layout power estimate would also include the contribution of wiring
capacitances, which could have a big effect on all power components.
L14

2 Physical Layout Design


We have now reached the final phase of the design flow, the design of the physical layout. Its
aim is to generate the data that is needed in the chip's manufacturing process.

To reach this point, you first have to create a system-level model that works as a high-level
specification for the complete system-on-chip design. From this spec, you have to first extract
the requirements for the RTL design you are going to implement for instance as an IP block.
After that you can design the RTL and write the code. It will then be synthesized and optimized
to produce a gate-level netlist that defines the components, wires, and connections that
describe the structure of the design. This way the functional model is transformed into a
structural model that implements the functionality. All throughout this process the functionality
has been described with logic, and the data as ones and zeros.

In the final phase, the structural model is converted into a geometrical one, and it does not
describe data processing any more.

The input data you have at hand in the beginning of the layout design phase include the
gate-level netlist of the design, the timing constraints, and a model of the scan paths included in
the design. You also need a component library that contains timing data, and a library that
contains the layout images of the components. A technology file that defines the placement and
routing rules of the target technology, and a capacitance table file that defines parasitic
capacitance and resistance properties of the metal layers are also needed.

The main results of this phase include the layout database, a post-layout gate-level netlist, and
parasitic capacitance and resistance values of wires extracted from the layout. The netlist and
parasitics are needed for analysis purposes.

3 Standard Cell Layout Principle


The layouts of digital integrated circuits are nowadays created using the standard cell based
design principle.

Standard cells are objects that define the layout patterns for the logic components that are used
in the synthesized netlist. The layout patterns, in turn, define the shapes of regions implanted in
the silicon substrate, as well as polysilicon and metal layer wires created on top of the silicon.
These patterns define the masks that are needed in the photolithography-based integrated
circuit manufacturing process.

Standard cells are designed by the silicon vendor or a third-party intellectual property provider,
and they are used as library components in layout design programs. The layout is created by
placing the cells in the chip's floorplan, and then designing the wiring patterns that connect the
cells to each other. This means that only the layout patterns that define the metal-layer wires are
design-specific.

On the right you can see images of some standard cells as they would be represented on the
computer screen in a design program. They are presented in the same scale, which allows you
to make area comparisons.

For the layout design process, only the dimensions of the cells, the location of contact terminals,
and the shape of wires on the first metal layer, shown in dark blue, are relevant information. In
layout design, the cells are placed and metal wires are connected to their contact areas, and
therefore information about the silicon layers is not needed.

The key things to understand about standard cells and their usage are the following.

All cells have the same height but their width varies according to their function, as you can see
on the right.

Power, or VDD, and ground, or VSS, pins are always at the top and bottom edges of the cells.

The cells do not use layers above metal layer 1 for internal routing. This leaves layers from
metal layer 2 up free for signal routing over the cells.

The cells are placed in rows with every other row flipped so that power and ground rails can be
shared between rows. Filler cells that do nothing are used to fill gaps in the rows.

Because of these principles, the layout is very regular. This makes it possible to use powerful
algorithms to solve the cell placement and signal routing problem automatically.

4 Standard Cell Placement and Routing


The main optimization problems that layout design tools solve are the placement of standard
cells and the routing of signals.

In placement, the starting point is the gate-level netlist obtained from logic synthesis.The netlist
contains a list of components used, and the names of wires that connect these components. A
structural Verilog model such like the one shown on the right is often used as the netlist. The
layout design tool uses standard cell representations of the components in placement.

In placement, the rows in the floorplan are filled with standard cells using a chosen area
utilization ratio. The utilization ratio, which is also known as cell density, specifies how much
empty space not covered with cells must be left in the layout. The utilization is usually set to be
less than 100% so as to make the design easier to route. Filler cells are placed in empty slots.
In the routing phase, the connections defined in the netlist are implemented with metal-layer
wires. More than 10 layers are usually available for this. On one layer wires are only created in
either horizontal or vertical direction. Wire segments on different layers can be connected by
punching so-called vias through the dielectric material between the layers. One signal in the
netlist can therefore be implemented as several wire segments that criss-cross on many layers
in the layout. The routes are drawn on a predefined "grid", which limits the number of wires that
can be created per unit area. This is why lowering the utilization makes routing easier.

On the computer screen, wires on different layers are drawn in different colors, as shown on the
right.

5 Block Layout Design Phases


An IP block is usually implemented in a system-on-chip's floor plan as one physical block. A
block is often, but not necessarily, a rectangle that contains the standard cells, the signal wires
and the power routing, but no input and output cells for off-chip connections. Blocks are placed
in SoC floorplan and then connected to each other and to I/O cells, but the internal layout of
each block can be designed separately. Block level layout design is therefore a simpler task
compared with full SoC layout, because there is no need for floorplan and I/O interface design.

This slide presents the main steps of block layout design.

The shape and size of the block's core area is first specified either by defining its dimensions or
cell density. In the latter case, the design tool computes the required size. The term core refers
to the shape inside which the components will be placed.

When the core area has been fixed, power and ground metal rings are created around it.
After that, power stripes are drawn over the core area using the top metal layers and connected
to the rings with vias.
The rings and the stripes form the backbone of the power distribution network.
Standard cell power and ground rails are drawn on cell rows and connected to the stripes with
vias.

When the power network has been created, the standard cells are placed.

This is followed by clock tree synthesis, in which clock buffer cells are added and then placed
and routed.

In the final phase, the data signals are routed. A global routing is first created at global route cell
level, and this is followed by detailed routing inside the global routing cells.
6 Clock Tree Synthesis
In the previous slide, we mentioned clock tree synthesis. This slide explains what that is all
about.

First recall that in logic synthesis, the clock was treated as an ideal signal, which was not
buffered even though its fanout load is huge. In the gate-level netlist, the clock is represented as
a single wire.

To keep the edges of the clock signal waveform sharp, the signal must be buffered. A tree-like
arrangement that consists of many small buffer cells is used for this. The buffers are usually
added in the layout design phase after placement when wiring delays can be estimated
accurately. This task is called clock tree synthesis (CTS).

Clock tree synthesis is a placement and routing task whose purpose is to create an optimized
clock tree. It is done after placement, because the location of flip-flops must be known for
optimal buffer cell placement. CTS involves the routing of the signals in the clock tree which is
why it's done before the routing of the design's data signals, when all routing resources are still
available. This way the wire lengths in the clock tree can be minimized.

Clock tree synthesis can be controlled by setting constraints on clock latency and clock skew.
Clock latency is the maximum delay from the root of the clock tree to its farthest leaf flip-flop.
Clock skew is the maximum delay variation caused by different buffer and wiring delays on
different branches of the clock tree.

Controlling the skew is important, because it can cause hold time violations in flip-flops. If the
clock seen by one flip-flop is delayed with respect to the clock of another that drives the first
flip-flop, new data from the driving flip-flop can arrive at the data input of the flip-flop with the
delayed clock too soon after its clock edge, causing a hold violation. This danger is greatest
when there is no logic between the flip-flops.

It is easy to fix hold violations by delaying data signals between the directly connected flip-flops,
as shown in the figure at bottom right. Hold fixing must be done after clock tree synthesis, when
the timing properties of the clock tree are known.

7 Post-Layout Timing Analysis: Delays


We have already discussed timing reporting in the previous lecture. After logic synthesis, the
timing reports provide an early glimpse of the timing properties. The final truth is only revealed
after layout.

Before generating the post-layout timing report, the design tool extracts parasitic resistance and
capacitance values of the wires in a specific delay corner based on wire dimensions and
material properties. After that the delays for the report are computed with the extracted data.
The delay corner defines the operating conditions, in terms of manufacturing process variation,
the supply voltage level and the operating temperature, for which the user wants the timing to
be reported.

In the post-layout phase, it is useful to generate critical path reports based on both setup and
hold time analysis. Setup analysis means checking that flip-flops' setup times are not violated
because of excessively long delays on paths that end in registers. Hold analysis checks that
clock skew does not cause problems on short paths.

The timing values reported by a logic synthesis tool and a post-layout analysis tool can differ for
many reasons.

First of all, post-layout analysis is based on real RC values, and not estimated as in the logic
synthesis phase.

Some decisions made in the blocks floorplan may cause large timing deviations. For instance, if
a pin on the layout block's perimeter is placed suboptimally, it may require extremely long wiring,
which the synthesis tool could not have predicted. Placement of macrocells, such as memory
blocks, can cause similar effects because of the routing blockages they can create.

The third cause for post-layout timing surprises is routing congestion. Congestion occurs inside
a region of the layout, where there are not enough routing channels available to make
connections that should preferably go through that region. Routing blockages caused by
macrocells, and high pin-densities caused by high-fan-in cells inside a small area, are common
reasons for routing congestion.

8 Post-Layout Timing Analysis: Clocks


In the post-layout phase, it is useful to analyze the timing of the clock signal, too. This can give
insight that helps improve the design or its timing constraints.

The clock timing properties of interest are the latency and skew of the clock tree. Layout design
programs contain good tools for analyzing these. On the right you can see a graphical
representation of the structure and timing of a clock tree.

As we have mentioned earlier, clock latency shifts the circuit's internal time to the right, affecting
input-to-register and register-to-output paths that have been constrained with input and output
delays, respectively. Clock skew, on the other hand, can shrink both setup and hold slack.

By observing the clock tree timing, you can update the synthesis constraints to include clock
latency and skew values. Next time you run synthesis, these constraints direct the logic
optimizer to take the latency or skew values into account in timing analysis.
9 Post-Layout Power Analysis
As you can guess, power analysis also produces more accurate results in the post-layout
phase. You can use the method based on propagated activity just like in the logic synthesis
phase, but now you can use real RC data.

This slide presents a more accurate method that uses gate-level simulation to obtain
instantaneous activity data. The idea is presented on the right.

You first write out a post-layout netlist, and delay and wiring parasitic data from the layout tool.

As the first analysis step, you simulate the post-layout netlist using your test program, and set
the simulator to dump all signal changes in a file during the course of the simulation.

After simulation, you can use a power analysis program that can combine the design data with
the activity data, and compute the power consumption on every time instance. This method is
sometimes called time-based power analysis.

The time-based analysis can give an estimate of both the average power and the peak power
consumption seen during the simulation.

Average power data can be used to estimate, for instance, battery lifetime or cooling
requirements. IP blocks also often have an average-power budget that must not be exceeded.

Peak power data can be used to estimate power source and distribution network design
requirements. High power peaks can cause large voltage drops in the power distribution
network if they have not been taken into account in its design.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy