Chapter 1
Chapter 1
Detailed Syllabus:
The Nature of Hardware and Software: Introducing Hardware/ Software Co-design, The Quest
for Energy Efficiency, The Driving Factors in Hardware/ Software Co-design, The Dualism of
Hardware Design and Software Design.
Data Flow Modeling and Transformation: Introducing Data Flow Graphs, Analyzing
Synchronous Data Flow Graphs, Control Flow Modeling and the Limitations of Data Flow,
Transformations.
Analysis of Control Flow and Data Flow: Data and Control Edges of a C Program, Implementing
Data and Control Edges, Construction of the Control Flow Graph and Data Flow Graph.
Finite State Machine with Datapath: Cycle-Based Bit-Parallel Hardware, Hardware Modules,
Finite State Machines with Datapath, FSMD Design Example: A Median Processor.
System on Chip: The System-on-Chip Concept, Four Design Principles in SoC Architecture, SoC
Modeling in GEZEL. Applications: Trivium Crypto-Coprocessor, CORDIC Co-Processor.
1
Hardware-Software Co-Design
Learning Resources:
Text Books:
1. Patrick Schaumont, A Practical Introduction to Hardware/ Software Co-design,
Springer, 2010.
2. Ralf Niemann, Hardware/Software Co-Design for Data flow Dominated
Embedded Systems, Springer, 1998.
Reference Books:
2
Hardware-Software Co-Design
The Nature of Hardware and Software
• What is H/S Codesign (Prof. Schaumont’s definition):
Other definitions
• HW/SW Codesign is a design methodology supporting the concurrent
development of hardware and software (co-specification, co-development and co-
verification) in order to achieve shared functionality and performance goals for a
combined system
3
Hardware-Software Co-Design
Hardware
we will model hardware by means of single-clock synchronous digital circuits
created using word-level combinational logic and flip-flops.
Hardware is realized by word-level combinational and sequential components,
such as registers, MUXs, adders and multipliers.
3 3
d q 3 4
0
+ 1
clk 3 2
3
4
Hardware-Software Co-Design
Hardware
Bear in mind that this is a very simplistic treatment of actual hardware, We ignore
advanced circuit styles including asynchronous hardware, dynamic logic, multi-
phase clocked hardware, etc.
The cycle-based model is limited because it does not model glitches, race
conditions or events that occur within clk cycles
7
Hardware-Software Co-Design
Software
ARM assembly example
start
LDR R0, =array ; Load the address of the array into R0
LDR R1, =array_length ; Load the length of the array into R1
LDR R2, [R0], #4 ; Load the first element of the array into R2
SUBS R1, R1, #1 ; Decrement array
Loop : CMP R1, #0 ; Check if we have processed all elements
BEQ done ; If R1 is 0, we're done
LDR R3, [R0], #4 ; Load the next element of the array into R3
CMP R3, R2 ; Compare the next with the current max (R2)
BLE continue_loop ; If R3 <= R2, skip the update of max
MOV R2, R3 ; If R3 > R2, update max value in R2
continue_loop : SUBS R1, R1, #1 ; Decrement the counter
BNE loop ; If there are more elements, continue the loop
done: END
8
Hardware-Software Co-Design
9
Hardware-Software Co-Design
10
Hardware-Software Co-Design
11
Hardware-Software Co-Design
12
C Program
13
HDL program
14
15
Hardware-Software Co-Design
This very simple design can be addressed using hardware/software codesign;
The hardware model contains the 8051 processor, the coprocessor, and the connections
between them.
During execution, the 8051 processor will execute a software program written in C.
C program and RTL hardware model for this design, written in the GEZEL language.
Each time, it also cycles the value on port P0 between ins hello and ins idle, which are
encoded as value 1 and 0,respectively.
The hardware model includes both the microcontroller and the coprocessor.
This particular hardware model is a combination of a finite state machine (lines 10–18) and a
datapath (lines 1–8).
16
Hardware-Software Co-Design
This FSMD is quite easy to understand.
The FSM controller selects, each clock cycle, which of those instructions to execute.
This means: when the value of insreg is 1 and the FSM controller current state is s1, the
datapath will execute instructions hello and decode, and the FSM controller next-state is s2.
When the value of insreg would be 0, the datapath will execute only instruction decode and
the FSM controller next-state is s1.
The overall coprocessor behavior is like this: when the ins input changes from 0 to 1, then the
din input will be printed in the next clock cycle.
17
Hardware-Software Co-Design
The 8051 microcontroller is captured with three ipblock (GEZEL library
modules), on lines 20–37.
The first ipblock is an i8051system.
It represents the 8051 microcontroller core, and it indicates the name
of the compiled C program that will execute on this core (driver.ihx on
line 22).
The other two ipblock are two 8051 output ports (i8051systemsource),
one to model port P0, and the other to model port P1.
Finally, the coprocessor and the 8051 ports are wired together in a
top-level module, shown in lines 39–49.
We can now simulate the entire model, including hardware and
software, as follows.
First, the 8051 C program is compiled to a binary executable.
Next, the GEZEL simulator will combine the hardware model and the
8051 binary executable in a co-simulation.
18
Hardware-Software Co-Design
19
Defining Hardware/Software Codesign
20
Defining Hardware/Software Codesign
21
Defining Hardware/Software Codesign
A Digital-Signal Processor (DSP) is a processor with a specialized
instruction set, optimized for signal-processing applications.
Writing efficient programs for a DSP requires detailed knowledge of
these specialized instructions.
Very often, this means writing assembly code, or making use of a
specialized software library.
Hence, there is a strong connection between the efficiency of the
software and the capabilities of the hardware.
23
The Quest for Energy Efficiency
Choosing between implementing a design in hardware or
implementing it in software very difficult.
Indeed, from a designers’ point-of-view, the easiest approach is to
write software, for example in C.
Software is easy and flexible, software compilers are fast, there are
large amounts of source code available, and all you need to start
development is a nimble personal computer.
Furthermore, why go through the effort of designing a hardware
architecture when there is already one available (namely, the RISC
processor)?
24
Relative Performance
25
Relative Performance
Figure illustrates various cryptographic implementations in software
and hardware that have been proposed over the past few years.
These are all designs proposed for embedded applications, where the
trade-off between hardware and software is crucial.
As demonstrated by the graph, hardware crypto architectures have,
on the average, a higher relative performance compared to embedded
processors.
26
Relative Performance
However, relative performance may not be a sufficient argument to
motivate the use of a dedicated hardware implementation.
Consider for example a specialized Application-Specific Integrated
Circuit (ASIC) versus a high-end (workstation) processor.
The hardware inside of the ASIC can execute many operations in
parallel, but the processor runs at a much higher clock frequency.
Furthermore, modern processors are very effective in completing
multiple operations per clock cycle.
As a result, an optimized software program on top of a high-end
processor may outperform a quick-and-dirty hardware design job on
an ASIC.
Thus, the absolute performance of software may very well be higher
than the absolute performance of hardware.
In contrast to relative performance, the absolute performance needs
to take clock frequency into account.
27
Energy Efficiency
There is another metric which is independent from clock frequency, and which can
be applied to all architectures.
That metric is energy-efficiency: the amount of useful work done per unit of energy.
Flexibility
28
Energy Efficiency
Take an example of a particular encryption application (AES) for
different target platforms.
The flexibility of these platforms varies from very high on the left to
very low on the right.
The platforms include: Java on top of a Java Virtual machine on top of
an embedded processor;
C on top of an embedded processor;
optimized assembly-code on top of a Pentium-III processor;
Verilog code on top of a Virtex-II FPGA; and
an ASIC implementation using 0.18 micron CMOS standard cells.
Y-axis shows the amount of gigabits that can be encrypted on each of
these platforms using a single Joule of energy.
This shows battery-operated devices would greatly benefit using less
flexible, dedicated hardware engines
29
The Driving Factors in Hardware/Software Codesign
30
The Driving Factors in Hardware/Software Codesign
Also, specialized hardware architectures are usually also more efficient than software
from a relative performance perspective, i.e., amount of useful work done per clock
cycle
Flexibility comes with a significant energy cost -- one which energy optimized
applications cannot tolerate
Therefore, you will never find a Pentium processor in a cell phone!
31
The Driving Factors in Hardware/Software Codesign
32
The Driving Factors in Hardware/Software Codesign
Power Densities:
Further increasing clock speed in modern high-end processors as a performance
enhancer has run-out-of-gas because of thermal limits
This is driven a broad and fundamental shift to increase parallelism within
processor architectures
However, at this moment, there is no dominant parallel computer architecture
that has shown to cover all applications. commercially available systems include
Symmetric multiprocessors with shared memory
Traditional processors tightly coupled with FPGAs as accelerator engines
Multi-core and many-core architectures such as GPUs
Nor is there yet any universally adopted parallel programming language, i.e.,
code must be crafted differently depending on the target parallel platform
This forces programmers to be architecturally-aware of the target platform
33
The Driving Factors in Hardware/Software Codesign
Design Complexity:
Today, it is common to integrate multiple microprocessors together with all related
peripherals and hardware components on a single chip.
This approach has been touted system-on-chip (SoC). Modern SoC are extremely complex.
The conception of such a component is impossible without a detailed planning and design
phase.
Extensive simulations are required to test the design upfront, before committing to a costly
implementation phase.
Since software bugs are easier to address than hardware bugs, there is a tendency to increase
the amount of software.
Design Cost:
New chips are very expensive to design. As a result, hardware designers make chips
programmable so that these chips can be reused over multiple products or product
generations.
The SoC is a good example of this trend.
However, ‘programmability’ can be found in many different forms other than embedded
processors: reconfigurable systems are based on the same idea of reuse-through-
reprogramming.
34
The Driving Factors in Hardware/Software Codesign
35
The Driving Factors in Hardware/Software Codesign
Deep-Submicron Effects:
Designing new hardware from-scratch in high-end silicon processes is
difficult due to second-order effects in the implementation.
For example, each new generation of silicon technology has an increased
variability and a decreased reliability.
Programmable, flexible technologies make the hardware design process
simpler, more straightforward, and easier to control.
In addition, programmable technologies can be created to take the effects
of variations into account.
Finding the correct balance, while weighing in all these factors, is a complex
problem Instead, we will focus on optimizing metrics related to design cost
and performance
In particular, we will consider how adding hardware to a software
implementation increases performance while weighing in the increase in
design cost
36
The Hardware–Software Codesign Space
The proceeding discussion makes it apparent that there are a multitude of
alternatives available for mapping an application to an architecture
For a given application, there are many different possible solutions.
The collection of all these implementations is called the hardware–software
codesign space.
The following figure gives a symbolic representation of this design space and
indicates the main design activities in this design space.
37
The Hardware–Software Codesign Space
38
The Hardware–Software Codesign Space
Examples Micrographs of Target Platforms
Microprocessor FPGA SoC
DSP Microcontroller
39
The Hardware–Software Codesign Space
SoC Examples
Example System-on-Chip (SoC) with IP cores
Processor RF Micro
Memory RF #RF2173
Pow Amp
Transreflective Analog Devices
monochrome Maxim
#AD7873 #MAX4472
backlit display Screen digitizer Pow. Amp contrl
Hynix drivers
#HY57V641629 Motorola
SDRAM 8MB
Motorola DSP #MC1376VF
#MC68VZ328 Dig. Transceivers
Fijitsu DragonBall Proc.
#MBM29D1323
Flash 4MB Philips TCXO
#PDIUBD12
USB Interface K001 VCO
FPGA Interface
Manual inputs
40
The Hardware–Software Codesign Space
Codesign Examples
Video Codec (H261)
Camera Display
MSQ bus
MCC bus
uP+code SW Processors
HW HW Processors
41
The Hardware–Software Codesign Space
42
The Hardware–Software Codesign Space
Each of the above platforms presents a trade-off between flexibility and efficiency
43
The Hardware–Software Codesign Space
Codesign involves the following three activities:
• Platform selection
• Application mapping
• Platform programming
Very often, a specification is just a piece of English text, that leaves many
details of the application undefined
44
The Hardware-Software Codesign Space
Step 2: Application mapping
Examples include:
• RISC: Software is written in C while the hardware is a processor
• FPGAs: Software is written in a hardware description language (HDL)
FPGAs can be configured to implement a soft processor, in which case, software
also needs to be written in C
• DSP: A digital signal processor is programmed using a combination of C and
assembly, which is run on a specialized processor architecture
• ASIP: Programming an ASIP is a combination of C and an HDL description
• ASIC: The application is written in a HDL which is then synthesized to a hardwired
netlist and implementation
Note: ASICs are typically non-programmable, i.e., the application and platform
are one and the same
45
HW/SW Codesign
The Hardware-Software Codesign Space
However, many platforms are not just composed of simple components, but
rather require multiple pieces of software, possibly in different programming
languages.
For example, the platform may consist of a RISC processor and a specialized
hardware coprocessor
Here, the software consists of C (for the RISC) as well as dedicated
coprocessor instruction-sequences (for the coprocessor).
46
The Hardware–Software Codesign Space
Another concept reflected in the wedge-figure is the domain-specific platform
The first question is harder - seasoned designers choose based on their previous expe-
rience with similar applications
The second issue is also challenging, but can be addressed in a more systematic fash-
ion using a design methodology
A design method is a systematic sequence of steps to convert a specification
into an implementation
49
The Dualism of Hardware Design and Software Design
Designing requires the decomposition of a specification into low level primitives such
as gates (HW) and instructions (SW)
50
The Dualism of Hardware Design and Software Design
Resource Cost:
Temporal vs. spatial decomposition
The dualism in decomposition methods leads a similar dual resource cost.
Decomposition in space, as used by a hardware designer, means that more
gates are required for when a more complex design needs to be implemented.
Decomposition in time, as used by a software designer, implies that a more
complex design will take more instructions to complete.
Therefore, resource cost for hardware is circuit area while resource cost for
software is execution time
Design Constraints:
A hardware designer is constrained by the clock cycle period of a design.
A software designer, on the other hand, is limited by the capabilities of the
processor instruction set and the memory space available with the processor.
Thus, the design constraints for hardware are in terms of a time budget, while the
design constraints for software are fixed by the CPU.
So, a hardware designer invests circuit area to maintain control over execution
time, and a software designer invests execution time for an almost constant circuit
area.
51
The Dualism of Hardware Design and Software Design
Flexibility:
Software excels over hardware in the support of application flexibility.
Flexibility is the ease by which the application can be modified or
adapted after the target architecture for that application is
manufactured.
In software, flexibility is essentially free.
In hardware on the other hand, flexibility is not trivial.
Hardware flexibility requires that circuit elements can be easily reused
for different activities or functions in a design.
52
The Dualism of Hardware Design and Software Design
Parallelism:
A dual of flexibility can be found in the ease with which parallel
implementations can be created.
Parallelism is the most obvious approach to improving performance.
For hardware, parallelism comes for free as part of the design
paradigm.
For software, on the other hand, parallelism is a major challenge.
If only a single processor is available, software can only implement
concurrency, which requires the use of special programming
constructs such as threads.
When multiple processors are available, a truly parallel software
implementation can be made, but inter-processor communication and
synchronization become a challenge.
53
The Dualism of Hardware Design and Software Design
Modelling:
In software, modeling and implementation are very close.
Indeed, when a designer writes a C program, the compilation of that
program for the appropriate target processor will also result in the
implementation of the program.
In hardware, the model and the implementation of a design are
distinct concepts.
Initially, a hardware design is modeled using a HDL.
Such a hardware description can be simulated, but it is not an
implementation of the actual circuit.
Hardware designers use a hardware description language, and their
programs are models which are later transformed to implementation.
Software designers use a software programming language, and their
programs are an implementation by itself.
54
The Dualism of Hardware Design and Software Design
Reuse:
Finally, hardware and software are also quite different when it comes to
Intellectual Property Reuse or IP-reuse.
The idea of IP-reuse is that a component of a larger circuit or a program can
be packaged, and later reused in the context of a different design.
In software, IP-reuse has known dramatic changes in recent years due to
open source software and the proliferation of open platforms.
When designing a complex program these days, designers will start from a
set of standard libraries that are well-documented and available on a wide
range of platforms.
For hardware design, IP-reuse is still in its infancy.
Hardware Designers are only starting to define standard exchange
mechanisms.
IP-reuse of hardware has a long way to go compared to the state of reuse in
software.
55
Abstraction Levels
57
Discrete-event
Here, simulators abstract node behavior into discrete, possibly
irregularly spaced, time steps called events
Events represent the changes that occur to circuit nodes when the
inputs are changed in the test bench
The simulator is capable of modeling actual propagation delay of the
gates, similar to what would happen in a hardware instance of the
circuit
Discrete-event simulation is very popular for modeling hardware at
the lowest layer of abstraction in codesign
This level of abstraction is much less compute-intensive than
continuous time but accurate enough to capture details of circuit
behavior including glitches
58
Cycle-accurate
Single-clock synchronous hardware circuits have the important property that
all interesting things happen at regularly spaced intervals, namely at the clock
edge.
This abstraction is important enough to merit its own abstraction level, and it
is called cycle-accurate modeling.
A cycle-accurate model does not capture propagation delays or glitches.
All activities that fall ‘in between’ clock edges are concentrated at the clock
edge itself.
This level of abstraction is considered the golden reference in HW/SW
codesign
59
Instruction-accurate
RTL models are great but may be too slow for complex systems.
For example, your laptop has a processor that probably clocks over 1 GHz (one billion cycles).
Assuming that you could write a C function that expresses a single clock cycle of processing,
you would have to call that function one billion times to simulate just a single second of
processing.
Clearly, further abstraction can be useful to build leaner and faster models.
Instruction-accurate modeling expresses activities in steps of one microprocessor instruction
(not cycle count)
Each instruction lumps together several cycles of processing.
Instruction-accurate simulators are used to verify complex software systems, such as
complete operating systems.
60
Transaction-accurate
For very complex systems, even instruction-accurate models may be too slow
or require too much modeling effort
In transaction-accurate modeling, only the interactions (transactions) that
occur between components of a system are of interest
For example, suppose you want to model a system in which a user process is
performing hard disk operations, e.g., writing a file
The simulator simulates commands exchanged between the disk drive and
the user application
The sequence of instruction-level operations between two transactions can
number in the millions but the simulator instead simulates a single function
call
Transaction-accurate models are important in the exploratory phases of a
design, before effort is spent on developing detailed models
For this course, we are interested in instruction-accurate and cycle-accurate levels
61
Concurrency and Parallelism
Concurrency and parallelism are terms that often occur in the context of hardware-
software codesign.
They mean very different things.
Concurrency is the ability to execute simultaneous operations because these
operations are completely independent.
Parallelism is the ability to execute simultaneous operations because the operations
can run on different processors or circuit elements.
Thus, concurrency relates to an application model, while parallelism relates to the
implementation of that model.
Hardware is always parallel.
Software on the other hand can be sequential, concurrent, or parallel.
Sequential and concurrent software requires a single processor
Parallel software requires multiple processors.
Software running on your laptop, e.g., WORD, email, etc. is concurrent
Software running on a 65536-processor IBM Blue Gene/L is parallel
62
Concurrency and Parallelism Cont..
A key objective of HW/SW codesign is to allow designers to leverage the
benefits of true parallelism in cases where concurrency exists in the
application
There is a well-known Comp. Arch principle called Amdahl’s law
The maximum speedup of any application that contains q% sequential
code is: 1 / (q/100).
For example, if your application spends 33% of its time running
sequentially, the maximum speedup is 3
This means that no matter how fast you make the parallel component
run, the maximum speedup you will ever be able to achieve is 3
Thus, you see that we don’t only need to have parallel platforms, we
also need a way to write parallel programs to run on those platforms.
Surprisingly, even algorithms that seem sequential at first can be executed (and
specified) in a parallel fashion.
63
Concurrency and Parallelism Cont..
64
Concurrency and Parallelism Cont..
The authors of the CM, Hellis and Steele, show that it is possible to express
algorithms in a concurrent fashion so that they map neatly onto a CM
Consider the problem of summing an array of numbers
The array can be distributed across the CM by assigning one number to each
processor
To take the sum, distribute the array over the CM processors so that each
processor holds one number.
We can now take the sum over the entire array in log(n) steps (n being the
number of processors)
65
Concurrency and Parallelism Cont..
Even through the parallel sum speeds up the computation significantly, there
remains a lot of wasted compute power
On the other hand, if the application requires all partial sums, i.e., the sum of the
first two, three, four, etc. numbers, then the full power of the parallel machine is
used
66