0% found this document useful (0 votes)

72 views6 pages

Simpledsp: A Fast and Flexible DSP Processor Model: (Extended Abstract)

1) The authors developed a simulator for the Texas Instruments TMS320C6211 (C62x) digital signal processor and incorporated it into the widely used SimpleScalar toolset to allow for detailed cycle-accurate simulations of the C62x architecture. 2) The C62x architecture presents challenges to simulate due to its very long instruction word, parallel execution, and non-uniform instruction latencies that require static scheduling by the compiler. 3) The authors implemented the new simulator by creating def files to decode instructions and define timing and functionality, and extracting the pipeline code into a separate file to allow customization for different versions while sharing core code. Experiments showed large percentages of wasted cycles consisting solely of N

Uploaded by

Alex Obrejan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views6 pages

Simpledsp: A Fast and Flexible DSP Processor Model: (Extended Abstract)

Uploaded by

Alex Obrejan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

SimpleDSP: A Fast and Flexible DSP Processor Model

(EXTENDED ABSTRACT)
Jeff Ringenberg, David Oehmke
Todd Austin, Trevor Mudge
{jringenb, doehmke, taustin, tnm}@eecs.umich.edu
The University of Michigan
Advanced Computer Architecture Lab

1.0 Introduction
When designing a future mobile microprocessor, the standard model of high performance at all costs does
not apply. Power usage and chip costs are two major design points that must be minimized in order for the
processor to be a viable product. Unfortunately, many current high performance designs satisfy neither of these
requirements and, therefore, a separate class of machines has emerged that incorporates DSP functionality. As a
tool to explore this design space, we have developed a simulator for a popular DSP, the Texas Instruments
TMS320C6211 (C62x) [1], and incorporated it into the widely used SimpleScalar toolset [2]. With this simulator,
we have run detailed, cycle-accurate simulations of the underlying architecture and have found several
bottlenecks within the design. In addition, we have discovered the importance of appropriate code design and
selection through the use of intrinsic instructions. However, much deeper studies remain to be performed since
the purpose of this paper is to show the functionality and usage of the simulator.

2.0 Motivation
The importance of DSP systems doesn’t need justification. However, until recently there has not been
widely disseminated support for simulating such architectures in the academic community. The need for this type
of simulator is apparent in the abundant usage of SimpleScalar, and many new papers in a wide range of
microarchitecture research make use of this popular simulation environment. However, since it models the
architecture seen in a superscalar processor, it is not useful for evaluating architectures that do not adhere to this
type. This is especially true for DSP and VLIW processors. Therefore, in order to address this functional lacking,
we have decided to modify the existing SimpleScalar structure to facilitate the simulation of both types of
architecture with a focus on flexibility and accuracy.

3.0 Related Work

Due to the fact that DSP and VLIW processors are not currently the mainstream designs for modern high
performance processors, there are not many simulators available that are free to use for research. Texas
Instruments and other third party vendors offer several development environments for the C62x architecture, the
most popular of which is the Code Composer Studio from TI [3]. However, these tools do not expose enough of
the underlying architecture to allow for the detailed research of new design ideas.
1
An example of one tool, WETICS [4], created to simulate a DSP has been developed at the University of
Texas. The tool is a web-based JAVA simulator that provides support for the TI-C30 and several Motorola MCx
DSPs. However, the project is no longer under development. Unfortunately, this same status holds true for
several other simulators of DSP, and VLIW, architectures. Our hope is that by building our model into
SimpleScalar, we will garner much wider exposure and support than previous attempts.

4.0 Implementation
SimpleScalar was designed to simulate processors in which each instruction semantically had a latency of
only one cycle and instructions were executed serially. However, a VLIW processor, which the TI-C62x is,
includes both parallel execution and non-uniform latencies. For these processors, the compiler is responsible for
statically scheduling the code using these latencies. Preserving the semantics of the generated code requires that
correct timing be used for each instruction and that instructions be executed in parallel making the functional
simulation of a VLIW processor much more difficult than that of a scalar processor.
The TI-C62x itself has some additional features, many geared towards DSP functionality, that further
complicate the simulation. The pipeline is more complex because the processor has several techniques to improve
code density including NOPs with multi-cycle latencies and the decoupling of fetch packets and execution
packets. Instruction decode is also difficult because most instructions can be executed on either of the DSP’s two
clusters and on several different functional units, and most instructions also allow one or more sources to be a
register, a constant, or a register from the other cluster. A final complication is that the TI compiler targets their
test board and all the I/O system calls are implemented by the connected PC using a breakpoint and global buffer
for data transfer.
The complications inherent in this architecture meant that it was not practical to merge this into the
existing SimpleScalar simulators. Instead, we decided to create a new simulator based on the ideas in
SimpleScalar and to use as much of the existing code as possible. This would allow us to take advantage of future
improvements, allow people familiar with SimpleScalar to quickly learn our simulator, and to provide for possible
future interoperability in areas like heterogeneous multiprocessing. We also decided to explicitly simulate each
stage of the TI pipeline to guarantee accurate timing of all the instructions and to make the resulting simulator
cycle accurate.
Similar to SimpleScalar, we use “def” files to decode and implement instructions. However, our
simulator uses two separate def files. One def file is similar to SimpleScalar and decodes the instructions. The
decode is done as one pass on the entire text section of the executable and an operation structure is filled out for
each instruction. The other def file, called the operation def file, contains the timing information of the operation
as well as its implementation. The basic execution of an operation is similar to that done in SimpleScalar with
one notable exception. The reading and writing of registers is removed from the instruction specific
implementation and done in a generic fashion using the information stored in the operation structure with the
values stored into, or read from, the structure. This was done both to simplify instruction implementations and
2
also because some of these operations must occur in parallel. As an example, during each cycle all data reading
must be complete before any writing can be done. Finally, the pipeline code was taken out and put into a separate
pipeline file so that the bulk of the code could be shared across all the different versions of our simulator. The
pipeline code provides macros that the various versions can use to hook into the pipeline. These were used to
provide additional stats in sim-vliw-profile, to hook into the cache model in sim-vliw-cheetah and sim-vliw-
cache, and to verify the simulation against a trace file in sim-vliw-verify.

5.0 Experiments
As mentioned previously, our simulator has allowed us to do detailed, cycle-accurate simulations of the
TI-C62x architecture and subsequently find what features make it work well and what do not. Since the purpose
of this paper is to show the functionality and features of the simulator, and not performance results, demonstrative
experiments were run for a few DSP-like benchmarks, namely the GSM coder/decoder [5] and several
components of a SmartCamera system [6].
For our first experiment, we found the number, and then analyzed the removal, of cycles that consist
completely of NOP instructions. Due to the statically scheduled, VLIW nature of the C62x, NOP instructions are
inserted after branches and any other multi-cycle latency instructions to ensure correct operation. Simply
removing the NOP cycles from the code is not as straightforward as one would think, since the code would not
function properly without them. Therefore, the results function as a best case scenario if the compiler did not
need to insert these scheduling delays. As Figure 1 shows, there are many cycles consisting only of NOPs that
could contribute to wasted execution and if they could be removed, execution time would greatly decrease.
All NOP Cycles Removed Branch Delay Cycles Removed
100%

90%

80%

70%
Relative Execution Time

60%

50%

40%

30%

20%

10%

0%
GSM Encoder GSM Decoder Smart Camera Smart Camera Smart Camera Smart Camera Smart Camera
Region Contour Ellipse Graph HMM
Benchmark

Figure 1: NOP cycles and the Effects of their Removal

3
For our second experiment, we looked at the effects on execution time of including intrinsic instructions
in the GSM code. These intrinsic instructions, a saturated add is a good example, are tailor made for the C62x
and are used as much as possible. Since TI provided us with the header files [7] containing the insertion of the
intrinsic instructions into the GSM code, we were unable to get these results for the SmartCamera applications. It
should be noted that these hand coded header files are required for appropriate intrinsic selection because the TI
compiler is not efficient at discovering this information. Our third experiment explores the varying abilities of the
compiler to create efficient code.
As Figure 2 shows, the insertion of the intrinsic instructions has a dramatic effect on the execution time of
the GSM code. This is a perfect example of the importance of these types of instructions on the efficient
generation and execution of code on a DSP or any other such architecture that includes these types of instructions.
It also demonstrates the need for having a good compiler that can decide where these instructions should be
placed or at least having hand coded header files that fulfill the same function.

100%

90%

80%
Relative Execution Time

70%

60%

50%

40%

30%

20%

10%

0%
With Intrinsics Without Intrinsics With Intrinsics Without Intrinsics
GSM Encoder GSM Decoder
Benchmark

Figure 2: The Effects of the Addition of Intrinsic Instructions

For our third experiment, we decided to actually look at the TI compiler’s ability to create efficient code
by exploring the effects of various compiler optimizations on execution time. Since VLIW processors are highly
dependent on their compiler for optimal code generation and scheduling, it makes sense to explore which
optimizations are the most beneficial.
In Figure 3, all relative execution times are normalized to defaults with –O3 optimization. The first two
optimizations turn off debug information and the graph shows that for some benchmarks quite a bit of
performance is lost when this debug information left in. Forcing the stat counting function inline on the GSM
code has quite a performance benefit as well. C62x function calls have a fair amount of overhead, so inlining

4
works well when a very small function is called often. For the next optimization, assuming no aliasing allows the
compiler to be aggressive in register allocation and in instruction reordering and this works in a couple of the
benchmarks. Next, using a large inlining threshold benefits those benchmarks that contain a lot of function calls.
Finally, whole program analysis provides the compiler with more information to use when compiling for instance
propagating constants through function calls. In several benchmarks, this allows the compiler to more efficiently
software pipeline some of the major loops. As the graph shows, it is not always as simple as turning on all
optimizations, since for a couple of the benchmarks turning on whole program analysis actually degrades the
performance.
Profile Debug No Debug Inline Counts Full Speculation Large Inline Threshold Whole Program
100%

90%

80%
Relative Execution Time

70%

60%

50%

40%

30%

20%

10%

0%
Intrinsic GSM GSM Encoder Intrinsic GSM GSM Decoder Smart Camera Smart Camera
Encoder Decoder Region Contour
Benchmark

Figure 3: The Effects of Compiler Optimizations on Execution Time

In addition to these experiments, we have also run, and compiled statistics for, many other experiments that
measure function unit usage, crosspath and cluster utilization, instruction class breakdowns, addressing mode
types, branch and predicated instruction breakdowns, and many other statistics that measure the efficiency of the
C62x.

6.0 Conclusion
In this paper, we have presented a simulator that can be used to do detailed analysis of a popular DSP, the
TI TMS320C6211. In addition to being able to simulate this particular architecture, the simulator is general
enough to allow the simulation of other VLIW machines with only minor changes to the infrastructure. With this
simulator, it is now possible to explore both new architecture ideas and compiler ideas in a flexible and accurate
manner. It is our opinion that this tool will make the design of future mobile devices easier and hopefully usher in
a new era of design and simulation.
5
References
[1] Texas Instruments. TMS320C6000 CPU and Instruction Set Reference Guide. SPRU189F. October 2000.
Available from http://focus.ti.com/lit/ug/spru189f/spru189f.pdf.

[2] D.C. Burger and T.M. Austin. The SimpleScalar Tool Set, Version 2.0. Technical Report CS-TR-97-1342,
University of Wisconsin-Madison, June 1997.

[3] Texas Instruments. Code Composer Studio IDE Version 2.2. August 2003. Available from
http://www.ti.com/tmwccs.

[4] D. Arifler and B. L. Evans. Web-Enabled Simulation and Debugging for Digital Signal Processors and
Microcontrollers. Available from http://anchovy.ece.utexas.edu/ ~arifler/wetics/.

[5] GSM 06.51 Encoder/Decoder Digital Cellular Telecommunications System (Phase 2+), Enhanced Full Rate
Speech Processing Functions Version 8.0.1. Available from http://www.etsi.org.

[6] T. Lv, B. Ozer, and W. Wolf. "Workload Characterization for Smart Cameras". 3rd Workshop on Media and
Streaming Processors (held in conjunction with the 34th International Symposium on Microarchitecture).
December 2001.

[7] Texas Instruments. ETSI Math Operations in C for the TMS320C62x. SPRA617A. November 2000.
Available from http://focus.ti.com/lit/an/spra617a/spra617a.pdf.

Embedded Systems by Rajkamal PDF
67% (3)
Embedded Systems by Rajkamal PDF
557 pages
Embedded Systems by Rajkamal 2nd PDF
76% (21)
Embedded Systems by Rajkamal 2nd PDF
557 pages
S7-1500 Ladder Logic Reference Manual
50% (2)
S7-1500 Ladder Logic Reference Manual
28 pages
Csaaie Igcse Physics 0625 Alternative To Practical v1
No ratings yet
Csaaie Igcse Physics 0625 Alternative To Practical v1
3 pages
MA C6000 2DAY Student Guide Rev2.3
No ratings yet
MA C6000 2DAY Student Guide Rev2.3
164 pages
Intro To DSP
No ratings yet
Intro To DSP
46 pages
Metal Oxide Surge Arrester
No ratings yet
Metal Oxide Surge Arrester
35 pages
Unit4 Digitsl
No ratings yet
Unit4 Digitsl
28 pages
04 - Design With Microprocessors
No ratings yet
04 - Design With Microprocessors
71 pages
Elec327b DSP Processors 1
100% (1)
Elec327b DSP Processors 1
21 pages
DSP Notes
No ratings yet
DSP Notes
15 pages
Pan Os Admin
No ratings yet
Pan Os Admin
1,538 pages
Esd Unit1
No ratings yet
Esd Unit1
116 pages
Project Reciew - Final
No ratings yet
Project Reciew - Final
17 pages
01 Introduction
No ratings yet
01 Introduction
29 pages
Fpga Implementation of A License Plate Recognition Soc Using Automatically Generated Streaming Accelerators
No ratings yet
Fpga Implementation of A License Plate Recognition Soc Using Automatically Generated Streaming Accelerators
8 pages
Introduction To THE TMS320C6000 Vliw DSP: Prof. Brian L. Evans
No ratings yet
Introduction To THE TMS320C6000 Vliw DSP: Prof. Brian L. Evans
33 pages
Implementation of Uart Using Systemc and Fpga Based Co-Design Methodology
No ratings yet
Implementation of Uart Using Systemc and Fpga Based Co-Design Methodology
7 pages
DSP - Presentation - Sumit 5
No ratings yet
DSP - Presentation - Sumit 5
45 pages
Implementation of Serial Communication IP For Soc Applications
No ratings yet
Implementation of Serial Communication IP For Soc Applications
4 pages
Design of A 32-Bit Dual Pipeline Superscalar RISC-V Processor On FPGA
No ratings yet
Design of A 32-Bit Dual Pipeline Superscalar RISC-V Processor On FPGA
4 pages
Module 5 - DSP
No ratings yet
Module 5 - DSP
38 pages
What Is SimpleScalar
No ratings yet
What Is SimpleScalar
3 pages
5.dsp UNIT 5 With 8X
No ratings yet
5.dsp UNIT 5 With 8X
69 pages
MIPS Superscalar Simulator
No ratings yet
MIPS Superscalar Simulator
5 pages
01 Introduction
No ratings yet
01 Introduction
29 pages
Bibliography 2015 Top Down Digital VLSI Design
No ratings yet
Bibliography 2015 Top Down Digital VLSI Design
11 pages
Mudge Mpsoc
No ratings yet
Mudge Mpsoc
47 pages
Ground Observer 20 MM: Win Decisive Seconds in C-UAV - Regain Battlefield Supremacy
No ratings yet
Ground Observer 20 MM: Win Decisive Seconds in C-UAV - Regain Battlefield Supremacy
2 pages
A Proposed Risc Instruction Set Architecture For The Mac Unit of 2014
No ratings yet
A Proposed Risc Instruction Set Architecture For The Mac Unit of 2014
6 pages
10 - Chapter 2 PDF
No ratings yet
10 - Chapter 2 PDF
15 pages
Doku - Pub Embedded Systems by Rajkamalpdf
No ratings yet
Doku - Pub Embedded Systems by Rajkamalpdf
557 pages
Library System Documentation
100% (1)
Library System Documentation
28 pages
Combine Syllabus of BE SEVENTH SEM and Eight Sem EN PDF
No ratings yet
Combine Syllabus of BE SEVENTH SEM and Eight Sem EN PDF
45 pages
Unit 1
No ratings yet
Unit 1
59 pages
Module 4B
No ratings yet
Module 4B
21 pages
The C6000 Family: Architecture, Pipelining and General Trends
No ratings yet
The C6000 Family: Architecture, Pipelining and General Trends
47 pages
Embedded System Notes
No ratings yet
Embedded System Notes
66 pages
Dewald 2013
No ratings yet
Dewald 2013
4 pages
Introduction To Digital Signal Processors (DSPS) : Prof. Brian L. Evans
No ratings yet
Introduction To Digital Signal Processors (DSPS) : Prof. Brian L. Evans
30 pages
Inovação e Crescimento
No ratings yet
Inovação e Crescimento
1 page
EEE5232 Lecture-1
No ratings yet
EEE5232 Lecture-1
8 pages
Static Pipelining #2 and Goodbye To Computer Architecture: Prof. Lawrence Rauchwerger
No ratings yet
Static Pipelining #2 and Goodbye To Computer Architecture: Prof. Lawrence Rauchwerger
22 pages
Rohini 69490291128
No ratings yet
Rohini 69490291128
5 pages
Embedded Sys1 - Updated
No ratings yet
Embedded Sys1 - Updated
5 pages
Fpga: Digital Designs: Team Name:Digital Dreamers
No ratings yet
Fpga: Digital Designs: Team Name:Digital Dreamers
8 pages
Os CP
No ratings yet
Os CP
2 pages
633888485056270520
No ratings yet
633888485056270520
115 pages
Embedded Syllabus 2013
No ratings yet
Embedded Syllabus 2013
23 pages
Assignment Questions
No ratings yet
Assignment Questions
3 pages
Product Description For Qlik Cloud Subscriptions
No ratings yet
Product Description For Qlik Cloud Subscriptions
8 pages
Digital Signal Processing Advanced
No ratings yet
Digital Signal Processing Advanced
14 pages
Xa Zhing Software List - Update 10.12.2018
No ratings yet
Xa Zhing Software List - Update 10.12.2018
48 pages
VHDL 2 Proc
No ratings yet
VHDL 2 Proc
10 pages
Compilers Intro Jan2025
No ratings yet
Compilers Intro Jan2025
60 pages
Embedded Prathap
No ratings yet
Embedded Prathap
58 pages
Methane Hazard Mitigation Standard Plan Simplified Method For Small Additions Ib P Bc2014 102
No ratings yet
Methane Hazard Mitigation Standard Plan Simplified Method For Small Additions Ib P Bc2014 102
9 pages
Architecture of TMS320C50 DSP Processor
No ratings yet
Architecture of TMS320C50 DSP Processor
8 pages
Usc
No ratings yet
Usc
17 pages
M.tech Vlsi Syllabus: D.A.John & K.Martin, Analog Integrated Circuit Design, Wiley, 1997
No ratings yet
M.tech Vlsi Syllabus: D.A.John & K.Martin, Analog Integrated Circuit Design, Wiley, 1997
6 pages
TS-V9 Multi-Functional Vehicle GPS Tracker User Manual Updated 201801
No ratings yet
TS-V9 Multi-Functional Vehicle GPS Tracker User Manual Updated 201801
14 pages
Lab Manual
No ratings yet
Lab Manual
39 pages
ICT Procedures Manual
No ratings yet
ICT Procedures Manual
18 pages
Solar Tree: Presented By
No ratings yet
Solar Tree: Presented By
18 pages
MAN B&W Diesel A/S: Service Letter
No ratings yet
MAN B&W Diesel A/S: Service Letter
2 pages
Session F4E Uvi51: A Simulation Tool For Teaching/Learning The 8051 Microcontroller
No ratings yet
Session F4E Uvi51: A Simulation Tool For Teaching/Learning The 8051 Microcontroller
6 pages
M (1) .Tech 2 Sem Syllabi
No ratings yet
M (1) .Tech 2 Sem Syllabi
16 pages
Multimedia Projects Based On Matlab
No ratings yet
Multimedia Projects Based On Matlab
14 pages
010 Code - Smells 20240507
No ratings yet
010 Code - Smells 20240507
9 pages
LD3464 ACG TABLET COATING PRODUCT CATALOGUE - NH - E2 - SC
No ratings yet
LD3464 ACG TABLET COATING PRODUCT CATALOGUE - NH - E2 - SC
17 pages
DADT Group 29 Section A MBAFT24
No ratings yet
DADT Group 29 Section A MBAFT24
10 pages
Question Bank 2161me167
No ratings yet
Question Bank 2161me167
2 pages
JNCIA Junos P1 - 2012 12 1900012
No ratings yet
JNCIA Junos P1 - 2012 12 1900012
1 page
Compressor Surging
No ratings yet
Compressor Surging
2 pages
Chapter 9: Thyristors Thyristors: (C) Diac (D) Triac (A) (E) SCS (B) SCR 4-Layer Diode
No ratings yet
Chapter 9: Thyristors Thyristors: (C) Diac (D) Triac (A) (E) SCS (B) SCR 4-Layer Diode
5 pages
Eng - Ali CV
No ratings yet
Eng - Ali CV
3 pages
From Advanced Technology Digital Frequency Discriminators
No ratings yet
From Advanced Technology Digital Frequency Discriminators
8 pages
Key Points: Week 4: Strategy Driven by Digital
No ratings yet
Key Points: Week 4: Strategy Driven by Digital
6 pages
PL ION 7550 7650 - TRAN Model
No ratings yet
PL ION 7550 7650 - TRAN Model
3 pages
BDF Cedric
No ratings yet
BDF Cedric
2 pages
Arun Elangovan PSGIM CV
No ratings yet
Arun Elangovan PSGIM CV
2 pages
Dialog Based Application
No ratings yet
Dialog Based Application
7 pages
Chapter 3
No ratings yet
Chapter 3
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Simpledsp: A Fast and Flexible DSP Processor Model: (Extended Abstract)

Uploaded by

Simpledsp: A Fast and Flexible DSP Processor Model: (Extended Abstract)

Uploaded by

SimpleDSP: A Fast and Flexible DSP Processor Model

3.0 Related Work

Figure 1: NOP cycles and the Effects of their Removal

Figure 2: The Effects of the Addition of Intrinsic Instructions

Figure 3: The Effects of Compiler Optimizations on Execution Time

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.