Thesis
Thesis
Mark Shannon
Submitted for the degree of MSc by Research
University of York
Department of Computer Science
July 2006
Abstract
Compiler design for stack machines, in particular register allocation, is an
under researched area. In this thesis I present a framework for analysing
and developing register allocation techniques for stack machines. Using this
framework, I analyse previous register allocation methods and develop two new
algorithms for global register allocation, both of which outperform previous
algorithms, and lay the groundwork for future enhancements. Finally I discuss
how effective global register allocation for stack machines can influence the
design of high performance stack architectures.
Included with this thesis is a portable C compiler for stack machines,
which incorporates these new register allocation methods. The C compiler
and associated tools (assembler, linker and simulator) are included in the CD.
Acknowledgements
This work was done as part of the UFO project, to develop a soft-core stack
processor utilising the latest research on stack based architectures. Its aim
was to produce a C compiler which would produce high quality code for the
UFO machine.
The research was funded by the Department of Trade and Industry through
its Next Wave initiative, as part of the AMADEUS project.
I would like to thank my supervisor Chris Bailey for his help and support
throughout this project, and for kindling my interest in stack machines.
Thanks also Christopher Fraser and David Hanson for providing the lcc C
compiler free of charge. Without it this project could not have been a success.
Finally I would like to thank Ally Price and Huibin Shi for proof reading
and listening to my ideas and ramblings.
Previous Publication
Elements of this thesis, notably the stack-region framework, appeared in a
paper presented at EuroForth 2006[25].
Contents
1 Introduction
1.1 Stack Machines . . . . . . . . . . . . . . . . . . .
1.1.1 The Stack . . . . . . . . . . . . . . . . . .
1.1.2 Advantages of the Stack Machine . . . . .
1.1.3 History . . . . . . . . . . . . . . . . . . .
1.1.4 Current Stack Machines . . . . . . . . . .
1.1.5 Stack Machines, More RISC Than RISC?
Common Features . . . . . . . . . . . . .
Differences . . . . . . . . . . . . . . . . .
1.1.6 Performance of Stack Machines . . . . . .
Pros . . . . . . . . . . . . . . . . . . . . .
Cons . . . . . . . . . . . . . . . . . . . . .
In Balance . . . . . . . . . . . . . . . . .
1.2 The Stack . . . . . . . . . . . . . . . . . . . . . .
1.2.1 An Abstract Stack Machine . . . . . . . .
1.2.2 Stack Manipulation . . . . . . . . . . . .
1.3 Compilers . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Compilers for Stack Machines . . . . . . .
1.3.2 Register Allocation for Stack Machines . .
1.4 Context . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 The UFO Architecture . . . . . . . . . . .
1.5 Goal . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Contribution . . . . . . . . . . . . . . . . . . . .
1.7 Summary . . . . . . . . . . . . . . . . . . . . . .
2 The Stack
2.1 The Hardware Stack . . . . . . . . . . . . . . . .
2.2 Classifying Stack Machines by Their Data Stack
2.2.1 Views of the Stack . . . . . . . . . . . . .
2.2.2 Stack Regions . . . . . . . . . . . . . . . .
The Evaluation Region . . . . . . . . . .
The Parameter Region . . . . . . . . . . .
The Local Region . . . . . . . . . . . . .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
11
12
12
13
13
13
14
14
14
14
15
15
15
16
17
17
18
19
19
20
20
20
.
.
.
.
.
.
.
21
21
22
22
23
23
24
24
2.3
2.4
3 The compiler
3.1 The Choice of Compiler . . . . . . . . . . . . . . . . . . . . . .
3.1.1 GCC vs LCC . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 A Stack Machine Port of GCC The Thor Microprocessor
3.2 LCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Initial Attempt . . . . . . . . . . . . . . . . . . . . . . .
3.3 The Improved Version . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 The Register Allocation Phase . . . . . . . . . . . . . .
3.3.2 Producing the Flow-graph . . . . . . . . . . . . . . . . .
3.3.3 Optimisation . . . . . . . . . . . . . . . . . . . . . . . .
Tree Flipping . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Structure of the New Back-End . . . . . . . . . . . . . . . . . .
3.4.1 Program flow . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . .
3.4.3 Infrastructure . . . . . . . . . . . . . . . . . . . . . . . .
3.4.4 Optimisations . . . . . . . . . . . . . . . . . . . . . . . .
3.5 LCC Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Representing Stack Manipulation Operators as LCC TreeNodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2 The Semantics of the New Nodes . . . . . . . . . . . . .
Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.3 Tuck . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.4 Representing the Stack Operations. . . . . . . . . . . . .
Drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rotate Up . . . . . . . . . . . . . . . . . . . . . . . . . .
Rotate Down . . . . . . . . . . . . . . . . . . . . . . . .
Tuck . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.5 Change of Semantics of Root Nodes in the LCC forest .
3.6 Additional Annotations for Register Allocation . . . . . . . . .
3.7 Tree Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
25
25
25
26
26
26
26
27
27
28
32
33
33
33
34
35
35
38
38
38
38
38
39
39
39
40
40
40
40
42
42
42
43
43
43
43
43
44
44
44
45
45
3.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
47
47
48
48
49
49
49
50
4 Register allocation
51
4.1 Koopmans Algorithm . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.4 Analysis of Replacements Required . . . . . . . . . . . . 52
Maintaining the Depth of the L-stack . . . . . . . . . . 52
Initial Transformations . . . . . . . . . . . . . . . . . . . 54
Transformations After the First Node Has Been Transformed 55
Transformations After the Second Node Has Been Transformed 55
Transformations After Both Nodes Have Been Transformed 56
4.2 Baileys Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.2 Analysis of Baileys Algorithm . . . . . . . . . . . . . . 57
4.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 A Global Register Allocator . . . . . . . . . . . . . . . . . . . . 58
4.3.1 An Optimistic Approach . . . . . . . . . . . . . . . . . . 59
4.3.2 A More Realistic Approach . . . . . . . . . . . . . . . . 59
4.3.3 A Global Approach . . . . . . . . . . . . . . . . . . . . . 60
4.3.4 Outline Algorithm . . . . . . . . . . . . . . . . . . . . . 61
4.3.5 Determining X-stacks . . . . . . . . . . . . . . . . . . . 61
Ordering of Variables. . . . . . . . . . . . . . . . . . . . 61
Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Propagation of Preferred Values . . . . . . . . . . . . . 62
4.3.6 Other Global Allocators . . . . . . . . . . . . . . . . . . 63
4.3.7 L-Stack Allocation . . . . . . . . . . . . . . . . . . . . . 64
4.3.8 Chain Allocation . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Final Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 Peephole Optimisation . . . . . . . . . . . . . . . . . . . . . . . 68
4.5.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5.2 The Any Instruction . . . . . . . . . . . . . . . . . . . . 69
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5
5 Results
5.1 Compiler Correctness . . . . . . . . . . . . . .
5.2 Benchmarks . . . . . . . . . . . . . . . . . . .
5.3 The Baseline . . . . . . . . . . . . . . . . . .
5.4 The Simulator . . . . . . . . . . . . . . . . . .
5.5 Cost Models . . . . . . . . . . . . . . . . . . .
5.5.1 The Models . . . . . . . . . . . . . . .
5.6 Relative Results . . . . . . . . . . . . . . . .
5.6.1 Comparing the Two Global Allocators
5.6.2 Relative program size . . . . . . . . .
5.7 Summary . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
71
71
72
72
73
73
74
75
77
77
.
.
.
.
.
.
.
.
.
.
.
79
79
79
80
80
80
80
81
81
81
83
83
85
B Results
87
B.1 Dynamic Cycle Counts . . . . . . . . . . . . . . . . . . . . . . . 88
B.2 Data memory accesses . . . . . . . . . . . . . . . . . . . . . . . 89
C LCC-S Manual
C.1 Introduction . . . . . .
C.2 Setting up lcc-s . . . .
C.3 Command line options
C.4 The structure of lcc-s .
C.5 Implementations . . .
C.6 The components . . .
C.6.1 lcc . . . . . . .
C.6.2 scc . . . . . . .
C.6.3 peep . . . . . .
C.6.4 as-ufo . . . . .
C.6.5 link-ufo . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
91
91
92
93
94
94
94
95
95
96
96
99
107
G Source Code
111
Bibliography
111
Index
114
List of Tables
1.1
1.2
1.3
16
16
17
2.1
Stack Classification . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.1
3.2
3.3
34
45
46
4.1
4.2
4.3
4.4
4.5
.
.
.
.
.
54
55
56
57
59
5.1
5.2
The benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . .
Instruction Costs . . . . . . . . . . . . . . . . . . . . . . . . . .
72
73
B.1
B.2
B.3
B.4
B.5
B.6
B.7
Flat model . . . . . . . .
Harvard model . . . . . .
UFO model . . . . . . . .
UFO slow-memory model
Pipelined model . . . . . .
Stack mapped model . . .
Data memory accesses . .
88
88
88
88
88
88
89
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
23
24
25
28
29
30
31
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
Compiler overview . . . . . . .
Translator . . . . . . . . . . . .
Compiler back end . . . . . . .
lcc tree for y = a[x] + 4 . . . .
lcc forest for C source in Figure
drop3 as lcc tree . . . . . . . .
copy3 as lcc tree . . . . . . . .
rot3 as lcc tree . . . . . . . . .
rrot3 as lcc tree . . . . . . . . .
tuck3 as lcc tree . . . . . . . .
Labelled tree for y = a[x] + 4 .
Relabelled tree for y = a[x] + 4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
37
37
40
41
43
43
44
44
44
45
46
4.1
67
5.1
5.2
5.3
5.4
5.5
5.6
Relative
Relative
Relative
Relative
Relative
Relative
.
.
.
.
.
.
74
75
76
76
77
78
F.1
F.2
F.3
F.4
doubleloop.c . .
twister.c . . . .
Cycles for UFO
Relative speed
.
.
.
.
107
108
108
109
. .
. .
. .
. .
2.5
. .
. .
. .
. .
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Algorithms
1
2
3
4
5
6
7
8
9
10
11
Koopmans Algorithm . . . . . . . . . .
Implemention of Koopmans Algorithm
Baileys Algorithm . . . . . . . . . . . .
Baileys Algorithm with x-stack . . . . .
Determining the x-stack . . . . . . . . .
Propagation of x-stack candidates . . . .
L-Stack Allocation . . . . . . . . . . . .
Chain Allocation . . . . . . . . . . . . .
Matching the L-Stack and X-Stack . . .
Cost function . . . . . . . . . . . . . . .
Proposed Algorithm . . . . . . . . . . .
10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
52
53
57
58
63
63
65
65
66
68
81
Chapter 1
Introduction
This thesis aims to demonstrate that compilers can be made which produce
code well tailored to stack machines, taking advantage of, rather than being
confounded by, their peculiar architecture.
This chapter covers the necessary background, explaining stack machines
and laying out the problems in implementing quality compilers for them.
1.1
Stack Machines
Stack machines have an elegant architecture which would seem to offer a lot
to computer science, yet the stack machine is widely viewed as outdated and
slow. Since there have been almost no mainstream stack machines since the
1970s this is hardly surprising, but is unfortunate as the stack machine does
have a lot of potential. The usual view is that stack machines are slow because
they must move a larger amount of data between the processor and memory
than a standard von Neumann architecture[17]. Obviously, this argument only
holds true if stack machines do indeed have to move more data to and from
memory than other architectures. This dissertation aims to show that this is
not necessarily true and that a well designed compiler for stack machines can
reduce this surplus data movement significantly, by making full and efficient
use of the stack.
1.1.1
The Stack
accessible registers varies. For example the UFO machine[23], which is the
target of the compiler discussed in this thesis, has three accessible registers
below the top of stack but other machines could have considerably more. In
most stack machines, the arithmetic logic unit(ALU) and memory management
unit operate strictly at the top of the stack; all ALU operands and memory
writes are popped from the top of the stack and all ALU results and memory
loads are pushed to the top of the stack. Thus stack machines are also
called implicit-addressing machines since no explicit reference to operands
are required. This stack is usually implemented in hardware, and although
physically finite, it is conceptually infinite, with hardware or the operating
system managing the movement of the lower parts of the stack to and from
memory.
1.1.2
1.1.3
History
Stack machines have been around since the 1960s, the Burroughs B5000 and
successors[22] being the first widely used stack-based processors, and were once
considered one of the fastest architectures, as it was thought that their design
simplified compiler construction. Subsequently improvements in compiler design
meant that memory-memory architectures came to the fore. At the time
memory-tomemory architectures were often considered superior to registerto-memory or register-to-register architectures, as memories were able to keep
up with processors at the time, as suggested by Myers[21]. However it was
the development of the IBM 360[3] which sidelined the stack machine. As
processor speeds increased, the need for processors to use on-chip registers
as their primary data-handling mechanism became apparent and the RISC1
1
RISC: Reduced Instruction Set Computer. RISC machines have only a few addressing
modes and notably cannot move data to or from memory at the same time as performing
operations on that data. This simpler design freed silicon resources to be spent on improving
performance.
12
1.1.4
1.1.5
The guiding philosophy of the RISC revolution was to have a simple, consistent
instruction set that could be executed quickly and facilitate hardware optimisations.
Stack machines also have simple instruction sets and the instruction set has
more orthogonality, since register addressing and computation are in separate
instructions.
Common Features
Simple instructions - All instructions correspond to a single simple operation
within the processor.
Consistent instruction format - The instructions are usually of a fixed
length, with a standard layout.
2
CISC: Complex Instruction Set Computer. CISC machines have instructions capable of
performing several simple operations at once, but at the expense of complex processor design
and performance reductions due to the use of microcoding.
3
Field Programmable Gate Array
13
Differences
The primary difference between RISC and stack machines is in the arrangement
of their on-chip registers. RISCs arrange their registers as a fixed size array,
whereas stack machines arrange theirs as a stack.
Stack machines make the simple instruction format of RISC even simpler,
with fully separate addressing and computation. All RISC ALU instructions,
such as add include addressing information, but stack machines instruction
do not; the stack machine instruction to add two numbers is simply add.
RISC machine:
r1 r2 add r3
Stack machine:
add
The stack architecture evaluates expressions without side effects, that is, it
leaves no non-result values in any register. Conventional machine instructions,
such as the add instruction, have the side effect of leaving a value in a
register. Leaving a value in a register is often a useful side effect, but it can
be regarded as a side effect nonetheless.
1.1.6
Stack machines have a reputation for being slow. This is a sufficiently widely
held opinion that it is self-perpetuating. The sole formal comparison of RISC
vs stack machine performance was inconclusive[28], and is rather out of date
however it is possible to analyse how the differences between stack machines
and RISC might affect performance.
Pros
Higher clock speeds: The stack processor is a simpler architecture and
should be able to run with a faster clock, at least no slower.
Low procedure call overheads: Stack machines have low procedure call
penalties, since no registers need saving to memory across procedure
calls.
Fast interrupt handling: Interrupt routines can execute immediately as
hardware takes care of the stack management.
Smaller programs: Stack machine instructions have a much shorter encoding,
often one byte per instruction, meaning that stack machine programs
are generally smaller. Although the instruction count is usually slightly
higher, overall, the code length is significantly shorter.
Cons
Higher instruction count: The separation of addressing and computation
in stack machine instructions, means that more of them are required to
do anything.
14
1.2
1.2.1
The Stack
An Abstract Stack Machine
In order to discuss stack machines, an ideal stack machine and its instruction
set needs to be defined. It is only the stack manipulation capabilities of this
machine that are of interest in this section. Stack manipulation instructions
take the form nameX where name defines the class of the manipulator, and
X is the index of the element of the stack to be manipulated. The index of
the top of stack is one, so the drop2 instruction removes the second element
on the stack. The abstract stack machine used in this thesis has five classes
of stack manipulation instructions, these are:
Drop Eliminates the Xth item from the stack.
Copy Duplicates the Xth item, pushing the duplicate on to the top of the
stack.
Rot Rotates the Xth item to the top of the stack, pushing the intermediate
items down by one.
Rrot Rotates the top of the stack to the Xth position, pushing the intermediate
items up by one.
Tuck Duplicates the top of the stack, inserting the duplicate into the Xth
position, pushing the intermediate items up by one. This means that
the duplicate actually ends up at the (X + 1)th position.
The abstract machine instruction set is listed more fully in Appendix A.
4
Typically 4, when stack scheduling is used, whereas small scale RISC machines have
considerably more.
15
1.2.2
Drop
drop
nip
Retrieve
dup
over
Save
dup
tuck
Rotate (up)
nop
swap
rot
Rotate (down)
nop
swap
Stack Manipulation
Drop
drop1
drop2
drop3
drop4
Retrieve
copy1
copy2
copy3
copy4
Save
copy1
tuck2
tuck3
tuck4
Rotate (up)
nop
swap
rot3
rot4
Rotate (down)
nop
swap
rrot3
rrot4
With this stack accessing scheme the number of stack manipulation operators
required for a depth of N equally accessible registers is 5N 4, as shown in the
right hand column of table 1.3. This could become a problem if deep access
into the stack, say N = 8, 12 or even 16 were required. The number of stack
manipulators could become prohibitively large for larger values of N , taking
up an excessive fraction of the instruction set and potentially complicating the
physical design. Not all these operators are required however. The orthogonal
set of operators to cover this space consists of drop1, tuckX X N and rotX
X N, X 6= 1. This only requires 2N stack manipulators, as shown in the left
hand column of table 1.3. The instruction dropN can be emulated with rotN
16
drop1, rrotN can be emulated with tuckN drop and copyN can be emulated
with rotN tuckN. Although this only requires 2N stack manipulators, in order
to do register allocation a compiler might have to insert large numbers of
stack manipulations in the generated code. As a compromise between these
two extremes, an intermediate range of operators is provided by the set of
operators, rotX, rrotX, copyX and drop1, as shown in the centre column of
table 1.3. This would allow most of the convenience provided by the complete
set of operators, but with fewer operators.
Table 1.3: Possible sets of stack manipulation operators
Manipulator group
Drop
Retrieve
Save
Rotate (up)
Rotate (down)
Total
1.3
Orthogonal
1
0
N
N-1
0
2N
Intermediate
1
N
0
N-1
N-2
3N-2
Complete
N
N
N-1
N-1
N-2
5N-4
Compilers
1.3.1
1.3.2
A basic block is a piece of code which has one entry point, at the beginning, and one
exit point, at the end. That is, it is a sequence of instructions that must be executed, in
order, from start to finish.
18
allocation across blocks. Baileys Algorithm tries to find usage pairs that
span blocks, and then replaces the memory accesses with stack manipulations.
Although this approach spans basic blocks it is still essentially local as it does
not directly consider variables across their whole lifetimes, merely in the blocks
in which they are used. It is limited to variables whose liveness does not change
between blocks, which means that a number of variables, such as loop indices
and some variables used in conditional statements, cannot be dealt with. As
demonstrated in Chapter 5 this can cause a notable loss of performance for
some programs.
In order to produce a better algorithm and understand the issues involved,
I have developed a framework which divides the stack into logical partitions
and simplifies the analysis and development of stack-based register allocation
methods.
Register allocation methods have been developed using this framework
that can reduce local variable traffic significantly in most cases and to zero in
a few cases.
The compiler developed includes the algorithms covered in this thesis. It
also includes a peephole optimiser to minimise stack manipulations generated
by the register allocation algorithms. For comparative purposes both Koopmans
and Baileys algorithms were also implemented.
1.4
Context
This work was done as part of the development of a compiler for a new stack
machine, called the UFO[23].
1.4.1
The UFO is a dual-stack machine. The data stack has four accessible registers
at the top of the stack which can be permuted by various instructions; copy,
drop, tuck and rotate. Underneath these is the stack buffer which contains
the mid part of the stack. The lower part of the stack resides in memory,
with hardware managing the movement of data to and from the buffer. This
paper, however, deals with a generalised form of this architecture, with any
number of accessible registers from two upwards, and shows that increasing
the number of usable registers is as valuable in a stack machine as any other
architecture. The UFO design is developed from a research stack architecture,
the UTSA (University of Teeside Stack Architecture). As a simulator for the
UTSA was readily available, most of the experimentation in register allocation
was done with the UTSA version of the compiler.
19
1.5
Goal
It is the ultimate goal of the work behind thesis to demonstrate that a stack
machine can be as fast as, or even faster than, a RISC machine. Obviously,
even if this goal can be achieved, it cannot be achieved in a single Master
level thesis. Equally obviously this cannot be done with just software, and it
cannot be done overnight, as modern RISC machines have many performance
enhancements over the original RISC machines of the 1980s. However with a
good compiler it might be possible to demonstrate that a stack machine for a
low resource system, such a soft core processor on an FPGA, can be an equal
or better alternative to a RISC machine.
1.6
Contribution
1.7
Summary
20
Chapter 2
The Stack
In this chapter, the stack is analysed in detail. After looking at the stack
from various perspectives, a novel theoretical framework for studying the stack
is presented. Finally, the use of this framework to do register allocation is
outlined.
To design a compiler for a stack machine, most of the conventional techniques
for compiler design can be reused, with the exception of register allocation
and, to a lesser extent, instruction scheduling. Register allocation for stack
machines is fundamentally different from that for conventional architectures,
due the arrangement of the registers. This chapter looks at the stack from
a few different perspectives and describe a way of analysing the stack that
is suitable for classifying and designing register allocation methods for stack
machines. In its strictest form a stack has only one accessible register and
values can only be pushed on to the top of the stack and popped off again,
a pure last-in-first-out arrangement, and whilst this form of stack is simple
to implement and useful for some applications, it is a very limiting way to
organise the registers of a general purpose machine. As a consequence, all real
stack machines have stacks that allow more flexible access to the stack.
2.1
2.2
Any processor can be classified by the number and purpose of its stacks [24].
The term stack machine is taken to mean a machine with two hardware
stacks, one for data and one for program control. Stack machines can be
further classified by the parameters of its data stack. One way to classify
the data stack is by the number of accessible registers and the by number of
instructions available for manipulating the stack. The original stack machine,
the Burrough B5000 had only a few stack manipulation instructions. A range
of stack machines are listed in table 2.1.
2.2.1
2.2.2
Stack Regions
To aid analysis of the stack with regard to register allocation, the perspective
chosen divides the stack into a number of regions. These regions are abstract,
having no direct relation to the hardware and exist solely to assist our thinking.
The boundaries between these regions can be moved without any real operation
taking place, but only at well defined points and in well defined ways. This
compiler oriented view of the stack consists of five regions. Starting from the
top, these are:
The evaluation region (e-stack)
The parameter region (p-stack)
The local region (l-stack)
The transfer region (x-stack)
The remainder of the stack, included for completeness.
The Evaluation Region
The evaluation region, or e-stack, is the part of the stack that is used for
the evaluation of expressions. It is defined to be empty except during the
evaluation of expressions when it will hold any intermediate sub-expressions1 .
Figure 2.2: Evaluation of expression y = a b + 4
This is by definition, any expression that does not fulfil these criteria should be
broken down into its constituent parts, possibly creating temporary variables if needed.
The conditional expression in C is an example of such a compound expression.
23
The e-stack and p-stack are the parts of the stack that would be used by
a compiler that did no register allocation. Indeed the stack use of the JVM[19]
code produced by most Java[13] compilers corresponds to the evaluation region.
The Local Region
The local region, or l-stack, is the region directly below the p-stack. The lstack is used for register allocation. It is always empty at the beginning and
end of any basic block, but may contain values between expressions. In the
earlier example, no mention was made of where either a or b came from or
where y is stored. They could be stored in memory but it is better to keep
values in machine registers whenever possible. So let us assume that in the
earlier example, y = a * b + 4, a and b are stored in the l-stack, as shown
in Figure 2.4. To move a and b from the l-stack to the e-stack, we can copy
2
This presumes that all procedure calls have been moved to the top level of any expression
trees, as they are by most compilers. If this were not the case then procedure calls would
remove only their own parameters from the top of the p-stack, leaving the remainder.
24
them, thus retaining the value on the l-stack, or move them to the e-stack
from the l-stack. In this example, b might be stored at the top of the l-stack,
with a directly below it; to move them to the e-stack requires no actual move
instruction, merely moving the logical boundary between the e-stack and lstack. Likewise storing the result, y, into the l-stack is a virtual operation.
Figure 2.4: Using the l-stack when evaluating y = a[x] + 4
2.2.3
register allocation. The ability of the hardware to reach a point in the l-stack
depends on the height of the combined e- and p-stacks above it, but it is fixed
during register allocation, meaning it needs to be calculated only the once at
the start of register allocation.
The E-stack
The e-stack is unchanged during optimisations. Optimisation changes whether
values are moved to the e-stack by reading from memory or by lifting from a
lower stack region, but the e-stack itself is unchanged.
The P-stack
For a number of register allocation operations, there is no distinction between
the e-stack and p-stack and they can be treated as one region, although the
distinction can be useful. For certain optimisations, which are localised and
whose scopes do not cross procedure calls, the p-stack and l-stack can be
merged increasing the usable part of the stack. For the register allocations
method discussed later, which are global in scope and can cross procedure
calls, the p-stack is treated essentially the same as the e-stack.
The L-stack
The l-stack is the most important region from a register allocation point of
view. All intra-block optimisations operate on this region. Code is improved
by retaining variables in the l-stack rather than storing them in memory.
Variables must be fetched to the l-stack at the beginning of each basic block
and, if they have been altered, restored before the end of the block, since by
definition, the l-stack must be empty at the beginning and end of blocks.
The X-stack
The x-stack allows code to be improved across basic block boundaries. The
division between the l-stack and x-stack is entirely notional; no actual instructions
are inserted to move values from one to the other. Instead the upper portion,
or all, of the x-stack forms the l-stack at the beginning of a basic block.
Conversely, the l-stack forms the upper portion, or all, of the x-stack at the
end of the basic block. Since the e-stack and l-stack are both empty between
basic blocks, the p-stack and x-stack represent the complete stack which is
legally accessible to the current procedure at those points. This makes the
x-stack the critical part of the stack with regards to global register allocation.
Code improvements using the x-stack can eliminate local memory accesses
entirely by retaining variables on the stack for their entire lifetime.
26
2.2.4
The logical stack regions can be of arbitrary depth regardless of the hardware
constraints of the real stack. However, the usability of the l-stack and x-stack
depends on the capabilities of the hardware. Our real stack machine has a
number of stack manipulation instructions which allow it to access values up
to a fixed depth below the top of the stack. However, as the e-stack and pstack vary in depth, the possible reach into the l-stack also varies. Variables
that lie below that depth are unreachable at that point, but, as they may have
been reachable earlier and become reachable later, they can still be useful. We
assume that the hardware allows uniform access to a fixed number of registers,
so if we can copy from the nth register we can also store to it and rotate through
it.
2.2.5
Edge-Sets
27
2.3
An Example
In order to illuminate the process of using the stack regions to perform register
allocation we will use an example. The program code in Figure 2.5 is a simple
iterative procedure which returns n factorial for any value of n greater than
0, otherwise it returns 1. The C source code is on the left, along side it is the
output from the compiler without any register allocation.
Figure 2.5: C Factorial Function
C source
Assembly
int f a c t ( int n )
{
int f = 1 ;
while ( n > 0 ) {
f = f n;
n = n 1;
}
return f ;
}
. text
; function fact
! loc n
lit 1
! loc f
jump L3
L2 :
@loc f
@loc n
mul
! loc f
@loc n
lit 1
sub
! loc n
L3 :
@loc n
lit 0
b r g t L2
@loc f
exit
The first part of the stack to be determined is the x-stack, and before that
can be done, the edge-sets need to be found; see Figure 2.6. Once the edgesets are found, the x-stack for each can be determined. Firstly consider the
edge-set {a, b}; both the variables n and f are live on this edge set. Presuming
that the hardware can manage this, it makes sense to leave both variables in
28
the x-stack. The same considerations apply for {c, d}, so again both n and f
are retained in the x-stack. The order of variables, whether n goes above f, or
vice versa, also has to be decided. In this example we choose to place n above
f, since n is the most used variable, although in this case it does not make a
lot of difference. The algorithms to determine which variables to use in more
complex cases are covered in chapter 4.
Once the x-stack has been determined, the l-stack should be generated in a
way that minimises memory accesses. This is done by holding those variables
which are required by the e-stack in the l-stack, whilst matching the l-stack
to the x-stack at the ends of the blocks. Firstly n, as the most used variable,
is placed in the l-stack. It is required on the l-stack thoughout, except during
the evaluation of n = n+1, when it is removed, so the value is not duplicated.
Secondly f is allocated in the l-stack, directly under n. In the final block the
value of n is superfluous and has to be dropped.
The original and final stack profiles are shown in Figure 2.7. Note the
extra, seemingly redundant, stack manipulations, such as rrot2 which is
equivalent to swap, and rrot1, which does nothing at all. These virtual
stack manipulations serve to mark the movement of variables between the
e-stack and l-stack. The final assembly stack code, with redundant operations
removed, is shown in Figure 2.8 on the right. Not only is the new code shorter
than the original, but the number of memory accesses has been reduced to
zero. Although much of the optimisation occurs in the l-stack, the x-stack is
vital, since without it variables would have to be stored to memory in each
block. Register allocation using only the l-stack can be seen in the centre
column of Figure 2.8. This would suggest that the selection of the x-stack
29
.text
n
param n
n
rot1
lit 1
!loc n
lit 1
n
n
rrot1
n
n
!loc f
rrot2
jump L3
jump L3
L2:
L2:
@loc f
@loc n
mul
f*n
!loc f
rot2
copy2
mul
f*n
rrot2
@loc n
lit 1
sub
n1
!loc n
L3:
@loc n
lit 0
brgt L2
rot1
lit 1
sub
n1
rrot1
L3:
copy1
lit 0
brgt L2
@loc f
rot1
exit
drop
rot1
exit
30
No register allocation
. text
; function fact
! loc n
lit 1
! loc f
jump L3
L2 :
@loc f
@loc n
mul
! loc f
@loc n
lit 1
sub
! loc n
L3 :
@loc n
lit 0
b r g t L2
@loc f
exit
31
. text
; function fact
lit 1
swap
jump L3
L2 :
tuck2
mul
swap
lit 1
sub
L3 :
copy1
lit 0
b r g t L2
drop
exit
2.4
Summary
32
Chapter 3
The compiler
This chapter covers the choice of compiler, approaches to porting the compiler
to a stack machine and the final structure of the compiler.
3.1
3.1.1
GCC vs LCC
GCC originally stood for the GNU1 C Compiler it now stands for the GNU
Compiler Collection. It is explicitly targeted at fixed register array architectures,
but is freely available and well optimised. It is thoroughly, if not very clearly,
documented[27]. lcc does not seem to be an acronym, just a name, and is
designed to be a fast and highly portable C compiler. Its intermediate form is
much more amenable to stack machines. It is documented clearly and in detail
in the book[10] and subsequent paper covering the updated interface[11].
1
GNU is a recursive acronym for GNUs Not UNIX. It is the umbrella name for the Free
Software Foundations operating system and tools.
33
GCC
yes
yes
yes
yes
yes
yes
no
no
yes and no
lcc
yes
no
yes
no
no
no
yes
yes
no
A point by point comparison of lcc and gcc is shown in table 3.1. The long
list of positives in the GCC column suggests that using GCC would be the
best option for a compiler platform for stack machines, at least from the users
point of view. However the two negatives suggest that this porting GCC to a
stack machine would be difficult, if not impossible. After all, a working and
tested version of lcc is a very useful tool, whereas a non-working version of
GCC is useless. This leads to the question: Could a working, efficient version
of GCC be produced in the time available? As find as I could discover, there
has only been one attempt to port GCC to a stack machine. This port, for the
Thor Microprocessor, demonstrated the difficulties involved in such a port.
3.1.2
3.2
LCC
lcc is a portable C compiler with a very well defined and documented interface
between the front and back ends. The intermediate representation for lcc
consists of lists of trees known as forests, each tree representing an expression.
The nodes represent arithmetic expressions, parameter passing, and flow control
in a machine independent way. The front end of lcc does lexical analysis,
parsing, type checking and some optimisations. The back end is responsible
for register allocation and code generation.
The standard register allocator attempts to put as many eligible variables2
in registers as possible, without any global analysis. Initially all such variables
are represented by local address nodes that are replaced with virtual register
nodes. The register allocator then determines which of these can be stored in
a real register and which have to be returned to memory. The stack allocator
uses a similar overall approach, although the algorithm used to do the register
allocation is completely different and does include global analysis.
Previous attempts at register allocation for stack machines have been
implemented as post-processors to a compiler rather than as part of the
compiler itself. The compiler simply allocates all local variables to memory
and the post-processor then attempts to eliminate memory accesses. The
problem with this approach is that the post-processor needs to be informed
which variables can be removed from memory without changing the program
semantics, since some of the variables in memory could be aliased, that is, they
could be referenced indirectly by pointer. Moving these variables to registers
would change the program semantics.
3.2.1
Initial Attempt
In C local variables may have their addresses used, which renders them ineligible to be
stored in a register.
3
The lburg code-generator-generator is part of the standard lcc distribution
35
part of lccs back end that the target would be a conventional register machine.
This was circumvented by claiming the stack machine had a large number of
registers4 with the intention of cleaning up these virtual-registers later.
Using special pseudo-instructions to mark these virtual-registers in the
assembly code ensured that the post-processor was aware of which variables
could be legitimately retained on the stack. Many of these virtual registers
could then be moved onto the stack and the remaining ones allocated memory
slots.
This approach works, that is it is able to produce correct code, but has a
number of problems:
The post-processor must parse its input and output assembly language,
and both of these tasks are already done by the compiler.
It has to be implemented anew for each architecture.
4
Thirty two, but as unlike a real machine, all of these were available to store variables,
none were required for parameter passing or expression evaluation.
36
Front End
Register
Allocator
Machine
Description
Code
Generator
Peephole
Optimiser
Assembler
Code
Generator
Generator
Flow information, which was readily available within the compiler, has
to be extracted from the assembly code. It is extremely difficult, if not
impossible, to extract flow control from computed jumps such as the the
C switch statement.
Implementing the register allocator within the compiler not only solves the
above problems, it also allows some optimisations which work directly on the
intermediate code trees[8].
Finally, the code-generator always walked the intermediate code tree bottom
up and left to right. This can cause swap instructions to be inserted where
the architecture expects the left hand side of an expression to be on top of
the stack. By modifying the trees directly these back to front trees can be
reversed.
37
3.3
The standard distribution of lcc comes with a number of backends for the
Intel x86 and various RISC architectures. These share a large amount of
common code. Each back end has a machine specific part of about 1000 lines
of tree-matching rules and supporting C code, while the register allocator
and code-generator-generator are shared. In order to port lcc to a stack
architecture a new register allocator was required, but otherwise as much
code as possible is reused. Apart from the code-generator-generator, which is
reused, the backend is new code. The machine descriptions developed for the
initial implementation were reused.
3.3.1
Since the register allocation algorithm which was to be used was unknown
when the architecture of the back-end was being designed. The optimiser
architecture was designed to allow new optimisers to be added later. The
flow-graph is built by the back-end and then handed to each optimisation
phase in turn. The optimised flow-graph is then passed to the code generator
for assembly language output.
3.3.2
3.3.3
Optimisation
3.4
The source code for the back end is structured in three layers. This is to
allow as much modularity as possible. When choosing between data hiding,
modularity and reducing errors on one side and speed on the other, safety has
been chosen over performance throughout. Performance is noticeably slower
than the standard version of lcc, but there is plenty of potential for improving
speed, by eliminating runtime checks used during development and replacing
some of the smaller functions with macros.
3.4.1
Program flow
The compiler translates the source code from the pre-processor into assembly
language in four phases:
1. Intermediate code is passed from the front end and the flow-graph is
built
2. Data flow analysis is done. Definitions and uses are calculated, liveness
derived from that, and def-use chains created
3. Optimisations are performed
4. Code is generated
3.4.2
Data Structures
This layer is composed of the simple general data structures required: sets,
lists and graphs. These are implemented using opaque structures to maximise
data hiding. Each data structure consists of a typedef for a pointer to the
opaque data structure and a list of functions, all with a name of the form
xxx functionName, where xxx is the data structure name.
For example the set data structure BitSet and some of its functions are
defined as:
typedef struct bit_set* BitSet;
BitSet set_new(int size);
BitSet set_clone(BitSet original);
void set_add(BitSet set, int n);
int set_equals(BitSet set1, BitSet set2);
39
3.4.3
Infrastructure
Built upon these classic computer science data structures are the application
specific data structures: the flow-graph, basic blocks, liveness information
and means of representing the e-, p-, l- and x-stacks. These follow a similar
pattern to the fundamental data structures, but with more complex interfaces
and imperfect data hiding. For example the Block data-structure holds
information about a basic block. This includes the depths of the stack at
its start and end and which variables are defined or used in that block. It also
provides functions for inserting code at the start or end of the block, as well
as the ability to join blocks.
3.4.4
Optimisations
3.5
LCC Trees
The intermediate form in lcc, that is, the data structures that are passed from
the front end to the back end, are forests of trees each representing a statement
or expression. For example the expression y = a[x] + 4, where a is an array of
ints, is represented by the tree in Figure 3.4. Trees also represent flow control
and procedure calls. For example the C code in Figure 2.5 is represented by
the forest in Figure 3.5. A glossary of lcc tree nodes, including the new stack
based nodes, is included in appendix D.
3.5.1
41
3.5.2
Both stack and copy are virtual address nodes. This means they represent
the address of a stack register. They are virtual, since registers cannot have
their address taken. This means they can only occur as the l-value in an
assignment or as the child of an indirection node.
Stack
The stackN node represents the location of a stack based register. The
address of the nth stack register is denoted as stackN . Since registers do
not have real addresses, the stack node can only be used indirectly through
the indir node or as the l-value of an asgn node. Since reading the stack is
typically destructive, fetching from a stack node will remove the value stored
in it from the stack and writing to it will insert a new value, rather than
overwriting any value. The stack node will typically be implemented with a
rotate instruction, to either pull a value out of the stack for evaluation, or to
insert the value currently on top of the stack deeper into the stack.
Copy
The copyN node, like the stackN node, represents the location of a stack
based register, but with the side effect of leaving the value stored in nth stack
register, when read through the indir node. The copy node cannot be used
as an l-value, as it would be meaningless in this context. The copy node will
typically be implemented with a copy instruction.
42
3.5.3
Tuck
The third new node added is the tuck node. The tuck node differs from the
other two in that it represents a value rather than an address. Its value is that
of its only child node, but as a side effect, the tuckN node stores that value
into the the N th register on the stack. Since, when evaluating an expression
on the stack, the value of the current partial expression is held in the topof-stack register, copying that value into the stack corresponds to the tuck
instruction. Although this representation is rather inelegant, it is required,
both to efficiently represent the semantics of the tuck instruction and to allow
localised register allocation.
3.5.4
Drop
Since an expression that is evaluated for its side effects only is represented as
a complete tree in lcc, a drop operation can be represented as an indirection
of a stack node. In effect a drop instruction can be regarded as discarding
the result of the evaluation of the nth register. The semantics of root nodes is
explained in Section 3.5.5. See Figure 3.6.
Figure 3.6: drop3 as lcc tree
Copy
copy is represented by fetching from a copy node and assigning to the top of
stack node, stack1 . See Figure 3.7
Figure 3.7: copy3 as lcc tree
Rotate Up
rot is represented by fetching from a stack node and assigning to the top of
stack node, stack1 . See Figure 3.8
43
Rotate Down
This is a stack node as the l-value of an assignment with the top of stack node,
stack1 , as the assignee. See Figure 3.9.
Figure 3.9: rrot3 as lcc tree
Tuck
This is simply a tuck node, with an arbitrary expression as its child. See
Figure 3.10.
Figure 3.10: tuck3 as lcc tree
3.5.5
The other change required for stack based register allocation is a subtle alteration
to the semantics of root nodes in the lcc forest. In lcc the roots of trees
represent statements in C. As such they have no value. On a stack machine
a statement leaves no value on the stack, but an expression does, and so
to convert from an expression to a statement the value has to be explicitly
dropped from the stack. In lcc call nodes may serve as roots, although
they may have a value. This is generalised, so that any node may be a root
node. This means that all the semantics of the stack machine drop instruction
can easily be represented by using an expression as a statement. This also
simplifies the implementation of dead store elimination considerably.
44
stmt:
stk:
stk:
stk:
stk:
Table 3.2:
ASGN(stk, ADDRL)
INDIR(ADDRL)
CNST
LSH(stk, stk)
INDIR(stk)
Partial machine
"!loc %1\\n"
"@loc %0\\n"
"lit %a\n"
"call bshl\n"
"@\n"
description
3
3
1
10
3
3.6
3.7
Tree Labelling
The vreg node is produced by lccs front end, to represent a virtual register. See
Appendix D
6
These rules are simplified; the real rules include type information, for example the ASGN
node is really a CNSTI4 for 4 byte integer constant
45
stmt:
stk:
stk:
stk:
stk:
stk:
stk:
Table 3.3:
ASGN(stk, ADDRL)
INDIR(ADDRL)
CNST
LSH(stk, stk)
INDIR(stk)
LSH(stk, CNST 2)
ADD(stk, CNST 4)
Extended machine
"!loc %1\\n"
"@loc %0\\n"
"lit %a\n"
"call bshl\n"
"@\n"
"shl\nshl\n"
"add4\n"
description
3
3
1
10
3
2
1
two. The code generated can be improved by adding an extra rule to cover
this case, as well as a rule for the UFOs add4 instruction, to give the machine
description in table 3.3, which is labelled in Figure 3.12, for a reduced cost of
3 + 2 + 3 + 1 + 3 + 1 + 3 = 16. By including sufficient rules and classes to cover
the entire instruction set, good quality code can be generated. See Section 3.8
for an example of this in practice.
3.8
The UTSA7 is a stack machine specification and simulator used for researching
advanced stack architectures. It was designed with the intent of being implemented
with relatively few transistors, yet achieving good performance. It is a standard
two stack architecture (computation stack and address stack), with a couple of
traditional user registers available. All computation is done via the computation
stack. The address stack is used primarily for storing return addresses. The
two user registers are designed with arrays in mind; there are instructions
to read and write to the memory locations referred to while incrementing
or decrementing the register. The feature of the UTSA which had the most
impact on the design of the UTSA machine description was its memory addressing.
7
46
3.8.1
31 and 30
byte address
29 to 24
zero
23 down to 0
word address
Since most addresses will be word aligned, this format is identical to the
native word address for the common case. The compiler guarantees that
accesses will be appropriately aligned. When using a half word or byte address,
a standard library function will need to be called for memory accesses anyway;
making these functions aware of the new format is simple.
3.8.2
Not stack rotation, but bit rotation. The ror instruction moves all but the rightmost
bit of the top of stack register one place to the right, while moving the rightmost bit to the
leftmost position.
47
byte address down to bits 1 and 0, do the addition, and roll the result back
round9 . In the case just discussed this is incredibly wasteful. Thankfully the
machine description has types to allow the description of different addressing
modes. To take advantage of this, all values known to be a multiple of 4
are marked as a special type, called addr4 to distinguish it from the byteaddress or just addr. Where this special type is used, redundant instructions
can be removed. Careful consideration of which expressions can be considered
to be of this type allows almost all excess instructions to be removed. These
expressions are:
Constant multiples of 4
Addresses of types of size 4 or greater
The result of any value left shifted by 2 or more
The addition or subtraction of two values, both of which of the type
addr4
The multiplication of two values, either of which is of the type addr4
Conversion from address to numeric value is achieved by inserting the appropriate
rotation instructions. This is relatively rare and the overall effect is a significant
saving.
3.9
The UFO[23] machine has been recently developed at the University of York.
It is a 32 bit, byte addressed stack machine. It has a simple and compact
instruction format. Since the UFO version of lcc supports double word types
the machine description is considerable enlarged compared with the UTSA
machine description, to cover these extra types. The UFO instruction set
includes instructions with three different literal formats; seven, nine and eleven
bits. This also adds bulk, but no real complexity, to the machine description.
3.9.1
Stores
48
copy node, then the copy or tuck is replaced with a rotation, since the copystore-drop or tuck-store-drop sequence would be too complex for the peephole
optimiser to deal with.
3.10
3.10.1
When calling a variadic function the total number of stack slots used for the
variable arguments must be pushed onto the stack immediately before the
function is called. Basically we are adding an implicit extra parameter to any
variadic function.
3.10.2
At the start of any variadic function, the variadic parameters must be stored
to memory as follows:
Pop the top of the stack and store into memory or another register. This
is the number of variadic parameters.
Store each parameter, in turn, to memory.
Store the address of these parameters to a standard location.
Procede as normal.
Provided the architecture has hardware support for local variables, the best
place to store the variadic parameters is in the local variable frame, moving
the frame pointer to accommodate. The address of the parameters can then
be stored at a fixed offset from the frame pointer. Variadic functions are
implemented in this way for both the UFO and UTSA architectures.
10
49
3.11
Summary
Having compared GCC and lcc, lcc was chosen as a better front-end for a
compiler for stack machines. Once the front-end was selected, a new backend was required, although the code generator generator was reused. Some
practical problems involved in the ports to UTSA and UFO were solved. At
this stage a working compiler existed, but one that produced poor quality
code. The next chapter explains approaches to register allocation, in order to
improve the quality of the compiler output.
50
Chapter 4
Register allocation
In this chapter we look at the pre-existing methods of register allocation for
stack machines as well as global allocation techniques developed using the
framework laid out in Chapter 2
4.1
Koopmans Algorithm
4.1.1
Description
4.1.2
Analysis
4.1.3
Implementation
Each step in Section 4.1.1 can be directly translated to a form suitable for use
on the lcc intermediate form, as shown in Algorithm 2.
Koopmans Algorithm is implemented in approximately 140 lines of C code,
not including code shared by all the optimisers and data-flow analysis.
4.1.4
Since each member of the pair can be a definition or a use, there are four
possible types of pairs; def-def, def-use, use-def, and use-use. Additionally
as the forest is changed during optimisation, either the first or the second
member can be changed. This multiplies the number of permutations by four;
unchanged-unchanged, unchanged-changed, changed-unchanged, changed-changed.
This gives a total of sixteen combinations to consider.
53
Def-use
Use-use
Initial Transformations
Firstly the initial transformations on the four types of pairs are considered.
This is before any transformations have been performed. Three different
transformations are possible in this case. The transformations are listed below;
the modifications to the trees are shown in table 4.1.
Def-def The first definition is dead, as it is never used, and can be eliminated.
This simplifies subsequent dead store elimination as only the final occurrence
of a variable in each block need be considered.
Def-use A tuck node is inserted before the value is stored, and the use replaced
with a stack node.
Use-def The two nodes are unrelated, so no change is possible.
Use-use As in the def-use case, a tuck is inserted before the first use, and the
second replaced with a stack node.
54
55
Use-use
4.2
4.2.1
Baileys Algorithm
Description
Baileys inter-boundary algorithm[5] was the first attempt to utilise the stack
across basic block boundaries. This is done by determining edge-sets; although
56
in the paper the algorithm is defined in terms of blocks rather than edges.
Then the x-stack, termed sub stack inheritance context, is determined for
the edge-set. See Algorithm 3.
Algorithm 3 Baileys Algorithm
1. Find co-parents and co-children for a block (determine the edge-set).
2. Create an empty sub stack inheritance context.
3. For each variable in a child block, starting with the first to occur:
If that variable is present in all co-parents and co-children, then:
Test to see if it can be added to the base of the x-stack. This
test is done for each co-parent and co-child to see whether the
variable would be reachable at the closest point of use in that
block.
Baileys Algorithm is designed to be used as a complement to an intra-block
optimiser, such as Koopmans. It moves variables onto the stack across edges
in the flow graph, by pushing the variables onto the stack immediately before
the edge and popping them off the stack immediately after the edge. Without
an intra-block optimiser this would actually cause a significant performance
drop.
4.2.2
x-stack during blocks. Not only that, but the variables in the x-stack are
shunted, via the l-stacks and e-stacks, to and from memory at the ends of
blocks. No attempt is made to integrate the allocation of the x-stack and the
allocation of the l-stack. In terms of performance, the main failing of Baileys
Algorithm is that it cannot handle variables which are live on some but not
all edges of an edge-set.
4.2.3
Implementation
4.3
=
After edge - in all successor blocks
=
Use-use
Before edge - in all predecessor blocks
=
After edge - in all successor blocks
=
to get close to the optimum.
4.3.1
An Optimistic Approach
The first approach is to use the same framework as Baileys; once x-stacks
are determined, values are pushed onto the stack before an edge block and
popped off afterwards, leaving the local optimiser to clean up, the extra
memory accesses. However, in Baileys approach the variables chosen quite
conservatively. The optimist approach uses all plausible variables, the set of
all variables that are live on any of the edges in the edge-set. Surprisingly, for
a few small programs with functions using only three or four local variables,
this approach works very well and produces excellent code. It does, however,
produce awful results for larger programs with more variables, as variables are
moved from memory to the stack and back again across every edge in the flow
graph.
4.3.2
Obviously a more refined way to select the x-stack is required. The optimist
algorithm in the previous section chooses the union of the sets of variables
59
which are live on each edge in the edge-set, whereas Baileys Algorithm chooses
the intersection of these sets. Experimental modifications to Baileys Algorithm
to use various combinations of unions and intersections of liveness and uses
revealed some important limitations in the localised push-on, pop-off approach.
These are:
Excessive spilling
There is no attempt to make the x-stack similar across blocks, so variables
may have to be saved at the start of a block, and other variables loaded
at the end of a block.
Excessive reordering
Even when the x-stack state at the start and end of a block contain
similar or the same variables, the order may be different and thus require
extra instructions.
No ability to use the x-stack across blocks
The requirement for the entire x-stack to be transferred to the l-stack
means that the size of the x-stack is limited. Variables cannot be stored
deeper in the stack when they are not required.
4.3.3
A Global Approach
However, in order to allocate the x-stack in a way that does not impede
the subsequent l-stack allocation, the l-stack, must be at least partially
determined before the x-stack.
4.3.4
Outline Algorithm
4.3.5
Determining X-stacks
There are two challenges when determining the x-stack. One is correctness,
that is, the x-stack must allow register allocation in the l-stacks to be both
consistent with the x-stack and legal. The other challenge is the quality of
the generated code. For example making the x-stack empty at all points is
guaranteed to be correct, but not to give good code. Both of the x-stack
finding methods work by first using heuristics to find an x-stack which should
give good code, then correcting the x-stack if necessary. The algorithm for
ensuring correctness is the same, regardless of heuristic used. For the x-stacks
to be correct, two things need to be ensured:
1. Reachability
Ensure all variables in the x-stack that are defined or used in successor
or predecessor blocks, are accessible at this point.
2. Cross block matching
Ensure that all unreachable variables in the x-stack on one edge do not
differ from those in the x-stack on an edge with one intervening block.
Ordering of Variables.
As stated earlier, a globally fixed ordering of variables is used. This is done
by placing variables with higher estimated dynamic reference count nearer
the top of the stack. In our implementation, which is part of a port of lcc[15],
the estimated dynamic reference count is the number of static references to
a variable, multiplying those in loops by 10 and dividing those in branches
by the number of branches that could be taken. However, any reasonable
estimation method should yield similar results. An alternative ordering could
be based around density of use, which would take into account the lifetime
of variables.
61
Heuristics
By a process of experiment and successive refinement, two different heuristics
for x-stack creation have been developed. The first is relatively simple and
fast, whereas the second is somewhat more complex, and consequently slower.
Global 1
The first, simpler heuristic is simply to take the union of live values as
done in the optimist approach in Section 4.3.1. So why is this any better?
It is better because the subsequent parts of the algorithm remove a lot of
variables that would be counter productive, thus ensuring that the x-stack is
not overwhelmed with too many variables. The globally consistent ordering
reduces unnecessary stack manipulations in short blocks. Finally the l-stack
allocation method discussed in Section 4.3.7 is designed to work with x-stacks
directly, further reducing mismatched stacks.
Global 2
This heuristic was developed to improve on Global 1. It considers the ideal
l-stack for each block and then attempts to match x-stack as closely to that as
possible. Given that the ordering of variables is pre-determined, the x-stack
can be treated as a set. In order to find this set, we determine a set of variables
which would be counter productive to allocate to the l-stacks. The x-stack is
then chosen as the union of live values less this set of rejected values. The
set of rejects is found by doing mock allocation to the l-stack, to see which
values can be allocated, then propagating the values to neighbouring blocks
in order to reduce local variation in the x-stack. See Algorithm 5. Overall
this heuristic out performs Global 1, but results in slower compilation and
can produce worse code for a few programs, appendix F shows an extreme
example of this.
Mock L-stack Allocation
Before any real allocation can be done, information needs to be gathered about
which variables can be beneficially allocated to the x-stack. In order to do this,
it helps to know what variables will be in the l-stack. This ideal l-stack is found
by doing a mock allocation. This is essentially the same as the proper l-stack
allocation, covered in Section 4.3.7, except that is done without regard to the
x-stack, by pretending that it is empty. Since this is a mock allocation, no
change is made to the intermediate form, see Algorithm 7.
Propagation of Preferred Values
Once mock l-stack allocation has taken place, the variables which have been
rejected as x-stack candidates for individual blocks are propagated to their
62
b.reject
b.reject
4.3.6
4.3.7
L-Stack Allocation
The uses of a variable within a block can be divided up into chains. Each chain
starts either at the beginning of a block or when the variable is assigned. So
if a variable is not assigned during a block then there is only one chain for
that variable. If a variable is neither used nor assigned, but is live through
the block, it still has a chain. These empty chains are still required because
the variable will require stack space, whether it is used or not.
To allocate a chain, the variable represented by the chain is stored at the
bottom of the l-stack, extending the l-stack downwards for the length of the
chain. See Algorithm 7.
L-stack allocation also needs to ensure that the l-stack matches the x-stack
at the block ends. This is done by dropping, storing or reloading variables
where the l-stack and x-stack do not match; see Algorithm 9. There is one
final complication. If a block has a conditional branch statement at its end,
then statements to match up the l-stack and x-stack cannot be inserted after
the branch, as they would not necessarily execute. However, if the instructions
are inserted before the branch, then the stack would be incorrect, as the branch
test would disturb the stack. This can be solved as follows: if either of the
values in the comparison are not temporary variables, local to the block, then
a temporary is created and assigned the value, before the test. For example,
the following intermediate code:
if x+4 < y*2 then goto L1
becomes
t1 = x+4
t2 = y*2
if t1 < t2 then goto L1
These temporary variables, t1 and t2, are ignored during global register allocation.
They are left for the local optimiser to clean up. This guarantees correctness
and since t1 and t2 have such short lifetimes, they are easily removed later.
4.3.8
Chain Allocation
65
4.4
Final Allocation
As the global register allocation only allocates variables to the stack when an
entire chain can be allocated, it can leave a number of variables unallocated.
To further reduce memory accesses, a further allocator is required. The
global register allocator will have already used the l-stack to good effect,
so Koopmans Algorithm would be unable to handle the output of the that
allocator, since the base of the l-stack would be largely unreachable. Consequently
a different form of local optimiser is required, one that places values within the
l-stack, rather than just at its base. This works by trying to remove definitionuse and use-use pairs as well, choosing the depth with a simple cost function,
Algorithm 10, based on an estimate of the number of instructions required.
This algorithm checks all legal positions in the l-stack for a slot in which to
66
By chain
rot4
mul
x*x
copy3
x*x
rot4
x*x
mul
x*x
y*y
x*x
add
x*x
copy2
mul
x*x
y*y
rot3
add
mul
z*z
rot2
mul
x*x
x*x
rot3
rrot2
rot3
tuck1
rot2
tuck2
copy3
add
shl
2*s
rot3
L1:
2*s
mul
z*z
exit
2*s
add
shl
2*s
L1:
2*s
exit
2*s
rot2
tuck2
67
insert the value; a position is not legal if it causes a use of another value to
become unreachable, or if the value intrudes into the e- or p-stack. If the
pair does not span a procedure call then the p-stack may be viewed as part
of the l-stack, and only the e-stack preserved. The transformations on the
intermediate trees are the same as for Koopmans Algorithm and both share
a considerable amount of code.
Algorithm 10 Cost function
For a usage pair with a depth at start of x and a depth at the end of y, then:
1. For first item:
If already a stackN node and N 6= 1 and N = x then:
cost1 0
Else:
cost1 1
2. For second item:
if y = 0 then:
cost2 0
Else:
cost2 1
3. cost cost1 + cost2
4.5
Peephole Optimisation
The final stage is to clean up the resulting stack manipulations. For example,
in order to move a variable from the l-stack to the e-stack a swap instruction
might be issued. If two in a row were issued then the resulting sequence,
swap-swap could be eliminated. The peephole optimiser is table driven. It
parses the input accepting stack manipulations and any instruction that puts
a simple value on the stack, such as a literal or a memory read. When any
other type of instruction is encountered or the increase in stack depth exceeds
two, the calculated stack state is looked up in a table to find the optimum
sequence of instructions, which are then emitted. Should none be found or a
different class of instructions be found, then the input is copied.
4.5.1
Example
drop1
copy1
lit 1
rot
add
The add instruction halts the sequence of instructions that can be handled
with the resulting stack transformation
A
B
C
B
1
B
C
The new stack state would be looked up and replaced with the shorter
sequence:
drop1
lit 1
copy2
add
For the UFO architecture, as only the top four values on the stack are in
hardware registers, the peephole optimiser is concerned only with those four
values plus literals and memory reads; this means that 56 + 55 + 54 + 53 +
52 + 5 + 1 = 19531 possible permutations are required. Generation of this
table requires an exhaustive search over millions of possible sequences, but
can be done at compile time, so is not a problem. The table itself requires
about 150k bytes of memory, out of 209k when compiled using GCC for the
x86 architecture. Performance is more than adequate at about 1 million lines
per second on a typical desktop PC(2006).
4.5.2
lit 1
rot
add
can be replaced with
lit 1
rot
add
Since the any instruction tends to get inserted in loops, this is a worthwhile
saving.
For completeness, and to avoid any surprises when the peephole optimiser
is turned off, the assembler recognises the any instruction and treats it as a
lit 0 instruction.
4.6
Summary
70
Chapter 5
Results
This chapter discusses the relative effectiveness of the various register allocation
algorithms for various stack machines, both real and hypothetical. Before
doing that, however, the correctness of both the compiler and the simulator
used needs to be verified.
The are two types of test suites for a compiler:
1. To test coverage and correctness of the compiler.
2. To test performance of the compiler and hardware.
When selecting of a suite of test programs for performance testing, it is usual
to consider a ready made suite to allow comparisons with other architectures.
Since the hardware for UFO was unavailable at the time of writing, comparison
with other architectures is impossible. Therefore, in order to minimise effort
a significant proportion of the test suite was reused for benchmarking. Some
of the benchmarks were also used to verify correctness.
5.1
Compiler Correctness
The test suite consists of the test programs that come with lcc plus the
benchmarks listed below. All benchmarks and test programs ran correctly,
using the simulator.
5.2
Benchmarks
Twelve benchmarks were selected. The criteria for selected benchmarks were
zero-cost and minimal library requirements. Since the UFO compiler is targetting
an embedded processor there is no requirement for the standard C library,
and so only a very limited library is provided to allow interfacing with the
simulator. The characteristics of the benchmarks are listed in table 5.1.
71
Benchmark
bsort
image
matmul
fibfact
life
quick
queens
towers
bitcnts
dhrystone
wf1
yacc
Overall
5.3
The Baseline
All the results in this chapter are relative to a baseline compiler with no register
allocator. This baseline is not simply the compiler with all register allocation
turned off, as the front end of lcc generates a number of spurious temporary
variables which need to be removed to get a fair comparison. Consequently
any variable that is defined and never used, or any variable that is used once
and only once immediately 1 after it is defined, is removed. This gives the
output one would expect from a naive compiler.
5.4
The Simulator
The test platform was the simulator for the UFO hardware. This simulator
is an instruction level simulator; it does not attempt to mimic the low-level
internal architecture of the UFO. It is nonetheless required to be an exact
model of the processor at the visible register level. The simulator is also
enhanced with a number of features that the hardware will not have. These
are debugging aids, monitoring and host-architecture interface. Various cost
models, which produce a cost of running any particular program, can be
plugged in to simulate the relative effectiveness of the various register allocation
algorithms with regards to different hypothetical architectures.
1
72
5.5
Cost Models
Instructions are classified into groups. The cost model then assigns a cost to
each instruction group. The instruction groups and costs for those groups in
the different models can be seen in table 5.2. Cost models can have internal
state to simulate pipelining, but this simple framework would not allow outof-order execution to be modelled with any confidence.
Table 5.2: Instruction Costs
Model
Flat
Harvard
UFO
UFO (slow)
Pipelined
Stack Mapped
5.5.1
Memory
access
1
2
2 or 3
4 or 5
1 to 3
1 to 3
Predicted
branch
1
2
2
2
1
1
Missed
branch
1
2
2
2
4
4
Multiply
1
2
2
2
2
2
Divide
Stack
Other
1
5
5
5
5
5
1
2
1
1
1
0 or 1
1
1
1
1
1
1
The Models
These models are not detailed models of real machines, or even realistic models,
but serve to illustrate the effectiveness of the different register allocation
techniques for different types of architecture.
Flat
All instructions have a cost of one. Not a realistic model, but it gives a dynamic
instruction count.
Harvard
This assumes no interference between data memory and instruction fetching.
However, the processor is assumed to run twice as fast as the memory, so
all simple instructions have a cost of one; memory, multiply and flow control
instructions have a cost of two; divide has a cost of five.
UFO
This an attempt to model the UFO architecture. The UFO is a von Neumann
architecture, so any memory accesses must stall if an instruction fetch is
required. This is modelled by giving each memory access a cost of two, unless
it immediately follows another memory access when it has a cost of three.
Other instructions are costed the same as the Harvard model.
73
koopman
bailey
global1
global2
bsort
image
mat
mul
fibfact
life
quick
queen
s
towers
bitcnts
dhryst
one
wf1
yacc
Overall
5.6
Relative Results
The full results are shown in appendix B. Figures 5.1-5.5 show the performance
of the different algorithms relative to the baseline. Larger numbers are better.
74
1.2
1
0.8
0.6
0.4
0.2
0
bsort
image
mat
mul
fibfact
life
quick
queen
s
towers
bitcnts
dhryst
one
wf1
yacc
Overall
For the flat model, in other words the dynamic instruction count, the
global optimisers gain little over Koopmans and Baileys Algorithms. Global2
provides the best performance with a 12% reduction in executed instructions
over Baileys Algorithm. However, this is not the case for the quick benchmark,
where the more sophisticated allocators get worse results. In the case of
queens only Global1 manages an improvement over the baseline at a paltry
0.4%, all the others actually degrade performance relative to the baseline.
In the more realistic UFO model, memory bandwidth becomes important
and the performance of programs compiled with register allocators increases
relative to the baseline. The relative performance of the best Global allocator
relative to Baileys allocator remains about the same, with an 11% advantage.
For the UFO with slow memory, the global register allocators have significantly
better performance. This is unsurprising as they are designed to reduce
memory accesses to a minimum even at the cost of additional instructions.
These extra instructions are more than compensated for by the reduced memory
access costs in this case. The Global2 allocator outperforms Baileys by 20%
in this case and is 82% better than the baseline case.
The pipelined and stack mapped models are fairly loose approximations to
real architectures so the results should be viewed with caution. Nonetheless
the need for register allocation to get good performance out of these sorts of
architectures is clear.
5.6.1
Regardless of the performance models, the allocator Global2 gives the best
performance for the benchmarks image, matmul and life, whereas for the
benchmarks quick and queens it is Global1 that gives the best performance.
75
Figure 5.3: Relative performance for the UFO with slower memory model
3.5
3.25
3
2.75
2.5
2.25
2
koopman
bailey
global1
global2
1.75
1.5
1.25
1
0.75
0.5
0.25
0
bsort
image
mat
mul
fibfact
life
quick
queen
s
towers
bitcnts
dhryst
one
wf1
yacc
Overall
1
0.8
0.6
0.4
0.2
0
bsort
image
mat
mul
fibfact
life
quick
queen
s
towers
76
bitcnts
dhryst
one
wf1
yacc
Overall
1.25
1
0.75
0.5
0.25
0
bsort
image
mat
mul
fibfact
life
quick
queen
s
towers
bitcnts
dhryst
one
wf1
yacc
Overall
Why is this so? To demonstrate what is going on, consider two programs
doubleloop.c and twister.c. For the source code, assembly code and relative
performance of these programs see Appendix F. In the program doubleloop
two deeply nested loops occur one after the other. The two loops share no
variables and the all the variables in the second loop have slightly higher
reference counts. Global1 carries the variables for the second loop throughout
the first, impeding the variables actually needed. In this highly contrived
case, the Global2 version outperforms the Global1 version by 27%. So why
does Global1 do better than Global2 sometimes? When there are a lot of very
short blocks, each with varying variable use, Global2 can end up causing a lot
of spilling, that is, storing one variable to memory, when another is needed
in a register. Ways to prevent this and get the best of both algorithms are
discussed in Section 6.3.2.
5.6.2
The relative program size shown in Figure 5.6 is the size of the executable less
the size of an empty application, so as to remove all linked-in libraries and
most symbolic information. As can be seen, register allocation also reduces
program size but not greatly, as the difference between global allocation and
Koopmans version is only about 4% overall.
5.7
Summary
The results show that for all the models the new global optimisers outperform
both Baileys and Koopmans algorithms. The difference was most marked
for the models with the greatest difference between register access times and
77
koopman
bailey
global1
global2
0.5
0.4
0.3
0.2
0.1
0
bsort
image
matmul
fibfact
life
quick
queen
s
towers
78
bitcnts
dhrystone
wf1
yacc
Overall
Chapter 6
6.1
Performance
6.2
6.2.1
6.3
6.3.1
6.3.2
Currently the better of the two global allocators, global2, falls down when it
introduces too many changes in the x-stack. It does this because it has no
concept of loops and therefore cannot make good decisions as to which blocks
80
to prioritise. It currently does a good job where the loops are regular, but
suffers when the flow graph becomes more complex. If the back end were able
to find loops in the flow graph, then the register allocator could fix the x-stack
for inner loops first, ensuring that spills would occur only outside of loops.
A proposed, but unimplemented and thus untested, approach is outlined in
Algorithm 11.
Algorithm 11 Proposed Algorithm
1. By using dominators to determine loops, or by analysis of the source
code, make a static estimate of the dynamic execution frequency of each
block. Note that the dynamic execution frequency could be determined
by profiling in a profile driven framework.
2. For each block, starting with the most frequently executed:
(a) Determine the optimum, or near optimum l-stack, for that block
given any pre-existing x-stacks. For the first block there will be
no x-stack. Note that a pre-existing x-stack can be extended
downwards to incorporate more variables.
(b) Determine either or both of the upper parts of the x-stacks adjoining
that block, if they have not already been determined, leaving the
lower parts to be extended by subsequent propagation.
3. Propagate or eliminate, if necessary, variables in the lower parts of the
x-stack, until global consistency is achieved.
6.3.3
Stack Buffering
In all the models in chapter 5 it has been assumed that the stack never
overflows into memory. Obviously this is not always a reasonable assumption.
For all the non-recursive benchmarks run on the simulator, that is all but
fibfact, no stack overflows occur with a reasonable sized (32 entry) stack
buffer. For the recursive benchmark, a large number of overflows did occur. A
detailed analysis of how the requirements for the stack buffer are changed by
register allocation, especially for machines with larger numbers of accessible
registers, would be valuable for designing future stack processors.
6.4
6.4.1
The purpose of pipelining is to reduce the cycle time while maintaining the
same number of instructions executed per cycle. Since the lifetimes of instructions
81
82
6.4.2
6.5
Summary
Appendix A
rot1
rot2
rot3
rot4
copy1
copy2
copy3
copy4
tuck1
tuck2
tuck3
tuck4
drop1
drop2
drop3
drop4
rrot1
rrot2
rrot3
rrot4
lit X
!loc N
@loc N
call f
breq L
Appendix B
Results
87
B.1
image
1433577
1355957
1355957
1347337
1260509
matmul
12759225
11874561
11260161
12529542
10110390
matmul
8451079
8326279
8115079
9494549
7862549
fibfact
922974
753522
625954
467443
467443
fibfact
573112
520277
456493
350819
350819
life
4720577
4023616
4005531
4152719
3915943
life
2837827
2725660
2725662
2810731
2656871
quick
5399992
3398904
3357982
3245257
3618716
quick
4410403
3131515
3140251
3045400
3374435
quick
2625531
2050801
2165119
2273304
2435846
queens
502863
490917
490663
466126
485818
queens
429269
425709
425503
406002
422186
queens
268591
272967
272967
267457
277077
towers
5729784
4397025
4265932
3881234
3881235
towers
5132586
4306898
4216777
3881016
3881017
towers
3528051
3299597
3275023
3127395
3119204
bitcnts
29794082
24101986
22182576
16896610
17848631
bitcnts
26157154
22463747
21223808
16869692
17680537
bitcnts
17693249
16359801
15983782
14973834
15435415
dhrystone
20113884
18093862
17948860
17598851
17688849
dhrystone
16172202
14842186
14747184
14462175
14537173
dhrystone
9429161
8889154
8874154
8874153
8879151
wf1
8110018
6960778
6857289
5192526
5192526
wf1
6830260
6055894
6110264
4833058
4833058
wf1
4008756
3658075
3834459
3209346
3209346
yacc
1903412
1335649
1317273
1301093
1301093
yacc
1726075
1218675
1205546
1193577
1193577
yacc
1222415
832412
825848
854942
854942
Overall
4603792
3613779
3449838
3151453
3115386
Overall
3979269
3320692
3212539
2962415
2924996
Overall
2528230
2274299
2253700
2180554
2148301
image
2319218
1856649
1846848
1780698
1655160
life
5590601
4445305
4409980
4583040
4324344
Optimiser
none
koopman
bailey
global1
global2
bsort
2522836
1903736
1783935
1561830
1561830
fibfact
1028646
774467
625954
467443
467443
fibfact
1622686
1039601
731628
467443
467443
fibfact
827296
573117
488388
435551
435551
fibfact
817294
499333
414604
350820
350820
life
9525455
6650146
6559715
6925766
6449814
quick
9762024
5644478
5302392
4644333
5474952
quick
4987628
2910228
2796924
2718825
3029176
queens
846062
807028
806202
743772
781854
queens
439281
423619
423457
402960
420048
towers
9077463
5977562
5674402
4848075
4864460
towers
4747365
3398194
3308057
3087168
3070785
life
4749213
3185319
3114623
3295607
3142827
quick
4907405
2690577
2391449
2032853
2320201
queens
427657
399909
399541
371657
385689
towers
4534255
2808120
2677010
2399187
2382804
bitcnts
27065830
21527569
19022652
16105161
16924286
bitcnts
49852300
37159592
32169752
18307763
20243062
dhrystone
18873089
16828070
16693064
16518064
16543062
dhrystone
36865911
32480864
32105854
31030827
31290825
wf1
7079545
5793802
5394743
3732467
3732467
wf1
7246005
6036142
5935481
4457299
4457299
wf1
13975770
11509857
11056042
7572631
7572631
yacc
1440414
1014871
997233
931698
931698
yacc
1544516
1114100
1096463
1106918
1106918
yacc
2928247
2073102
2035770
1920103
1920103
Overall
3858643
2650973
2434217
2142389
2147379
Overall
3996455
3007023
2853808
2709122
2690856
Overall
7729966
5523744
5079844
4287930
4248969
bitcnts
26183796
19599378
16604452
11122810
11831953
Optimiser
none
koopman
bailey
global1
global2
matmul
13903329
12326001
11519601
12760182
10331430
matmul
10825615
8150959
7354159
7035653
6642053
matmul
11574439
9573247
8968447
10110702
8891454
matmul
22184171
18133787
16309787
17080470
13096614
image
2714165
2001300
1981698
1905746
1760804
image
2464263
1290209
1251005
1173772
1067246
image
2512283
1645558
1616155
1568622
1433186
image
4880196
2926048
2876846
2656920
2405452
bsort
2902337
1963637
1783936
1561830
1561830
bsort
2242834
1282834
1002632
902326
902326
bsort
2362434
1542734
1342533
1282430
1282430
bsort
4823150
2804850
2325246
1900930
1900930
Optimiser
none
koopman
bailey
global1
global2
Optimiser
none
koopman
bailey
global1
global2
Optimiser
none
koopman
bailey
global1
global2
Optimiser
none
koopman
bailey
global1
global2
B.2
Optimiser
none
koopman
bailey
global1
global2
bsort
740313
360613
260712
159608
159608
image
760500
375551
365750
308220
269510
matmul
3221985
2462121
2058921
1948832
1161680
fibfact
253186
156511
92726
39891
39891
quick
1570142
865984
760402
557466
723891
queens
125407
117471
117265
103284
109848
towers
1301445
704397
638790
450659
458851
bitcnts
7214085
4857343
3991837
646831
997636
wf1
2103821
1680136
1558122
906029
906029
yacc
406426
289029
282464
241401
241401
Overall
1162813
761535
665658
444265
445742
Appendix C
LCC-S Manual
C.1
Introduction
The port of lcc for stack machines retains much of the original lcc, but includes
a largely new, stack-specific, back-end phase. The front-end has been slighly
modified to provide extra data for the back-end. These modifications have
been kept as small as possible. lcc-s can be ported to a new stack-machine by
writing a new code generator specification and a few ancillary routines, much
in the same way as lcc can be ported to a conventional machine.
C.2
Setting up lcc-s
Fixed location
lcc-s is hardwired to run from fixed locations, if installed there, no paths will
be needed.
Unix: /usr/local/lib/lcc-s
Windows: C:\Program Files\lcc-s
Command line
Setting the -Ldir option on the command line is equivalent to setting LCCDIR
to dir. In fact the -L option can be used to override LCCDIR.
C.3
Since lcc-s is a modified version of lcc, it supports most of the command line
options of lcc, plus some additional options:
-A warn about nonANSI usage; 2nd -A warns more
-b emit expression -level profiling code; see bprint(1)
-Bdir/ use the compiler named dir/rcc
-c compile only
-dn set switch statement density to n
-Dname -Dname=def define the preprocessor symbol name
-E run only the preprocessor on the named C programs and unsuffixed files
-g produce symbol table information for debuggers
-help or -? print help message on standard error
-Idir add dir to the beginning of the list of #include directories
-lx search library x
C.4
C.5
Implementations
C.6
C.6.1
The components
lcc
The driver program has to interface the components with the operating system.
Currently four driver programs exist, although they share most source code.
They are:
lcc-ufo for Linux
lcc-ufo for Windows
lcc-utsa for Linux (Not yet)
lcc-utsa for Windows
However, they all share the same interface and will be treated collectivley as
lcc-s. The command line interface to lcc-s is listed above.
C.6.2
scc
C.6.3
peep
The peephole optimiser is a table driven optimiser which reduces the number
of stack manipulations in assembler code. It is highly portable. The UTSA
and UFO version differ only in a 20 line instruction description file. The
peephole optimiser can be run seperately. It takes exactly two arguments and
has no options.
peep-ufo source destination
C.6.4
as-ufo
The assembler for ufo is also table driven. It outputs objects files in a linkable
a.out format. It takes the input file name as its sole input.
as-ufo [option] source
Options
-oname Names the output, otherwise a.out
-g Inserts any debugging symbols into the object file.
C.6.5
link-ufo
The UFO linker expects input files in linkable a.out format and produces
output in executable a.out format.
link-ufo [options] source1, source2, ..., sourceN
Options
-oname Names the output, otherwise a.out
-lname Links in the named library named libname.ufo
-pathdir Add dir to the search path when looking for libraries.
-e Fail if no entry point.
-a Fail if any entry points.
-n Fails if any unresolved symobls.
-strip Remove symbols
-fix Remove relocation data.
-types=none No type checking.
-types=c C type checking. Default.
-types=strict Strict type checking.
-boot Insert startup code at zero.
-sp=value Initial SP offset, only valid with -boot
-rp=value Initial RP offset, only valid with -boot
To access linker options from lcc-s use the -Wl flag, so to strip symbols
from the driver use -Wl-strip.
Future Options
-textaddress Define start address of text section
-dataaddress Define start address of data section
-bssaddress Define start address of bss section
Type checking
The compiler emits very simple type information about symbols. This is in
the form .type symbol type_symbol For example .type main F21 defines
main as a function consuming two stack cells and producing one. The linker
can use this to ensure stack consistency. There are three levels of checking
None No checking is done
C Defined types are checked but undefined types, which are allowed in
pre-ANSI C, are ignored. This is the default level.
Strict Defined types are checked. Undefined types are not allowed.
Appendix D
Description
The address of a local variable
A constant
Indirection - Fetch value from address
Add
Multiply
Assign left child(value) to right child(address)
Left shift
Jump
Branch if left node right node
Exits procedure, returns child
Virtual register node - Replaced during register allocation.
99
Appendix E
E.1
Development path
The lcc-s compiler is able to generate code that takes full advantage of stack
architecture, yet in order to reduce local variable accesses to RISC levels, more
available registers will required. Increasing the number of registers to eight
would appear to be sufficient, but this would need thirty six stack manipulation
instructions for the current orthogonal stack access. This can be trimmed
down to twenty two by removing the tuck and drop instructions, but this is
still a lot, so is it possible to reduce it further? By removing the rotate(down)
instructions, the number of instructions can be brought down to sixteen, but
there is no way to store a value anywhere but the top of the stack. This means
that Koopmans algorithm cannot be implemented, however there is no reason
why an effective register allocator cannot be designed to meet this constraint.
Out of order execution
In order to support out of order execution, successive useful instructions1
must not have dependancies. In order to avoid this the compiler will have to
insert two stack manipulations between each useful instruction, in fact even
with the rotate(down) instruction, most each useful instructions would still
need to be separated by a pair of stack manipulations. This means that, not
1
In other words, not stack manipulations, which merely change the arrangment of values
and do not change the values themselves
101
only would the instruction issuing engine have to do a huge number of stack
manipulations, the code size would increase unacceptably.
Embedded stack manipulations
Since, for a typical operation such as add, the required sequence of instructions
would be something like copy4 rot6 add it makes sense to embed the stack
manipulations within the instruction, so the add become add c4, r6. The
c meaning copy, that is leave the value in the stack and use a copy, and r
meaning rotate, that is pull the value out of the stack for use. Since the copy
3 -1
0
or rotate packs nicely into four bits,
,a regular instruction
Register Copy?
format can be used which will assist rapid decoding and issuing of instructions,
thus allowing multiple instructions to be issued per cycle. Note that the rot1
form is not redundant any more, since the addressing is no longer implicit.
E.2
Instruction Format
E.2.1
11 - 8
Field0
7-4
Field1
3-0
Field2
Categories
Opcode
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
E.2.2
15 - 12
Category
Name
Arithmetic
Arithmetic Immediate
Logical
Fetch
Store
Call
Long Call
Local Fetch
Local Store
Jump
Long Jump
BranchT
BranchF
Test
Literal
Special
Stack effect
-1 to +1
-1 to +1
-1 to +1
0 or +1
-1 or 0
0
0
+1
-1 or 0
0
0
-1
-1
-1 to + 1
+1
?
Arithmetic
15 - 12
Arithmetic
11 - 9
Operation
8
Carry
7-5
Register1
4
Preserve
3-1
Register2
0
Preserve
E.2.3
Arithmetic Immediate
15 - 12
Arithmetic
11 - 9
Operation
8
Carry
7-5
Register1
4
Preserve
3-0
Constant
E.2.4
Logical
15 - 12
Logical
Logical
11
0
1
10 - 9
Operation
Operation
8
Invert
Invert
7-5
Register1
Register1
4
Preserve
Preserve
Logical operations
1. And
2. Or
3. Xor
4. K
E.2.5
15 - 12
Fetch
Fetch
11 - 9
Address Register
8
Preserve
7-0
Offset (signed)
3-1
0
Register2 Preserve
Constant
E.2.6
Store
E.2.7
11 - 9
Value Register
11 - 8
0000
11 - 9
Value Register
E.2.10
11
Link
7-0
Offset (signed)
E.2.11
15 - 12
BranchT
E.2.12
15 - 12
BranchF
10 - 0
Offset (signed)
Long Jump
11
Link
15 - 12
Long Jump
15 - 12
Test
Test
8
Preserve
Jump
15 - 12
Jump
E.2.13
7-0
Offset (signed)
Local Store
15 - 12
Local Store
E.2.9
7-0
Offset (signed)
Local Fetch
15 - 12
Local Fetch
E.2.8
8
Preserve
10 - 0
Offset (top 11 bits)
BranchT
11 - 0
Offset (signed)
BranchF
11 - 0
Offset (signed)
Test
11 - 9
Operation
Operation
8
0
1
7-5
Register1
Register1
Test operations
1. Equals
2. Not equals
3. Less than
4. Greater than or equals
4
Preserve
Preserve
3-1
0
Register2 Preserve
Constant
E.2.14
Literal
15 - 12
Literal
11 - 0
Constant
E.2.15
Special
These operations are not necessarily unusual or rare, they just dont fit into
the general framework.
15 - 12
Special
Special
11
0
1
10 - 8
Operation
Operation
7-5
Register1
4
Preserve
Constant
3-0
0000
Appendix F
Quantitative Comparison of
Global Allocators
Figure F.1: doubleloop.c
#include s t d i o . h
i n t main ( i n t p , i n t q , i n t x , i n t y )
{
int r = 0 ;
int z = 0 ;
i n t i , j , k , l ,m, n ;
f o r ( i = 0 ; i < 1 0 ; i ++) {
f o r ( j = 0 ; j < 1 0 ; j ++) {
f o r ( k = 0 ; k < 1 0 ; k++) {
r += q + q ;
}
}
}
p r i n t f ( %d\n , r ) ;
f o r ( l = 0 ; l < 1 0 ; l ++) {
f o r (m = 0 ; m < 1 0 ; m++) {
f o r ( n = 0 ; n < 1 0 ; n++) {
z += x + y + l + m + n ;
z += x + y ;
}
}
}
p r i n t f ( %d\n , z ) ;
return 0 ;
}
107
doubleloop
80241
57801
57801
55275
43453
twister
177014
106014
101010
106007
138007
1
0.8
0.6
0.4
0.2
0
doubleloop
twister
Appendix G
Source Code
This appendix describes how to build lcc-s from the sources. The source for the back-end of lcc-s
is included on the CD. To get the source code for the front-end you will have to download it from
http://www.cs.princeton.edu/software/lcc/. The front end has been modified slightly to inform the
back-end about the jumps in switch statements. The following code
{
int i ;
( IR>swtch ) ( e q u a t e d ( cp>u . swtch . d e f l a b ) ) ;
f o r ( i = 0 ; i < cp>u . s w t c h . s i z e ; i ++)
( IR>swtch ) ( e q u a t e d ( cp>u . swtch . l a b e l s [ i ] ) ) ;
}
should be inserted immediately after:
c a s e Sw itch :
and before the following
break ;
in the function
v o i d ge n c od e ( Symbol c a l l e r [ ] , Symbol c a l l e e [ ] )
In the file c.h The structure interface should have the extra member
v o i d ( swtch )
( Symbol ) ;
111
Bibliography
[1] Usenet nuggets. SIGARCH Comput. Archit. News, 21(1):3638, 1993.
[2] B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs.
In POPL 88: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of
programming languages, pages 111, New York, NY, USA, 1988. ACM Press.
[3] G. M. Amdahl, G. A. Blaauw, and F. P. Brooks, Jr. Architecture of the IBM System/360. IBM
Journal of Research and Development, 44(1/2):2136, Jan./Mar. 2000. Special issue: reprints
on Evolution of information technology 19571999.
[4] C. Bailey. Optimisation Techniques for Stack-Based Processors. PhD thesis, jul 1996.
[5] C. Bailey. Inter-boundary scheduling of stack operands: A preliminary study. Procedings of
EuroForth 2000, pages 311, 2000.
[6] C. Bailey. A proposed mechanism for super-pipelined instruction-issue for ILP stack machines.
In DSD, pages 121129. IEEE Computer Society, 2004.
[7] L. Brodie. Starting Forth: An introduction to the Forth language and operating system for
beginners and professionals. Prentice Hall, second edition, 1987.
[8] J. L. Bruno and T. Lassagne. The generation of optimal code for stack machines. J. ACM,
22(3):382396, 1975.
[9] L. N. Chakrapani, J. C. Gyllenhaal, W. mei W. Hwu, S. A. Mahlke, K. V. Palem, and
R. M. Rabbah. Trimaran: An infrastructure for research in instruction-level parallelism. In
R. Eigenmann, Z. Li, and S. P. Midkiff, editors, LCPC, volume 3602 of Lecture Notes in
Computer Science, pages 3241. Springer, 2004.
[10] C. W. Fraser and D. R. Hanson. A Retargetable C Compiler: Design and Implementation.
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1995.
[11] C. W. Fraser and D. R. Hanson. The lcc 4.x code-generation interface. Technical Report
MSR-TR-2001-64, Microsoft Research (MSR), July 2001.
[12] C. W. Fraser, D. R. Hanson, and T. A. Proebsting. Engineering a simple, efficient code-generator
generator. ACM Lett. Program. Lang. Syst., 1(3):213226, 1992.
[13] J. Gosling, B. Joy, G. Steele, and G. Bracha. The Java Language Specification, Second Edition.
Addison Wesley, 2000.
[14] H. Gunnarsson and T. Lundqvist. Porting the gnu c compiler to the thor microprocessor.
Masters thesis, 1995.
[15] D. R. Hanson and C. W. Fraser. A Retargetable C Compiler: Design and Implementation.
Addison Wesley, 1995.
113
[16] J. R. Hayes and S. C. Lee. The architecture of FRISC 3: A summary. In Proceedings of the
1988 Rochester Forth Conference on Programming Environments, Box 1261, Annandale, VA
22003, USA, 1988. The Institute for Applied Forth Research, Inc.
[17] J. L. Hennessy and D. A. Patterson. Computer Architecture A Quantitative Approach.
Morgan Kaufmann Publishers, third edition, 2003.
[18] P. Koopman, Jr. A preliminary exploration of optimized stack code generation. Journal of
Forth Application and Research, 6(3):241251, 1994.
[19] T. Lindholm and F. Yellin. The Java Virtual Machine Specification. Addison-Wesley, 1996.
[20] M. Maierhofer and M. A. Ertl. Local stack allocation. In CC 98: Proceedings of the
7th International Conference on Compiler Construction, pages 189203, London, UK, 1998.
Springer-Verlag.
[21] G. J. Myers. The case against stack-oriented instruction sets. SIGARCH Comput. Archit.
News, 6(3):710, 1977.
[22] E. I. Organick. Computer system organization: The B5700/B6700 series (ACM monograph
series). Academic Press, Inc., Orlando, FL, USA, 1973.
[23] S. Pelc and C. Bailey. Ubiquitous forth objects. Procedings of EuroForth 2004, 2004.
[24] J. Philip J. Koopman. Stack computers: the new wave. Halsted Press, New York, NY, USA,
1989.
[25] M. Shannon and C. Bailey. Global stack allocation. In Procedings of EuroForth 2006, 2006.
[26] H. Shi and C. Bailey. Investigating available instruction level parallelism for stack based machine
architectures. In DSD, pages 112120. IEEE Computer Society, 2004.
[27] R. M. Stallman. Using and Porting the GNU Compiler Collection, For GCC Version 2.95.
Free Software Foundation, Inc., pub-FSF:adr, 1999.
[28] J. William F. Keown, J. Philip Koopman, and A. Collins. Performance of the harris rtx 2000
stack architecture versus the sun 4 sparc and the sun 3 m68020 architectures. SIGARCH
Comput. Archit. News, 20(3):4552, 1992.
Index
A global approach, 60
A Global register allocator, 58
A new register allocator, 80
Abstract stack machine, 15
Additional annotations for register allocation, 45
Advanced stack architectures, 80
Advantages of the stack machine, 12
An example, 28
Analysis, 51
Analysis of Baileys Algorithm, 57
Back-end infrastructure, 40
Baileys Algorithm, 56
Benchmarks, 71
Chain allocation, 64
Change of semantics of root nodes in the lcc forest,
44
Classifying stack machines by their data stack,
22
Comparing the two global allocators, 75
Compiler correctness, 71
Compilers, 17
Compilers for stack machines, 17
Context for This Thesis, 19
Cost models, 73
Current Stack Machines, 13
Peephole optimisation, 68
Performance of Stack machines, 14
Pipelined stack architectures, 81
Producing the flow-graph, 38
Program flow, 39
Propagation of preferred values, 62
Register allocation for stack machines, 18
Relative results, 74
Representing stack manipulation operators as lcc
tree-nodes., 40
Representing the stack operations., 43
Edge-sets, 27
Final allocation, 66
GCC vs LCC, 33
Goal, 20
Heuristics for determining the x-stacks, 62
History, 12
How the logical stack regions relate to the real
stack, 27
Implementation, 58
Implementation of Koopmans Algorithm, 52
Implementation of optimisations, 40
Optimisation, 38
Ordering of variables., 61
Outline Algorithm, 61
The
The
The
The
The
The
The
The
The
115