Finding Difficult Branches
Finding Difficult Branches
Lucian Vintan, Arpad Gellert, Adrian Florea, Marius Oancea Lucian Blaga University of Sibiu, Computer Science Department, Emil Cioran Street, No. 4, 550025 Sibiu, Romania, E-mail: {lucian.vintan, arpad.gellert, adrian.florea, marius.oancea}@ulbsibiu.ro
1. Introduction There are two trends that are further increasing the importance of branch prediction. From architectural point of view, processors are getting wider and pipelines are getting deeper, allowing more aggressive clock rates in order to improve overall performance. A very high frequency will determine a very short clock cycle and the prediction cannot be delivered in a single clock cycle or maximum two cycles which is the prediction latency used in the actual commercial processors (see Alpha 21264 branch predictor) [C]. Also a very wide superscalar processor can suffer from performance point of view in the misprediction case when the CPU context must be recovered and the correct paths have to be (re)issued. The performance of the Pentium 4 equivalent processor degrades by 0.45% per additional misprediction cycle, and therefore the overall performance is very sensitive to branch prediction. From technological point of view, modern high-end processors use an array of tables for branch direction and target prediction [D]. These tables are quite large in size (352K bits for the direction predictor in Alpha EV8) and they are accessed every cycle resulting in significant energy consumption - sometimes more than 10% of the total chip power [E]. Despite the neural branch predictors ability to achieve very high prediction rates, the associated complexity due to latency, large quantity of adder circuits, area and power are still obstacles to the industrial adoption of this technique. The path-based neural predictors [F] improve the
instructions-per-cycle (IPC) rate of an aggressively clocked microarchitecture by 16% over the original perceptron predictor. A branch may be linearly inseparable as a whole, but it may be piecewise
linearly separable with respect to the distinct associated program paths. In other words, the pathbased neural predictor combines path history with pattern history, resulting superior learning skills to those of a neural predictor that relies only on pattern history.
2. Simulation Methodology Our first goal is to find the difficult predictable branches in the SPEC2000 benchmarks. We consider that a branch in a certain context is difficult predictable if it is unbiased [B] the number of taken and respectively not taken outcomes followed after the context of the branch are close (as closer, as more unbiased is the branch) , and the taken and not taken outcomes are shuffled. The second goal is to improve prediction accuracy for branches with low polarization rate, introducing new feature sets that will increase their polarization rate and, therefore, their predictability. A feature is the binary context on p bits of prediction information such as local history, global history or path. Each static branch has associated k dynamic contexts in which it can appear (k 2 p ). A context instance is a dynamic branch executed in the respective context. We introduce the polarization index (P) of a certain branch context:
(1)
S = {S 1 , S 2 , ..., S k } = set of distinct contexts that appear during all branch instances; k = number of distinct contexts, k 2 p , where p is the length of the binary context; T NT f0 = , f1 = , NT = number of not taken branch instances corresponding T + NT T + NT to context Si, T = number of taken branch instances corresponding to context Si, ( ) i = 1, 2, ..., k , and obviously f 0 + f 1 = 1 ; if P( S i ) = 1, ( )i = 1, 2, ..., k , then the context Si is completely biased (100%), and thus, the afferent branch is highly predictable; if P( S i ) = 0.5, ()i = 1, 2, ..., k , then the context Si is totally unbiased, and thus, the afferent branch is not predictable if the taken and not taken outcomes are shuffled.
As it can be observed in Figure 1, we want to analyze different feature sets used by different present-day branch predictors and in this way to reduce the list of unbiased branch contexts (contexts with low polarization P). A certain Feature Set is evaluated only on the unbiased branches determined with the previous Feature Sets, not on all branches from the benchmark, because the rest were solved with the previous Feature Sets. For the final list of unbiased branches we will try to find new feature sets in order to further improve their polarization index.
Feature Sets Simulated Branches List of Unpolarized Branches Branches unpolarized on local history Used Branch Predictor PAg
Local history
Global history
GAg
Global history
XOR
Branch address
Gshare
More exactly, as it can be observed in Figure 2, a certain context of a branch is evaluated only if that branch was unbiased for its all previously analyzed contexts. Thus, the final list of unbiased branches contains only the branches that were unbiased for all their contexts of all lengths (1628 bits).
LH
16 bits
GH
16 bits
GH
XOR PC 16 bits
LH
20 bits
GH
20 bits
GH
XOR PC 20 bits
LH
p bits
GH
p bits
GH
XOR p bits
PC
Figure 2. Reducing the number of unbiased branches through feature set extension. We concentrated only on benchmarks with a percentage of unbiased branch context instances (obtained with equation (2)), greater than a certain threshold (T=1%); the potential prediction accuracy improvement is not significant in the case of benchmarks with percentage of unbiased context instances less than 1%. If the percentage of unpredictable branch contexts is 1%, if they would be solved, the prediction accuracy would increase with maximum 1%.
T=
(2)
where NUBi is the total number of unbiased branch context instances on benchmark i, and NBi is the number of dynamic branches on benchmark i (therefore, the total number of branch context instances).
3. Simulation Results We started our study evaluating the branch contexts from SPEC2000 benchmarks on local branch history of 16 bits. All simulation results are reported on 1 billion instructions, skipping the first 300 million instructions. In Table 1, for each benchmark there are presented the percentages of branch contexts with polarization indexes belonging to five different intervals.
3
U
Unbiased branches
Polarization Rate (P) [%] [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) 10.50 8.17 8.52 5.90 3.68 4.56 16.50 8.58 6.94 15.63 11.03 9.50 12.72 6.92 5.34 2.68 1.72 2.30 10.65 6.68 6.19
Table 1. Polarization rates of branch contexts on local history of 16 bits. The column Dynamic Branches contains the number of all dynamic conditional branches for each benchmark. The column Static Branches contains the number of static branches for each benchmark. For each benchmark we generated using equation (1) a list of unbiased branch contexts, having polarization less than 0.95. We considered that the branch contexts with polarization greater than 0.95 are predictable and will obtain relatively high prediction accuracies, around 0.95, therefore, in these cases we considered that the potential improvement of the prediction accuracy is low. The following table compares the prediction accuracies obtained with original PAg predictor on all branches (Figure 3), and respectively PAg predicting only unbiased contexts. The predictors have the same configuration: 1024 entries in the first level (L1size = 1024), 16 bit LHR lengths (W = 16), and 216 entries in the second level (L2size = 65536). For the PAg predictor we used the simbpred simulator from Simplesim-3.0, with the following options: -bpred 2lev -bpred:2lev 1024 65536 16 0.
SPEC2000 Benchmark mcf parser bzip gzip twolf gcc Average Prediction Accuracy Address Direction Address Direction Address Direction Address Direction Address Direction Address Direction Address Direction PAg 0.9838 0.9838 0.9367 0.9367 0.9010 0.9010 0.8905 0.8905 0.8490 0.8490 0.9284 0.9299 0.9149 0.9151 PAg/Unbiased Contexts 0.8267 0.8267 0.7748 0.7748 0.6913 0.6913 0.7423 0.7423 0.7078 0.7078 0.8049 0.8059 0.7579 0.7581 Unbiased Context Instances (P<0.95) 6812313 5.76% 17589658 11252986 27692102 31763071 9809360 17486582 20.60% 26.42% 38.73% 44.98% 10.80% 24.55%
Table 2. Prediction accuracy on the unbiased branch contexts for local history of 16 bits. The column Unbiased Context Instances contains for each benchmark the number of unbiased context instances and respectively the percentage of unbiased context instances reported to all context instances (dynamic branches).
log 2 L1size
Predicted PC
Prediction bits
L2size
LHR L1size-1
W bits
As it can be observed in Table 2, the bzip, gzip and twolf benchmarks are difficult predictable with the original PAg predictor (prediction accuracies less than 0.9 on all branches). The low prediction accuracies obtained with PAg predicting only the unbiased contexts, and respectively the high percentages of unbiased contexts show that the prediction accuracy can be significantly improved. We continue our work analyzing a global branch history of 16 bits only on the local branch contexts that we found unbiased for local branch history (see Table 2 last column). That means that we used a dynamic branch in our evaluations only if its 16 bit local context is one of the unbiased local contexts. In Table 3, for each benchmark there are presented the percentages of branch contexts with polarization indexes belonging to five different intervals. The column Unbiased Dynamic Branches contains the number of simulated dynamic branches (the local branch context instances that we found unbiased for local history) and respectively their percentages reported to all dynamic branches. The column Unbiased Static Branches represents the number of static branches simulated within each benchmark.
SPEC2000 Benchmark mcf parser bzip gzip twolf gcc Average Unbiased Dynamic Branches 6812313 17589658 11252986 27692102 31763071 9809360 17486582 5.76% 20.60% 26.42% 38.73% 44.98% 10.80% 24.55% Unbiased Static Branches 25 707 83 62 132 4923 988.66 Polarization Rate (P) [%] [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) 11.94 6.98 16.62 10.09 7.43 4.13 9.53 9.25 5.71 14.36 9.01 6.39 3.14 7.97 8.13 6.18 13.80 10.88 9.89 3.56 8.74
Table 3. Polarization rates of branch contexts on global history of 16 bits evaluating only the unbiased local branch contexts of 16 bits
Continuing the previous methodology, for each benchmark we generated using equation (1) a list of unbiased branch contexts on local and global history of 16 bits, having polarization less than 0.95. The following table compares the prediction accuracies obtained with original GAg (Figure 4) on all branches, and respectively GAg on these unbiased branch contexts. The last column contains the number of unbiased branch context instances and respectively their percentages reported to all dynamic branches. The predictors have the same configuration: one global history
5
register of 16 bits, and 216 entries in the second level (L2size = 65536). For the GAg predictor we used the sim-bpred simulator from Simplesim-3.0, with the following options: -bpred 2lev bpred:2lev 1 65536 16 0.
SPEC2000 Benchmark mcf parser bzip gzip twolf gcc Average Prediction Accuracy Address Direction Address Direction Address Direction Address Direction Address Direction Address Direction Address Direction GAg 0.9850 0.9850 0.9452 0.9452 0.9080 0.9080 0.9241 0.9241 0.8564 0.8564 0.9492 0.9512 0.9279 0.9283 GAg/Unbiased Contexts 0.8250 0.8250 0.7010 0.7011 0.6629 0.6629 0.7363 0.7363 0.6591 0.6591 0.7356 0.7365 0.7199 0.7201 Unbiased Context Instances (P<0.95) 3887052 3.28% 11064817 9969701 20659305 22893014 3563776 12006278 12.95% 23.40% 28.89% 32.41% 3.92% 17.48%
Table 4. Prediction accuracy on the unbiased branch contexts for local and global history of 16 bits.
W bits
GHR
W
Predicted PC
Prediction bits
L2size
Analyzing comparatively Tables 2 and 4, we can observe that the global branch history reduced the average percentage of unbiased branch context instances from 24.55% to 17.48%, and it also increased the average prediction accuracy on all branches from 0.91 with the PAg to 0.92 with the Gag predictor. But the branch contexts that are still unbiased (for local and global history of 16 bits), are more difficult predictable: on these branch contexts, with the GAg predictor, we measured an average prediction accuracy of 0.72. The next feature set we analyzed is the XOR between a global branch history of 16 bits and the lower part of branch address (PC bits 183). We used again only the branch contexts we found unbiased for the previous feature sets (local and global branch history of 16 bits). That means that we used a dynamic branch in our evaluations only if its 16 bit local context is one of the unbiased local contexts (Table 2), and its 16 bit global context is one of the unbiased global contexts (Table 4). In Table 5, for each benchmark there are presented the percentages of branch contexts with polarization indexes belonging to five different intervals. The column Unbiased Dynamic Branches contains the number of simulated dynamic branches and respectively their percentages reported to all dynamic branches. The column Unbiased Static Branches represents the number of static branches simulated within each benchmark.
SPEC2000 Unbiased Dynamic Unbiased 6 Polarization Rate (P) [%]
Branches 3887069 11065068 9969757 20659343 22893103 3565197 12006590 3.28% 12.95% 23.40% 28.89% 32.41% 3.92% 17.48%
Table 5. Polarization rates on the XOR between global history and branch address on 16 bits evaluating only the unbiased local and global branch contexts of 16 bits For each benchmark we generated again using equation (1), a list of unbiased branch contexts with polarization less than 0.95 (unbiased for local and global history of 16 bits and respectively XOR of global history and branch address on 16 bits). The following table compares the prediction accuracies obtained with original Gshare predictor (Figure 5) on all branches, and respectively Gshare only on the determined unbiased branch contexts. The last column contains for each benchmark the number of unbiased branch context instances and respectively their percentages reported to all dynamic branches. The predictors have the same configuration: one global history register of 16 bits, and 216 entries in the second level (L2size = 65536). For the Gshare predictor we used the sim-bpred simulator from Simplesim-3.0, with the following options: -bpred 2lev bpred:2lev 1 65536 16 1.
Prediction Accuracy Address Direction Address Direction Address Direction Address Direction Address Direction Address Direction Address Direction
Gshare 0.9849 0.9849 0.9510 0.9510 0.9110 0.9110 0.9231 0.9231 0.8837 0.8837 0.9603 0.9623 0.9356 0.9360
Gshare/Unbiased Contexts 0.8302 0.8302 0.7031 0.7032 0.6563 0.6563 0.7352 0.7352 0.6676 0.6676 0.7388 0.7398 0.7218 0.7220
Unbiased Context Instances (P<0.95) 3887050 3.28% 11063791 9969678 20659290 22892985 3561998 12005798.7 12.95% 23.40% 28.89% 32.41% 3.91% 17.47%
Table 6. Prediction accuracy on the unbiased branch contexts for local and global history of 16 bits and respectively the XOR between global history and branch address on 16 bits. As it can be observed, the XOR of global history and branch address increased the prediction accuracy with the Gshare predictor with almost 1%, but it didnt reduced the percentage of unbiased context instances. The high percentages of unbiased branch context instances in the case of bzip, gzip and twolf benchmarks represent a potential improvement of prediction accuracy.
W bits
GHR
XOR
Predicted PC
Prediction bits
L2size
We now want to analyze for the unbiased branch contexts if the taken and respectively not taken outcomes are grouped separately. This study is necessary, because if the taken and not taken outcomes are grouped they are predictable, and if they are shuffled the predictors cannot learn them, and therefore are not predictable. For this study we introduce the distribution index for a certain branch context, defined as follows: 0, nt = 0 D( S i ) = nt 2 min( NT , T ) , n t > 0 where: nt = the number of branch outcome transitions, from taken to not taken and vice-versa, in context Si; 2 min( NT , T ) = maximum number of possible transitions; k = number of distinct contexts, k 2 p , where p is the length of the binary context; if D( S i ) = 1, ( )i = 1, 2, ..., k , then the behavior of the branch in context Si is contradictory (the most unfavorable case), and thus its learning is impossible; if D( S i ) = 0, ( )i = 1, 2, ..., k , then the behavior of the branch in context Si is constant (the most favorable case), and it can be learned.
(3)
We used equation (3) in order to determine the distribution indexes for each unpredictable branch context per benchmark. We evaluated only the dynamic branches having all their contexts unbiased (on local history, global history and respectively XOR of global history and branch address). Table 7 shows for each benchmark the percentages of branch contexts with distribution indexes belonging to five different intervals in the case of local branch history. In the same way, Tables 8 and 9 present the distribution indexes in the case of global history and respectively the XOR between global history and branch address. The column Unbiased Dynamic Branches contains the number of simulated dynamic branches and respectively their percentages reported to all dynamic branches. The column Unbiased Static Branches represents the number of static branches simulated within each benchmark.
Tables 7, 8 and 9 show that in the case of unbiased branch contexts, the taken and respectively not taken outcomes are not grouped separately, more, they are highly shuffled: 76.3% of the unbiased branch contexts have highly shuffled outcomes in the case of local history o f 16 bits (see Table 7), 89.37% of them have highly shuffled outcomes in the case of local and global history of 16 bits (see Table 8), and 89.37% of them have highly shuffled outcomes in the case of local history and XOR of global history and branch address on 16 bits (see Table 9). It can be observed that we obtained the same distribution indexes for both the global history and respectively the XOR between global history and branch address (Tables 8 and 9). A distribution index of 1.0 means the highest possible alternation frequency (with taken or not taken periods of 1). A distribution index of 0.5 means again a high alternation, since, supposing a constant frequency, the taken or not taken periods are only 2, lower than the predictors learning times. In the same manner, periods of 3 introduce a distribution of about 0.25, and periods of 5 generate a distribution index of 0.15, therefore we considered that if the distribution index is lower than 0.2 the taken and not taken outcomes are not highly shuffled, and the behavior of the branch can be learned.
SPEC2000 Benchmark mcf parser bzip gzip twolf gcc Average Unbiased Dynamic Branches 3887069 11064250 9969752 20659339 22893094 3564489 12006332 3.28% 12.95% 23.40% 28.89% 32.41% 3.91% 17.47% Unbiased Static Branches 19 483 75 51 110 2553 548.5 Distribution Rate (D) [%] [0.2, 0.4) [0.4, 0.6) [0.6, 0.8) 11.02 9.50 6.45 5.38 5.81 9.11 7.87 46.30 42.44 44.00 38.70 43.42 33.32 41.36 13.32 9.63 16.80 20.98 16.71 6.00 13.90
Table 7. Distribution rates on local history of 16 bits evaluating only the branches that were unbiased on all their 16 bit contexts (on local history, global history and respectively XOR of global history and branch address)
Unbiased Dynamic Branches 3887069 11064250 9969752 20659339 22893094 3564489 12006332 3.28% 12.95% 23.40% 28.89% 32.41% 3.91% 17.47%
Distribution Rate (D) [%] [0.2, 0.4) [0.4, 0.6) [0.6, 0.8) 4.30 14.62 2.94 2.18 5.12 18.03 7.86 37.75 36.63 32.24 26.45 26.84 38.66 33.09 34.38 19.33 37.43 35.19 28.44 16.06 28.47
Table 8. Distribution rates on global history of 16 bits evaluating only the branches that have all their 16 bit contexts unbiased (on local history, global history and respectively XOR of global history and branch address)
SPEC2000 Benchmark mcf parser bzip gzip twolf gcc Average Unbiased Dynamic Branches 3887069 11064250 9969752 20659339 22893094 3564489 12006332 3.28% 12.95% 23.40% 28.89% 32.41% 3.91% 17.47% Unbiased Static Branches 19 483 75 51 110 2553 548.5 Distribution Rate (D) [%] [0.2, 0.4) [0.4, 0.6) [0.6, 0.8) 4.30 14.62 2.94 2.18 5.12 18.03 7.86 37.75 36.63 32.24 26.45 26.84 38.66 33.09 34.38 19.33 37.43 35.19 28.44 16.06 28.47
Table 9. Distribution rates on the XOR between global history and branch address on 16 bits evaluating only the branches that have all their 16 bit contexts unbiased (on local history, global history and respectively XOR of global history and branch address).
We continued our evaluations extending the lengths of feature sets from 16 bits to 20, 24 and respectively 28 bits, our hipothesis being that the longer feature sets will increase the polarization index and the prediction accuracy. We started with a local branch history of 20 bits, evaluating again only the branch contexts we found unbiased for the previous feature sets of 16 bits. In Table 10, for each benchmark there are presented the percentages of branch contexts with polarization indexes belonging to five different intervals. The column Unbiased Dynamic Branches contains the number of simulated dynamic branches and respectively their percentages reported to all dynamic branches. The column Unbiased Static Branches represents the number of static branches simulated within each benchmark. The last column of Table 10 shows for each benchmark the number of unbiased dynamic branches (unbiased for local history of 20 bits, global history of 16 bits and XOR of global history and branch address on 16 bits), and respectively their percentage reported to all dynamic branches.
SPEC2000 Benchmark mcf parser bzip gzip twolf gcc Average Unbiased Dynamic Branches 3887050 3.28% 11063878 12.95% 9969651 23.40% 20659242 28.89% 22892904 32.41% 3563213 3.91% 12005990 17.47% Unbiased Static Branches 19 476 75 51 110 2546 546.16 Polarization Rate (P) [%] Unbiased Context Instances [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1.0] (P<0.95) 8.41 7.96 5.28 5.97 72.37 3147989 2.66% 8.50 6.70 3.87 4.44 76.49 7838166 9.18% 8.93 4.69 2.10 2.17 82.11 6493881 15.24% 9.98 7.47 4.55 4.84 73.16 17753722 24.82% 12.79 10.91 5.17 3.93 67.20 17540719 24.83% 7.79 6.31 3.68 4.56 77.66 2061136 2.26% 9.4 7.34 4.10 4.31 74.83 9139269 13.17%
Table 10. Polarization rates on local history of 20 bits evaluating only the branches that have all their 16 bit contexts unbiased (on local history, global history and respectively XOR of global history and branch address). Table 11 shows the results of using a global branch history of 20 bits evaluating only the branches unbiased for local history of 20 bits, global history of 16 bits and respectively XOR of global history and branch address on 16 bits. The column Polarization Rate presents the percentages of branch contexts with polarization indexes belonging to five different intervals. The column Unbiased Dynamic Branches contains the number of simulated dynamic branches and respectively their percentages reported to all dynamic branches. The column Unbiased Static Branches represents the number of static branches simulated within each benchmark. The last column of Table 11 shows for each benchmark the number of unbiased dynamic branches (unbiased for local history of 20 bits, global history of 20 bits and XOR of global history and branch address on 16 bits), and respectively their percentage reported to all dynamic branches.
Unbiased Dynamic Branches 3148005 2.66% 7838384 9.18% 6493918 15.24% 17753750 24.82% 17540776 24.83% 2062167 2.26% 9139500 13.17%
Polarization Rate (P) [%] Unbiased Context Instances [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1.0] (P<0.95) 20.06 20.55 13.08 10.60 35.71 3057312 2.58% 15.44 14.61 10.83 11.04 48.09 7166404 8.39% 15.86 17.02 12.45 12.43 42.24 6228047 14.62% 15.32 16.89 15.88 17.75 34.16 17215762 24.07% 13.96 12.79 11.63 17.61 44.00 16240443 22.99% 14.59 13.77 9.35 9.93 52.37 1767385 1.94% 15.87 15.93 12.20 13.22 42.76 8612559 12.43% 10
Table 11. Polarization rates on global history of 20 bits evaluating only the unbiased branches on local history of 20 bits, global history of 16 bits, and the XOR of global history and branch address on 16 bits. In the same manner, Table 12 shows the results of using a XOR of 20 bits between global history and branch address, evaluating only the branches unbiased for local history of 20 bits, global history of 20 bits and respectively XOR of global history and branch address on 16 bits. The column Polarization Rate presents the percentages of branch contexts with polarization indexes belonging to five different intervals. The column Unbiased Dynamic Branches contains the number of simulated dynamic branches and respectively their percentages reported to all dynamic branches. The column Unbiased Static Branches represents the number of static branches simulated within each benchmark. The last column of Table 12 shows for each benchmark the number of unbiased dynamic branches (unbiased for local history of 20 bits, global history of 20 bits and XOR of global history and branch address on 20 bits), and respectively their percentage reported to all dynamic branches.
Unbiased Dynamic Branches 3057327 2.58% 7166723 8.39% 6228107 14.62% 17215799 24.07% 16240535 22.99% 1769008 1.94% 8612917 12.43%
Polarization Rate (P) [%] Unbiased Context [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1.0] Instances (P<0.95) 30.53 31.28 19.91 16.14 2.13 3057309 2.58% 27.62 26.16 19.37 19.76 7.08 7166215 8.39% 26.21 28.12 20.57 20.53 4.57 6228010 14.62% 20.78 22.96 21.58 24.13 10.55 17215749 24.07% 21.26 19.48 17.70 26.81 14.74 16240434 22.99% 28.28 26.84 18.17 19.29 7.41 1766800 1.94% 25.78 25.80 19.55 21.11 7.74 8612420 12.43%
Table 12. Polarization rates on the XOR of 20 bits between global history and branch address evaluating only the branches unbiased for local history of 20 bits, global history of 20 bits and respectively XOR of global history and branch address on 16 bits.
As it can be observed a considerable number of unbiased branches become biased if the feature sets are extended from 16 bits to 20 bits. Extending the feature set length from 16 bits to 20 bits, the percentage of unbiased dynamic branches decreased from 17.47% (see Table 6) to 12.43% (Table 12), at average. Using the same simulation methodology, we extend the feature sets to 24 bits. Tables 13, 14 and 15 show the results of using a local history of 24 bits, a global history of 24 bits and respectively a XOR of 24 bits between global history and branch address. Table 13 shows the results of using a local branch history of 24 bits evaluating only the branches unbiased for local history of 20 bits, global history of 20 bits and respectively XOR of global history and branch address on 20 bits. The column Polarization Rate presents the percentages of branch contexts with polarization indexes belonging to five different intervals. The column Unbiased Dynamic Branches contains the number of simulated dynamic branches and respectively their percentages reported to all dynamic branches. The column Unbiased Static Branches represents the number of static branches simulated within each benchmark. The last column of Table 13 shows for each benchmark the number of unbiased dynamic branches (unbiased for local history of 24 bits, global history of 20 bits and XOR of global history and branch address on 20 bits), and respectively their percentage reported to all dynamic branches.
SPEC2000 Benchmark mcf parser Unbiased Dynamic Branches 3057318 2.58% 7166415 8.39% Unbiased Polarization Rate (P) [%] Unbiased Context Static [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1.0] Instances Branches (P<0.95) 18 9.04 7.95 4.59 5.41 73.01 2632531 2.22% 424 10.88 8.16 4.19 4.44 72.34 5083585 5.95% 11
Table 13. Polarization rates on local history of 24 bits only for branches that were unbiased on all their 20 bit contexts (on local history, global history and respectively XOR of global history and branch address). Table 14 shows the results of using a global branch history of 24 bits evaluating only the branches unbiased for local history of 24 bits, global history of 20 bits and respectively XOR of global history and branch address on 20 bits. The column Polarization Rate presents the percentages of branch contexts with polarization indexes belonging to five different intervals. The column Unbiased Dynamic Branches contains the number of simulated dynamic branches and respectively their percentages reported to all dynamic branches. The column Unbiased Static Branches represents the number of static branches simulated within each benchmark. The last column of Table 14 shows for each benchmark the number of unbiased dynamic branches (unbiased for local history of 24 bits, global history of 24 bits and XOR of global history and branch address on 20 bits), and respectively their percentage reported to all dynamic branches.
SPEC2000 Benchmark mcf parser bzip gzip twolf gcc Average Unbiased Dynamic Branches 2632542 2.22% 5083795 5.95% 4250689 9.98% 13753960 19.23% 5459637 17.42% 1228364 1.35% 5401498 9.36% Unbiased Static Branches 18 414 73 44 93 1856 416.33 Polarization Rate (P) [%] Unbiased Context [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1.0] Instances (P<0.95) 15.20 13.79 7.13 5.90 57.98 2568911 2.17% 18.82 16.61 10.90 10.41 43.25 4664394 5.46% 12.10 11.31 7.12 7.60 61.87 3799893 8.92% 18.43 18.17 15.37 16.36 31.67 13480788 18.85% 16.99 14.90 10.91 13.88 43.32 5144339 7.28% 17.16 14.61 9.94 10.15 48.14 1097445 1.20% 16.45 14.89 10.22 10.71 47.70 5125962 7.31%
Table 14. Polarization rates on global history of 24 bits evaluating only the branches unbiased for local history of 24 bits, global history of 20 bits and respectively XOR of global history and branch address on 20 bits.
Table 15 shows the results of using the XOR of global branch history and branch address on 24 bits evaluating only the branches unbiased for local history of 24 bits, global history of 24 bits and respectively XOR of global history and branch address on 20 bits. The column Polarization Rate presents the percentages of branch contexts with polarization indexes belonging to five different intervals. The column Unbiased Dynamic Branches contains the number of simulated dynamic branches and respectively their percentages reported to all dynamic branches. The column Unbiased Static Branches represents the number of static branches simulated within each benchmark. The last column of Table 15 shows for each benchmark the number of unbiased dynamic branches (unbiased for local history of 24 bits, global history of 24 bits and XOR of global history and branch address on 24 bits), and respectively their percentage reported to all dynamic branches.
SPEC2000 Benchmark mcf parser bzip gzip twolf Unbiased Dynamic Branches 2568928 2.17% 4664693 5.46% 3799936 8.92% 13480825 18.85% 5144419 7.28% Unbiased Static Branches 18 398 72 41 89 Polarization Rate (P) [%] Unbiased Context [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1.0] Instances (P<0.95) 35.55 32.24 16.67 13.79 1.75 2568910 2.17% 31.21 27.52 18.08 17.25 5.93 4664273 5.46% 30.43 28.45 17.91 19.13 4.07 3799859 8.92% 24.64 24.29 20.55 21.87 8.66 13480783 18.85% 27.03 23.73 17.38 22.10 9.76 5144327 7.28% 12
gcc Average
1098795 5126266
1.20% 7.31%
1668 381
30.73 29.93
26.27 27.08
17.87 18.07
18.39 18.75
6.75 6.15
1097009 5125860
1.20% 7.31%
Table 15. Polarization rates on the XOR of 24 bits between global history and branch address evaluating only the branches unbiased for local history of 24 bits, global history of 24 bits and respectively XOR of global history and branch address on 20 bits.
Extending the feature set length from 20 bits to 24 bits, the percentage of unbiased dynamic branches decreased from 12.43% (see Table 12) to 7.31% (Table 15), at average. We extended again the feature sets to 28 bits. Tables 16, 17 and 18 show the results of using a local history of 28 bits, a global history of 28 bits and respectively a XOR of 28 bits between global history and branch address. Table 16 shows the results of using a local branch history of 28 bits evaluating only the branches unbiased for local history of 24 bits, global history of 24 bits and respectively XOR of global history and branch address on 24 bits. The column Polarization Rate presents the percentages of branch contexts with polarization indexes belonging to five different intervals. The column Unbiased Dynamic Branches contains the number of simulated dynamic branches and respectively their percentages reported to all dynamic branches. The column Unbiased Static Branches represents the number of static branches simulated within each benchmark. The last column of Table 16 shows for each benchmark the number of unbiased dynamic branches (unbiased for local history of 28 bits, global history of 24 bits and XOR of global history and branch address on 24 bits), and respectively their percentage reported to all dynamic branches. As it can be observed, in the case of the gcc benchmark, extending the feature set length to 28 bits, the percentage of the unbiased context instances is less than the threshold T of 1% (see equation (2)), and so we eliminate it from our next evaluations. We consider that the conditional branches from the gcc benchmark are not difficult predictable using feature lengths of 28 bits. At the computation of all values from the Average row of Table 16, we omitted the results obtained with the gcc benchmark, since it is eliminated from our evaluations.
SPEC2000 Benchmark mcf parser bzip gzip twolf gcc Average Unbiased Dynamic Branches 2568923 2.17% 4664502 5.46% 3799904 8.92% 13480777 18.85% 5144325 7.28% 1098269 1.20% 5931686 8.54% Unbiased Static Branches 18 395 71 41 87 1644 122.4 Polarization Rate (P) [%] Unbiased Context Instances [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1.0] (P<0.95) 10.62 8.64 4.69 5.35 70.69 2174101 1.83% 11.17 7.09 3.72 4.07 73.95 3301587 3.86% 10.16 5.90 3.04 3.59 77.30 2728593 6.40% 9.76 6.14 3.50 4.14 76.46 10691142 14.95% 9.03 4.44 2.81 3.76 79.96 4208376 5.95% 13.68 10.29 5.68 6.76 63.59 774654 0.85% 10.14 6.44 3.55 4.18 75.67 4620759 6.60%
Table 16. Polarization rates on local history of 28 bits only for branches that were unbiased on all their 24 bit contexts (on local history, global history and respectively XOR of global history and branch address)
Table 17 shows the results of using a global branch history of 28 bits evaluating only the branches unbiased for local history of 28 bits, global history of 24 bits and respectively XOR of global history and branch address on 24 bits. The column Polarization Rate presents the percentages of branch contexts with polarization indexes belonging to five different intervals. The column Unbiased Dynamic Branches contains the number of simulated dynamic branches and respectively their percentages reported to all dynamic branches. The column Unbiased Static Branches represents the number of static branches simulated within each benchmark. The last column of Table 17 shows for each benchmark the number of unbiased dynamic branches (unbiased for local history of 28 bits, global history of 28 bits and XOR of global history and branch address on 24 bits), and respectively their percentage reported to all dynamic branches.
13
Unbiased Dynamic Branches 2174117 1.83% 3301768 3.86% 2728627 6.40% 10691161 14.95% 4208418 5.95% 4620818 6.60%
Polarization Rate (P) [%] Unbiased Context Instances [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1.0] (P<0.95) 15.41 11.53 6.18 5.29 61.60 2149108 1.81% 21.26 17.06 10.39 10.18 41.11 3041426 3.56% 11.81 8.86 5.07 5.55 68.72 2280197 5.35% 19.36 17.05 13.50 14.84 35.25 10405692 14.55% 16.53 14.43 10.21 13.55 45.29 4007088 5.67% 16.87 13.78 9.07 9.88 50.39 4376702 6.19%
Table 17. Polarization rates on global history of 28 bits evaluating only the branches unbiased for local history of 28 bits, global history of 24 bits and respectively the XOR of global history and branch address on 24 bits.
Finally, Table 18 shows the results of using the XOR of global branch history and branch address on 28 bits evaluating only the branches unbiased for local history of 28 bits, global history of 28 bits and respectively XOR of global history and branch address on 24 bits. The column Polarization Rate presents the percentages of branch contexts with polarization indexes belonging to five different intervals. The column Unbiased Dynamic Branches contains the number of simulated dynamic branches and respectively their percentages reported to all dynamic branches. The column Unbiased Static Branches represents the number of static branches simulated within each benchmark. The last column of Table 18 shows for each benchmark the number of unbiased dynamic branches (unbiased for local history of 28 bits, global history of 28 bits and XOR of global history and branch address on 28 bits), and respectively their percentage reported to all dynamic branches.
SPEC2000 Benchmark mcf parser bzip gzip twolf Average Unbiased Dynamic Branches 2149125 1.81% 3041691 3.56% 2280240 5.35% 10405726 14.55% 4007152 5.67% 4376787 6.19% Unbiased Static Branches 18 357 69 41 82 113.4 Polarization Rate (P) [%] Unbiased Context [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1.0] Instances (P<0.95) 39.26 29.37 15.73 13.46 2.17 2149107 1.81% 34.21 27.48 16.71 16.39 5.22 3041301 3.56% 36.29 27.22 15.57 17.05 3.87 2280161 5.35% 27.56 24.28 19.22 21.13 7.81 10405684 14.55% 27.73 24.21 17.12 22.73 8.21 4007068 5.67% 33.01 26.51 16.87 18.15 5.45 4376664 6.19%
Table 18. Polarization rates on the XOR of 28 bits between global history and branch address evaluating only the branches unbiased for local history of 28 bits, global history of 28 bits and respectively the XOR of global history and branch address on 24 bits.
Extending the feature set length from 24 bits to 28 bits, the percentage of unbiased dynamic branches decreased from 7.31% (see Table 15) to 6.19% (see Table 18), at average. Despite of the feature set extension, the number of unbiased dynamic branches remains still high (6.19%), and thus, it is obvious that it is not sufficient only to use longer feature sets.
14
Figure 6. Reduction of average percentages of unbiased context instances (P<0.95) by extending the lengths of feature sets.
The global history solves at average 2.56% of the unbiased dynamic branches not solved with local history (see Figure 6). The hashing between global history and branch address (XOR) behaves just like the global history, and it does not improve the polarization rate f urther. In Figure 6 can be also observed that increasing the branch history, the percentage of unbiased dynamic branches decreases, suggesting a correlation between branches situated at a large distance in the dynamic instruction stream. The results also show that the ultimative predictibility limit of context-based prediction is approximatively 94%. A conclusion based on our simulation methodology is that 94% of dynamic branches can be solved with prediction information of up to 28 bits (some of them are solved with 16 bits, others with 20, 24 or 28 bits).
Figure 7. Reduction of the percentage of unbiased branch context instances by each feature length extension.
Taking into account that increasing the prediction accuracy with 1%, the IPC (instructionsper-cycle) is improved with more than 1% (it grows not linearly), there are great chances to obtain considerably better overall performances even if not all of the 6.19% of difficult predictable
15
branches will be solved. Therefore, we consider that this 6.19% represents a significant percentage of unbiased branch context instances, and in the same time a good improvement potential in terms of prediction accuracy and IPC. Focalising on these unbiased branches in order to design some efficient path-based predictors for them [8], [13] the overall prediction accuracy should increase with some percents, that is quite remarkable. The simulation results also lead to the conclusion that as higher is the feature set length used in the prediction process, as higher is the branch polarization index and hopefully the prediction accuracy (Figure 7). A certain large context (e.g. 100 bits) due to its better precision has lower occurance probability than a smaller one, and higher dispersion (the dispertion grows exponentially). Thus, very large contexts can significantly improve the branch polarization and the prediction accuracy too. However, they are not always feasable for hardware implementation. The question is: what feature set length is really feasable for hardware implementation, and more important, in this case, which is the solution regarding the unbiased branches? In our opinion, a feasable solution in this case could be given by path-predictors. The path information could be a solution for relatively short contexts (low correlations). Our hypothesis is that short contexts used together with path information should replace significantly longer contexts, providing the same prediction accuracy. A common criticism for all the present two-level adaptive branch prediction schemes consists in the fact that they used insufficient global correlation information [A]. There are situations when for the same static branch and in the same global history context pattern it is possible to find different targets. If each bit belonging to the global history will be associated during the prediction process with its corresponding PC, the context of the current branch becomes more precisely, and therefore its prediction accuracy could be better. Our next goal is to extend the correlation information with the path, according to the above idea [A]. Extending the correlation information in this way, suggests that at different occurrences of a certain static branch with the same global branch context, the path contexts can be different. In our further work, we want to increase through the path information the polarization rate, hopefully improving in this way the prediction accuracy. We started our evaluations regarding the path, studying the gain obtained by introducing the path of different lengths. The analyzed feature consists of a global branch history of 16 bits and the last p PCs. We applied this feature only to dynamic branches that we found unbiased (P<0.95) for local and global history of 16 bits and respectively XOR of global history and branch address on 16 bits. Benchmark lh16->gh16-> lh16->gh16-> lh16->gh16-> lh16->gh16-> lh16->gh16-> xor16 xor16->path1 xor16-path16 xor16->path20 xor16->lh20 bzip 23.40% 23.35% 22.16% 20.38% 15.24% gzip 28.89% 28.88% 28.17% 27.51% 24.82% mcf 3.28% 3.28% 3.28% 3.20% 2.66% parser 12.95% 12.89% 12.01% 10.95% 9.18% twolf 32.41% 32.41% 31.46% 27.10% 24.83% gcc 3.91% 3.91% 3.56% 3.02% 2.26% Average 17.47% 17.45% 16.77% 15.36% 13.17% Gain 0.02% 0.70% 2.11% 4.30% Table 19. The gain introduced by the path of different lengths (1, 16, 20 PCs) versus the gain introduced by extended local history (20 bits). The column lh16->gh16->xor16 presents the percentage of unbiased context instances for each benchmark. Columns lh16->gh16->xor16->path1, lh16->gh16->xor16->path16 and lh16-> gh16->xor16->path20 presents the percentages of unbiased context instances obtained using a global history of 16 bits and a path of 1, 16 and respectively 20 PCs. The last column presents the percentages of unbiased context instances extending the local history to 20 bits (without path). For each feature is presented the gain. It can be observed that a path of 1 introduces a not significant gain of 0.2%. Even a path of 20 introduces a gain of only 2.11% related to the more significant gain
16
of 4.30% introduced by an extended local branch history of 20 bits. The results show (Table 19) that the path is useful only in the case of short contexts. Thus, a branch history of 16 bits compresses and approximates well the path information. In other words, a branch history of 16 bits spreads well the different paths that lead to a certain dynamic branch.
Benchmark p=1 p=4 p=8 p=12 p=16 bzip 58.54% 39.00% 37.24% 35.08% 32.41% gzip 49.85% 45.93% 43.58% 35.67% 34.10% mcf 27.85% 21.30% 6.38% 5.89% 6.35% parser 57.75% 44.64% 36.37% 30.63% 27.25% twolf 67.49% 59.07% 51.28% 43.51% 37.12% gcc 34.17% 26.34% 17.65% 12.61% 9.51% Average 49.28% 39.38% 32.08% 27.23% 24.46% Table 20. The percentages of unbiased context instances using only the global history of p bits.
Benchmark p=1 p=4 p=8 p=12 p=16 bzip 38.99% 36.93% 34.41% 32.16% 30.15% gzip 48.53% 44.81% 42.20% 34.45% 33.21% mcf 26.01% 20.98% 6.23% 5.85% 6.48% parser 48.42% 39.50% 32.13% 27.48% 24.66% twolf 62.65% 55.68% 49.47% 42.60% 35.81% gcc 28.51% 20.42% 13.84% 10.53% 8.44% Average 42.19% 36.39% 29.71% 25.51% 23.13% Table 21. The percentages of unbiased context instances using as feature the global history of p bits together with the path of p PCs.
It the case of the mcf benchmark we obtained higher percentage of unbiased context instances when we extended the correlation information (Table 21) from 12 bits of global history and 12 PCs (p=12) to 16 bits of global history and 16 PCs (p=16). This growth is possible because a certain biased context (P=0.95), through extension is splitted into more contexts, and some of these longer contexts can be unbiased (P<0.95), thus increasing the number of unbiased branches.
55,00% 50,00% 45,00% 40,00% 35,00% 30,00% 25,00% 20,00% p=1 p=4 p=8 p=12 p=16 Context Length
Figure 8. The gain introduced by the path for different context lengths.
17
As it can be observed in Figure 8, an important gain is obtained through path in the case of short contexts (p<16). A branch history that is longer than 16 bits, compresses well the path information, and therefore, in this cases, the gain introduced by the path is not significant.
Conclusions The simulations show that the path is relevant for better polarization rate and prediction accuracy only in the case of short contexts. In our further work, we can try to reduce the path information extracting and using only the most important bits. Thus, the path information could be built using only a part of the branch address instead of all the 32 bits of the complete PC. We want to analyze other correlation information, too: we want to study if there is some correlation between branch behavior and some important registers (e.g. stack pointer). We also want to study some longer contexts. One of them could be a concatenation of the local history with the global history. These new contexts, being longer then the previously studied contexts, have higher precision, higher dispersion, and therefore, lower occurrence probability. Thus, for a context of 64 bits (32 bits of local history concatenated with 32 bits of global history), we expect to obtain considerably higher polarization rates and, as a consequence, better prediction accuracies. For simulations that use these longer contexts we need computers with more memory than we have at this time. The next stage of the work will consist in exploiting the information regarding the branch polarization. Thus, we can pre-train a perceptron only with dynamic branches that have polarization index greater than 0.95, avoiding in this way the contradictory behavior of the unbiased branches that is difficult to be learned. Pre-training the perceptron with the biased branches, we expect to obtain higher prediction accuracy and superior overall performances (IPC) to those of the original perceptron.
References [1] [E]Chaver D., Pinuel L., Prieto M., Tirado F., Huang M., Branch Prediction On Demand: an Energy-Efficient Solution, ISLPED03, August 2527, 2003, Seoul, Korea. [2] [F]Jimnez D., Fast Path-Based Neural Branch Prediction, Proceedings of the 36th Annual International Symposium on Microarchitecture, December 2003. [3] Jimnez D., Improved Latency and Accuracy for Neural Branch Prediction, ACM Transactions on Computer Systems (TOCS), Vol. 23, No. 2, May 2005. [4] Jimnez D., Piecewise Linear Branch Prediction, Proceedings of the 32nd International Symposium on Computer Architecture (ISCA-32), June 2005. [5] [C]Jimnez D., Lin C., Neural Methods for Dynamic Branch Prediction, ACM Transactions on Computer Systems, Vol. 20, No. 4, November 2002. [6] Loh G. H., Jimnez D., A Simple Divide-and-Conquer Approach for Neural-Class Branch Prediction, Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2005. [7] [B]Loh G. H., Jimnez D., Reducing the Power and Complexity of Path-Based Neural Branch Prediction, 5th Workshop on Complexity Effective Design (WCED5), June 2005. [8] Nair R., Dynamic Path-Based Branch Correlation, IEEE Proceedings of MICRO-28, 1995. [9] [D]Seznec A., Felix S., Krishnan V., Sazeides Y., Design Tradeoffs for the Alpha EV8 Conditional Branch Predictor, Proceedings of the 29th International Symposium on Computer Architecture, Anchorage, AK, USA, May 2002. [10] Simplescalar The SimpleSim Tool Set , ftp://ftp.cs.wisc.edu/pub/sohi/Code/simplescalar. [11] SPEC, The SPEC benchmark programs, http://www.spec.org.
18
[12] Tarjan D., Scadron K., Merging Path and GshareIndexing in Perceptron Branch Prediction, ACM Transactions on Architecture and Code Optimization, Vol. 2, No. 3, September 2005. [13] [A]Vintan L., Egan C., Extending Correlation in Branch Prediction Schemes, International Euromicro99 Conference, Italy, September 1999. [14] Yeh T.-Y., Patt Y. N., A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History, Proceedings of the 20th Annual International Symposium on Computer Architecture, San Diego, California, May 1993.
19