0% found this document useful (0 votes)
127 views126 pages

Activity Factor PDF

This document discusses power optimization techniques at various levels of the design process. It describes how behavioral-level transformations, algorithm-level reductions, and logic-level optimizations can all help reduce power consumption. Key techniques discussed include state assignment to minimize state transitions, probabilistic state transition graphs to estimate signal activity, and multi-level logic optimization using kernel extraction to factor out high-activity common subexpressions.

Uploaded by

vaishraut02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views126 pages

Activity Factor PDF

This document discusses power optimization techniques at various levels of the design process. It describes how behavioral-level transformations, algorithm-level reductions, and logic-level optimizations can all help reduce power consumption. Key techniques discussed include state assignment to minimize state transitions, probabilistic state transition graphs to estimate signal activity, and multi-level logic optimization using kernel extraction to factor out high-activity common subexpressions.

Uploaded by

vaishraut02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 126

Power Optimization

Rajaram Sivasubramanian
Associate Professor
ECE Department
Thiagarajar College of Engineering,Madurai-
Engineering,Madurai-15

Courtesy : Prof .Bushnell


Motivation

• Conventional automated layout synthesis method:


– Describe design at RTL or higher level
– Generate technology-
technology-independent realization
– Map logic-
logic-level circuit to technology library
• Optimization goal: shifting from low-
low-area to low-
low-power
and higher performance
• Need accurate signal probability/activity estimates
• Consider low-
low-power needs at all design levels
Behavioral--Level Transformations
Behavioral
Algorithm-Level Power
Algorithm-
Reductions vs. Other Levels
Logic/Circuit Synthesis for Low-
Low-
Power
Logic--Level Optimizations
Logic
Design Flow

• Behavioral Synthesis – not used very much


• Initial design description: RTL or Logic level
• Logic synthesis widely used
• FSMs:
– State Assignment – opportunity for power saving
– Logic Synthesis – look for common subfunctions – opportunity
for power saving
• Custom VLSI design – size transistors to optimize for power, area, and
delay
• Library--based design – technology mapping used to map design into
Library
library elements
FSM and Combinational Logic
Synthesis
• Consider likelihood of state transitions during state
assignment
– Minimize # signal transitions on present state inputs V
• Consider signal activity when selecting best common
sub-expression to pull out during multi-
sub- multi-level logic
synthesis
– Factor highest-
highest-activity common sub-
sub-expression out of
all affected expressions
Huffman FSM Representation
Introduction to our technique

• Terminology
– Literal
Literal:: A variable or a constant eg. a,b,2,3.14
Cube:: Product of literals e.g. +3a2b, -2a3b2c
– Cube
SOP:: Sum of cubes e.g. +3a2b – 2a3b2c
– SOP
– Cube
Cube--free expression:
expression: No literal or cube can divide all
the cubes of the expression
– Kernel
Kernel:: A cube free sub-
sub-expression of an expression,
e.g. 3 – 2abc
– Co
Co--Kernel
Kernel:: A cube that is used to divide an expression
to get a kernel, e.g. a2b
Kernels and Kernel
Intersections
DEFINITION:
An expression is cube
cube--free if no cube divides the expression evenly (i.e. there is no
literal that is common to all the cubes).
ab + c is cube-
cube-free
ab + ac and abc are not cube-
cube-free

Note:: a cube-
Note cube-free expression must have more than one cube.

DEFINITION:
The primary divisors of an expression F are the set of expressions
D(F) = {F/c | c is a cube}.

37
Kernels and Kernel
Intersections
DEFINITION:
The kernels of an expression F are the set of expressions
K(F) = {G | G  D(F) and G is cube-
cube-free}.

In other words, the kernels of an expression F are the cube


cube--free primary divisors of F.

DEFINITION:
A cube c used to obtain the kernel K = F/c is called a co
co--kernel of K.

C(F) is used to denote the set of co-


co-kernels of F.

38
Example
Example:
x = adf + aef + bdf + bef + cdf + cef + g
= (a + b + c)(d + e)f + g

kernels co--kernels
co

a+b+c df, ef
d+e af, bf, cf
(a+b+c)(d+e)f+g 1

39
Kernels: Example
F = adf + aef + bdf + bef + cdf + cef + bfg + h
= (a+b+c
(a+b+c)(
)(d+e
d+e)f)f + bfg + h
cube Prim. Div. Kernel Co-
Co-kernel level
a df+ef NO NO --
b df+ef+fg NO NO --
bf d+e+g YES YES 0
cf d+e YES YES 0
df a+b+c YES YES 0
fg b NO NO --
f (a+b+c)(d+e)+bg YES YES 1
1 F YES YES 2
Kerneling Illustrated

abcd + abce + adfg + aefg + adbe + acdef + beg

(bc + fg)(d + e) + de(b + cf)


c
a b (a)
c
b (a)
c d f g d e
(a)
c d e d e e f cd+g d+e ac+d+g
d+e c+d b+ef b+df b+cf ce+g
c+e a(d+e)
c(d+e) + de=
d(c+e) + ce =
... 41
Kerneling Illustrated
co--kernels
co kernels

1 a((
((bc
bc + fg
fg)(d
)(d + e) + de(b + cf
cf)))
))) + beg
a (bc + fg
fg)(d
)(d + e) + de(b + cf
cf))
ab c(d+e
c( d+e)) + de
abc d+e
abd c+e
abe c+d
ac b(d + e) + def
acd b + ef
Note:: f/f/bc
Note bc = ad + ae = a(d + e)
42
Probabilistic State Transition
Graphs (STGs)
• Edges showing state transitions not only indicate input values
causing transitions and resulting outputs
• Also have labels pij giving conditional probability of transition
from state Si to Sj
– Given that machine is in state Si
– Directly related to signal probabilities at primary inputs

• Introduce self-
self-loops in STG for don’t care situations to
transform incompletely-
incompletely-specified machine into completely-
completely-
specified machine
Example
Relationship Between State
Assignment and Power
• Hamming distance between states Si and Sj:
– H (Si, Sj) = # bits in which the assignments differ
• Average Power:
– D (i) = signal activity at node i
– Approximate Ci with fanout factor at node i
• Average power proportional to:
Handling Present State Inputs
• Find state transitions (Si, Sj) of highest probability
• Minimize H (Si, Sj) by changing state assignment of Si Si,,
Sj
• Requires system simulation of circuit over many clock
periods, noting signal values and transitions
• If one-
one-hot design is used, note that H = 2 for all states
– Impossible to obtain optimum power reduction
– Uses too many flip-
flip-flops
• Optimization cost function:
Simulated Annealing Optimization
Algorithm
• Allowed moves:
– Interchange codes of two states
– Assign an unassigned code to a state that is randomly
picked for an exchange
• Accept move if it decreases g
• If move increases g, accept with probability:
e - |d (g) | / Temp
Example State Machine
State Assignments

• Coding 1 uses 15% more power than coding 2


Multi--Level Logic Optimization
Multi
for Low Power
• Combinational logic is F (I, V)
– I = set of primary inputs
– V = present state inputs
• Need to estimate probabilities and activities of V inputs (same
as next state outputs but delayed one clock period) in order to
synthesize logic for minimum power
– Use methods of Chapter 3
• Randomly generate PI signals with probabilities and activities
conforming to a given distribution
– Get D (vj) = transition activity at input vj (transitions / clock
period)
– Get from fast state transition diagram simulation
Power--Driven Multi-
Power Multi-Level Logic
Optimization
• Use Berkeley MIS tool
– Takes set of Boolean functions as input
– Procedure kernel finds all cube-
cube-free multiple or single-
single-cube
divisors of each Boolean function
– Retains all common divisors
– Factors out best few common divisors
– Substitution procedure simplifies original functions to use
factored--out divisor
factored
• Original criteria for selecting common divisor:
– Chip area saving
• New criterion: power saving
Boolean Expression Factoring
• g = g (u1, u2, …, uK), K  1 is common sub-
sub-expression
• When g factored out of L functions, signal probabilities and
activities at all circuit nodes are unchanged
• Capacitances at output of driver gates u1, u2, …, uK change
• Each drives L-1 fewer gates than before
• Reduced power:

• D (x) = activity at node x


• nuk = # gates belonging to node g and driven by uK
Factoring (continued)
• Only one copy now of g instead of L copies
– L-1 fewer copies of internal nodes v1, v2, …, vm in
factored--out hardware for switching and dissipating
factored
power
• Power saving:

• Total power saving:


Factoring (concluded)

• T (g) = # literals in factored form of g


• Area saving:
• Net saving of power and area:
Optimization Cost Criteria
The accepted optimization criteria for multi-
multi-level logic are to minimize
some function of:
1. Area occupied by the logic gates and interconnect
(approximated by literals = transistors in technology
independent optimization)
2. Critical path delay of the longest path through the logic
3. Degree of testability of the circuit, measured in terms of the
percentage of faults covered by a specified set of test vectors
for an approximate fault model (e.g. single or multiple stuck-
stuck-
at faults)
4. Power consumed by the logic gates
5. Noise Immunity
6. Wireability
while simultaneously satisfying upper or lower bound constraints placed on
these physical quantities

59
1) Kernel Extraction : Consider the Boolean Function
F = uvy + vwy + xy + uz + vz.
• Identify all co-kernel/kernel paris of F. State their levels.

2) Weak Algebraic Division (10 points):


We have studied an algorithm to perform weak algebraic
division in class that can be used to decompose a function F, algebraically,as
Fdividend = Gdivisor ·Hquotient+Rremainder. Divide F = ab+ac+ad′+bc+bd′ by
the following:
• G = a + b. What is the quotient and the remainder?
• G = c+d′. What is the quotient and the remainder?
Optimization Algorithm
Optimization Algorithm
(concluded)
Example Unoptimized Circuit
Optimization for Area Alone
Optimization for Low-
Low-Power Alone
• Large area but reduces power from 476.12 to 423.12
Example Signal Probabilities
Propagating Combinational
Signal Activities
Results

• On the MCNC Benchmarks:


• Two
Two--stage process
– State assignment problem
– Multi
Multi--level combinational logic synthesis based on
power dissipation and area reduction
• Result:
– 25% reduction in power
– 5% increase in area
Technology Mapping for Low
Power
• Problem statement:
– Given Boolean network optimized in a technology-
technology-independent
way and a target library, bind network nodes to library gates to
optimize a given cost
• Method:
– Decompose circuit into trees
– Use dynamic programming to cover trees
– Cost function:

– Traverse tree once from leaves to root


Extension for Low-
Low-Power Design
• Power dissipation estimate:

• Estimate partial power consumption of intermediate solutions


• Cost function:

• MinPower (ni) is minimum power cost for input pin ni of g


• power (g) = 0.5 f VDD2 ai Ci
• Formulation:
– R = Total Area,
Area, w gives their relative importance
– f = frequency, T = circuit delay
Top--Level Mapping Algorithm
Top

• Overall process:
– From tree leaves to root, compute trade-
trade-off curves for
matching gates from library
– From root to leaves:
• Select minimum-
minimum-cost solution
• Reduces average power by 22% while keeping the
same delay
– Sometimes increases area as much as 39%
Circuit--Level Optimizations
Circuit
Algorithm Components

1. Find which gate to examine next


2. Use a set of transformations for the gate
3. Compute overall power improvement due to
transformations
4. Update the circuit after each transformation
Gate Delay Model
• For every input terminal Ii and output terminal Oj of
every gate:
– T ii,j (G) – fanout load independent delay (intrinsic)
– Ri,j (G) – additional delay per unit fanout load
• Total gate propagation delay from input to output:
– Normalize all activities dy by dividing them by clock
activity (2f
(2f))
• Probability of rising or falling transition at y:
CMOS Gate Usage

• Deep sub-
sub-micron technology:
• Delay of NAND/NOR to INVERTER delay lessens in
deep sub-
sub-micron technology
– Series transistor connection Vds and Vgs smaller than
that for inverter transistor
• Encourages wider use of complex CMOS gates
• Important to order series transistors correctly
– Delay varies by 20%
– Power varies by 10%
CMOS Gate Power Consumption
• For series-
series-connected
transistors, signal with
lower activity should be
on transistor closest to
power supply rail
Calculating Transition Probability

• Hard to find pzi


– Hard to determine prior state of internal circuit nodes
– Assume that when state cannot be determined, a transition
occurred (upper power limit)
– More accurate bound: Observe that # conducting paths from
node to Vdd must change from 0 to > 0 followed by similar change
in # conducting paths to Vss
• Use # conducting paths that is smaller

– Use serial-
serial-parallel graph edge reduction techniques
Transistor Reordering

• Already know delay of longest paths through each gate


input from static timing analyzer
• Should (for NAND or NOR) connect latest arriving
signal to input with smallest delay
– Break gate inputs into permutable sets and swap inputs
– Hard to compute which input order is best – can afford
to enumerate all possible orderings and try them
• Compute prob. (signal is switching while all other signals in
permutable set are on) – gives maximum internal node C
charging / discharging
Optimization Algorithm
• Try to meet circuit performance goal (do forwards and then
backwards graph traversal)
• During backwards traversal:
– If a gate delay is larger than specified delay, reorder inputs to
decrease delay
– End up with valid backwards delays for gates, but not valid
forward delays
• Repeat forward traversal if input reordering was done
– Continue reordering inputs if gate path delay specification is
exceeded
• Continue alternating forward/backwards traversals until no
more reorderings happen, then proceed to power minimization
Power Minimization

• Repeat alternating forward and backward traversals


• Change: Determine delay increase for input order
corresponding to least estimated power dissipation
– If increase less than available path slack, reorder inputs
• Available slack: difference between:
1. Larger of maximum acceptable delay and longest path delay
2. Delay of longest path through gate
– Results on MCNC benchmarks – reduced power by 7 to 8 %,
with no critical path delay increase, and very little area
penalty
Zero Slack Algorithm

• Arrival time in forward direction from primary input to primary output.


• Choose Maximum arrival time so that all the inputs are reaching the gate. So the
gate generate output.
• Required time in backward direction from primary output to primary input.
• Choose Minimum required time – how fast the output from gate reach the input
of next gate. So that signal is propagated to next level.
• Slack= Difference between Maximum arrival time-
time- Minimum required time
Zero Slack Algorithm

96
Zero Slack Algorithm
Transistor Reordering
 Logically equivalent CMOS gates may not have
identical energy/delay characteristics

y  (a1  a2)b

a1 a2 a1 a2
b b b b
a2 a1 a2 a1
y y y y
b b a1 a2 a1 a2
a1 a2 a1 a2 b b

A B C D

Micro transductors ‘08, Low


Power 98
Transistor Reordering cont’d
Normalized Pdyn
Activity (transitions / s) (A) (B) (C) (D) max. savings
Aa1 = 10 K
(1) Aa2 = 100 K 0.81 0.84 0.98 1.0 19%
Ab = 1 M
Aa1 = 1 M
(2) Aa2 = 100 K 0.58 0.53 0.53 0.48 10%
Ab = 10 K

 For given logic function and activity:


Signal with highest activity → closest to output
to reduce charging/discharging internal nodes
99
Transition Probabilities for CMOS Gates
Example: Static 2 Input NOR Gate

If A and B with same input signal probability:

Truth table of NOR2 gate PA=1 = 1/2


A B Out PB=1 = 1/2

1 1 0 Then:
0 1 0 POut=0 = 3/4
POut=1 = 1/4
1 0 0
0 0 1
P0→1 = POut=0 * POut=1
= 3/4 * 1/4 = 3/16

Ceff = P0→1 * CL = 3/16 * CL

100
Transition Probabilities cont’d
 A and B with different input signal probability:
 PA and PB : Probability that input is 1
 P1 : Probability that output is 1

 Switching activity in CMOS circuits: P01 = P0 * P1


 For 2-Input NOR: P1 = (1-PA)(1-PB)
 Thus: P01 = (1-P1)*P1 = [1-(1-PA)(1-PB)]*[(1-PA)][1-PB]

P01 = Pout=0 * Pout=1


NOR (1 - (1 - PA)(1 - PB)) * (1 - PA)(1 - PB)
OR (1 - PA)(1 - PB) * (1 - (1 - PA)(1 - PB))
NAND PAPB * (1 - PAPB)
AND (1 - PAPB) * PAPB

XOR (1 - (PA + PB- 2PAPB)) * (PA + PB- 2PAPB)


Logic Restructuring
 Logic restructuring: changing the topology of a logic
network to reduce transitions

AND: P01 = P0 * P1 = (1 - PAPB) * PAPB


3/16
0.5 A Y
0.5 (1-0.25)*0.25 = 3/16
A W 7/64 = 0.109 0.5 B 15/256
B X F
15/256 0.5
0.5 C C
0.5 D F
0.5 0.5 D Z
3/16 = 0.188
Chain implementation has a lower overall switching activity than tree
implementation for random inputs
 BUT: Ignores glitching effects
Source: Timmernann, 2007
Input Ordering
(1-0.5x0.2)*(0.5x0.2)=0.09
(1-0.2x0.1)*(0.2x0.1)=0.0196
0.5 0.2
A B X
X
B C
F 0.1 A F
0.2 C
0.1 0.5
AND: P01 = (1 - PAPB) * PAPB

Beneficial: postponing introduction of signals with a high


transition rate (signals with signal probability close to 0.5)
Transistor Resizing Methods

• Datta, Nag & Roy: resized transistors on critical paths to


reduce power and shorten delay
• Wider transistors speed up critical path and reduce power
because you get sharper edges, and therefore less short-
short-circuit
power dissipation
– Penalty – larger transistors increase node C, which can increase
delay and power
– Increased drive for present block, and greater transition time for
preceding block (due to larger load CL) may increase present
block short-
short-circuit current
– Simulated annealing algorithm tries to optimize gates on N most
critical paths
Transistor Sizing for Power Minimization

Lower Capacitance Higher Voltage


Small W’s

To keep
performance

Large W’s
Higher Capacitance Lower Voltage

• Larger sized devices: only useful only when interconnects dominate


• Minimum sized devices: usually optimal for low-
low-power

Source: Timmernann, 2007


Micro transductors ‘08,
Low Power 105
Transistor Sizing
• Optimum transistor sizing

• The first stage is driving the gate capacitance of the second and
the parasitic capacitance
• input gate capacitance of both stages is given by NCref, where
Cref represents the gate capacitance of a MOS device with the
smallest allowable (W/L)
Transistor Sizing
• When there is no parasitic capacitance contribution (i.e., α = 0), the
energy increases linearly with respect to N and the solution of utilizing
devices with the smallest (W/L) ratios results in the lowest power.
• At high values of α, when parasitic capacitances begin to dominate over
the gate capacitances, the power decreases temporarily with increasing
device sizes and then starts to increase, resulting in a optimal value for
N.
• The initial decrease in supply voltage achieved from the reduction in
delays more than compensates the increase in capacitance due to
increasing N.N.
• after some point the increase in capacitance dominates the achievable
reduction in voltage, since the incremental speed increase with
transistor sizing is very small
• Minimum sized devices should be used when the total load capacitance
is not dominated by the interconnect
Summary
• Logic
Logic--level multi-
multi-level logic optimization is effective
– State assignment
– Modified MIS algorithm
• Logic
Logic--level Technology mapping
– Tree
Tree--covering algorithm is effective
• Circuit
Circuit--level operations are effective
– Transistor input reordering
– Transistor resizing
Addition of Binary Numbers
Full Adder. The full adder is the fundamental building block
of most arithmetic circuits:
ai bi

Cout Full Cin


Adder
si
The sum and carry outputs are described as:
si  ai bi ci  ai bi ci  ai bi ci  ai bi ci
ci1  aibi ci  ai bi ci  aibi ci  aibi ci  aibi  ai ci  bi ci
Oklobdzija 2004 Computer Arithmetic 114
Addition of Binary Numbers
Inputs Outputs
ci ai bi si ci+1
0 0 0 0 0
0 0 1 1 0
Propagate
0 1 0 1 0
0 1 1 0 1 Generate
1 0 0 1 0
1 0 1 0 1
Propagate
1 1 0 0 1
1 1 1 1 1 Generate
Oklobdzija 2004 Computer Arithmetic
115
Full--Adder Implementation
Full
Full Adder operations is defined by equations:
si  ai bi ci  ai bi ci  ai bi ci  ai bi ci  ai  bi  ci  pi  ci
ci 1  ai bi ci  ai bi ci  ai bi  gi  pi ci ai bi

Carry-Propagate: pi  ai  bi
and Carry-Generate gi

g i  a i  bi
cout
cin
One-bit adder could be
implemented as shown

si
Oklobdzija 2004 Computer Arithmetic 116
High--Speed Addition
High
ci 1  gi  pi ci

ai bi
g i  ai  bi pi  ai  bi

0
cout
s 1 cin

One-bit adder could be


implemented more efficiently si  pi  ci
because MUX is faster si

Oklobdzija 2004 Computer Arithmetic 117


Array Multipliers with Lower Power Consumption
0 0 0 0 0
a4 a3 a2 a1 a0
x0
Carry 0
Sum
x1 p0
0

x2 p1
0

x3 p2
0

x4 p3
0

p9 p8 p7 p6 p5 p4

Fig. 26.5 An array multiplier with gated FA cells.


New and Emerging Methods
Data Data ready
Dual-rail data encoding with
transition signaling:
Arithmetic Local
Circuit Control Two wires per signal
Release Transition on wire 0 (1) indicates
the arrival of 0 (1)

Arithmetic Local
Circuit Control Dual-rail design does increase
the wiring density, but it offers
the advantage of complete
insensitivity to delays

Arithmetic Local
Circuit Control
Part of an asynchronous chain of
computations.
The Ultimate in Low-
Low-Power Design
A P=A A P=A
B TG Q=B B FRG Q = AB  AC
C R = AB  C C R = AC  AB Some reversible
(a) Toffoli gate (b) Fredkin gate
logic gates.
A P=A
A P=A B Q=AB
PG
FG
B Q=AB C R = AB  C

(c) Feynman gate (d) Peres gate

A B
Reversible binary full
B Cout
adder built
of 5 Fredkin gates, 0 + G
with a single Feynman C A
gate used to fan out
1 s
the input B. The label
“G” denotes “garbage.” 0 s
(sum)
3.1
3.2
3.3
3.3
3.3

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy