Thesis 04
Thesis 04
2 sin(2u
2
) (4.2)
g
2
(u
2
) =
2 cos(2u
2
) (4.3)
The products x
1
and x
2
will have Gaussian distributions with zero mean and = 1.
x
1
= f(u
1
)g(u
2
) (4.4)
x
2
= f(u
2
)g(u
2
) (4.5)
The HW implementation of the these equations is straightforward, as shown in Figure 4.2. The
challenge is how to implement the square root of the logarithm and the sine/cosine eciently.
An ecient solution is to employ hybrid look-up tables with non-uniform segmentation, as
proposed in [25] and [52]. Nevertheless, the resulting circuit size is still large and it is not easy
to make the design generic and scalable.
f(u1)
g2(u2)
g1(u2)
uniform
gen.
uniform
gen.
x1
x1
u2
u1
Figure 4.2.: Gaussian random generator using the Box-Muller method
4.2.2. The Use of the Central Limit Theorem
The central limit theorem oers a very simple way to generate random Gaussian noise. Accord-
ing to the central limit theorem, the distribution of a sum of independent random variables with
arbitrary distributions converges to a normal distribution as the number of variables increases.
If N independent variables have equal distributions with mean and standard deviation , the
distribution of their sum will have
N
= N and
N
=
2
u
=
M
2
1
12
(4.7)
If the uniform variable has a width of B bits, the mean
u
and the standard deviation
u
have
68 Chapter 4 Multipath Channel Simulators
Bits 1 2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
2
10
Width 1 2 3 4 5 6 7 8 9 10 11
Range 0 . . . 1 0 . . . 2
1
0 . . . 2
2
0 . . . 2
3
0 . . . 2
4
0 . . . 2
5
0 . . . 2
6
0 . . . 2
7
0 . . . 2
8
0 . . . 2
9
0 . . . 2
10
Mean 1/2 2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
Stdev 1/2 1/
2 2
0
2
0
2 2
1
2
1
2 2
2
2
2
2 2
3
2
3
2 2
4
Table 4.1.: Properties of the sum of 2
N
random binary variables
the following expressions:
u
=
2
B
1
2
(4.8)
u
=
_
2
2B
1
12
(4.9)
The challenge is how to design a multi-bit uniform generator and how to sum so many values,
e.g. 256 for a decent accuracy. The complexity can minimized if we sum 1-bit variables. The
number of required adders remains the same, but their size will be much smaller. The problem
consists now in how to generate so many independent binary variables in parallel.
Summing 1-bit uniform variables has also the advantage that both the mean and the standard
deviation have simple expressions, which simplies the hardware implementation. We hereafter
denote by 2
N
the number of variables to be summed, always a power of two. Table 4.1 shows
various properties of the resulting sum: bitwidth, range, mean , and standard deviation .
These results have been obtained by treating the binary variables as unsigned numbers and
summing them accordingly. This results in a non-zero mean. When summing binary variables
the mean is always a power of two, which is convenient for hardware implementations. In most
practical applications, however, a zero-mean Gaussian distribution is desired. This can be
achieved either by subtracting the known mean from the sum or by simply summing the binary
variables as signed numbers. The latter requires no hardware and is obviously the preferred
method. Figure 4.4 shows the resulting distribution (histogram) when summing 2
6
binary
variables, both as signed and as unsigned. The horizontal axis shows the full range of a 4-bit
number. The mean and the limits of the resulting sum are also shown on the horizontal axis.
Sigma is 2
2
in both cases.
From both Table 4.1 and Figure 4.4 it can be observed that the range of the sum is nearly
half of the full range. Another useful observation that can be derived from Table 4.1 is that the
relative sigma, i.e. relative to the full range, decreases by
N1
k=1
k 2
Nk
Number of elementary adders in the critical path:
N1
k=1
k = n(n 1)/2
Table 4.2 shows the number of elementary adders (total and in the critical path) for N ranging
from 4 to 8. For N = 8 the depth reaches 28, which limits the operating frequency drastically.
Fortunately, the adder tree can be pipelined since there are no loops involved.
The regular structure of the adder tree makes the realization of a scalable design possible.
As a proof of concept we have created a completely generic adder tree design in VHDL. The
parameters are the number and the width of the input operands, and wether they are treated as
signed or unsigned. The number of inputs is not constrained to a power of two. Moreover, the
pipelining is congurable through an additional generic parameter that indicates the number of
combinational adder stages between two registers, starting from the output. If this parameter
is zero, the resulting adder tree is purely combinational. The circuit relies on the recursive
instantiation feature in VHDL, which is currently supported by most of the synthesis tools.
In some applications, Gaussian noise samples need not be generated every clock cycle. An
example is the white noise generator for the fading taps, where the sample rate before the
interpolator is very low because of the very low Doppler spreads. This situation can be exploited
to reduce the number of adders for the same number of bits to be summed or to sum more bits
with the same adder. The solution is to use an accumulator at the output to sum consecutive
samples. If we need to generate a sample every 4 clock cycles, the number of bits to be
generated and summed in a clock cycle is reduced by 4. Now instead of 2
8
bits we will only
need to generate 2
6
in parallel. According to Table 4.2, the total number of elementary adders
decreases from 494 to only 114 while the number of adders in the critical path is reduced from
28 to 15, which saves area and increases the maximum frequency.
Figure 4.15 shows the distribution of the noise generated with the architecture presented
above, that is, by summing 256 binary variables. In this gure, the probability is normalized
to the probability of zero and the range is limited between 4 and 4. The distribution is
perfectly smooth, unlike the one obtained in [25] and [52] using the Box-Muller method.
76 Chapter 4 Multipath Channel Simulators
LFSR gen. 0
b0
b1
b15
en
LFSR gen. 1
b0
b1
b15
en
LFSR gen. 2
b0
b1
b15
en
LFSR gen. 3
b0
b1
b15
en
0
Timer
active first last
start
D D D
D D D
D D D
Adder tree (64 inputs)
Adder pipe delay
compensation
D
0
1
0
1
0
Accumulator
7
11
start
en init
dout
D valid
Extra LFSR
b0
b1
b15
en
3 MSB
8 LSB
Figure 4.14.: Schematic of a Gaussian noise generator with phase accumulator
FPGA
Resource
Sequential factor
Total in
device
4 8 16
# Slices 283 189 114 960
# Slice FFs 448 311 244 1920
# 4-input LUTs 499 312 238 1920
Table 4.3.: Synthesis results for a Xilinx Spartan3E XC3S100E-5 FPGA
The proposed architecture is also completely scalable, unlike those proposed in [25] and [52],
allowing to generate Gaussian noise by summing any power-of-2 number of binary variables.
Moreover, it also allows to trade o throughput for precision, which is very desirable for fading
tap generators, where data rates are very low before interpolation. Table 4.3 shows the
synthesis results for a Gaussian generator that sums 256 binary variables. This implementation
uses 4 LFSRs with lengths 32, 33, 34, and 35. Synthesis results are shown for sequential factors
of 4, 8, and 16. The sequential factor indicates the number of clock cycles needed to generate
a noise sample. Lower sequential factors denote higher parallelism, which explains why they
require more FPGA resources. The reported speed after synthesis only, for a grade-5 device, is
about 190 MHz in all three cases.
Compared to the results in [52] our architecture requires far fewer resources, in addition to its
excellent scalability. For sake of comparison, we have also synthesized a parallel conguration
4.3 Spectrum-Shaping Filter 77
64 48 32 16 0 16 32 48 64
0
0.2
0.4
0.6
0.8
1
Sample values [4,4]
N
o
r
m
a
l
i
z
e
d
p
r
o
b
a
b
i
l
i
t
y
64 48 32 16 0 16 32 48 64
2
1
0
1
2
x 10
4
Sample values [4,4]
A
b
s
o
l
u
t
e
p
r
o
b
a
b
i
l
i
t
y
e
r
r
o
r
Figure 4.15.: Normalized probability and the absolute error when summing 256 binary variables
with roughly the same performance. This conguration generates a sample each clock cycle
and has 8 LFSRs with increasing lengths starting from 32, each with 32 parallel outputs, thus
generating 256 random bits in parallel. The used FPGA is Xilinx Virtex2 XC2V4000-6. Out of
23040 slices available, only 880 are used, that is, about 3.8%. The maximum frequency reported
after synthesis was 180 MHz. This compares very favorably with the reference implementation,
which takes up 10% of the same FPGA and runs at 133 MHz post synthesis.
4.3. Spectrum-Shaping Filter
The challenge is how to design a lter with the desired magnitude Y
d
(). A FIR lter would
require a very large number of taps, while an all-pole or auto-regressive lter, although relatively
more ecient, would still requires a high order to approximate the autocorrelation for large
lags. Our approach uses an adaptation of the design method presented in [88] for designing IIR
lters with an arbitrary impulse response.
According to this method, an IIR lter of order 2N is synthesized as a cascade of N second-
order canonic sections (bi-quads). A general second-order lter is dened by ve coecients.
Its transfer function has two poles and two zeros, which are usually complex conjugate pairs,
78 Chapter 4 Multipath Channel Simulators
having the following transfer function:
H(z) =
b
0
+b
1
z
1
+b
2
z
2
1 +a
1
z
1
+a
2
z
2
(4.10)
The straightforward implementation of this equation leads to the conguration in Figure 4.16a,
referred to as the direct form I or DF-I. The rst stage implements the zeros (numerator), while
the second one implements the poles (denominator) or the autoregressive part. If the two stages
are swapped, a more hardware-ecient implementation is obtained, which saves two data reg-
isters, as shown in Figure 4.16b. This conguration is referred to as the direct form II or
DF-II, and will be used in our implementation.
b0
D
D
dout din
b1
b2
a1
a2
D
D
D
D
dout din
b0
b1
b2
a1
a2
Direct form I Direct form II
(a) Direct form I
b0
D
D
dout din
b1
b2
a1
a2
D
D
D
D
dout din
b0
b1
b2
a1
a2
Direct form I Direct form II
(b) Direct form II
Figure 4.16.: Second-order canonical lter section
For N cascaded second-order sections, the transfer function is given by the following equation,
in which the constant K has been obtained by factoring out the b
0
coecients. In this way,
each section is dened by four coecients instead of ve.
H(z) = K
N
n=1
1 +b
1,n
z
1
+b
2,n
z
2
1 +a
1,n
z
1
+a
2,n
z
2
(4.11)
For z = e
j
, H(z) approaches the desired magnitude response Y
d
(). The design of the lter
is an optimization problem which consists in nding the set of 4N coecients (a
1,n
, a
2,n
, b
1,n
,
b
2,n
) that leads to the best approximation of Y
d
().
4.3.1. Filter Design Algorithm
The rst step is to discretize Y
d
() by dividing the Nyquist interval [0, ] into M + 1
frequency points. We have that
i
= i/M and Y
d
i
= Y
d
(
i
), where i = 0, 1, . . . , M. We also
dene z
i
= e
j
i
.
4.3 Spectrum-Shaping Filter 79
For a Jakes spectrum, given by (2.5), we dene L = M, where [0, 1] is the desired discrete
Doppler rate. The discretized magnitude response is thus given by the following equation:
Y
d
i
=
_
_
1
_
1 (i/L)
2
i = 0, 1, . . . , L 1
L
_
2
arcsin
_
L 1
L
__
i = L
0 i = L + 1, . . . , M
(4.12)
The response for i = L results from the requirement that the area under the spectrum be equal
for the sampled and the continuous cases.
We dene the vector x of length 4N that contains the coecients b
1,n
, b
2,n
, a
1,n
, a
2,n
and express
H(z) = KF(z, x), where F(z, x) is the product of biquad transfer functions in (4.11), apart
from K. Designing the lter consists now in minimizing the mean squared error (MSE):
E(K, x) =
1
M + 1
M
i=0
(|KF(z
i
, x)| Y
d
i
)
2
(4.13)
E(K, x) is a function of 4N + 1 variables. In order to reduce the problem order, we determine
the value of K that minimizes E(K, x). Knowing that K is positive, dierentiating E with
respect to K and equating with zero yields the optimum value K
o
:
K
o
=
M
i=0
|F(z
i
, x)|Y
d
i
M
i=0
|F(z
i
, x)|
2
(4.14)
The problem is now reduced to optimizing R(x) = E(A
o
x) in 4N dimensions. The gradient
x
R(x) is a vector with the partial derivatives of R(x) with respect to each element x
v
of x,
where v [1, 4N].
R(x)
x
v
=
2K
o
M + 1
M
i=0
(K
o
|F(z
i
, x)| Y
d
i
)
|F(z
i
, x)|
x
v
(4.15)
Evaluating (4.15) for all frequencies i and biquad stages n requires the calculation of 4MN
partial derivatives:
80 Chapter 4 Multipath Channel Simulators
|F(z
i
, x)|
a
1,n
= +|F(z
i
, x)| R
_
z
1
i
1 +a
1,n
z
1
i
+a
2,n
z
2
i
_
(4.16)
|F(z
i
, x)|
a
2,n
= +|F(z
i
, x)| R
_
z
2
i
1 +a
1,n
z
1
i
+a
2,n
z
2
i
_
(4.17)
|F(z
i
, x)|
b
1,n
= |F(z
i
, x)| R
_
z
1
i
1 +b
1,n
z
1
i
+b
2,n
z
2
i
_
(4.18)
|F(z
i
, x)|
b
2,n
= |F(z
i
, x)| R
_
z
2
i
1 +b
1,n
z
1
i
+b
2,n
z
2
i
_
(4.19)
We have now all quantities needed for iterative optimization. The evaluation of the cost function
and the gradient is performed according to the following steps:
1. Starting from a vector x, evaluate F(z
i
, x) at all frequencies i [0, M].
2. Compute the optimum scaling factor K
o
using (4.14).
3. Evaluate the error E
i
at all frequencies i [0, M].
E
i
= K
o
|F(z
i
, x)| Y
d
i
(4.20)
4. Evaluate the squared error cost function R(x).
R(x) =
1
M + 1
M
i=0
E
2
i
(4.21)
5. Determine the elements of gradient vector
x
R(x). In the equation below, x
v
is one of
the a
1,n
, a
2,n
, b
1,n
, b
2,n
biquad coecients, where v [1, 4N] and n [1, N].
[
x
R(x)]
v
=
2K
o
M + 1
M
i=0
E
i
|F(z
i
, x)|
x
v
(4.22)
Knowing how to evaluate the cost function and the gradient, we can use an iterative optimiza-
tion technique to nd an optimum coecient set. We have selected the Ellipsoid algorithm
because it is very simple to code and works well with highly non-linear target functions. The
convergence is not that fast as in the case of a descent method, but since the lter design is
done only once this is not a big issue.
4.3.2. A Case Study
In the following, we use the method outlined above to design spectrum-shaping lters for
the three Doppler proles introduced in Subsection 2.1.2: Jakes, at, and Gaussian. The
4.3 Spectrum-Shaping Filter 81
generated magnitudes are normalized, in the sense that the magnitude is always 1 at DC. We
have created a exible function that generates discrete Doppler proles, which accepts three
parameters:
1. The number of subdivisions M of the Nyquist interval. The resulting Doppler prole will
have M + 1 elements.
2. The Doppler prole type. It can be either jakes, at, or gauss.
3. The Doppler frequency f
D
. For the Jakes and at proles it is the maximum Doppler
frequency f
Dmax
, while for the Gaussian proles it represents the Doppler spread
D
.
For our scenario we used M = 512 frequency points since this ensures a good trade-o between
precision and computation time. The Doppler frequency f
D
was chosen to be 0.25. The ratio-
nale is that we want low frequencies because it results in lower interpolation errors. However,
if the frequency is too low, the constraints on the spectrum-shaping lter are increased.
The lter design algorithm accepts three parameters:
1. The discrete magnitude response at M + 1 frequencies. In this case it is the desired
Doppler prole.
2. The number of biquad sections.
3. The target minimum squared error (MSE) for the resulting magnitude response.
In the following, we investigate the relationship between the desired MSE and the required
number of biquad sections for the three Doppler proles, for a discrete Doppler frequency of
0.25. For a number of biquads between 1 and 8 we determine the minimum achievable MSE.
The results, shown in Table 4.4, indicate that a Gaussian lter requires much fewer stages for
a given MSE. The reason is that the magnitude response is very smooth, unlike the Jakes and
at lters which exhibit a very sharp cut-o. The Gaussian lter achieves excellent performance
with only two biquad stages, while for the other two at least ve stages are required. That is
why no values for MSE are given for Gaussian lters with more than two taps. It is also worth
mentioning that each extra stage reduces the MSE by approx. 10, i.e. with 10 dB RMS.
Table 4.4 helps us to choose the appropriate number of stages depending on the specic
requirements. To be on the safe side, we impose a maximum MSE of 10
6
. The appropriate
lter sizes and the actual MSEs achieved in one of the optimizations are listed in Table 4.5.
The MSE is given for the original 512 frequency points at which the lters were optimized.
Figure 4.17 shows the magnitude response of the three designed lters in logarithmic and
linear magnitude axes. The logarithmic plots also display the error between the designed and
the theoretical magnitude. Since these lters will be used now for both HW and SW implemen-
tations, we also list the designed coecients in Appendix B for reference. The multiplicative
82 Chapter 4 Multipath Channel Simulators
# SOS
Doppler type
Jakes Flat Gaussian
1 7.45 10
2
1.92 10
2
1.21 10
4
2 2.09 10
2
3.65 10
3
1.74 10
9
3 2.66 10
3
8.75 10
4
4 3.03 10
4
1.20 10
4
5 3.65 10
5
1.30 10
5
6 4.40 10
6
1.24 10
6
7 5.10 10
7
1.24 10
6
2
in
=
n=0
g(n)
2
(4.23)
If we know the magnitude response H(f) (continuous and nite), the power gain can be also
expressed as in , where N +1 is the number of frequencies at which the magnitude response is
evaluated. For good precision, N must be high enough, e.g. at least 1024.
A
p
=
2
out
2
in
=
1
N + 1
N
i=0
H(i/N)
2
(4.24)
Once the power gain A
p
is determined, the normalization factor can be simply calculated as the
inverse of the square root of A
p
. For the three lters we designed, we merged the multiplicative
constant K with the normalization factor to end up with only a multiplication instead of two.
4.3 Spectrum-Shaping Filter 83
0 0.25 0.5 0.75 1
80
60
40
20
0
20
Normalized frequency
M
a
g
n
i
t
u
d
e
(
d
B
)
Magnitude response
Magnitude error
(a) Jakes, log.
0 0.25 0.5 0.75 1
80
60
40
20
0
20
Normalized frequency
M
a
g
n
i
t
u
d
e
(
d
B
)
Magnitude response
Magnitude error
(b) Flat, log.
0 0.25 0.5 0.75 1
80
60
40
20
0
20
Normalized frequency
M
a
g
n
i
t
u
d
e
(
d
B
)
Magnitude response
Magnitude error
(c) Gaussian, log.
0 0.25 0.5 0.75 1
0
0.5
1
1.5
2
Normalized frequency
M
a
g
n
i
t
u
d
e
Magnitude response
(d) Jakes, lin.
0 0.25 0.5 0.75 1
0
0.5
1
1.5
2
Normalized frequency
M
a
g
n
i
t
u
d
e
Magnitude response
(e) Flat, lin.
0 0.25 0.5 0.75 1
0
0.5
1
1.5
2
Normalized frequency
M
a
g
n
i
t
u
d
e
Magnitude response
(f) Gaussian, lin.
Figure 4.17.: Magnitude responses of the designed Doppler lters
The constants K in Table B.1 are already normalized to ensure a unity power gain for WGN
input.
4.3.3. Implementation Guidelines
The lter design algorithm described in the previous section returns second-order sections with
poles and zeros in arbitrary order. Finite-precision hardware realizations, however, require the
optimization of the lter in order to reach one of the following two goals: 1) minimizing the
probability of overow or 2) minimizing the peak round-o noise. In the case of a Doppler
spectrum shaping lter, the latter optimization is more desirable.
For a given transfer function implemented with second-order sections, minimizing the peak
round-o noise is achieved by 2-norm scaling and by reordering of the sections so that the poles
84 Chapter 4 Multipath Channel Simulators
are in descending order [91], which means that the rst section has the poles closest to the unit
circle. The zeros and the poles are then grouped according to their proximity, starting with the
poles closest to the unit circle and successively matching each pole with the closest remaining
zeros until all of them are matched.
The goal of the 2-norm scaling is to achieve a constant noise variance after each section. The
2-norm of a lter is dened by the equation below and is intuitively the area under the squared
magnitude response or the power gain of a ltered white noise.
H
2
=
1
2
_
2
0
|H(e
j
)|
2
d (4.25)
The normalization is performed by adjusting the b
0
coecient of each section so that the 2-
norm of the lter formed by cascading the current section with all previous sections becomes
one. The process starts with the rst section and is described by the equation below. In this
equation H
k
is the magnitude response of the individual sections, while H
1,n
is the magnitude
response of all cascaded sections up to and including current section n of the total number of
sections N.
H
1,n
2
=
_
1
2
_
2
0
k=1
H
k
(e
j
)
2
d = 1, n 1 . . . N (4.26)
In the case of a Doppler spectrum shaping lter, the variance of the input white noise has
to be preserved. Using the above scaling method, the variance remains the same after each
lter section, and the additional multiplicative constant K is no longer necessary. With each
new section, the frequency response converges to the overall lter response. As an example,
we consider the Jakes lter designed in the previous section, which has seven second-order
sections. Figure 4.18 shows the frequency response of the individual sections, as well as their
cumulative response.
The hardware realization of a digital lter involves the discretization of the data samples and the
coecients. After multiplications and accumulation, the bitwidth of the intermediate results
becomes very large and has to be scaled down. Dedicated scaling (casting) blocks are used
for this purpose. For a discretized second-order section, the schematic containing the casting
blocks is shown in Figure 4.19. The casting to a higher precision is denoted by <<, while the
casting to a lower precision by >>. In order to avoid the two subtractions in the straightforward
DF-II implementation, shown in Figure 4.16b, the minus is embedded into a
1
and a
2
, which
incurs no additional hardware costs.
Figure 4.19 also shows the bitwidth at dierent points in the signal path. The four bitwidths
encountered in the implementation are the following:
4.3 Spectrum-Shaping Filter 85
0 0.2 0.4 0.6 0.8 1
80
70
60
50
40
30
20
10
0
10
20
30
A
m
p
l
i
t
u
d
e
r
e
s
p
o
n
s
e
(
d
B
)
Normalized frequency
Section 1
Section 2
Section 3
Section 4
Section 5
Section 6
Section 7
(a) Individual sections
0 0.2 0.4 0.6 0.8 1
80
70
60
50
40
30
20
10
0
10
20
30
A
m
p
l
i
t
u
d
e
r
e
s
p
o
n
s
e
(
d
B
)
Normalized frequency
Sections 1 ... 1
Sections 1 ... 2
Sections 1 ... 3
Sections 1 ... 4
Sections 1 ... 5
Sections 1 ... 6
Sections 1 ... 7
(b) Cumulative sections
Figure 4.18.: Magnitude response of the second-order sections for the designed Jakes lter
D
D
dout din << >> >>
DW SW SW+CW
SW+CW
SW+CW
AW
-a1
CW
-a
2
CW
b
0
CW
b1
CW
b
2
CW
SW+CW SW
SW
SW
SW
SW
SW+CW
SW+CW
AW DW
Figure 4.19.: Discretized second-order lter section with casting blocks
DW: data bitwidth
CW: coecients bitwidth
SW: states bitwidth
AW: accumulator bitwidth
The result after multiplications and accumulation has the largest bitwidth, which must be at
least as large as the bitwidth after multiplication, SW+CW. The accumulation result needs to
be cast down to the bitwidth of the lter states or of the output. Depending on the requirements,
casting can be performed by truncation or rounding. Rounding produces a lower round-o
noise but requires an extra adder. Another parameter of the casting is the overow behavior,
which can be either wrapping or saturation. Wrapping is obtained with no hardware costs, by
simply discarding a number of MSBs. Saturation, however, requires two comparators and two
86 Chapter 4 Multipath Channel Simulators
multiplexors. Unlike casting to a lower precision, casting to a higher precision only consists in
appending a number of zeros.
If multiple second-order sections are cascaded, casting of the input to the accumulator precision
and of the accumulator to the output precision must be only performed once. No casting is
necessary between the sections. Moreover, the accumulation of the multiplications with the a
coecients of a section can be combined with the accumulation of the multiplications with the b
coecients of the previous section. These considerations can be better observed in Figure 4.20,
which shows three cascaded second-order sections, with the shared accumulations marked by
dashed rectangles.
D
D
din
b0,1
b1,1
b
2,1
-a1,1
-a
2,1
>>
D
D
b0,2
b1,2
b
2,2
-a1,2
-a
2,2
>>
D
D
b0,3
b1,3
b
2,3
-a1,3
-a
2,3
>> << >> dout
Figure 4.20.: Cascade of three second-order lter sections
Depending on the required throughput, dierent architectures can be used for the implementa-
tion of a cascade of second-order lter sections. In the following we denote with N the number
of lter sections. The three most straightforward solutions are outlined in the following list.
Fully parallel. This solution enures the highest throughput, of one sample per clock cycle,
but also requires the most hardware resources. Each section has 4 multipliers, 5 adders,
and 2 registers, exactly like in Figure 4.16b. The requirements grow linearly with N.
Individual sequential. Each section contains one multiply-and-accumulate (MAC) unit,
which is operated sequentially under the control of a small state machine. The computa-
tion of one output sample takes 5 clock cycles, which ensures a throughput independent
of N. The complexity, however, grows linearly with N as well.
Fully sequential. All computations are performed sequentially using a single MAC unit
under the control of a state machine. The computation time is 5N clock cycles, growing
linearly with the number of sections. The essential advantage is that it has the smallest
area. One drawback, however, is the fact that the internal bitwidths must be the same
for all sections, which precludes the local ne tuning possible for the other two solutions.
As the required throughput before interpolation is very low in the case of a Doppler spectrum
generator, the fully sequential architecture is the ideal candidate for a hardware implementation.
A possible implementation solution is presented in Figure 4.21. The lter states for all sections
4.3 Spectrum-Shaping Filter 87
are stored in a single synchronous RAM block, whereas the constant coecients are stored in
a dedicated ROM. The memory maps for both ROM and RAM are shown in Figure 4.21.
D
0
1
>>
RAM
(clocked)
rdata
wdata
wen
ROM
(clocked)
data
Control state machine
raddr
addr
Filter states
Coefficients
din
start
Accumulator
dout
done
r
o
m
_
a
d
d
r
r
a
m
_
r
a
d
d
r
r
a
m
_
w
e
n
a
c
c
_
i
n
i
t
>>
D
en
0
a
c
c
_
i
n
i
t
_
s
e
l
1
0
SW
CW
SW+CW
SW AW
AW
DW
DW
D
D D
D D
D D D
done
D
en
DW
AW
SW
waddr
r
a
m
_
w
a
d
d
r
D
D
1
0
Figure 4.21.: Sequential architecture for cascaded second-order sections
It is worth mentioning that no full dual-port RAM is needed because requires the same address
is used for read/write. Moreover, if the RAM supports the read-after-write mode, as it is the
case in many FPGA embedded RAM blocks, the multiplexor in front of the multiplier and the
delay register on the feedback path can be saved. The accumulation result that is written into
the RAM would be available at its output in the next clock cycle.
If the fading tap generator needs to support dierent Doppler proles, such as Jakes, at, or
Gaussian, the number of sections and the lter coecients must be made congurable. This
can be easily achieved with the presented architecture by making the control state machine
dependent on the number of sections (conguration parameter) and by replacing the coecients
ROM with a dual-port RAM block that can be written by a central processor. Synchronous
on-chip memories, which can be used as RAM or ROM, are readily available in almost all
modern FPGA devices. Most FPGA synthesis tools are able to infer them from HDL, which
results in generic and scalable designs.
The control state machine is relatively simple and consists of two counters plus a few additional
logic gates. One counter iterates through all sections, while the other sequences the 5 MAC
operations for each section. The output signals of the state machine are shown in Figure 4.23
in the case of a three-section lter. The proposed architecture has no wait states and keeps
the MAC busy for the entire duration of a computation, so that the computation takes exactly
5N clock cycles, where N is the number of lter sections. Pipelining increases the lter latency
with 3 cycles, but does not aect its throughput.
88 Chapter 4 Multipath Channel Simulators
a
1,1
a
2,1
b
0,1
b
1,1
b
2,1
a
1,2
a
2,2
b
0,2
b
1,2
b
2,2
a
1,3
a
2,3
b
0,3
b
1,3
b
2,3
00
01
02
03
04
05
06
07
08
09
10
11
12
13
14
s
1,1
s
2,1
s
1,2
s
2,2
s
1,3
s
2,3
00
01
02
03
04
05
Coefficients
ROM
States
RAM
Address maps
Figure 4.22.: Address maps for the coecients and the lter states
rom_addr
acc_init
acc_init_sel
a2,1 a1,1 b2,1 b1,1 b0,1 a2,2 a1,2 b2,2 b1,2 b0,2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a2,3 a1,3 b2,3 b1,3 b0,3
s2,1 s1,1 s2,1 s1,1 s2,2 s1,2 s2,2 s1,2 s2,3 s1,3 s2,3 s1,3 s2,1 s2,2 s2,3
1 0 1 0 0 0 0 1 0 0 0 0 1 0 0
- - - - - 1 0 0 - - - - 0 - -
ram_addr
ram_wen 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
cycle #
done 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1
15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 start
Figure 4.23.: Signals generated by the control FSM for a three-section lter
4.4. Spectrum Shifter
In some cases, such as the ionospheric channel proles, a constant Doppler shift is specied for
each path, in addition to the Doppler spread. This frequency shift is performed by multiply-
ing the ltered complex signal with a complex exponential exp(jf
Dsh
n), where f
Dsh
is the
normalized Doppler shift between 0 and 1 and n is the sample index. It is essential that the
frequency shift do not translate the original symmetric spectrum past the Nyquist limit, i.e.
the condition f
Dmax
+f
Dsh
< 1 has to be fullled. Otherwise aliasing would occur.
Implementing a frequency shift in SW is trivial and consists in a call to sin and cos followed
by a complex multiplication with the input signal. In HW, however, generating a sin/cos pair
and then performing the complex multiplication is not the most ecient approach. A discrete
sin/cos generator with programmable frequency is usually implemented using a phase accumu-
lator and a look-up table. An alternative solution is to use a CORDIC rotator to compute the
sin/cos values for the generated phase. This HW solution is shown in Figure 4.24a.
The complex multiplier can be completely eliminated if we realize that there is no need to
actually generate sin and cos, but only to rotate the complex input signal with the phase
4.5 Polyphase Interpolator 89
CORDIC
rotator
Re
Im
Re
Im
dinre
dinim
1
0
phase
cos
sin
D
freqsh
Phase accumulator
doutre
doutim
Complex
multiplier
(a) Direct implementations
CORDIC
rotator
Re
Im
Re
Im
dinre
phase
D
freqsh
Phase accumulator
dinim
doutre
doutim
(b) Optimized implementations
Figure 4.24.: Frequency shifter implementations
computed by the phase accumulator. The same CORDIC rotator used for sin/cos generation
can be directly inserted in the signal path, thus obviating the need for a complex multiplier.
This HW solution is shown in Figure 4.24b for comparison.
The width of the phase accumulator depends on the required frequency resolution. For a N-bit
accumulator, the frequency can be varied linearly between 0 and 1 in 2
N1
increments, with
a frequency resolution of 1/2
N1
. If the phase accumulator increment is A
inc
, the normalized
frequency shift is exactly A
inc
/2
N1
.
4.5. Polyphase Interpolator
The interpolation factors can be very high. The narrower the desired spectrum, the higher
the interpolation factor. Values of 1000 are not uncommon in many applications, like Doppler
fading tap generators.
Various interpolation solutions for fading taps generators have been proposed in the literature.
In [51] the interpolation factor is restricted to integers and a single poly-phase interpolation
stage is used. In [82] a multi-stage approach is preferred, such as described in [14]. The
interpolation factor is in this case a composite number, i.e. L =
K1
k=0
L
k
, where K is the
number of successive interpolation stages with integer interpolation factors.
Restricting the interpolation factor to integers may not be exible enough for certain appli-
cations. One example would be a simulation that requires the Doppler frequency to sweep a
given interval with many intermediate steps. The accuracy will suer because of the disconti-
nuities caused by the discrete set of the Doppler frequencies that can be generated. However,
90 Chapter 4 Multipath Channel Simulators
the computational eciency of the integer factor interpolation is very good for a poly-phase
implementation, especially for single-stage solutions.
Another essential disadvantage of the conventional interpolation solutions is the fact that they
are very inexible for hardware implementation. Changing the interpolation factor entails the
changing of all poly-phase coecient sets. This is not an issue in software, where the coe-
cients are computed o-line before the simulation starts. In hardware, however, the dierent
coecients need to be computed o-line during design and stored, for each interpolation fac-
tor. There are practical solutions that alleviate this problem by reducing the number of stored
coecients, but they all suer from increased complexity and reduced exibility.
The solution we propose eliminates all the above mentioned problems. It oers very high
exibility in choosing the interpolation factor, without sacricing the computational eciency
and scalability, being at the same time suitable for both software and hardware implementation.
Moreover, both up-sampling and down-sampling can be implemented with the same structure,
which might be of interest in some applications. In software, it allows for arbitrary interpolation
factors, the only limitation being the intrinsic oating-point number precision. In hardware,
the interpolation factors are restricted to the formula 2
N
/X, where N is the width of an
accumulator register and X is an integer increment.
4.5.1. Architecture Overview
The architecture of the interpolator (resampler), shown in Figure 4.25, consists of three main
components:
Phase accumulator
Coecients generator
Interpolation (resampling) FIR lters
The phase accumulator keeps track of the sub-sample phase by incrementing its value with a
constant value for each output sample. A new sample is read from the input each time the
accumulator overows. We denote with N the width of the accumulator and with A
inc
the
increment. The upsampling factor, f
s,out
/f
s,in
, is given by the formula:
F =
2
N
A
inc
(4.27)
For upsampling, we have that A
inc
< 2
N
. An example of upsampling process is shown in
Figure 4.26 for N = 4 and A
inc
= 5, with a resulting upsampling factor of 16/5. In the gure,
red arrows mark the transitions for which accumulator overow occurs. It is readily apparent
that the accumulator phase represents the sub-pixel phase of the output interpolated samples.
4.5 Polyphase Interpolator 91
D
Dout
D D Din D D
Coefficients
generators
D +
Ainc
Phase accumulator
Interpolation FIR filter
phase
[0...1)
N
C0 C1 C2 C3 C4 C5
Resampler (variable delay element)
Figure 4.25.: Resampler architecture
This normalized phase is computed as A
inc
/2
N
and lies in the range [0, 1).
Input
samples
Output
samples
Phase: 0 5 10 15 4 9
Ainc
14 3 8 13 2 7 12 1
2
N
Figure 4.26.: Example of upsampling process
The interpolator works in pull mode, i.e. it is driven by the output. The output data rate is
xed and controls the phase accumulator. The input rate is lower and variable, depending on
the actual interpolation factor. Decimators, on the other hand, work in push mode, where the
accumulator is operated with the xed rate of the input, while the output rate depends on the
decimation factor.
The sub-pixel phase generated by the phase accumulator is now used for resampling. Resam-
pling means to generate an output sample from a predened number of input samples, with a
relative phase between two original samples. The resampling process consists in multiplying the
input samples with a set of weights that depend on the desired phase. Physically, the resampler
is implemented as a FIR lter with variable coecients and a block that generates the appro-
priate coecients based on the desired phase. In the eld of communication, the resampler
block is also referred to as variable delay element, since it can be regarded as introducing a
sub-sample delay in the input signal.
An advantage of the proposed interpolation architecture is apparent the case of multi-channel
interpolation, i.e. when more than one data channel is upsampled at the same time. In our
92 Chapter 4 Multipath Channel Simulators
case, the signal to be upsampled is complex, so we have two channels processed in parallel.
Other cases include multi-channel audio sampling rate conversion and image scaling, e.g. RGB
or YUV. In this case, the phase accumulator and the coecients generator can be shared and
only one interpolation FIR lter per extra channel is needed, as shown in Figure 4.27.
Coefficients
generators
Ainc
C0
Phase
accumulator
Interpolation FIR filter
C1
C2
C3
C4
C5
Interpolation FIR filter
Q
in
I
in
I
out
Q
out
Figure 4.27.: Multi-channel interpolation architecture
4.5.2. Polyphase Coecients Generator
In order to determine the relationship between the desired phase and the coecients of the
interpolation lter we discuss rst the poly-phase decomposition of interpolation functions.
The principle of the interpolation is to rst insert zero samples between the original samples,
then apply a low-pass lter to reject the images created by oversampling at multiples of the
sampling frequency. In Figure 4.28 we show the block schematic and the spectra for a 5x
interpolation. First, 4 zeros are inserted between original samples, followed by a low-pass
ltering with f
c
= 1/5.
(a) Structure (b) Spectra
Figure 4.28.: Interpolation structure and spectra
4.5 Polyphase Interpolator 93
If the interpolation lter is ideal, i.e. rectangular with f
c
= 1/5, no loss of information occurs.
When real lters are used, however, the interpolated signal will be distorted. There are two
classes of distortions: a) linear distortions, due to the high-frequency attenuation of the lter,
and b) non-linear distortions, caused by the insucient attenuation of the image components.
Since most of the multiplications in the lter are with zero samples, the straightforward imple-
mentation of the textbook structure in Figure 4.28 is very inecient. For larger interpolation
factors the situation becomes worse. The standard solution is to use a lter with coecients
that depend on the phase of the output signal. For 8 interpolation, there are eight possible
output phases and therefore eight coecient sets.
An example of polyphase decomposition of the original interpolation FIR lter is shown in
Figure 4.29, where the eight phases have been shown using dierent colors. The original
lter is symmetrical with 47 taps, whereas the poly-phase lter has only 6 taps and 8 sets of
coecients. Thus, the number of MAC operations per output sample decreased from 47 to 6.
The saving is even more signicant for very high interpolation factors, for which the textbook
lter approach would be extremely inecient.
0 8 16 24 32 40
0.2
0
0.2
0.4
0.6
0.8
1
Figure 4.29.: Interpolation lter poly-phase decomposition
Besides the reduced number of MAC operations, another essential advantage of the polyphase
structure is that the interpolation lter size is independent of the interpolation factor. The
latter determines only the number of coecients sets. Such a poly-phase implementation can
be used for generating interpolated samples with any phase between 0 and 1 by simply using
the appropriate coecients set. If the number of desired phases becomes very large, the storage
needed for coecients becomes prohibitive.
The solution we propose enables the generation of any output phase using a relatively low
number of stored coecients sets. The idea is to store coecients for a limited number of
equidistant phases P, usually a power of 2. For a given phase, the actual lter coecients are
computed by linear interpolation between two adjacent coecients sets. Figure 4.30 shows
the interpolation process for the two central taps of a 6-tap lter. The number of coecients
sets for P phases is P+1 because the coecients for phase 1 are needed for linear interpolation.
These are simply the coecients for phase 0 reversed. In the case of 4 poly-phases, 5 coecients
sets are stored, one for each of the following phases: 0, 1/4, 2/4, 3/4 and 1.
94 Chapter 4 Multipath Channel Simulators
0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1
0
0.2
0.4
0.6
0.8
1
Subsample phase
C
o
e
f
f
i
c
i
e
n
t
C
3
0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1
0
0.2
0.4
0.6
0.8
1
Subsample phase
C
o
e
f
f
i
c
i
e
n
t
C
4
Figure 4.30.: Linear interpolation between stored coecients
The sampling interval is thus divided into P segments of equal widths. In order to perform linear
interpolation, the segment number K
s
and the intra-segment phase
s
have to be computed,
using the following relationships. The desired output phase in range [0 . . . 1) is denoted here
by
o
.
K
s
= P
o
, K
s
[0, 1, . . . , P 1] (4.28)
s
= P
o
K
s
,
s
[0 . . . 1) (4.29)
The output coecient C
int
is computed by linear interpolation between the selected adjacent
coecients C
K
s
and C
K
s
+1
:
C
int
= C
K
s
+
s
(C
K
s
+1
C
K
s
) (4.30)
In hardware, the phase
o
is encoded using a xed number of bits N. If the number of coecients
sets is a power-of-2, say 2
Q
, the selection of the two adjacent coecients is done with two
multiplexors controlled by the rst Q MSBs of the N-bit phase word. The remaining N Q
bits represent the intra-segment phase
s
and are directly used for linear interpolation. The
schematic is shown in Figure 4.31 for 2
Q
= 4. Such a structure is needed for every lter
coecient.
In most applications, the coecients for each phase are normalized, i.e. their sum is one. This
ensures that the response at DC is the same for all phases. If this condition is not met, ripple
occurs for slowly varying signals, which shows up as high-frequency spurious components in the
spectrum of the interpolated signal.
When coecients have nite precision, the normalization of the interpolated coecients might
be aected, even if the original coecients sets are normalized. Simulations show that for
discretized coecients having the sum for each phase equal with 256, the resulting sum after
linear interpolation can vary with 2 around the average 256. The solution is to renormalize
4.5 Polyphase Interpolator 95
2 MSB
phase
[0...1)
0
1
2
3
N-2 LSB N
M
U
X
0
1
2
3
Linear interpolation
CXINT
M
U
X
CX44
CX34
CX24
CX14
CX04
Constant
coefficients
Figure 4.31.: Schematic of a coecient generator
the coecients after linear interpolation. Renormalization is performed by computing the error
of the sum and subtracting it from one of the two central coecients or, even better, from the
largest of them. The schematic of the proposed solution is shown in Figure 4.32 for a 4-tap
interpolation lter.
2 MSB
phase
[0...1)
0
1
2
3
N-2 LSB N
M
U
X
0
1
2
3
M
U
X
C
0
4
4
C
0
3
4
C
0
2
4
C
0
1
4
C
0
0
4
0
1
2
3
M
U
X
0
1
2
3
M
U
X
C
1
4
4
C
1
3
4
C
1
2
4
C
1
1
4
C
1
0
4
0
1
2
3
M
U
X
0
1
2
3
M
U
X
C
2
4
4
C
2
3
4
C
2
2
4
C
2
1
4
C
2
0
4
0
1
2
3
M
U
X
0
1
2
3
M
U
X
C
3
4
4
C
3
3
4
C
3
2
4
C
3
1
4
C
3
0
4
C0INT
64
Demux
0 1
C1N C2N C3N C0N
C1INT C2INT C3INT
Coefficients normalization
Figure 4.32.: Post-interpolation coecients renormalization
4.5.3. Interpolation Functions
There are several types of interpolation functions known in the literature, which can be divided
into three main categories:
Polynomial: Lagrange, spline
96 Chapter 4 Multipath Channel Simulators
Windowed sinc: Lanczos, Hamming, Hanning
Optimal, matched to the signals spectrum
The two main parameters of an interpolation function are the number of samples of the orig-
inal signal that are taken into account for interpolation and the integer interpolation factor.
Additionally, the optimal signal-matched interpolation requires the knowledge of the signal
spectrum or its autocorrelation function. For our study we consider the Lagrange, Lanczos,
and the signal-matched interpolation.
The interpolation function is decomposed into its poly-phase components for eciency reasons,
as shown in Subsection 4.5.2. Each output sample is obtained by multiplying a xed number
of neighboring input samples by a set of coecients that depend on the desired phase.
We want to compare the performance of various interpolation functions for a given input signal.
In the following analysis we consider a at-spectrum band-limited Gaussian noise. First, we
consider a bandwidth of 0.25 and determine the mean square error (MSE) of the interpolated
output as a function of the sub-pixel phase between 0 and 1. The test bench, shown in Fig-
ure 4.33, consists of an ideal band-limited noise generator, followed by a decimator and an
interpolator. The decimation and the interpolation factors are equal with the number of phases
for which the analysis is performed. The bandwidth of the noise generator is the desired band-
width before interpolation divided by the interpolation factor. It is essential that the generated
noise has very low spectral components outside the band of interest, otherwise these would fold
back in the Nyquist band after decimation and would appear as interpolation errors.
Band-limited
noise
generator
Delay
MSE
compute
(M phases)
M:1 1:M
fmax = 0.25 fmax = 0.25/M fmax = 0.25/M
Interpolator Decimator
Figure 4.33.: Interpolation error measurement
The frequency response of the Lagrange interpolation is shown in Figure 4.36 for sub-
sample phases between 0 and 0.5, for 4 and 8 taps respectively. As the lter is symmetrical,
the coecients for phase are the coecients of phase 1 reversed, and their frequency
responses are identical.
The frequency response of the overall Lagrange interpolation lter for an interpolation factor
of 8 is shown in Figure 4.35, for 4, 6, and 8 taps. As expected, the longer the lter, the better
its impulse response. The ideal interpolation lter should have a constant frequency response
up to the Nyquist frequency, which is marked with a vertical line on the gure, while outside
the Nyquist band the response should be zero. These conditions can only be fullled by a sinc
lter with an innite number of taps. Real interpolation lters, however, cannot fulll either
of these conditions, which gives rise to two categories of interpolation errors:
4.5 Polyphase Interpolator 97
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
A
m
p
l
i
t
u
d
e
r
e
s
p
o
n
s
e
Normalized frequency
Phase 0/8
Phase 1/8
Phase 2/8
Phase 3/8
Phase 4/8
(a) 4 taps
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
A
m
p
l
i
t
u
d
e
r
e
s
p
o
n
s
e
Normalized frequency
Phase 0/8
Phase 1/8
Phase 2/8
Phase 3/8
Phase 4/8
(b) 8 taps
Figure 4.34.: Frequency response of Lagrange poly-phase lters
Linear distortions due to the attenuation of the upper frequencies in the original Nyquist
band.
Non-linear distortions due to insucient rejection of the image components outside the
original Nyquist band (aliasing).
0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1
80
70
60
50
40
30
20
10
0
10
A
m
p
l
i
t
u
d
e
r
e
s
p
o
n
s
e
(
d
B
)
Normalized frequency
2 taps
4 taps
6 taps
8 taps
Figure 4.35.: Frequency response of Lagrange interpolation lters
It must be mentioned here that the ultimate cause of aliasing is the dierence in frequency
response of the poly-phase lters in Figure 4.36. If the responses were the same for all phases,
no aliasing would occur, only attenuation of the high-frequency components of the original
signal.
The Lanczos interpolation belongs to the windowed-sinc family of interpolation functions. In
this case, the sinc function is windowed with the main lobe of a wider sinc. The relative width
98 Chapter 4 Multipath Channel Simulators
of later sinc is the interpolation factor, as shown in (4.31) for a factor of 3. The frequency
response of the poly-phase components for Lanczos interpolation are shown in Figure 4.36, for
4 and 8 taps respectively. It can be seen that Lanczos is better than Lagrange for higher-order
interpolation lters and for signals with signicant high-pass components.
L(x) =
_
_
_
sin(x)
x
sin(x/3)
x/3
|x| < 3
0 |x| 3
(4.31)
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
A
m
p
l
i
t
u
d
e
r
e
s
p
o
n
s
e
Normalized frequency
Phase 0/8
Phase 1/8
Phase 2/8
Phase 3/8
Phase 4/8
(a) 4 taps
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
A
m
p
l
i
t
u
d
e
r
e
s
p
o
n
s
e
Normalized frequency
Phase 0/8
Phase 1/8
Phase 2/8
Phase 3/8
Phase 4/8
(b) 8 taps
Figure 4.36.: Frequency response of Lanczos poly-phase lters
The frequency response of the overall Lanczos interpolation lter for an interpolation factor of
8 is shown in Figure 4.37, for 4, 6, and 8 taps. Unlike the Lagrange lter, whose frequency
response has lobes centered around odd multiples of the original Nyquist frequency, the Lanczos
lter has a faster decaying frequency response, without clear periodic structures. This property
is typical for all windowed sinc lters.
The optimal signal-matched interpolation, described in detail in Subsection 5.4.2, has a
frequency response close to that of the Lagrange interpolation, e.g. with lobes at odd multiples
of the original Nyquist frequency, except that the lobes are smaller. Since this lter is MSE-
optimized for a specic signal spectrum, it oers the best interpolation performance, provided
that the actual signal has the spectrum for which the lter has been designed. This condition
is always fullled in the case of a fading tap generator, where the Doppler spectrum before
interpolation is known by design.
In the following, we want to evaluate the three interpolation functions for various signal band-
widths and interpolation lter lengths and select the best for our requirements. We consider a
frequency range from 1/8 to 1 and lter lengths in range 4 . . . 16. The testbench used is the one
in Figure 4.33. The decimation/interpolation factor does not aect the result and has been
4.5 Polyphase Interpolator 99
0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1
80
70
60
50
40
30
20
10
0
10
A
m
p
l
i
t
u
d
e
r
e
s
p
o
n
s
e
(
d
B
)
Normalized frequency
4 taps
6 taps
8 taps
Figure 4.37.: Frequency response of Lanczos interpolation lters
set to 4 in this case. The results are plotted in Figure 4.38 for a bandwidth between 1/8 and
1 on a logarithmic scale.
The truncated sinc interpolation was added as a reference to show that a smooth window
is necessary to achieve decent interpolation results. As expected, the results show that the
signal-match interpolation oers the smallest error for all bandwidths. For high bandwidths,
Lanczos interpolation is superior to Lagrange. The former has a slightly oscillating error oor
at lower bandwidths. For comparison purposes, Figure 4.39 shows the results for the three
interpolation types on the same axes, for 8 and 16 taps, this time only for two octaves between
1/4 and 1.
For 8 taps, the error is less than -80 dB for bandwidths below 0.25, for both Lagrange and signal-
matched interpolation. At 0.5 bandwidth, the error of the signal-match interpolation is with
20 dB smaller than for Lagrange. Two important conclusions can be drawn from this analysis.
First, Lanczos interpolation, like other windowed-sinc varieties, oers the worst performance for
bandwidths below 0.5 and is not appropriate for fading tap generation. Second, interpolation
error decreases very fast with the bandwidth, roughly 40 dB per octave for 8 taps and 60 dB
per octave for 16 taps. For a bandwidth of 0.25, the interpolation error is less than -90 dB for
both Lagrange and signal-matched interpolation. The later achieves -90 dB even with 4 taps
instead of 8, which doubles the computational eciency.
4.5.4. Performance Analysis
The interpolation performance analysis in the previous section has been performed for a xed
interpolation factor, assuming the lter coecients to be exactly computed for the desired
phases. The number of phases is in this case the interpolation factor, so that the coecients
100 Chapter 4 Multipath Channel Simulators
can be computed o-line and stored in a look-up table. In the case of a resampler, the number
of phase is very large, usually a power-of-2. The successive phases are computed using a
phase accumulator with an increment A
inc
. The resulting output frequency has the following
expression, where N is the width of the phase accumulator.
f
out
= f
in
A
inc
2
N
(4.32)
This equation shows that the output frequency depends linearly on the input frequency. Such
a resampler architecture can generate 2
N
equally spaced f
out
in range 0 f
in
. The increment
A
inc
can be larger than 2
N
, in which case the resampler performs downsampling and f
out
is
larger than f
in
. In our case the input bandwidth is constant 0.25, while the output bandwidth
can be varied between 0 and 0.5.
Unlike interpolation, such a resampler architecture requires the generation of output samples
for a very large number of possible sub-sample phases. Storing coecients for all these phases
in impractical. As mentioned in Subsection 4.5.2, an ecient solution is to store coecients
for a predened number of phases and linearly interpolate for all others. One of the aims of
the present performance analysis is to investigate how the number of coecients sets aects
the resampling performance.
In the specic case of fading tap generation, resampling performance is dened in terms of
spurious frequency components outside the desired Doppler bandwidth. Both the highest peak
and the variance of these spurious components are of interest. In order to determine them, the
testbench in Figure 4.40 is used. The low-pass lter limits the noise bandwidth to 0.25 and
has a very sharp cut-o and a at response in the band-pass. It is an Elliptic lter of order 16,
with pass-band ripple of 0.1 dB and stop-band attenuation of -120 dB, and is scaled so that
the noise variance after ltering is 1. It is implemented as 8 cascaded second-order sections
(bi-quads). The tough constraints force the poles very close to the unit circle, which makes
a straightforward direct-form implementation impossible. Even with double-precision oating
point coecients, such a lter would be unstable.
The high-pass lter isolates only the desired out-of-band spurious components. Its constraints
are identical to those of the low-pass lter. The cut-o frequency would be ideally the highest
frequency in the output spectrum, f
Dmax
. However, due to the nite width of the transition
bands, the cut-o frequency is chosen to be 1.1 f
Dmax
. This will only introduce a small error
when computing the variance of the spurious components.
One way to evaluate the purity of the interpolated signal is by spectral analysis of the out-of-
band resampling artifacts. The power spectral density (PSD) is obtained using Welchs method
[96] with an FFT window size of 4096 (2
12
), which ensures a good trade-o between accuracy
and computational eciency. The PSD of the original signal signal with f
Dmax
= 0.25 is shown
in Figure 4.41, where the decaying tail is a side-eect of the spectrum estimation method.
The spectrum of the spurious components depends on the spectrum of the original signal and on
the frequency response of the interpolation lter. Figure 4.41 shows the spurious interpolation
4.5 Polyphase Interpolator 101
artifacts in the case of a 6-tap lter, for an input bandwidth of 0.25 and an interpolation factor
of 8. The corresponding frequency response of the interpolation lters is plotted on the same
axes. It is now apparent why Lagrange interpolation is better than windowed sinc types for
lower input bandwidths.
The results presented so far assumed a power-of-2 interpolation factor and exact lter coef-
cients. In reality, however, the upsampling factor is a rational number, which causes the
intermediate sub-sample phase to take many dierent values. As the coecients are computed
by linear interpolation between a xed number of coecients sets, usually a power-of-2, this
will contribute to the overall interpolation error. Our goal is to evaluate this error and to
understand its variation with the interpolation factor. To this end we use the test-bench in
Figure 4.40 and measure the variance of the out-of-band components in a given range of
interpolation factors.
As in the previous analysis, the input signal has a at spectrum and a bandwidth of 0.25. The
resampling factors are chosen so that the bandwidth of the resampled signal covers four octaves,
between 1/16 and 1/2. Figure 4.43a shows the results for Lagrange interpolation with 4, 6, and
8 taps. The number of poly-phase coecients sets has been chosen suciently large suciently
large (256), so that the results are not aected by it. As expected, at bandwidths 0.25 (no
resampling) and 0.5 (decimation by 2) the error is zero. The reason is that no intermediate
samples are generated and the output consists of original samples only. The error is otherwise
relatively constant, varying only slightly with the frequency.
Figure 4.43b shows the same analysis in the case of 4 coecients sets. The results show that
the linear interpolation errors create an additional error oor, so that the error for 6 and 8 taps
is the same. Unlike the ideal case, the error depends now on the actual resampling factor. For
resampling factors 1/2 and 1/4 the error reaches the ideal level in Figure 4.43a because the
phase only takes values for which coecients sets are precomputed and no linear interpolation
is needed: 0, 1/4, 2/4, and 3/4.
As the coecients need to be stored, the number of poly-phase sets directly aects the hardware
complexity. This is less of an issue in software, were a few extra bytes of constant storage are
easily available. In order to determine the optimum number of poly-phase coecients sets we
need to know how this number aects the error oor. The test conguration in Figure 4.40
is also used in this case. We consider interpolation lters with 4, 6, and 8 taps respectively,
for a resampling factor and output bandwidth at which no error minimum occurs, such as
0.1. Figure 4.44 shows the resampling artifacts variance for Lagrange and signal-matched
interpolation.
The conclusion that can be drawn is that the number of coecients sets must be correlated
with the lter size for a target interpolation performance. It can be seen that a 4-tap signal-
matched interpolation lter with 8 coecients sets oer excellent interpolation results, with a
-70 dB variance of the out-of-band artifacts, relative to the variance of the signal. It can be
also observed that for Lagrange interpolation with 4 and 256 coecients sets the results are
consistent with those presented in Figure 4.43.
102 Chapter 4 Multipath Channel Simulators
1/8 1/4 1/2 1
80
70
60
50
40
30
20
10
0
Bandwidth (before interpolation)
I
n
t
e
r
p
o
l
a
t
i
o
n
M
S
E
(
d
B
)
4 taps
6 taps
8 taps
12 taps
16 taps
(a) Lagrange
1/8 1/4 1/2 1
80
70
60
50
40
30
20
10
0
Bandwidth (before interpolation)
I
n
t
e
r
p
o
l
a
t
i
o
n
M
S
E
(
d
B
)
4 taps
6 taps
8 taps
12 taps
16 taps
(b) Signal-matched
1/8 1/4 1/2 1
80
70
60
50
40
30
20
10
0
Bandwidth (before interpolation)
I
n
t
e
r
p
o
l
a
t
i
o
n
M
S
E
(
d
B
)
4 taps
6 taps
8 taps
12 taps
16 taps
(c) Lanczos
1/8 1/4 1/2 1
80
70
60
50
40
30
20
10
0
Bandwidth (before interpolation)
I
n
t
e
r
p
o
l
a
t
i
o
n
M
S
E
(
d
B
)
4 taps
6 taps
8 taps
12 taps
16 taps
(d) Truncated sinc
Figure 4.38.: MSE vs. bandwidth for various interpolation lters
4.5 Polyphase Interpolator 103
1/4 1/2 1
80
70
60
50
40
30
20
10
0
Bandwidth (before interpolation)
I
n
t
e
r
p
o
l
a
t
i
o
n
M
S
E
(
d
B
)
Lagrange
Signalmatched
Lanczos
(a) 8 taps
1/4 1/2 1
80
70
60
50
40
30
20
10
0
Bandwidth (before interpolation)
I
n
t
e
r
p
o
l
a
t
i
o
n
M
S
E
(
d
B
)
Lagrange
Signalmatched
Lanczos
(b) 16 taps
Figure 4.39.: MSE vs. bandwidth for two interpolation lter sizes
AWGN
generator
K = M/N fc = 0.25
fmax = 0.25/K
Resampler DUT
fmax = 0.25
fc = 1.1 * 0.25/K
Analysis
Low-pass filter High-pass filter
Figure 4.40.: Resampler spurious components measurement
0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1
10
6
10
5
10
4
10
3
10
2
10
1
10
0
10
1
10
2
Frequency
P
o
w
e
r
(
l
o
g
)
Figure 4.41.: PSD of the band-limited signal used for testing the resampler performance
104 Chapter 4 Multipath Channel Simulators
0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1
10
8
10
6
10
4
10
2
10
0
10
2
Frequency
P
o
w
e
r
(
l
o
g
)
(a) 6-tap Lagrange interpolation
0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1
10
8
10
6
10
4
10
2
10
0
10
2
Frequency
P
o
w
e
r
(
l
o
g
)
(b) 6-tap Lanczos interpolation
Figure 4.42.: Spectrum of the out-of-band interpolation artifacts
1/32 1/16 1/8 1/4 1/2
110
100
90
80
70
60
50
40
Output bandwidth (log)
O
u
t
o
f
b
a
n
d
c
o
m
p
o
n
e
n
t
s
v
a
r
i
a
n
c
e
(
d
B
)
4 taps
6 taps
8 taps
(a) exact coecients
1/32 1/16 1/8 1/4 1/2
110
100
90
80
70
60
50
40
Output bandwidth (log)
O
u
t
o
f
b
a
n
d
c
o
m
p
o
n
e
n
t
s
v
a
r
i
a
n
c
e
(
d
B
)
4 taps
6 taps
8 taps
(b) 4 coecients sets
Figure 4.43.: Resampling artifacts variance vs. output bandwidth
4.5 Polyphase Interpolator 105
2 4 8 16 32 64 128 256
120
110
100
90
80
70
60
50
40
30
Number of polyphase coefficients sets
O
u
t
o
f
b
a
n
d
c
o
m
p
o
n
e
n
t
s
v
a
r
i
a
n
c
e
(
d
B
)
4 taps
6 taps
8 taps
(a) Lagrange interpolation
2 4 8 16 32 64 128 256
120
110
100
90
80
70
60
50
40
30
Number of polyphase coefficients sets
O
u
t
o
f
b
a
n
d
c
o
m
p
o
n
e
n
t
s
v
a
r
i
a
n
c
e
(
d
B
)
4 taps
6 taps
8 taps
(b) Signal-matched interpolation
Figure 4.44.: Resampling artifacts variance vs. number of coecients sets