On Deep Learning - Based Channel Decoding
On Deep Learning - Based Channel Decoding
Tobias Gruber∗ , Sebastian Cammerer∗ , Jakob Hoydis†, and Stephan ten Brink∗
∗ Institute of Telecommunications, Pfaffenwaldring 47, University of Stuttgart, 70659 Stuttgart, Germany
{gruber,cammerer,tenbrink}@inue.uni-stuttgart.de
† Nokia Bell Labs, Route de Villejust, 91620 Nozay, France
jakob.hoydis@nokia-bell-labs.com
Abstract—We revisit the idea of using deep neural networks for A. Related Work
one-shot decoding of random and structured codes, such as polar
arXiv:1701.07738v1 [cs.IT] 26 Jan 2017
codes. Although it is possible to achieve maximum a posteriori In 1943, McCulloch and Pitts published the idea of a NN
(MAP) bit error rate (BER) performance for both code families that models the architecture of the human brain in order
and for short codeword lengths, we observe that (i) structured
to solve problems [2]. But it took about 45 years until the
codes are easier to learn and (ii) the neural network is able to
generalize to codewords that it has never seen during training backpropagation algorithm [3] made useful applications such
for structured, but not for random codes. These results provide as handwritten ZIP code recognition possible [4]. One early
some evidence that neural networks can learn a form of decoding form of a NN is a Hopfield net [5]. This concept was shown
algorithm, rather than only a simple classifier. We introduce to be similar to maximum likelihood decoding (MLD) of
the metric normalized validation error (NVE) in order to further
linear block error-correcting codes (ECCs) [6]: an erroneous
investigate the potential and limitations of deep learning-based
decoding with respect to performance and complexity. codeword will converge to the nearest stable state of the Hop-
field net which represents the most likely codeword. A naive
implementation of MLD means correlating the received vector
I. I NTRODUCTION
of modulated symbols with all possible codewords which
Deep learning-based channel decoding is doomed by the makes it infeasible for most practible codeword lengths, as the
decoding complexity is O 2k with k denoting the number
curse of dimensionality [1]: for a short code of length N = 100
and rate r = 0.5, 250 different codewords exist, which of information bits in the codeword. The parallel computing
are far too many to fully train any neural network (NN) capabilities of NNs allow us to solve or, at least, approximate
in practice. The only way that a NN can be trained for the MLD problem in polynomial time [7]. Moreover, the
practical blocklengths is, if it learns some form of decoding weights of the NN are precomputed during training and the
algorithm which can infer the full codebook from training on decoding step itself is then relatively simple.
a small fraction of codewords. However, to be able to learn a Due to its low storage capacity, Hopfield nets were soon
decoding algorithm, the code itself must have some structure replaced by feed-forward NNs which can learn an appropriate
which is based on a simple encoding rule, like in the case of mapping between noisy input patterns and codewords. No
convolutional or algebraic codes. The goal of this paper is to assumption has to be made about the statistics of the channel
shed some light on the question whether structured codes are noise because the NN is able to learn the mapping or to
easier to “learn” than random codes, and whether a NN can extract the channel statistics during the learning process [8].
decode codewords that it has never seen during training. Different ideas around the use of NN for decoding emerged
We want to emphasize that this work is based on very short in the 90s. While in [8] the output nodes represent the bits of
blocklengths, i.e., N ≤ 64, which enables the comparison the codeword, it is also possible to use one output node per
with maximum a posteriori (MAP) decoding, but also has an codeword (one-hot coding) [9]. For Hamming coding, another
independent interest for practical applications such as the in- variation is to use only the syndrome as input of the NN in
ternet of things (IoT). We are currently restricted to short codes order to find the most likely error pattern [10]. Subsequently,
because of the exponential training complexity [1]. Thus, NND for convolutional codes arose in 1996 when Wang and
the neural network decoding (NND) concept is currently not Wicker showed that NND matches the performance of an ideal
competitive with state-of-the-art decoding algorithms which Viterbi decoder [1]. But they also mentioned a very important
have been highly optimized over the last decades and scale to drawback of NND: decoding problems have far more possi-
arbitrary blocklengths. bilities than conventional pattern recognition problems. This
Yet, there may be certain code structures which facilitate limits the NND to short codes. However, the NN decoder for
the learning process. One of our key finding is that structured convolutional codes was further improved by using recurrent
codes are indeed easier to learn than random codes, i.e., less neural nets [11].
training epochs are required. Additionally, our results indicate NND did not achieve any big breakthrough for neither
that NNs may generalize or “interpolate” to the full codebook block nor convolutional codes. Due to the standard training
after having seen only a subset of examples, whenever the techniques in those times it was not possible to work with NNs
code has structure. employing a large number of neurons and layers, which ren-
Input Noise NND input Hidden 2 Output
codeword xi ∈ X of length N
b1 xi,2 b̂1
Encoder
{0, 1}k → {0, 1}N
bk−2 xi,N −3 b̂k−2
bk−1 xi,N −2 b̂k−1
xi,N −1
dered them unsuited for longer codewords. Hence, the interest output of the NN, an input-output mapping is defined by a
in NNs dwindled, not only for machine learning applications chain of functions depending on the set of parameters θ by
but also for decoding purposes. Some slight improvements
were made in the following years, e.g., by using random neural w = f (v; θ) = f (L−1) f (L−2) . . . f (0) (v) (2)
nets [12] or by reducing the number of weights [13].
where L gives the number of layers and is also called depth.
In 2006, a new training technique, called layer-by-layer
It was shown in [17] that such a multilayer NN with L = 2
unsupervised pre-training followed by gradient descent fine-
and nonlinear activation functions can theoretically approxi-
tuning [14], led to the renaissance of NNs because it made
mate any continuous function on a bounded region arbitrarily
training of NNs with more layers feasible. NNs with many
closely—if the number of neurons is large enough.
hidden layers are called deep. Nowadays, powerful new hard-
In order to find the optimal weights of the NN, a training
ware such as graphical processing units (GPUs) are available
set of known input-output mappings is required and a specific
to speed up learning as well as inference. In this renaissance
loss function has to be defined. By the use of gradient descent
of NNs, new NND ideas emerge. Yet, compared to previous
optimization methods and the backpropagation algorithm [3],
work, the NN learning techniques are only used to optimize
weights of the NN can be found which minimize the loss
well known decoding schemes which we denote as introduc-
function over the training set. The goal of training is to enable
tion of expert knowledge. For instance, in [15], weights are
the NN to find the correct outputs for unseen inputs. This is
assigned to the Tanner graph of the belief propagation (BP)
called generalization. In order to quantify the generalization
algorithm and learned by NN techniques in order to improve
ability, the loss can be determined for a data set that has not
the BP algorithm. It still seems that the recent advances in the
been used for training, the so-called validation set.
machine learning community have not yet been adapted to the
In this work, we want to use a NN for decoding of noisy
pure idea of learning to decode.
codewords. At the transmitter, k information bits are encoded
II. D EEP L EARNING FOR C HANNEL C ODING into a codeword of length N . The coded bits are modulated
The theory of deep learning is comprehensively described and transmitted over a noisy channel. At the receiver, a noisy
in [16]. Nevertheless, for completeness, we will briefly explain version of the codeword is received and the task of the decoder
the main ideas and concepts in order to introduce a NN for is to recover the corresponding information bits. In comparison
channel (de-)coding and its terminology. A NN consists of to iterative decoding, the NN finds its estimate by passing
many connected neurons. In such a neuron all of its weighted each layer only once. As this principle enables low-latency
inputs are added up, a bias is optionally added, and the result implementations, we term it one-shot decoding.
is propagated through a nonlinear activation function, e.g., a Obtaining labeled training data is usually a very hard and
sigmoid function or a rectified linear unit (ReLU), which are expensive task for the field of machine learning. But using
respectively defined as NN for channel coding is special because we deal with man-
made signals. Therefore, we are able to generate as many
1 training samples as we like. Moreover, the desired NN output,
gsigmoid (z) = , grelu (z) = max {0, z} . (1)
1 + e−z also denoted as label, is obtained for free because if noisy
If the neurons are arranged in layers without feedback connec- codewords are generated, the transmitted information bits are
tions we speak of a feedforward NN because information flows obviously known. For the sake of simplicity, binary phase shift
through the net from the left to the right without feedback (see keying (BPSK) modulation and an additive white Gaussian
Fig. 1). Each layer i with ni inputs and mi outputs performs noise (AWGN) channel is used. Other channels can be adopted
the mapping f (i) : Rni → Rmi with the weights and biases straightforwardly, and it is this flexibility that may be a
of the neurons as parameters. Denoting v as input and w as particular advantage of NN-based decoding.
In order to keep the training set small it is possible to 5 40
extend the NN with additional layers for modulating and 4 30
adding noise (see Fig. 1). These additional layers have no
NVE
NVE
trainable parameters, i.e., they perform a certain action such 3 20
as adding noise and propagate this value only to the node 2 10
of the next layer with the same index. Instead of creating, 1 1
and thus storing, many noisy versions of the same codeword, −2 0 2 4 6 −4−2 0 2 4 6 8
working on the noiseless codeword is sufficient. Thus, the Training-Eb/N0 [dB] Training-Eb /N0 [dB]
training set X consists of all possible codewords xi ∈ FN 2 with
F2 ∈ {0, 1} (the labels being the corresponding information (a) Polar Code (b) Random Code
bits) and is given by X = {x0 , . . . , x2k−1 } . Fig. 2: NVE versus training-Eb/N0 for 16 bit-length codes for
As recommended in [16], each hidden layer employs a a 128-64-32 NN trained with Mep = 216 training epochs.
ReLU activation function because it is nonlinear and at the
same time very close to linear which helps during optimiza- complexity. As we support reproducible research, we have
tion. Since the output layer represents the information bits, a made parts of the source code of this paper available.4
sigmoid function forces the output neurons to be in between
zero and one, which can be interpreted as the probability that III. L EARN TO D ECODE
a “1” was transmitted. If the probability is close to the bit of In the sequel, we will consider two different code families:
the label, the loss should be incremented only slightly whereas random codes and structured codes, namely polar codes [19].
large errors should result in a very large loss. Examples for Both have codeword length N = 16 and code rate r = 0.5.
such loss functions are the mean squared error (MSE) and the While random codes are generated by randomly picking
binary cross-entropy (BCE), defined respectively as codewords from the codeword space with a Hamming distance
1 X 2 larger than two, the generator matrix of polar codes of block
LMSE = bi − b̂i (3) size N = 2n is given by
k i
1 Xh 1 0
GN = F⊗n ,
i
LBCE = − bi ln b̂i + (1 − bi ) ln 1 − b̂i (4) F= (6)
k i 1 1
where F⊗n denotes the nth Kronecker power of F. The
where bi ∈ {0, 1} is the ith target information bit (label) and codewords are now obtained by x = uGN , where u contains
b̂i ∈ [0, 1] the NN soft estimate. k information bits and N − k frozen positions, for details we
There are some alternatives for this setup. First, log- refer to [19]. This way, polar codes are inherently structured.
likelihood ratio (LLR) values could be used instead of channel
values. For BPSK modulation over an AWGN channel, these A. Design parameters of NND
are obtained by Our starting point is a NN as described before (see Fig. 1).
P (x = 0|y) 2 We introduce the notation 128-64-32 which describes the
LLR (y) = ln = 2y (5) design of the NN decoder employing three hidden layers with
P (x = 1|y) σ
128, 64, and 32 nodes, respectively. However, there are other
where σ 2 is the noise power and y the received channel design parameters with a non-negligible performance impact:
value. This processing step can be also implemented as an 1) What is the best training signal-to-noise-ratio (SNR)?
additional layer without any trainable parameters. Note, that 2) How many training samples are necessary?
the noise variance must be known in this case and provided as 3) Is it easier to learn from LLR channel output values
an additional input to the NN.1 Representing the information rather than from the direct channel output?
bits in the output layer as a one-hot-coded vector of length 2k 4) What is an appropriate loss function?
is another variant. However, we refrain from this idea since 5) How many layers and nodes should the NN employ?
it does not scale to large values of k. Freely available open- 6) Which type of regularization5 should be used?
source machine learning libraries, such as Theano2 , help to The area of research dealing with the optimization of these
implement and train complex NN models on fast concurrent parameters is called hyperparameter optimization [20]. In this
GPU architectures. We use Keras3 as a convenient high- work, we do not further consider this optimization and restrict
level abstraction front-end for Theano. It allows to quickly ourselves to a fixed set of hyperparameters which we have
deploy NNs from a very abstract point of view in the Python found to achieve good results. Our focus is on the differences
programming language that hides away a lot of the underlying between random and structured codes.
1 Inspired by the idea of spatial transformer networks [18], one could 4 https://github.com/gruberto/DL-ChannelDecoding
alternatively use a second NN to estimate σ2 from the input and provide 5 Regularization is any method that trades-off a larger training error against
this estimate as an additional parameter to the LLR layer. a smaller validation error. An overview of such techniques is provided in [16,
2 https://github.com/Theano/Theano
Ch. 7]. We do not use any regularization techniques in this work, but leave
3 https://github.com/fchollet/keras it as an interesting future investigation.
Since the performance of NND depends not only on the
SNR of the validation data set (for which the bit error rate 10−1
(BER) is computed) but also on the SNR of the training
data set6 , we define below a new performance metric, the Mep
10−2 Mep = 210
normalized validation error (NVE). Denote by ρt and ρv the
BER
Mep = 212
SNR (measured as Eb /N0 ) of the training and validation 10−3 Mep = 214
data sets, respectively, and let BERNND (ρt , ρv ) be the BER
Mep = 216
achived by a NN trained at ρt on data with ρv . Similarly, let −4
10 Mep = 218
BERMAP (ρv ) be the BER of MAP decoding at SNR ρv . For a
MAP
set of S different validation data sets with SNRs ρv,1 , . . . , ρv,S ,
the NVE is defined as 10−5
0 2 4 6 8 10
S
1 X BERNND (ρt , ρv,s ) Eb /N0 [dB]
NVE(ρt ) = . (7) (a) Polar Code
S s=1 BERMAP (ρv,s )
BER
the NVE over S = 20 different SNR points from 0 dB to 5 dB Mep = 212
with a validation set size of 20000 examples for each SNR. 10−3 Mep = 214
We train our NN decoder in so-called “epochs”. In each Mep = 216
epoch, the gradient of the loss function is calculated over the 10−4 Mep = 218
entire training set X using Adam’, a method for stochastic MAP
gradient descent optimization [22]. Since the noise layer in 10−5
0 2 4 6 8 10
our architecture generates a new noise realization each time it
is used, the NN decoder will never see the same input twice. Eb /N0 [dB]
For this reason, although the training set has a limited size (b) Random Code
of 2k codewords, we can train on an essentially unlimited Fig. 3: Influence of the number of epochs Mep on the BER of a
training set by simply increasing the number of epochs Mep . 128-64-32 NN for 16 bit-length codes with code rate r = 0.5.
However, this makes it impossible to distinguish whether the
NN is improved by a larger amount of training samples or
However, for polar codes, close to MAP performance is
more optimization iterations.
already achieved for Mep = 218 epochs, while we may need
Starting with a NN decoder architecture of 128-64-32 and
a larger NN or more training epochs for random codes.
Mep = 222 learning epochs, we train the NN with datasets
In Fig. 4, we illustrate the influence of direct channel values
of different training SNRs and evaluate the resulting NVE.
versus channel LLR values as decoder input in combination
The result is shown in Fig. 2, from which it can be seen that
with two loss functions, MSE and BCE. The NVE for all
there is an “optimal” training Eb /N0 . An explanation for the
combinations is plotted as a function of the number of training
occurrence of an optimum can be explained by the two cases:
epochs. Such a curve is also called “learning curve” since
1) Eb /N0 → ∞; train without noise, the NN is not trained it shows the process of learning. Although it is ususally
to handle noise. recommended to normalize the NN inputs to have zero mean
2) Eb /N0 → 0; train only with noise, the NN can not learn and unit variance, we train the NN without any normalization
the code structure. which seems to be sufficient for our setup. For a few training
This clearly indicates an optimum somewhere in between these epochs, the LLR input improves the learning process; however,
two cases. From now on, a training Eb /N0 of 1 dB and 4 dB this advantage disappears for a larger Mep . The same holds
is chosen for polar and random codes, respectively. for BCE against MSE. For polar codes with LLR values and
Fig. 3 shows the BER achieved by a very small NN BCE the learning appears not to converge for the applied
of dimensions 128-64-32 as a function of the number of number of epochs. In summary, for training the NN with a
training epochs ranging from Mep = 210 , . . . , 218 . For BER large number of training epochs it does not matter if LLR or
simulations, we use 1 million codewords per SNR point. For channel values are used as inputs and which loss function is
both code families, the larger the number of training epochs, employed. Moreover, normalization is not required.
the closer is the gap between MAP and NND performance. In order to answer the question how large the NN should
be, we trained NNs with different sizes and structures. From
6 It would also be possible to have a training data set which contains a
Fig. 5, we can conclude that, for both polar and random codes,
mix of different SNR values, but we have not investigated this option here.
Recently, the authors in [21] observed that starting at a high training SNR it is possible to achieve MAP performance. Moreover, and
and then gradually reducing the SNR works well. somewhat surprisingly, the larger the net, the less training
20 30
direct channel N = 16
channel LLR N = 32
15 MSE 20 N = 64
NVE
BCE Polar Code
NVE
5
8 9 10 11 12 13 14
# information bits k
104 105
Fig. 6: Scalability shown by NVE for a 1024-512-256 NN
Training epochs Mep for 16/32/64 bit-length codes with different code rates and
Fig. 4: Learning curve for 16 bit-length codes with code rate Mep = 216 training epochs.
r = 0.5 for a 128-64-32 NN.
IV. C APABILITY OF G ENERALIZATION
20
128-64-32 As Fig. 2–6 show, NNDs for polar codes always perform
256-128-64 better than random codes for a fixed NN design and number of
15 512-256-128 training epochs. This provides a first indication that structured
1024-512-256 codes, such as polar codes, are easier to learn than random
NVE
Polar Code
10 codes. In order to confirm this hypothesis, we train the NN
Random Code
based on a subset Xp which covers only p % of the entire set
of valid codewords. Then, the NN decoder is evaluated with
5 the set Xp that covers the remaining 100 − p % of X . As a
benchmark, we evaluate the NN decoder also for the set of
all codewords X . Instead of BER as in Fig. 3, we now use
104 105
the block error rate (BLER) for evaluation (see Fig. 7). This
Training epochs Mep way, we only consider whether an entire codeword is correctly
Fig. 5: Learning curve for different NN sizes for 16 bit-length detected or not, exluding side-effects of similarities between
codes with code rate r = 0.5. codewords which might lead to partially correct decoding.
While for polar codes the NN is able to decode codewords
that were not seen during training, the NN cannot decode
epochs are necessary. In general, the larger the number of
any unseen codeword for random codes. Fig. 8 emphasizes
layers and neurons, the larger is the expressive power or
this observation by showing the single-word BLER for the
capacity of the NN [16]. Contrary to what is common in
codewords xi ∈ X80 which were not used for training.
classic machine learning tasks, increasing the network size
Obviously, the NN fails for almost every unseen random
does not lead to overfitting since the network never sees the
codeword which is plausible. But for a structured code, such
same input twice.
as a polar codes, the NN is able to generalize even for unseen
codewords. Unfortunately, the NN architecture considered here
B. Scalability is not able to achieve MAP performance if it is not trained on
Up to now, we have only considered 16 bit-length codes the entire codebook. However, finding a network architecture
which are of little practical importance. Therefore, the scal- that generalizes best is topic of our current investigations.
ability of the NN decoder is investigated in Fig. 6. One can In summary, we can distinguish two forms of generalization.
see that the length N is not crucial to learn a code by deep First, as described in Section III, the NN can generalize from
learning techniques. What matters, however, is the number input channel values with a certain training SNR to input
of information bits k that determines the number of different channel values with arbitrary SNR. Second, the NN is able to
classes (2k ) which the NN has to distinguish. For this reason, generalize from a subset Xp of codewords to an unseen subset
the NVE increases exponentially for larger values of k for a Xp . However, we observed that for larger NNs the capability
NN of fixed size and fixed number of training epochs. If a NN of the second form of generalization vanishes.
decoder is supposed to scale, it must be able to generalize from
V. O UTLOOK AND C ONCLUSION
a few training examples. In other words, rather than learning
to classify 2k different codewords, the NN decoder should For small block lengths, we achieved to decode random
learn a decoding algorithm which provides the correct output codes as well as polar codes with MAP performance. But
for any possible codeword. In the next section, we investigate learning is limited through exponential complexity as the
whether structure allows for some form of generalization. number of information bits in the codewords increases. The
100 1 Random Code
Polar Code
BLER
70 %
10−1 80 % 0.5
90 %
p = 100 %
BLER
p = 90 % 70 %
10−2 0
p = 80 % 0 10 20 30 40 50
80 %
p = 70 %
90 % Codeword index i of X80 .
MAP
10−3 100 %
Xp Fig. 8: Single-word BLER for xi ∈ X80 at Eb /N0 = 4.16 dB
X and Mep = 218 learning epochs.
10−4
0 2 4 6 8 10
[4] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
Eb /N0 [dB] W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551,
(a) 16 bit-length Polar Code (r = 0.5)
Dec. 1989.
0 [5] J. Hopfield, “Neural networks and physical systems with emergent
10 90 % 80 % 70 %
70 %
collective computational abilities,” Proc. Nat. Acad. Sci., vol. 79, pp.
2554–2558, 1982.
80 % [6] J. Bruck and M. Blaum, “Neural networks, error-correcting codes, and
10−1 90 % polynomials over the binary n-cube,” IEEE Trans. Inform. Theory,
p = 100 % vol. 35, no. 5, pp. 976–987, Sept. 1989.
BLER