Bitwise Neural Network
Bitwise Neural Network
networks one still needs to employ arithmetic operations, 2. Feedforward in Bitwise Neural Networks
such as multiplication and addition, on fixed-point values.
Even though faster than floating point, they still require rel- It has long been known that any Boolean function, which
atively complex logic and can consume a lot of power. takes binary values as input and produces binary outputs
as well, can be represented as a bitwise network with one
With the proposed Bitwise Neural Networks (BNN), we hidden layer (McCulloch & Pitts, 1943), for example, by
take a more extreme view that every input node, output merely memorizing all the possible mappings between in-
node, and weight, is represented by a single bit. For ex- put and output patterns. We define the forward propagation
ample, a weight matrix between two hidden layers of 1024 procedure as follows based on the assumption that we have
units is a 1024 × 1025 matrix of binary values rather than trained such a network with bipolar binary parameters:
quantized real values (including the bias). Although learn- l−1
K
ing those bitwise weights as a Boolean concept is an NP- X
complete problem (Pitt & Valiant, 1988), the bitwise net- ali = bli + l
wi,j ⊗ zjl−1 , (1)
j
works have been studied in the limited setting, such as µ-
zil = sign ali ,
perceptron networks where an input node is allowed to be (2)
connected to one and only one hidden node and its final l Kl l K l ×K l−1 l Kl
z ∈B ,W ∈ B ,b ∈ B , (3)
layer is a union of those hidden nodes (Golea et al., 1992).
A more practical network was proposed in (Soudry et al., where B is the set of bipolar binaries, i.e. ±11 , and ⊗ stands
2014) recently, where the posterior probabilities of the bi- for the bitwise XNOR operation (see Figure 1 (a)). l, j, and
nary weights were sought using the Expectation Back Prop- i indicate a layer, input and output units of the layer, respec-
agation (EBP) scheme, which is similar to backpropagation tively. We use bold characters for a vector (or a matrix if
in its form, but has some advantages, such as parameter- capicalized). K l is the number of input units at l-th layer.
free learning and a straightforward discretization of the Therefore, z0 equals to an input vector, where we omit the
weights. Its promising results on binary text classification sample index for the notational convenience. We use the
tasks however, rely on the real-valued bias terms and aver- sign activation function to generate the bipolar outputs.
aging of predictions from differently sampled parameters.
We can check the prediction error E by measuring the bit-
This paper presents a completely bitwise network where wise agreement of target vector t and the output units of
all participating variables are bipolar binaries. Therefore, L-th layer using XNOR as a multiplication operator,
in its feedforward only XNOR and bit counting operations L+1
KX
are used instead of multiplication, addition, and a nonlinear
1 − ti ⊗ ziL+1 /2,
activation on floating or fixed-point variables. For training, E= (4)
i
we propose a two-stage approach, whose first part is typical
network training with a weight compression technique that but this error function can be tentatively replaced by involv-
helps the real-valued model to easily be converted into a ing a softmax layer during the training phase.
BNN. To train the actual BNN, we use those compressed
The XNOR operation is a faster substitute of binary mul-
weights to initialize the BNN parameters, and do noisy
tiplication. Therefore, (1) and (2) can be seen as a special
backpropagation based on the tentative bitwise parameters.
version of the ordinary feedforward step that only works
To binarize the input signals, we can adapt any binariza-
when the inputs, weights, and bias are all bipolar binaries.
tion techniques, e.g. fixed-point representations and hash
Note that these bipolar bits will in practice be implemented
codes. Regardless of the binarization scheme, each input
using 0/1 binary values, where (2) activation is equivalent
node is given only a single bit at a time, as opposed to a bit
to counting the number of 1’s and then checking if the accu-
packet representing a fixed-point number. This is signifi-
mulation is bigger than the half of the number of input units
cantly different from the networks with quantized inputs,
plus 1. With no loss of generality, in this paper we will use
where a real-valued signal is quantized into a set of bits,
the ±1 bipolar representation since it is more flexible in
and then all those bits are fed to an input node in place of
defining hyperplanes and examining the network behavior.
their corresponding single real value. Lastly, we apply the
sign function as our activation function instead of a sig- Sometimes a BNN can solve the same problem as a real-
moid to make sure the input to the next layer is bipolar bi- valued network without any size modifications, but in gen-
nary as well. We compare the performance of the proposed eral we should expect that a BNN could require larger net-
BNN with its corresponding ordinary real-valued networks work structures than a real-valued one. For example, the
on hand-written digit recognition tasks, and show that the XOR problem in Figure 1 (b) can have an infinite num-
bitwise operations can do the job with a very small perfor- ber of solutions with real-valued parameters once a pair
mance loss, while providing a large margin of improvement 1
In the bipolar binary representation, +1 stands for the
in terms of the necessary computational resources. “TRUE” status, while −1 is for “FALSE.”
Bitwise Neural Networks
+1 (-1,-1) (1,-1)
3. Training Bitwise Neural Networks
(c) (d)
We first train some compressed network parameters, and
then retrain them using noisy backpropagation for BNNs.
(-1,1) (1,1)
i
instead come up with a solution by combining multiple bi-
nary hyperplanes that will eventually increase the perceived Note that the errors fron the next layer are multiplied with
complexity of the model. However, even with a larger num- the compressed versions of the weights. Hence, the gradi-
ber of nodes, the BNN is not necessarily more complex ents of the parameters in the case of batch learning are
than the smaller real-valued network. This is because a pa- X
rameter or a node of BNN requires only one bit to represent
l
∇w̄i,j = δil (n)z̄jl−1 · 1 − tanh2 w̄i,j
l
,
n
while a real-valued node generally requires more than that, X
up to 64 bits. Moreover, the simple XNOR and bit count- ∇b̄li = δil (n) · 1 − tanh2 b̄li ,
ing operations of BNN bypass the computational complica- n
Bitwise Neural Networks
with the additional term from the chain rule on the com-
Table 1. Classification errors for real-valued and bitwise networks
pressed weights.
on different types of bitwise features
4. Experiments 5. Conclusion
In this section we go over the details and results of the
In this work we propose a bitwise version of artificial neu-
hand-written digit recognition task on the MNIST data
ral networks, where all the inputs, weights, biases, hid-
set (LeCun et al., 1998) using the proposed BNN system.
den units, and outputs can be represented with single bits
Throughout the training, we adopt the softmax output layer
and operated on using simple bitwise logic. Such a net-
for these multiclass classification cases. All the networks
work is very computationally efficient and can be valuable
have three hidden layers with 1024 units per layer.
for resource-constrained situations, particularly in cases
From the first round of training, we get a regular dropout where floating-point / fixed-point variables and operations
network with the same setting suggested in (Srivastava are prohibitively expensive. In the future we plan to in-
et al., 2014), except the fact that we used the hyperbolic vestigate a bitwise version of convolutive neural networks,
tangent for both weight compression and activation to make where efficient computing is more desirable.
Bitwise Neural Networks