0% found this document useful (0 votes)
55 views5 pages

Bitwise Neural Network

This document proposes bitwise neural networks (BNNs) as a way to perform neural network computations using only binary values and bitwise operations. BNNs could improve efficiency for applications with limited resources. The document outlines how BNNs represent weights, inputs, outputs and activations with single bits. It describes the feedforward process using basic bit logic instead of arithmetic. Training methods are proposed to learn the bitwise weights, with the goal of BNNs achieving performance comparable to traditional neural networks. The document tests BNNs on MNIST digit classification to evaluate this approach.

Uploaded by

God Gaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views5 pages

Bitwise Neural Network

This document proposes bitwise neural networks (BNNs) as a way to perform neural network computations using only binary values and bitwise operations. BNNs could improve efficiency for applications with limited resources. The document outlines how BNNs represent weights, inputs, outputs and activations with single bits. It describes the feedforward process using basic bit logic instead of arithmetic. Training methods are proposed to learn the bitwise weights, with the goal of BNNs achieving performance comparable to traditional neural networks. The document tests BNNs on MNIST digit classification to evaluate this approach.

Uploaded by

God Gaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Bitwise Neural Networks

Minje Kim MINJE @ ILLINOIS . EDU


Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
Paris Smaragdis PARIS @ ILLINOIS . EDU
University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
Adobe Research, Adobe Systems Inc., San Francisco, CA 94103, USA
arXiv:1601.06071v1 [cs.LG] 22 Jan 2016

Abstract cated function, Deep Neural Networks (DNN) achieve the


goal by learning a hierarchy of features in their multiple
Based on the assumption that there exists a neu-
layers (Hinton et al., 2006; Bengio, 2009).
ral network that efficiently represents a set of
Boolean functions between all binary inputs and Although DNNs are extending the state of the art results
outputs, we propose a process for developing and for various tasks, such as image classification (Goodfel-
deploying neural networks whose weight param- low et al., 2013), speech recognition (Hinton et al., 2012),
eters, bias terms, input, and intermediate hid- speech enhancement (Xu et al., 2014), etc, it is also the
den layer output signals, are all binary-valued, case that the relatively bigger networks with more parame-
and require only basic bit logic for the feedfor- ters than before call for more resources (processing power,
ward pass. The proposed Bitwise Neural Net- memory, battery time, etc), which are sometimes critically
work (BNN) is especially suitable for resource- constrained in applications running on embedded devices.
constrained environments, since it replaces ei- Examples of those applications span from context-aware
ther floating or fixed-point arithmetic with signif- computing, collecting and analysing a variety of sensor sig-
icantly more efficient bitwise operations. Hence, nals on the device (Baldauf et al., 2007), to always-on com-
the BNN requires for less spatial complexity, less puter vision applications (e.g. Google glasses), to speech-
memory bandwidth, and less power consumption driven personal assistant services, such as “Hey, Siri.” A
in hardware. In order to design such networks, primary concern that hinders those applications from be-
we propose to add a few training schemes, such ing more successful is that they assume an always-on pat-
as weight compression and noisy backpropaga- tern recognition engine on the device, which will drain the
tion, which result in a bitwise network that per- battery fast unless it is carefully implemented to minimize
forms almost as well as its corresponding real- the use of resources. Additionally, even in an environment
valued network. We test the proposed network with the necessary resources being available, speeding up
on the MNIST dataset, represented using binary a DNN can greatly improve the user experience when it
features, and show that BNNs result in compet- comes to tasks like searching big databases (Salakhutdinov
itive performance while offering dramatic com- & Hinton, 2009). In either case, a more compact yet still
putational savings. well-performing DNN is a welcome improvement.
Efficient computational structures for deploying artificial
neural networks have long been studied in the literature.
1. Introduction Most of the effort is focused on training networks whose
According to the universal approximation theorem, a sin- weights can be transformed into some quantized represen-
gle hidden layer with a finite number of units can approx- tations with a minimal loss of performance (Fiesler et al.,
imate a continuous function with some mild assumptions 1990; Hwang & Sung, 2014). They typically use the quan-
(Cybenko, 1989; Hornik, 1991). While this theorem im- tized weights in the feedforward step at every training iter-
plies a shallow network with a potentially intractable num- ation, so that the trained weights are robust to the known
ber of hidden units when it comes to modeling a compli- quantization noise caused by a limited precision. It was
also shown that 10 bits and 12 bits are enough to represent
Proceedings of the 31 st International Conference on Machine gradients and storing weights for implementing the state-
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- of-the-art maxout networks even for training the network
right 2015 by the author(s). (Courbariaux et al., 2014). However, in those quantized
Bitwise Neural Networks

networks one still needs to employ arithmetic operations, 2. Feedforward in Bitwise Neural Networks
such as multiplication and addition, on fixed-point values.
Even though faster than floating point, they still require rel- It has long been known that any Boolean function, which
atively complex logic and can consume a lot of power. takes binary values as input and produces binary outputs
as well, can be represented as a bitwise network with one
With the proposed Bitwise Neural Networks (BNN), we hidden layer (McCulloch & Pitts, 1943), for example, by
take a more extreme view that every input node, output merely memorizing all the possible mappings between in-
node, and weight, is represented by a single bit. For ex- put and output patterns. We define the forward propagation
ample, a weight matrix between two hidden layers of 1024 procedure as follows based on the assumption that we have
units is a 1024 × 1025 matrix of binary values rather than trained such a network with bipolar binary parameters:
quantized real values (including the bias). Although learn- l−1
K
ing those bitwise weights as a Boolean concept is an NP- X
complete problem (Pitt & Valiant, 1988), the bitwise net- ali = bli + l
wi,j ⊗ zjl−1 , (1)
j
works have been studied in the limited setting, such as µ-
zil = sign ali ,

perceptron networks where an input node is allowed to be (2)
connected to one and only one hidden node and its final l Kl l K l ×K l−1 l Kl
z ∈B ,W ∈ B ,b ∈ B , (3)
layer is a union of those hidden nodes (Golea et al., 1992).
A more practical network was proposed in (Soudry et al., where B is the set of bipolar binaries, i.e. ±11 , and ⊗ stands
2014) recently, where the posterior probabilities of the bi- for the bitwise XNOR operation (see Figure 1 (a)). l, j, and
nary weights were sought using the Expectation Back Prop- i indicate a layer, input and output units of the layer, respec-
agation (EBP) scheme, which is similar to backpropagation tively. We use bold characters for a vector (or a matrix if
in its form, but has some advantages, such as parameter- capicalized). K l is the number of input units at l-th layer.
free learning and a straightforward discretization of the Therefore, z0 equals to an input vector, where we omit the
weights. Its promising results on binary text classification sample index for the notational convenience. We use the
tasks however, rely on the real-valued bias terms and aver- sign activation function to generate the bipolar outputs.
aging of predictions from differently sampled parameters.
We can check the prediction error E by measuring the bit-
This paper presents a completely bitwise network where wise agreement of target vector t and the output units of
all participating variables are bipolar binaries. Therefore, L-th layer using XNOR as a multiplication operator,
in its feedforward only XNOR and bit counting operations L+1
KX
are used instead of multiplication, addition, and a nonlinear
1 − ti ⊗ ziL+1 /2,

activation on floating or fixed-point variables. For training, E= (4)
i
we propose a two-stage approach, whose first part is typical
network training with a weight compression technique that but this error function can be tentatively replaced by involv-
helps the real-valued model to easily be converted into a ing a softmax layer during the training phase.
BNN. To train the actual BNN, we use those compressed
The XNOR operation is a faster substitute of binary mul-
weights to initialize the BNN parameters, and do noisy
tiplication. Therefore, (1) and (2) can be seen as a special
backpropagation based on the tentative bitwise parameters.
version of the ordinary feedforward step that only works
To binarize the input signals, we can adapt any binariza-
when the inputs, weights, and bias are all bipolar binaries.
tion techniques, e.g. fixed-point representations and hash
Note that these bipolar bits will in practice be implemented
codes. Regardless of the binarization scheme, each input
using 0/1 binary values, where (2) activation is equivalent
node is given only a single bit at a time, as opposed to a bit
to counting the number of 1’s and then checking if the accu-
packet representing a fixed-point number. This is signifi-
mulation is bigger than the half of the number of input units
cantly different from the networks with quantized inputs,
plus 1. With no loss of generality, in this paper we will use
where a real-valued signal is quantized into a set of bits,
the ±1 bipolar representation since it is more flexible in
and then all those bits are fed to an input node in place of
defining hyperplanes and examining the network behavior.
their corresponding single real value. Lastly, we apply the
sign function as our activation function instead of a sig- Sometimes a BNN can solve the same problem as a real-
moid to make sure the input to the next layer is bipolar bi- valued network without any size modifications, but in gen-
nary as well. We compare the performance of the proposed eral we should expect that a BNN could require larger net-
BNN with its corresponding ordinary real-valued networks work structures than a real-valued one. For example, the
on hand-written digit recognition tasks, and show that the XOR problem in Figure 1 (b) can have an infinite num-
bitwise operations can do the job with a very small perfor- ber of solutions with real-valued parameters once a pair
mance loss, while providing a large margin of improvement 1
In the bipolar binary representation, +1 stands for the
in terms of the necessary computational resources. “TRUE” status, while −1 is for “FALSE.”
Bitwise Neural Networks

tions of a real-valued system, such as the power consump-


tion of multipliers and adders for the floating-point opera-
Input
Output (-1,1) (1,1) tions, various dynamic ranges of the fixed-point representa-
A B
-1 -1 +1 -x1+x2+1>0
tions, erroneous flips of the most significant bits, etc. Note
-1 +1 -1 x1-x2+1>0 that if the bitwise parameters are sparse, we can further
+1 -1 -1 reduce the number of hyperplanes. For example, for an in-
(-1,-1) (1,-1)
+1 +1 +1 active element in the weight matrix W due to the sparsity,
we can simply ignore the computation for it similarly to the
operations on the sparse representations. Conceptually, we
(a) (b)
can say that those inactive weights serve as zero weights, so
+1
that a BNN can solve the problem in Figure 1 (d) by using
-1 only one hyperplane as in (e). From now on, we will use
x1 -1
+1
(-1,1) (1,1) this extended version of BNN with inactive weights, yet
+1 +1
y there are some cases where BNN needs more hyperplanes
x2
-1
+1 than a real-valued network even with the sparsity.
x1-x2+1<0 x1+x2-1>0

+1 (-1,-1) (1,-1)
3. Training Bitwise Neural Networks
(c) (d)
We first train some compressed network parameters, and
then retrain them using noisy backpropagation for BNNs.
(-1,1) (1,1)

3.1. Real-valued Networks with Weight Compression


0x1+x2+0>0
First, we train a real-valued network that takes either bit-
(-1,-1) (1,-1)
wise inputs or real-valued inputs ranged between −1 and
(e) +1. A special part of this network is that we constrain
the weights to have values between −1 and +1 as well by
Figure 1. (a) An XNOR table. (b) The XOR problem that needs wrapping them with tanh. Similarly, if we choose tanh for
two hyperplanes. (c) A multi-layer perceptron that solves the the activation, we can say that the network is a relaxed ver-
XOR problem. (d) A linearly separable problem while bitwise sion of the corresponding bipolar BNN. With this weight
networks need two hyperplanes to solve it (y = x2 ). (e) A bit- compression technique, the relaxed forward pass during
wise network with zero weights that solves the y = x2 problem. training is defined as follows:
l−1
K
of hyperplanes can successfully discriminate (1, 1) and
X
ali = tanh(b̄li ) + l
tanh(w̄i,j )z̄jl−1 , (5)
(−1, −1) from (1, −1) and (−1, 1). Among all the pos- j
sible solutions, we can see that binary weights and bias are
z̄il = tanh ali ,

(6)
enough to define the hyperplanes, x1 − x2 + 1 > 0 and
−x1 + x2 + 1 > 0 (dashes). Likewise, the separation per- where all the binary values in (1) and (2) are real for the
l l−1 l l
formance of the particular BNN defined in (c) has the same time being: W̄l ∈ RK ×K , b̄l ∈ RK , and z̄l ∈ RK .
classification power once the inputs are binary as well. The bars on top of the notations are for the distinction.
Figure 1 (d) shows another example where BNN requires Weight compression needs some changes in the backprop-
more hyperplanes than a real-valued network. This linearly agation procedure. In a hidden layer we calculate the error,
separable problem is solvable with only one hyperplane, l+1
such as −0.1x1 + x2 + 0.5 > 0, but it is impossible to de-  KX   
scribe such a hyperplane with binary coefficients. We can δjl (n) = tanh(w̄i,j )δi (n) · 1 − tanh2 alj .
l+1 l+1

i
instead come up with a solution by combining multiple bi-
nary hyperplanes that will eventually increase the perceived Note that the errors fron the next layer are multiplied with
complexity of the model. However, even with a larger num- the compressed versions of the weights. Hence, the gradi-
ber of nodes, the BNN is not necessarily more complex ents of the parameters in the case of batch learning are
than the smaller real-valued network. This is because a pa- X   
rameter or a node of BNN requires only one bit to represent
l
∇w̄i,j = δil (n)z̄jl−1 · 1 − tanh2 w̄i,j
l
,
n
while a real-valued node generally requires more than that, X   
up to 64 bits. Moreover, the simple XNOR and bit count- ∇b̄li = δil (n) · 1 − tanh2 b̄li ,
ing operations of BNN bypass the computational complica- n
Bitwise Neural Networks

with the additional term from the chain rule on the com-
Table 1. Classification errors for real-valued and bitwise networks
pressed weights.
on different types of bitwise features

3.2. Training BNN with Noisy Backpropagation


Since we have trained a real-valued network with a proper N ETWORKS B IPOLAR 0 OR 1
F IXED - POINT
range of weights, what we do next is to train the actual (2 BITS )
bitwise network. The training procedure is similar to the
ones with quantized weights (Fiesler et al., 1990; Hwang F LOATING - POINT
1.17% 1.32% 1.36%
NETWORKS (64 BITS )
& Sung, 2014), except that the values we deal with are all
bits, and the operations on them are bitwise. To this end, BNN 1.33% 1.36% 1.47%
we first initialize all the real-valued parameters, W̄ and b̄,
with the ones learned from the previous section. Then, we
setup a sparsity parameter λ which says the proportion of
the zeros after the binarization. Then, we divide the param-
eters into three groups: +1, 0, or −1. Therefore, λ decides
l l
the boundaries β, e.g. wij = −1 if w̄ij < −β. Note that the network suitable for initializing the following bipolar
the number of zero weights |w̄ij | < β equals to λK l K l−1 .
l
bitwise network. The number of iterations from 500 to
1, 000 was enough to build a baseline. The first row of
The main idea of this second training phase is to feedfor-
Table 1 shows the performance of the baseline real-valued
ward using the binarized weights and the bit operations as
network with 64bits floating-point. As for the input to the
in (1) and (2). Then, during noisy backpropagation the
real-valued networks, we rescale the pixel intensities into
errors and gradients are calculated using those binarized
the bipolar range, i.e. from −1 to +1, for the bipolar case
weights and signals as well:
(the first column). In the second column, we use the origi-
K
X
l+1 nal input between 0 and 1 as it is. For the third column, we
l+1 l+1
δjl (n) = wi,j δi (n), encode the four equally spaced regions between 0 to 1 into
i two bits, and feed each bit into each input node. Hence, the
X X baseline network for the third input type has 1, 568 binary
l
∇w̄i,j = δil (n)zjl−1 , ∇b̄li = δil (n). (7)
n n
input nodes rather than 784 as in the other cases.
Once we learn the real-valued parameters, now we train
In this way, the gradients and errors properly take the bina- the BNN, but with binarized inputs. For instance, instead
rization of the weights and the signals into account. Since of real values between −1 and +1 in the bipolar case, we
the gradients can get too small to update the binary param- take their sign as the bipolar binary features. As for the 0/1
eters W and b, we instead update their corresponding real- binaries, we simply round the pixel intensity. Fixed-point
valued parameters, inputs are already binarized. Now we train the new BNN
l l l
with the noisy backpropagation technique as described in
w̄i,j ← w̄i,j − η∇w̄i,j , b̄li,j ← b̄li − η∇b̄li , (8) 3.2. The second row of Table 1 shows the BNN results. We
see that the bitwise networks perform well with very small
with η as a learning rate parameter. Finally, at the end of
additional errors. Note that the performance of the original
each update we binarize them again with β. We repeat this
real-valued dropout network with similar network topology
procedure at every epoch.
(logistic units without max-norm constraint) is 1.35%.

4. Experiments 5. Conclusion
In this section we go over the details and results of the
In this work we propose a bitwise version of artificial neu-
hand-written digit recognition task on the MNIST data
ral networks, where all the inputs, weights, biases, hid-
set (LeCun et al., 1998) using the proposed BNN system.
den units, and outputs can be represented with single bits
Throughout the training, we adopt the softmax output layer
and operated on using simple bitwise logic. Such a net-
for these multiclass classification cases. All the networks
work is very computationally efficient and can be valuable
have three hidden layers with 1024 units per layer.
for resource-constrained situations, particularly in cases
From the first round of training, we get a regular dropout where floating-point / fixed-point variables and operations
network with the same setting suggested in (Srivastava are prohibitively expensive. In the future we plan to in-
et al., 2014), except the fact that we used the hyperbolic vestigate a bitwise version of convolutive neural networks,
tangent for both weight compression and activation to make where efficient computing is more desirable.
Bitwise Neural Networks

References McCulloch, W. S. and Pitts, W. H. A logical calculus of


the ideas immanent in nervous activity. The Bulletin of
Baldauf, M., Dustdar, S., and Rosenberg, F. A survey
Mathematical Biophysics, 5(4):115–133, 1943.
on context-aware systems. International Journal of Ad
Hoc and Ubiquitous Computing, 2(4):263–277, January Pitt, L. and Valiant, L. G. Computational limitations on
2007. learning from examples. Journal of the Association for
Computing Machinery, 35:965–984, 1988.
Bengio, Y. Learning deep architectures for AI. Foundations
and Trends in Machine Learning, 2(1):1–127, 2009. Salakhutdinov, R. and Hinton, G. Semantic hashing. Inter-
national Journal of Approximate Reasoning, 50(7):969
Courbariaux, M., Bengio, Y., and David, J.-P. Low pre- – 978, 2009.
cision arithmetic for deep learning. arXiv preprint
arXiv:1412.7024, 2014. Soudry, D., Hubara, I., and Meir, R. Expectation backprop-
agation: Parameter-free training of multilayer neural net-
Cybenko, G. Approximations by superpositions of sig- works with continuous or discrete weights. In Advances
moidal functions. Mathematics of Control, Signals, and in Neural Information Processing Systems (NIPS), 2014.
Systems, 2(4):303–314, 1989.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
Fiesler, E., Choudry, A., and Caulfield, H. J. Weight dis- and Salakhutdinov, R. Dropout: A simple way to prevent
cretization paradigm for optical neural networks. In The neural networks from overfitting. Journal of Machine
Hague’90, 12-16 April, pp. 164–173. International Soci- Learning Research, 15(1):1929–1958, January 2014.
ety for Optics and Photonics, 1990.
Xu, Y., Du, J., Dai, L.-R., and Lee, C.-H. An experimen-
Golea, M., Marchand, M., and Hancock, T. R. On learning tal study on speech enhancement based on deep neural
µ-perceptron networks with binary weights. In Advances networks. IEEE Signal Processing Letters, 21(1):65–68,
in Neural Information Processing Systems (NIPS), pp. 2014.
591–598, 1992.

Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville,


A., and Bengio, Y. Maxout networks. In Proceedings
of the International Conference on Machine Learning
(ICML), 2013.

Hinton, G., Deng, L., Yu, D., Dahl, G. E., M, Abdel-


rahman, Jaitly, N., Senior, A., Vanhoucke, V., Nguyen,
P., Sainath, T., and Kingsbury, B. Deep neural networks
for acoustic modeling in speech recognition: The shared
views of four research groups. IEEE Signal Processing
Magazine, 29(6):82–97, 2012.

Hinton, G. E., Osindero, S., and Teh, Y. A fast learning


algorithm for deep belief nets. Neural Computation, 18
(7):1527–1554, 2006.

Hornik, K. Approximation capabilities of multilayer feed-


forward networks. Neural Networks, 4(2):251–257,
1991.

Hwang, K. and Sung, W. Fixed-point feedforward deep


neural network design using weights +1, 0, and -1.
In 2014 IEEE Workshop on Signal Processing Systems
(SiPS), Oct 2014.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-


based learning applied to document recognition. Pro-
ceedings of the IEEE, 86(11):2278–2324, November
1998.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy