Deep Learning 2.0: Artificial Neurons That Matter - Reject Correlation, Embrace Orthogonality
Deep Learning 2.0: Artificial Neurons That Matter - Reject Correlation, Embrace Orthogonality
Taha Bouhsine
MLNomads
Agadir, Morocco
yat@mlnomads.com
1
terdependent data relationships without distorting the 2. Theorical Foundation
geometric topology of the data. By eliminating the
need for activation functions, NMNs simplify network 2.1. Multi-Layer Perceptron: Theoretical Analysis
architecture, making it possible to interpret weight de- 2.1.1 Fundamental Components
pendencies directly without sacrificing the model’s ca-
pacity to learn complex patterns. Consider an input vector x ∈ Rn entering an MLP
layer. Each neuron is characterized by a weight vector
Our work also introduces the Neural-Matter State w ∈ Rn and a bias term b ∈ R. The transformation
(NMS) Plots, a new framework for visualizing and in- process consists of two primary operations:
terpreting weight distributions within NMNs. For the 1. Affine Transformation: The neuron computes a
first time, this approach attempts to provide a frame- weighted sum followed by a bias addition:
work to explore the ”black box” of neural networks, n
offering insights into the organization and significance z = wT x + b = (1)
X
w i xi + b
of learned weights. The NMS Plot reveals not only how i=1
individual weights contribute to model predictions but 2. Non-linear Activation: The affine result under-
also how weight clusters influence factors like overfit- goes non-linear transformation via function f : R →
ting and feature uniqueness. This advance holds sig- R:
nificant implications for enhancing the interpretability y = f (z) = f (wT x + b) (2)
and trustworthiness of AI systems as they are increas-
For a layer containing m neurons, we express the
ingly applied to critical decision-making tasks.
operation in matrix notation. Let W ∈ Rm×n denote
Additionally, we introduce the ⵟ-regularizer, which the weight matrix where each row corresponds to a
is inspired by the SimO loss proposed by Bouhsine et neuron’s weight vector, and let b ∈ Rm represent the
al. [1]. This regularizer addresses the issue of neu- bias vector. For a batch of k input samples X ∈ Rk×n ,
ral collapse, where neurons in the model become too the layer output Y ∈ Rk×m is computed as:
close in the neuron space. By promoting orthogonality
among neurons, the ⵟ-regularizer optimizes their spa- Y = f (XWT + b) (3)
tial to ensure that they are positioned further apart In traditional MLP architectures, vector similarity is
from each other and non-linearly dependent then each computed between the input vector x and each weight
other. vector wi using the dot product. For neuron i:
n
Our contributions are as follows:
zi = x · wi = (4)
X
xj wij
j=1
2
2.1.2 Architectural Limitations and commutativity, though lacking other conventional
algebraic properties such as associativity, distributiv-
Linear Similarity Constraints The dot product’s
ity, and the existence of an identity element. This
linearity in Euclidean space limits expressiveness.
unique structure enables the Bouhsine products to cap-
While ReLU introduces non-linearity, information loss
ture aspects of vector relationships that standard sim-
occurs through negative value elimination.
ilarity metrics like the dot product may overlook.
Activation Range Issues: Unbounded dot prod-
uct outputs (−∞ to +∞) can lead to neuron dom-
inance. For example, when a · w1 = 100 versus 2.2.2 Limitations of the Dot Product and Ad-
a · w2 = 0.1, ReLU preserves magnitude differences, vantages of the ⵟ-Product
potentially overshadowing subtle patterns.
Topological Structure Preservation: The inter- Traditional similarity measures, such as the dot prod-
leaving of linear and non-linear operations can obscure uct, often fall short in capturing comprehensive vector
topological relationships within the embedding space, relationships, especially when vectors share the same
complicating interpretability and structural informa- direction but vary in magnitude. This limitation be-
tion preservation. comes particularly apparent when interpreting neuron
Dropout serves as a primary mitigation strategy by weights geometrically, where each weight vector repre-
randomly zeroing activations: sents a specific reference in the embedding space.
Consider a set of neuron weight vectors (1, 1), (2, 2),
adropout = m a (10) (3, 3), (4, 4), (5, 5), (8, 8), and (9, 9), which are all par-
allel and thus point in the same direction. For a new
where m is a binary mask and denotes element-
point (6, 6), using cosine similarity (or dot product)
wise multiplication. For example:
would yield identical results for all these vectors since
aoriginal = [0, 1.5, 0, 122.1] (11) cosine similarity emphasizes direction while ignoring
magnitude. This outcome is misleading, as intuitively,
adropout = [0, 1.5, 0, 0] (12)
the point (6, 6) is closest to (5, 5) in both direction and
However, dropout remains a probabilistic solution magnitude. By disregarding magnitude, the dot prod-
that does not address fundamental dot product limita- uct and cosine similarity fail to differentiate the prox-
tions, especially during inference. imity between (6, 6) and the individual weight vectors.
These limitations motivate the exploration of alter- The ⵟ-product addresses this limitation by incorpo-
native similarity measures operating in non-euclidean rating both magnitude and distance information. Un-
spaces, capturing both similarity and orthogonality in like the dot product, which is primarily based on angu-
unified operations. lar similarity, the ⵟ-product is designed to account for
both directional alignment and the relative distances
2.2. Bouhsine’s Products (ⵟ-product and ⵟ-product) between vectors. When applying the ⵟ-product to com-
2.2.1 Definition pare (6, 6) with each neuron vector, it correctly iden-
tifies (5, 5) as the closest match, offering a nuanced
The Bouhsine products [1] introduce two distinct sim- understanding that aligns with the intuitive notion of
ilarity measures between vectors in Rn : the ⵟ-product similarity.
and the ⵟ-product, defined for two vectors e1 , e2 ∈ Rn Figure 1 visually demonstrates the differences be-
as follows: tween the dot product and the ⵟ-product. Plot (b)
ⵟ-product (yat) shows the dot product’s tendency to favor larger vec-
tor magnitudes. In contrast, the ⵟ-product plot (c) ef-
(e1 · e2 )2
e1 ⵟ e2 = (13) fectively differentiates between vectors based on both
||e2 − e1 ||2 magnitude and spatial proximity, underscoring its ad-
vantage in applications that require a holistic similarity
ⵟ-product (posi-yat)
measure.
||e2 − e1 ||2
e1 ⵟ e2 = (14)
(e1 · e2 )2 3. Methods
Here, · denotes the standard dot product, || · || rep- We propose a novel approach to the Multilayer Per-
resents the Euclidean norm. ceptron (MLP) layer operation, replacing the tradi-
The Bouhsine products are pseudo-metric (for ⵟ) tional dot product with a custom product we call the
and semi-metric (ⵟ) (Theorem 8.1), satisfying closure ⵟ-product.
3
ⵟ-score between it and the features vector it is trying
to attract.
We can express the operation in matrix form for a
layer with m neurons and a batch of k input samples.
Let W ∈ Rm×n be the weight matrix where each row
corresponds to a neuron’s weight vector, and X ∈ Rk×n
(a) (b) (c) be the input matrix. The layer’s output Y ∈ Rk×m is
computed as:
Figure 1. Comparison of similarity measurements for the
test point (6, 6) with neuron weight vectors using (a) the dot
Yij = ⵙ ∗ ⵟ(Wi , Xj ) + bi
product and (b) the ⵟ-product (Code 8.3). In (b), the dot
product scales with vector magnitude, often exaggerating n (Wi · Xj )2
similarity based on size alone. In (c), the ⵟ-product more Yij = ( )α ∗ + bi
log(1 + n) ||Xj − Wi ||2 +
accurately reflects the relative magnitude and distance, cor-
rectly identifying (5, 5) as the closest match to (6, 6). where W ∈ Rm×n is the weight matrix, X ∈ Rk×n
is the input matrix, and b ∈ Rm is the bias vector and
the ⵙ ∈ R is the scale.
3.3. ⵟ-Regularization
3.1. ⵟ-Neuron
We propose a ⵟ-regularizer based on the ⵟ-product
In our proposed neuron, we replace the traditional dot intra-similarity minimization used by Bouhsine et al for
product in the MLP layer with the ⵟ-product. For a Anchor-Free Contrastive Learning (AFCL) [1]. This
single ⵟ-neuron (Code 8.3) with weight vector w ∈ Rn regularizer minimizes ⵟ-similarity score between weight
and input vector x ∈ Rn , the output y is computed as: vectors, encouraging intra-orthogonality.
with the ⵙ ∈ R is the scale factor that is equal to: Our ⵟ-ViT retains the original design but uses ⵟ-
neurons instead, omitting any activation functions af-
n ter the layer.
ⵙ=( )α
log(1 + n)
The full output of the neuron is: 3.4.1 ⵟ-MHA
4
3.4.2 Random Token Masking
x0i = Mi · [MASK] + (1 − Mi ) · xi
5
Traditional Neuron with GeLU Activation Function
Model CIFAR10 [13] CIFAR100 [13] Caltech101 [14] Oxford Flowers [22] STL10 [3]
(in2 /# cls/steps) 322 /10/390 322 /100 962 /102/23 2242 /102/15 962 /10/78
MLP /4) ±25.31% /4) 10.09% /8) ±14.68% ±1.42% ±23.22 %
ViT-t /4) ±72.91% /4) ±36.93% /8) ±30.21% ±31.22% ±49.73%
Activation-Free ⵟ-Neuron
ⵟ-MLP /4) ±47.36% /4) ±25.14% /8) ±24.15% ±20.20 % ±42.25 %
ⵟ-ViT-t /4) ±74.22% /4)±40.75% /8) ±34.31% ±31.42% ±51.95%
Table 1. Comparison of test accuracy across multiple image classification datasets for architectures trained from scratch with
traditional neurons and activation-free ⵟ-neurons. Results highlight the improved or comparable performance of ⵟ-neuron
models, underscoring their effectiveness in both MLP and Vision Transformer (ViT) configurations
(ViT) configurations. Across all tasks, models incor- activation functions. The ⵟ-product introduces im-
porating ⵟ-neurons demonstrate improved or compa- plicit non-linearity, capturing complex data patterns in
rable accuracy over traditional neuron models trained a manner that preserves more information compared
for 200 epochs, underscoring the potential of ⵟ-neurons to standard activation-based approaches. This novel
to deliver higher performance with fewer architectural non-linear processing method represents a shift in how
complexities. neural networks interpret data, reducing common in-
For CIFAR-10, the traditional ViT-t architecture formation loss issues typically introduced by activation
achieves a test accuracy of 72.91%, while the ⵟ- functions ( Table 1).
ViT-t model outperforms it slightly at 74.22%, high- The ⵟ-neuron architecture enhances interpretabil-
lighting the effectiveness of ⵟ-neurons in enhancing ity by embedding non-linearity directly within the ⵟ-
generalization without additional activation functions. product, thus preserving essential geometric relation-
In CIFAR-100, ⵟ-neuron models similarly outperform ships. This design enables a more intuitive understand-
their traditional counterparts, with the ⵟ-ViT-t model ing of neuron interactions and allows for straightfor-
achieving 40.75% accuracy compared to 36.93% for the ward visual representations of their behavior.
traditional ViT-t. These consistent improvements re- Additionally, ⵟ-neurons circumvent several stabil-
flect the robustness of the ⵟ-neuron approach across ity challenges seen with traditional activation func-
datasets of varying complexity. tions. Standard non-linearities can lead to gradient
In MLP configurations, the advantage of ⵟ-neurons saturation and “dead” neurons [24], with saturating
is further pronounced. The ⵟ-MLP model achieves functions often resulting in vanishing gradients, while
47.36% accuracy on CIFAR-10 and 25.14% on CIFAR- non-saturating ones can cause exploding gradients or
100, compared to the traditional MLP’s 25.31% on neuron death, requiring added techniques like batch
CIFAR-10 and 10.09% on CIFAR-100. These results normalization and gradient clipping. The ⵟ-product
illustrate that ⵟ-neurons not only eliminate the need avoids these issues by maintaining non-saturating in-
for explicit activation functions but also improve over- ternal non-linearity, offering stable training dynamics,
all performance in simpler architectures, making them and reducing the dependency on additional regulariza-
an efficient alternative for network design in resource- tion techniques.
constrained settings.
5.1. Artificial Neurons that Matter
On other datasets like Caltech101 and STL10, ⵟ-
ViT-t models maintain their performance advantage, The ⵟ-product draws intriguing analogies with physical
with respective accuracies of 34.31% and 51.95%, out- laws, such as the inverse-square law, hinting at a poten-
performing the traditional ViT-t’s scores of 30.21% and tial new paradigm for understanding neural networks.
49.73%. The performance gains observed in ⵟ-neuron Here, neurons are not merely linear separators but op-
models across these datasets reinforce the generaliz- erate as geometric constructs within a non-linear man-
ability and scalability of ⵟ-neuron-based architectures. ifold. This interpretation reimagines neural network
layers as inherently physical, aligning closer to natural
5. Discussion and Analysis of ⵟ-Neuron principles. Such a view opens pathways to network ar-
Performance chitectures that are both adaptable and more naturally
aligned with fundamental scientific laws. Furthermore,
The results highlight the capability of ⵟ-neuron in including a learnable scale parameter mitigates the risk
achieving high accuracy without relying on traditional of weight explosion. Without this parameter, weights
6
grow excessively, leading to instability.
7
(a) Healthy (b) Healthy (c) Overfit (d) Overfit
Figure 6. Analysis of the neurons in the third layer of an MLP head trained on CIFAR-10, using a Neural-Matter State
(NMS) plot, reveals: (a, b) a well-fitting, healthy distribution of neuron specialization and (c, d) overfitting appears as a
collapsed distribution.
8
I would also like to thank Dr. Andrew Ng for creat- [13] Alex Krizhevsky, Geoffrey Hinton, et al. Learning mul-
ing the Deep Learning course that introduced me to tiple layers of features from tiny images. 2009. 6
this field, without his efforts to democratize access to [14] Fei-Fei Li, Marco Andreeto, Marc’Aurelio Ranzato,
knowledge, this work would not have been possible. and Pietro Perona. Caltech 101, 2022. 6
Additionally, I want to express my appreciation to all [15] Zachary C. Lipton. The mythos of model interpretabil-
the communities I have been part of, especially MLNo- ity: In machine learning, the concept of interpretabil-
mads, Google Developers, and MLCollective commu- ity is both important and slippery. Queue, 16(3):31–57,
2018. 1
nities.
[16] Hanxiao Liu, Zihang Dai, David R. So, and Quoc V.
References Le. Pay Attention to MLPs, 2021. arXiv:2105.08050
[cs]. 8
[1] Taha Bouhsine, Imad El Aaroussi, Atik Faysal, and [17] Siegrid Löwel and Wolf Singer. Selection of intrinsic
Wang. Simo loss: Anchor-free contrastive loss for fine- horizontal connections in the visual cortex by corre-
grained supervised contrastive learning. In Submitted lated neuronal activity. Science, 255(5041):209–212,
to The Thirteenth International Conference on Learn- 1992. 7
ing Representations, 2024. under review. 1, 2, 3, 4, [18] Warren S McCulloch and Walter Pitts. A logical cal-
8 culus of the ideas immanent in nervous activity. The
[2] Michael M Bronstein, Joan Bruna, Yann LeCun, bulletin of mathematical biophysics, 5:115–133, 1943.
Arthur Szlam, and Pierre Vandergheynst. Geomet- 1
ric deep learning: going beyond euclidean data. IEEE [19] Fraser Mince, Dzung Dinh, Jonas Kgomo, Neil
Signal Processing Magazine, 34(4):18–42, 2017. 1 Thompson, and Sara Hooker. The grand illusion: The
[3] Adam Coates, Andrew Ng, and Honglak Lee. An anal- myth of software portability and implications for ml
ysis of single-layer networks in unsupervised feature progress, 2023. 5
learning. In Proceedings of the fourteenth interna- [20] Vinod Nair and Geoffrey E. Hinton. Rectified linear
tional conference on artificial intelligence and statis- units improve restricted boltzmann machines. In In-
tics, pages 215–223. JMLR Workshop and Conference ternational Conference on Machine Learning, 2010. 1
Proceedings, 2011. 6 [21] Isaac Newton. Philosophiæ Naturalis Principia Math-
[4] Charles-Augustin de Coulomb. Premier mémoire sur ematica. S. Pepys, London, 1687. 8
l’électricité et le magnétisme. Histoire de l’Académie
[22] Maria-Elena Nilsback and Andrew Zisserman. Au-
Royale des Sciences, pages 1–31, 1785. in French. 8
tomated flower classification over a large number of
[5] Finale Doshi-Velez and Been Kim. Towards a rigor-
classes. In 2008 Sixth Indian conference on computer
ous science of interpretable machine learning. arXiv:
vision, graphics & image processing, pages 722–729.
Machine Learning, 2017. 1
IEEE, 2008. 6
[6] Alexey Dosovitskiy, Lucas Beyer, Alexander
[23] J Orbach. Principles of neurodynamics. perceptrons
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
and the theory of brain mechanisms. Archives of Gen-
Unterthiner, Mostafa Dehghani, Matthias Minderer,
eral Psychiatry, 7(3):218–219, 1962. 1
Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and
Neil Houlsby. An image is worth 16x16 words: [24] Mohammad Pezeshki, Sékou-Oumar Kaba, Yoshua
Transformers for image recognition at scale, 2021. 5, Bengio, Aaron Courville, Doina Precup, and Guil-
8 laume Lajoie. Gradient starvation: A learning pro-
[7] Carl Friedrich Gauss. Allgemeine Lehrsätze in clivity in neural networks, 2021. 6
Beziehung auf die im verkehrten Verhältniss des [25] Frank Rosenblatt. The perceptron: a probabilistic
Quadrats der Entfernung wirkenden Anziehungs- und model for information storage and organization in the
Abstossungskräfte. Dietrich, Göttingen, 1835. 8 brain. Psychological review, 65(6):386, 1958. 1
[8] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, [26] Juergen Schmidhuber. Annotated history of modern ai
Piotr Dollár, and Ross Girshick. Masked autoencoders and deep learning. arXiv preprint arXiv:2212.11279,
are scalable vision learners, 2021. 5 2022. 1
[9] Dan Hendrycks and Kevin Gimpel. Gaussian error [27] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai,
linear units (gelus), 2023. 5 Ross Wightman, Jakob Uszkoreit, and Lucas Beyer.
[10] Kurt Hornik, Maxwell B. Stinchcombe, and Halbert L. How to train your vit? data, augmentation, and regu-
White. Multilayer feedforward networks are universal larization in vision transformers, 2022. 5
approximators. Neural Networks, 2:359–366, 1989. 1 [28] Stephen M Stigler. Gauss and the invention of least
[11] Roger David Joseph. Contributions to perceptron the- squares. the Annals of Statistics, pages 465–474, 1981.
ory. Cornell University, 1961. 1 1
[12] Johannes Kepler. Ad vitellionem paralipomena, quibus [29] Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov,
astronomiae pars optica traditur. 1604. Johannes Ke- Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jes-
pler: Gesammelte Werke, Ed. Walther von Dyck and sica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lu-
Max Caspar, Münchenk, 1939. 8 cic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp ar-
9
chitecture for vision. arXiv preprint arXiv:2105.01601,
2021. 8
10
Appendix
8.1. Geometric Topology
In geometric topology, various types of spaces are used to study notions of distance and convergence. Each type
of space has a specific set of properties defined by a function known as a metric or a generalization thereof. We
describe below metric spaces, semi-metric spaces, and pseudo-metric spaces, emphasizing the differences in their
definitions.
oij (ei · ej )2
ⵟ(ei , ej ) = =
d2ij |ei − ej |2
d2ij |ei − ej |2
ⵟ(ei , ej ) = =
oij (ei · ej )2
for ei , ej ∈ Rn \ {0} where ei 6= ej , with d2ij = |ei − ej |2 being the squared Euclidean distance and oij = (ei · ej )2
being the squared dot product.
Then (Rn , ⵟ) is semi-metric, while (Rn , ⵟ) is pseudo-metric.
Proof. We structure this proof into four parts:
11
1. Preliminary observations and domain analysis
2. Proof of common properties for both measures
3. Proof that ⵟ is a pseudo-metric
4. Proof that ⵟ is a semi-metric
Part I: Preliminary Observations
Before proving the metric properties, we must establish the domain where these measures are well-defined:
1. For non-zero vectors ei , ej :
• d2ij = 0 ⇐⇒ ei = ej
• oij = 0 ⇐⇒ ei ⊥ ej (vectors are orthogonal)
2. Domain restrictions:
• ⵟ is defined when d2ij 6= 0 (distinct vectors)
• ⵟ is defined when oij 6= 0 (non-orthogonal vectors)
Part II: Common Properties
Both measures satisfy the following properties:
1. Non-negativity: Since both d2ij and oij are squared quantities:
Therefore:
ⵟ(ei , ej ) ≥ 0 and ⵟ(ei , ej ) ≥ 0
2. Identity of Indiscernibles: For both measures, we prove this bidirectionally:
(⇒) If ei = ej :
• d2ij = 0
• oij = |ei |4 > 0 (for non-zero vectors)
Therefore, ⵟ(ei , ej ) = 0 and ⵟ(ei , ej ) = 0
(⇐) If ⵟ(ei , ej ) = 0 or ⵟ(ei , ej ) = 0:
• For ⵟ: dij 2 = 0 =⇒ oij = 0 (since dij 6= 0 for distinct vectors)
o 2
ij
d2ij
• For ⵟ: oij = 0 =⇒ d2ij = 0 (since oij 6= 0 in domain)
In both cases, this implies that this rule stand for ⵟ, but not for ⵟ.
3. Symmetry: Symmetry follows from the symmetry of dot product and Euclidean distance:
(ei · ej )2 (ej · ei )2
ⵟ(ei , ej ) = = = ⵟ(ej , ei )
|ei − ej |2 |ej − ei |2
(e1 · e2 )2 = (||e1 || ||e2 || cos θ)2 = ||e1 ||2 ||e2 ||2 cos2 θ.
The Euclidean distance between e1 and e2 is:
12
||e2 − e1 ||2 = ||e1 ||2 + ||e2 ||2 − 2 ||e1 || ||e2 || cos θ.
Now we substitute these expressions into the formula for e1 ⵟe2 :
A2 B 2 cos2 θ
f (θ) = e1 ⵟe2 = .
A2 + B 2 − 2AB cos θ
Let’s factor out common terms in the numerator. Notice that each term in the numerator has a factor of
A2 B 2 sin θ, so we can factor that out:
e1 = (1, 0)
e2 = (0, 1)
e3 = (1, 1)
13
ⵟ(e1 , e2 ) = 0
1
ⵟ(e2 , e3 ) = = 1
1
2
ⵟ(e1 , e3 ) = = 2
1
Which means:
ⵟ(e1 , e3 ) ≥ ⵟ(e1 , e2 ) + ⵟ(e2 , e3 )
Hence the inequality doesn’t stand for ⵟ.
Therefore, we conclude that (Rn , ⵟ) is pseudo-metric and (Rn , ⵟ) is semi-metric.
# P l o t t h e n euro ns and t h e t e s t p o i n t i n t h e 2D s p a c e
p l t . f i g u r e ( f i g s i z e =(6 , 6 ) )
# P l o t n euro ns a s r e d d o t s
f o r i , neuron i n enumerate ( neur ons ) :
p l t . p l o t ( neuron [ 0 ] , neuron [ 1 ] , ’ ro ’ )
p l t . t e x t ( neuron [ 0 ] + 0 . 1 , neuron [ 1 ] , f ’ Neuron { i +1} ’ , f o n t s i z e =12)
# S e t up t h e p l o t l i m i t s and l a b e l s
p l t . xlim ( 0 , 7 )
p l t . ylim ( 0 , 7 )
p l t . x l a b e l ( ’X−a x i s ’ )
p l t . y l a b e l ( ’Y−a x i s ’ )
p l t . t i t l e ( ’ Neurons and Test Point i n 2D Space ’ )
p l t . g r i d ( True )
p l t . gca ( ) . s e t _ a s p e c t ( ’ equal ’ , a d j u s t a b l e =’box ’ )
p l t . show ( )
# D e f i n e t h e n euro ns a s v e c t o r s
neur ons = np . a r r a y ( [ [ 1 , 1 ] , [ 2 , 2 ] , [ 3 , 3 ] , [ 4 , 4 ] , [ 5 , 5 ] , [ 7 , 7 ] , [ 8 , 8 ] ] )
t e s t _ p o i n t = np . a r r a y ( [ 6 , 6 ] )
# dot p r o d uc t f u n c t i o n
14
d e f c o s i n e _ s i m i l a r i t y ( v1 , v2 ) :
dot_product = np . dot ( v1 , v2 )
r e t u r n dot_product
# Yat−p r o d uc t f u n c t i o n
d e f yat_product ( v1 , v2 , e p s i l o n =1e −6):
dot_product_squared = np . dot ( v1 , v2 ) ∗∗ 2
d i s t a n c e _ s q u a r e d = np . l i n a l g . norm ( v2 − v1 ) ∗∗ 2
r e t u r n dot_product_squared / ( d i s t a n c e _ s q u a r e d + e p s i l o n )
# C a l c u l a t e c o s i n e s i m i l a r i t i e s and yat−p r o d u c t s
c o s i n e _ s i m i l a r i t i e s = [ c o s i n e _ s i m i l a r i t y ( t e s t _ p o i n t , neuron ) f o r neuron i n n euro ns ]
yat_products = [ yat_product ( t e s t _ p o i n t , neuron ) f o r neuron i n n euro ns ]
# Plot the c o s i n e s i m i l a r i t i e s
p l t . f i g u r e ( f i g s i z e =(10 , 5 ) )
plt . subplot (1 , 2 , 1)
plt . p l o t ( r a n g e ( 1 , 6 ) , c o s i n e _ s i m i l a r i t i e s , marker =’o ’ , c o l o r =’ blue ’ , l a b e l =’ Co s i n e S i m i l a r i t y
plt . t i t l e ( ” C o s i n e S i m i l a r i t y with ( 6 , 6 ) ” )
plt . x l a b e l ( ” Neuron ” )
plt . y l a b e l (” Cosine S i m i l a r i t y ”)
plt . ylim ( 0 , 1 . 1 )
plt . g r i d ( True )
plt . xticks ([1 , 2 , 3 , 4 , 5])
# P l o t t h e Yat−p r o d u c t s
p l t . subplot (1 , 2 , 2)
p l t . p l o t ( r a n g e ( 1 , 6 ) , yat_products , marker =’o ’ , c o l o r =’ red ’ , l a b e l =’Yat−product ’ )
p l t . t i t l e ( ” Yat−p r o d u c t with ( 6 , 6 ) ” )
p l t . x l a b e l ( ” Neuron ” )
p l t . y l a b e l ( ” Yat−p r o d u c t ” )
p l t . g r i d ( True )
plt . xticks ([1 , 2 , 3 , 4 , 5])
plt . tight_layout ()
p l t . show ( )
import numpy a s np
def yat_neuron (X, w, b ) :
# Squared d o t p r o d u c t
dot_squared = np . dot (X, w) ∗∗ 2
# Squared E u c l i d e a n d i s t a n c e
d i s t a n c e _ s q u a r e d = np .sum( (w − X) ∗∗ 2 , a x i s =1)
# Avoid d i v i s i o n by z e r o by a d d i n g a s m a l l e p s i l o n
e p s i l o n = 1 e−6
return dot_squared / ( d i s t a n c e _ s q u a r e d + e p s i l o n ) + b
import numpy a s np
np . random . s e e d ( 4 2 ) # For r e p r o d u c i b i l i t y
15
w = np . random . randn ( 2 ) # Random w e i g h t s f o r a 2D i n p u t
b = np . random . randn ( ) # Random b i a s
# XOR d a t a s e t : i n p u t s and c o r r e s p o n d i n g o u t p u t s
X = np . a r r a y ( [ [ 0 , 0 ] , [ 0 , 1 ] , [ 1 , 0 ] , [ 1 , 1 ] ] ) # I n p u t s
y = np . a r r a y ( [ 0 , 1 , 1 , 0 ] ) # E x p e c t e d o u t p u t s f o r XOR
# Print r e s u l t s
print (w)
print ( b )
print ( o u t p u t s )
w, b , o u t p u t s
# I n i t i a l p a r a m e t e r s : [ w1 , w2 , b ]
i n i t i a l _ p a r a m s = np . append (w, b )
# E x t r a c t o p t i m i z e d w e i g h t s and b i a s
optimized_params = r e s u l t . x
o p t i m i z e d _ w e i g h t s = optimized_params [ : 2 ]
o p t i m i z e d _ b i a s = optimized_params [ 2 ]
print ( ’###############’ )
print ( o p t i m i z e d _ w e i g h t s )
print ( o p t i m i z e d _ b i a s )
print ( o p t i m i z e d _ o u t p u t s )
import m a t p l o t l i b . p y p l o t a s p l t
16
# Compute neuron o u t p u t f o r each p o i n t i n t h e g r i d
Z = yat_neuron ( g r i d _ p o i n t s , w e i g h t s , b i a s )
Z = Z . r e s h a p e ( xx . shape )
print (Z)
print ( xx )
# P l o t t h e d e c i s i o n boundary and XOR p o i n t s
p l t . f i g u r e ( f i g s i z e =(6 , 6 ) )
p l t . c o n t o u r f ( xx , yy , Z > 0 . 5 , a l p h a =0.5 , cmap= ’ coolwarm ’ ) # D e c i s i o n boundary
p l t . s c a t t e r ( X_pos [ : , 0 ] , X_pos [ : , 1 ] , c o l o r= ’ r e d ’ , l a b e l= ’ 1 ’ , e d g e c o l o r s= ’ k ’ )
p l t . s c a t t e r ( X_neg [ : , 0 ] , X_neg [ : , 1 ] , c o l o r= ’ b l u e ’ , l a b e l= ’ 0 ’ , e d g e c o l o r s= ’ k ’ )
p l t . t i t l e ( ”XOR Problem : D e c i s i o n Boundary � ( Neuron ) ” )
p l t . x l a b e l ( ’ x1 ’ )
p l t . y l a b e l ( ’ x2 ’ )
plt . legend ()
p l t . g r i d ( True )
p l t . show ( )
from t y p i n g import (
Any ,
)
import j a x . numpy a s jnp
import j a x . l a x a s l a x
from f l a x . l i n e n import Module , compact
from f l a x import l i n e n a s nn
from f l a x . l i n e n . i n i t i a l i z e r s import z e r o s _ i n i t , lecun_normal
from t y p i n g import Any , O p t i o n a l
c l a s s YatDense ( Module ) :
”””
Attributes :
f e a t u r e s : t h e number o f o u t p u t f e a t u r e s .
u s e _ b i a s : w h e t h e r t o add a b i a s t o t h e o u t p u t ( d e f a u l t : True ) .
d t y p e : t h e d t y p e o f t h e co m pu t at i on .
param_dtype : t h e d t y p e p a s s e d t o parameter i n i t i a l i z e r s ( d e f a u l t : f l o a t 3 2 ) .
p r e c i s i o n : n u m e r i c a l p r e c i s i o n o f t h e c o mp u ta t io n .
k e r n e l _ i n i t : i n i t i a l i z e r function f o r the weight matrix .
bias_init : i n i t i a l i z e r function for the bias .
epsilon : small constant .
17
”””
f e a t u r e s : int
u s e _ b i a s : bool = True
dtype : O p t i o n a l [ Any ] = None
param_dtype : Any = jnp . f l o a t 3 2
p r e c i s i o n : Any = None
k e r n e l _ i n i t : Any = nn . i n i t i a l i z e r s . o r t h o g o n a l ( )
b i a s _ i n i t : Any = z e r o s _ i n i t ( )
# I n i t i a l i z e alpha to 1.0
a l p h a _ i n i t : Any = lambda key , shape , dtype : jnp . o n e s ( shape , type )
e p s i l o n : f l o a t = 1 e−6
d o t _ g e n e r a l : DotGeneralT | None = None
d o t _ g e n e r a l _ c l s : Any = None
r e t u r n _ w e i g h t s : bool = F a l s e
@compact
def __call__ ( s e l f , i n p u t s : Any) −> Any :
”””
Args :
i n p u t s : The nd−a r r a y t o be t r a n s f o r m e d .
Returns :
The t r a n s f o r m e d i n p u t .
”””
k e r n e l = s e l f . param (
’ kernel ’ ,
s e l f . kernel_init ,
( s e l f . f e a t u r e s , jnp . shape ( i n p u t s ) [ − 1 ] ) ,
s e l f . param_dtype ,
)
a l p h a = s e l f . param (
’ alpha ’ ,
s e l f . alpha_init ,
( 1 , ) , # S i n g l e s c a l a r parameter
s e l f . param_dtype ,
)
i f s e l f . use_bias :
b i a s = s e l f . param (
’ b i a s ’ , s e l f . b i a s _ i n i t , ( s e l f . f e a t u r e s , ) , s e l f . param_dtype
)
else :
b i a s = None
18
inputs ,
jnp . t r a n s p o s e ( k e r n e l ) ,
( ( ( i n p u t s . ndim − 1 , ) , ( 0 , ) ) , ( ( ) , ( ) ) ) ,
p r e c i s i o n= s e l f . p r e c i s i o n ,
)
inputs_squared_sum = jnp .sum( i n p u t s ∗ ∗ 2 , a x i s =−1, keepdims=True )
kernel_squared_sum = jnp .sum( k e r n e l ∗ ∗ 2 , a x i s =−1)
d i s t a n c e s = inputs_squared_sum + kernel_squared_sum − 2 ∗ y
# Element−w i s e o p e r a t i o n
y = y ∗∗ 2 / ( d i s t a n c e s + s e l f . e p s i l o n )
s c a l e = ( jnp . s q r t ( s e l f . f e a t u r e s ) / jnp . l o g ( 1 + s e l f . f e a t u r e s ) ) ∗∗ a l p h a
y = y ∗ scale
# Normalize y
i f b i a s i s not None :
y += jnp . r e s h a p e ( b i a s , ( 1 , ) ∗ ( y . ndim − 1 ) + ( −1 ,))
i f s e l f . return_weights :
return y , k e r n e l
return y
19