0% found this document useful (0 votes)
19 views19 pages

Deep Learning 2.0: Artificial Neurons That Matter - Reject Correlation, Embrace Orthogonality

Uploaded by

Sachin Barthwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views19 pages

Deep Learning 2.0: Artificial Neurons That Matter - Reject Correlation, Embrace Orthogonality

Uploaded by

Sachin Barthwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Deep Learning 2.

0: Artificial Neurons That Matter - Reject


Correlation, Embrace Orthogonality

Taha Bouhsine
MLNomads
Agadir, Morocco
yat@mlnomads.com

Abstract complex, non-linear patterns in data [18, 28].


The incorporation of non-linear activation functions,
We introduce ⵟ-product-powered neural network, the such as the Rectified Linear Unit (ReLU) [20], was a
Neural Matter Network (NMN), a breakthrough in deep pivotal advancement, enabling neural networks to ap-
learning that achieves non-linear pattern recognition proximate a wide range of functions and thus address
without activation functions. Our key innovation relies the limitations inherent in linear models [10]. This
on the ⵟ-product and ⵟ-product, which naturally in- transformation has allowed for the development of deep
duces non-linearity by projecting inputs into a pseudo- learning architectures that can capture intricate re-
metric space, eliminating the need for traditional ac- lationships within large datasets. However, this gain
tivation functions while maintaining only a softmax in representational flexibility has introduced a signifi-
layer for final class probability distribution. This ap- cant trade-off: as models grow in complexity, the inter-
proach simplifies network architecture and provides un- pretability of their learned representations diminishes,
precedented transparency into the network’s decision- making it increasingly challenging to understand the
making process. Our comprehensive empirical evalua- underlying decision-making processes [15, 26].
tion across different datasets demonstrates that NMN A core challenge lies in the transformation of data
consistently outperforms traditional MLPs. The results space. The dot product in the perceptron layer, rooted
challenge the assumption that separate activation func- in Euclidean geometry, measures the similarity between
tions are necessary for effective deep-learning models. inputs and weight vectors. When followed by non-
The implications of this work extend beyond immediate linear activation functions, this transformation projects
architectural benefits: by eliminating intermediate ac- data into spaces that are no longer well-defined or ex-
tivation functions while preserving non-linear capabili- plainable, obscuring insights into the learned represen-
ties, ⵟ-MLP establishes a new paradigm for neural net- tations. As neural network models grow in complexity,
work design that combines simplicity with effectiveness. they become less interpretable, making it challenging
Most importantly, our approach provides unprecedented to fully comprehend how decisions are made, especially
insights into the traditionally opaque ”black-box” nature in high-stakes applications [5].
of neural networks, offering a clearer understanding of In this paper, we introduce the ⵟ-product as a novel
how these models process and classify information. solution to these challenges, aiming to bridge the gap
between model complexity and interpretability. The
ⵟ-product combines squared Euclidean distance with
1. Introduction orthogonality through the squared dot product. Origi-
nally proposed by Bouhsine et al. in the context of con-
The perceptron, introduced by Rosenblatt in 1958 [25], trastive learning [1], this operation works in a pseudo-
has played a fundamental role in the development of metric space [2] which retains non-linearity without re-
neural networks, serving as an essential building block lying on activation functions.
in artificial intelligence over the past six decades [11, Building on this innovation, we propose the Neural-
23]. As a linear model, the perceptron projects input Matter Network (NMN), a new network layer that
data onto an Euclidean space where the dot product leverages the ⵟ-product to create deep neural net-
of inputs and weights capture similarity. However, the works without activation functions. NMNs operate in
perceptron’s linear nature limits its ability to model a pseudo-metric space that accommodates intricate, in-

1
terdependent data relationships without distorting the 2. Theorical Foundation
geometric topology of the data. By eliminating the
need for activation functions, NMNs simplify network 2.1. Multi-Layer Perceptron: Theoretical Analysis
architecture, making it possible to interpret weight de- 2.1.1 Fundamental Components
pendencies directly without sacrificing the model’s ca-
pacity to learn complex patterns. Consider an input vector x ∈ Rn entering an MLP
layer. Each neuron is characterized by a weight vector
Our work also introduces the Neural-Matter State w ∈ Rn and a bias term b ∈ R. The transformation
(NMS) Plots, a new framework for visualizing and in- process consists of two primary operations:
terpreting weight distributions within NMNs. For the 1. Affine Transformation: The neuron computes a
first time, this approach attempts to provide a frame- weighted sum followed by a bias addition:
work to explore the ”black box” of neural networks, n
offering insights into the organization and significance z = wT x + b = (1)
X
w i xi + b
of learned weights. The NMS Plot reveals not only how i=1
individual weights contribute to model predictions but 2. Non-linear Activation: The affine result under-
also how weight clusters influence factors like overfit- goes non-linear transformation via function f : R →
ting and feature uniqueness. This advance holds sig- R:
nificant implications for enhancing the interpretability y = f (z) = f (wT x + b) (2)
and trustworthiness of AI systems as they are increas-
For a layer containing m neurons, we express the
ingly applied to critical decision-making tasks.
operation in matrix notation. Let W ∈ Rm×n denote
Additionally, we introduce the ⵟ-regularizer, which the weight matrix where each row corresponds to a
is inspired by the SimO loss proposed by Bouhsine et neuron’s weight vector, and let b ∈ Rm represent the
al. [1]. This regularizer addresses the issue of neu- bias vector. For a batch of k input samples X ∈ Rk×n ,
ral collapse, where neurons in the model become too the layer output Y ∈ Rk×m is computed as:
close in the neuron space. By promoting orthogonality
among neurons, the ⵟ-regularizer optimizes their spa- Y = f (XWT + b) (3)
tial to ensure that they are positioned further apart In traditional MLP architectures, vector similarity is
from each other and non-linearly dependent then each computed between the input vector x and each weight
other. vector wi using the dot product. For neuron i:
n
Our contributions are as follows:
zi = x · wi = (4)
X
xj wij
j=1

• The Neural-Matter Network (NMN), is an For illustration, consider x = [1, 2, 3] and wi =


activation-free network architecture using a [0.5, −1, 0.2]:
non-linear ⵟ-product to learn representations
without distorting data topology. zi = (1 × 0.5) + (2 × −1) + (3 × 0.2) = −0.9 (5)
• Introducing ⵟ-regularization for intra-orthogonality
vectors in the weight matrix. The activation function introduces non-linearity into
• Introduction of the Neural-Matter State (NMS) the network. Using ReLU as an example:
Plot, a framework for interpreting and visualizing
learned weights. ai = max(0, zi ) (6)
• Affero GNU Open-source implementation of the For instance, given z = [−0.9, 1.5, −0.3, 2.1], ReLU
Neural-Matter Layer and related experiments using yields:
Flax linen/Jax. a = [0, 1.5, 0, 2.1] (7)
The final layer typically employs softmax to gener-
ate probability distributions:
These contributions mark a significant advancement
in creating more interpretable and potentially efficient exp(xi )
softmax(xi ) = P (8)
deep learning models. By introducing ⵟ-product-based j exp(xj )
architectures, we pave the way for further exploration
For logits z = [2.0, 1.0, 0.1], softmax produces:
of activation-free designs and more explainable deep
learning frameworks. softmax(z) = [0.665, 0.245, 0.090] (9)

2
2.1.2 Architectural Limitations and commutativity, though lacking other conventional
algebraic properties such as associativity, distributiv-
Linear Similarity Constraints The dot product’s
ity, and the existence of an identity element. This
linearity in Euclidean space limits expressiveness.
unique structure enables the Bouhsine products to cap-
While ReLU introduces non-linearity, information loss
ture aspects of vector relationships that standard sim-
occurs through negative value elimination.
ilarity metrics like the dot product may overlook.
Activation Range Issues: Unbounded dot prod-
uct outputs (−∞ to +∞) can lead to neuron dom-
inance. For example, when a · w1 = 100 versus 2.2.2 Limitations of the Dot Product and Ad-
a · w2 = 0.1, ReLU preserves magnitude differences, vantages of the ⵟ-Product
potentially overshadowing subtle patterns.
Topological Structure Preservation: The inter- Traditional similarity measures, such as the dot prod-
leaving of linear and non-linear operations can obscure uct, often fall short in capturing comprehensive vector
topological relationships within the embedding space, relationships, especially when vectors share the same
complicating interpretability and structural informa- direction but vary in magnitude. This limitation be-
tion preservation. comes particularly apparent when interpreting neuron
Dropout serves as a primary mitigation strategy by weights geometrically, where each weight vector repre-
randomly zeroing activations: sents a specific reference in the embedding space.
Consider a set of neuron weight vectors (1, 1), (2, 2),
adropout = m a (10) (3, 3), (4, 4), (5, 5), (8, 8), and (9, 9), which are all par-
allel and thus point in the same direction. For a new
where m is a binary mask and denotes element-
point (6, 6), using cosine similarity (or dot product)
wise multiplication. For example:
would yield identical results for all these vectors since
aoriginal = [0, 1.5, 0, 122.1] (11) cosine similarity emphasizes direction while ignoring
magnitude. This outcome is misleading, as intuitively,
adropout = [0, 1.5, 0, 0] (12)
the point (6, 6) is closest to (5, 5) in both direction and
However, dropout remains a probabilistic solution magnitude. By disregarding magnitude, the dot prod-
that does not address fundamental dot product limita- uct and cosine similarity fail to differentiate the prox-
tions, especially during inference. imity between (6, 6) and the individual weight vectors.
These limitations motivate the exploration of alter- The ⵟ-product addresses this limitation by incorpo-
native similarity measures operating in non-euclidean rating both magnitude and distance information. Un-
spaces, capturing both similarity and orthogonality in like the dot product, which is primarily based on angu-
unified operations. lar similarity, the ⵟ-product is designed to account for
both directional alignment and the relative distances
2.2. Bouhsine’s Products (ⵟ-product and ⵟ-product) between vectors. When applying the ⵟ-product to com-
2.2.1 Definition pare (6, 6) with each neuron vector, it correctly iden-
tifies (5, 5) as the closest match, offering a nuanced
The Bouhsine products [1] introduce two distinct sim- understanding that aligns with the intuitive notion of
ilarity measures between vectors in Rn : the ⵟ-product similarity.
and the ⵟ-product, defined for two vectors e1 , e2 ∈ Rn Figure 1 visually demonstrates the differences be-
as follows: tween the dot product and the ⵟ-product. Plot (b)
ⵟ-product (yat) shows the dot product’s tendency to favor larger vec-
tor magnitudes. In contrast, the ⵟ-product plot (c) ef-
(e1 · e2 )2
e1 ⵟ e2 = (13) fectively differentiates between vectors based on both
||e2 − e1 ||2 magnitude and spatial proximity, underscoring its ad-
vantage in applications that require a holistic similarity
ⵟ-product (posi-yat)
measure.
||e2 − e1 ||2
e1 ⵟ e2 = (14)
(e1 · e2 )2 3. Methods
Here, · denotes the standard dot product, || · || rep- We propose a novel approach to the Multilayer Per-
resents the Euclidean norm. ceptron (MLP) layer operation, replacing the tradi-
The Bouhsine products are pseudo-metric (for ⵟ) tional dot product with a custom product we call the
and semi-metric (ⵟ) (Theorem 8.1), satisfying closure ⵟ-product.

3
ⵟ-score between it and the features vector it is trying
to attract.
We can express the operation in matrix form for a
layer with m neurons and a batch of k input samples.
Let W ∈ Rm×n be the weight matrix where each row
corresponds to a neuron’s weight vector, and X ∈ Rk×n
(a) (b) (c) be the input matrix. The layer’s output Y ∈ Rk×m is
computed as:
Figure 1. Comparison of similarity measurements for the
test point (6, 6) with neuron weight vectors using (a) the dot
Yij = ⵙ ∗ ⵟ(Wi , Xj ) + bi
product and (b) the ⵟ-product (Code 8.3). In (b), the dot
product scales with vector magnitude, often exaggerating n (Wi · Xj )2
similarity based on size alone. In (c), the ⵟ-product more Yij = ( )α ∗ + bi
log(1 + n) ||Xj − Wi ||2 + 
accurately reflects the relative magnitude and distance, cor-
rectly identifying (5, 5) as the closest match to (6, 6). where W ∈ Rm×n is the weight matrix, X ∈ Rk×n
is the input matrix, and b ∈ Rm is the bias vector and
the ⵙ ∈ R is the scale.

3.2. Neural-Matter State (NMS) Plot


To assess neuron distribution within a layer, we use
t-SNE and PCA for dimensionality reduction on neu-
ron weight vectors. This visualization involves a scatter
plot with a corresponding 2D density plot and is accom-
panied by a similarity matrix based on prior work by
Bouhsine et al. [1]. Through these representations, we
(a) (b) analyze neuron alignment and identify potential over-
fitting, vulnerability to adversarial attacks, and neuron
Figure 2. Neuron Field Plot of a (a) single ⵟ-Neuron with- redundancy, which can inform decisions about neuron
out scale and (b) multiple ⵟ-Neurons impact on the space fusion for optimized performance.

3.3. ⵟ-Regularization
3.1. ⵟ-Neuron
We propose a ⵟ-regularizer based on the ⵟ-product
In our proposed neuron, we replace the traditional dot intra-similarity minimization used by Bouhsine et al for
product in the MLP layer with the ⵟ-product. For a Anchor-Free Contrastive Learning (AFCL) [1]. This
single ⵟ-neuron (Code 8.3) with weight vector w ∈ Rn regularizer minimizes ⵟ-similarity score between weight
and input vector x ∈ Rn , the output y is computed as: vectors, encouraging intra-orthogonality.

y = ⵙ ∗ ⵟ(w, x) + b 3.4. ⵟ-ViT

with the ⵙ ∈ R is the scale factor that is equal to: Our ⵟ-ViT retains the original design but uses ⵟ-
neurons instead, omitting any activation functions af-
n ter the layer.
ⵙ=( )α
log(1 + n)
The full output of the neuron is: 3.4.1 ⵟ-MHA

n (w · x)2 Additionally, we replace the standard scaling factor


y=( )α ∗ +b dk with a learnable parameter that resembles the ⵟ-
log(1 + n) ||x − w||2
neuron scaling factor.
where b ∈ R is the bias term and α ∈ R is a learnable
parameter for damening of the output and  is a small Attention(Q, K, V ) = softmax(ⵙ · QK T )V (15)
positive constant added to ensure numerical stability.  α
Figure 2 shows the impact of neurons in the embed- where ⵙ = log(1+n)
n
and α is a learnable parame-
dings space, instead of learning a boundary, the neu- ter. This adjustment enhances the flexibility of scaling
ron learns a position in the space that maximizes the within the attention mechanism.

4
3.4.2 Random Token Masking

Inspired by the Masked Autoencoder (MAE) approach


[8], we apply random token masking between encoder
blocks. Specifically, we randomly remove p% of input
tokens and replace them with a single mask token. This
strategy bolsters robustness and mitigates overfitting.
Let:
- X = [x1 , x2 , . . . , xn ] be the input sequence,
- p be the masking ratio, representing the probability (a) (b)
of masking each token.
1. Binary Mask Generation: Figure 3. Decision boundaries of one trained (a) ⵟ-Neuron
without scale (Code 8.4) vs (b) Traditional Neuron
For each token xi , generate a binary mask Mi such
that:
with probability p
(
1
Mi =
0 with probability 1 − p

2. Apply the Mask:


Define each masked token x0i as:

x0i = Mi · [MASK] + (1 − Mi ) · xi

giving the masked sequence X0 = [x01 , x02 , . . . , x0n ],


(a) (b)
where p × n tokens are approximately replaced by
[MASK]. Figure 4. T-SNE plot of MNIST with (a) true labels and
(b) predicted labels by manual fitting of 10 ⵟ-Neurons (large
4. Results circles) without scale and 73% accuracy

4.1. Experimental Setup


4.2. XOR Delima
Our experimental framework compares the ⵟ-neuron
and traditional neuron ensuring comparable parame- One significant challenge in the field is the XOR prob-
ter counts across all models using different image clas- lem, which typically requires nonlinear transformations
sification datasets: (1) ⵟ-neuron-based MLP architec- or multiple layers in traditional neural networks to
tures with Dropout; (2) a traditional MLP with ReLU achieve accurate results.
activation, Dropout, Layer Normalization; (3) ViT [6] As shown in Figure 3, a single ⵟ-neuron success-
Architecture using ⵟ-neuron and traditional neuron. fully resolves the XOR problem without the need for
The configuration for the ViT model [8, 27, 27] used additional layers or non-homomorphic activation func-
in this study consists of a total of six layers. Each layer tions, demonstrating its capacity to handle nonlinear
contains 128 hidden units. The model is designed with problems with minimal complexity.
512 parameters and utilizes two attention heads (1.2m
4.3. Do you even MNIST bro?
params).
The results for the MLP and ⵟ-MLP (NMN) models In Figure 4, We manually placed 10 ⵟ-neurons (large
uses the same patch-embedding scheme as the ViT, fol- circles) on the t-SNE plot of the MNIST dataset. We
lowed by global average pooling. We compare two mod- then used the maximum ⵟ-similarity score between the
els with the same number of neurons (0.2m params), neurons and the t-SNE point of each image to predict
but the ⵟ-neurons do not use any activation functions, the class. This approach achieved an accuracy of 73%.
while the traditional neurons have GeLU activation
function [9]. For data augmentation we use Rotation, 4.4. Vision Classification Task
Flip Left-Right, Up-Down, and color jitter. Table 1 presents the accuracy results across five bench-
To ensure the reproducibility of our results, all ex- mark image classification datasets, comparing architec-
periments are conducted on NVIDIA L4 GPUs using tures trained with traditional neurons and activation-
Jax/Flax [19] over GCP/Colab. free ⵟ-neurons in both MLP and Vision Transformer

5
Traditional Neuron with GeLU Activation Function
Model CIFAR10 [13] CIFAR100 [13] Caltech101 [14] Oxford Flowers [22] STL10 [3]
(in2 /# cls/steps) 322 /10/390 322 /100 962 /102/23 2242 /102/15 962 /10/78
MLP /4) ±25.31% /4) 10.09% /8) ±14.68% ±1.42% ±23.22 %
ViT-t /4) ±72.91% /4) ±36.93% /8) ±30.21% ±31.22% ±49.73%
Activation-Free ⵟ-Neuron
ⵟ-MLP /4) ±47.36% /4) ±25.14% /8) ±24.15% ±20.20 % ±42.25 %
ⵟ-ViT-t /4) ±74.22% /4)±40.75% /8) ±34.31% ±31.42% ±51.95%

Table 1. Comparison of test accuracy across multiple image classification datasets for architectures trained from scratch with
traditional neurons and activation-free ⵟ-neurons. Results highlight the improved or comparable performance of ⵟ-neuron
models, underscoring their effectiveness in both MLP and Vision Transformer (ViT) configurations

(ViT) configurations. Across all tasks, models incor- activation functions. The ⵟ-product introduces im-
porating ⵟ-neurons demonstrate improved or compa- plicit non-linearity, capturing complex data patterns in
rable accuracy over traditional neuron models trained a manner that preserves more information compared
for 200 epochs, underscoring the potential of ⵟ-neurons to standard activation-based approaches. This novel
to deliver higher performance with fewer architectural non-linear processing method represents a shift in how
complexities. neural networks interpret data, reducing common in-
For CIFAR-10, the traditional ViT-t architecture formation loss issues typically introduced by activation
achieves a test accuracy of 72.91%, while the ⵟ- functions ( Table 1).
ViT-t model outperforms it slightly at 74.22%, high- The ⵟ-neuron architecture enhances interpretabil-
lighting the effectiveness of ⵟ-neurons in enhancing ity by embedding non-linearity directly within the ⵟ-
generalization without additional activation functions. product, thus preserving essential geometric relation-
In CIFAR-100, ⵟ-neuron models similarly outperform ships. This design enables a more intuitive understand-
their traditional counterparts, with the ⵟ-ViT-t model ing of neuron interactions and allows for straightfor-
achieving 40.75% accuracy compared to 36.93% for the ward visual representations of their behavior.
traditional ViT-t. These consistent improvements re- Additionally, ⵟ-neurons circumvent several stabil-
flect the robustness of the ⵟ-neuron approach across ity challenges seen with traditional activation func-
datasets of varying complexity. tions. Standard non-linearities can lead to gradient
In MLP configurations, the advantage of ⵟ-neurons saturation and “dead” neurons [24], with saturating
is further pronounced. The ⵟ-MLP model achieves functions often resulting in vanishing gradients, while
47.36% accuracy on CIFAR-10 and 25.14% on CIFAR- non-saturating ones can cause exploding gradients or
100, compared to the traditional MLP’s 25.31% on neuron death, requiring added techniques like batch
CIFAR-10 and 10.09% on CIFAR-100. These results normalization and gradient clipping. The ⵟ-product
illustrate that ⵟ-neurons not only eliminate the need avoids these issues by maintaining non-saturating in-
for explicit activation functions but also improve over- ternal non-linearity, offering stable training dynamics,
all performance in simpler architectures, making them and reducing the dependency on additional regulariza-
an efficient alternative for network design in resource- tion techniques.
constrained settings.
5.1. Artificial Neurons that Matter
On other datasets like Caltech101 and STL10, ⵟ-
ViT-t models maintain their performance advantage, The ⵟ-product draws intriguing analogies with physical
with respective accuracies of 34.31% and 51.95%, out- laws, such as the inverse-square law, hinting at a poten-
performing the traditional ViT-t’s scores of 30.21% and tial new paradigm for understanding neural networks.
49.73%. The performance gains observed in ⵟ-neuron Here, neurons are not merely linear separators but op-
models across these datasets reinforce the generaliz- erate as geometric constructs within a non-linear man-
ability and scalability of ⵟ-neuron-based architectures. ifold. This interpretation reimagines neural network
layers as inherently physical, aligning closer to natural
5. Discussion and Analysis of ⵟ-Neuron principles. Such a view opens pathways to network ar-
Performance chitectures that are both adaptable and more naturally
aligned with fundamental scientific laws. Furthermore,
The results highlight the capability of ⵟ-neuron in including a learnable scale parameter mitigates the risk
achieving high accuracy without relying on traditional of weight explosion. Without this parameter, weights

6
grow excessively, leading to instability.

5.2. Effect of ⵟ-Regularization on Output Layer


Representation
Our examination of well-trained model weight matrices
shows a tendency toward orthogonality among neurons,
which supports improved generalization and robust-
ness. Non-orthogonality or linear dependencies in the (a) with ⵟ-regularization (b) w/o regularization
weight matrix often indicate suboptimal training. To
promote orthogonality, The ⵟ-regularizer encourages
orthogonality among weight vectors, preventing neu-
ron representations from collapsing into similar config-
urations. This effect is especially beneficial for distin-
guishing similar classes, such as ”cats” versus ”dogs” in
datasets like CIFAR-10. Figure 5 illustrates how the
ⵟ-regularizer prevents neuron collapse in the output
layer, as shown through NMS plots, providing distinct
and separated representations for each class compared
to training without regularization which leads to the (c) with ⵟ-regularization (d) w/o regularization
neuron misclassification of the classes dog vs cats as
the neuron weights of those two classes are highly sim-
ilar.

6. Limitations and Future Directions


While the ⵟ-neuron architecture introduces several ad-
vantages, certain limitations may affect its scalability
and compatibility across different applications.
The ⵟ-product does not satisfy standard associative
and distributive properties, fundamental to many ma-
trix operations in deep learning. This limitation could (e) with ⵟ-regularization (f) w/o regularization
restrict the ⵟ-neuron’s adaptability, especially in code
Figure 5. ⵟ-reguliarization prevent the neuron collapse be-
optimizations and architectures that depend on these tween Dog (idx 5) and Cat (idx 3) Neurons in the output
algebraic properties for performance and flexibility. layer as shown by the ⵟ-similarity matrix (a,b) and NMS
As for the computational efficiency, traditional neu- Plots (c,d,e,f)
rons compute a dot product followed by a ReLU activa-
tion, requiring about 2d + 1 FLOPs per neuron (where
d is the input dimension). In contrast, ⵟ-neurons use The absence of activation functions between lay-
the ⵟ-product, which includes a squared Euclidean dis- ers in the ⵟ-neuron architectures results in more
tance and magnitude normalization, totaling approxi- information-preserving representations, yet this can
mately 5d − 1 FLOPs per neuron. lead to a higher risk of overfitting.
To analyze the computational overhead, we calcu- Drawing an analogy to Hebbian learning, which
late the FLOP ratio between ⵟ-neurons and traditional states ”neurons that fire together, wire together,” [17]
neurons: we observe that overfitting in neurons can manifest as
5d − 1 neuron collapse. This collapse occurs when multiple
Efficiency Ratio = ≈ 2.5 neurons converge to encode identical information, oc-
2d + 1
cupying the same region in the neural space. Figure 6
This suggests that ⵟ-neurons require approximately 2.5 illustrates this phenomenon using Neural-Matter State
times the FLOPs of conventional neurons. However, (NMS) plots revealing a collapsed distribution indicat-
the ⵟ-neuron’s design eliminates the need for separate ing overfitting in the output layer.
activation functions, potentially streamlining the over- Future research should concentrate on comprehend-
all network structure and reducing the memory load in ing this new paradigm and developing architectures
deeper architectures. specifically optimized for this new neuron.

7
(a) Healthy (b) Healthy (c) Overfit (d) Overfit

Figure 6. Analysis of the neurons in the third layer of an MLP head trained on CIFAR-10, using a Neural-Matter State
(NMS) plot, reveals: (a, b) a well-fitting, healthy distribution of neuron specialization and (c, d) overfitting appears as a
collapsed distribution.

7. Related Works far beyond human cognition. These fundamental laws


- which shape galaxies and guide quantum particles -
7.1. Inverse-Square Laws represent a deeper form of intelligence that we have
The inverse-square law describes a principle where a largely ignored in our pursuit of AI.
specified physical quantity or intensity is inversely pro- In this paper, we broke free from biological con-
portional to the square of the distance from the source. straints by drawing direct inspiration from inverse-
This relationship is foundational in various fields, par- square law [12], Coulomb’s law [4], Newton’s first law
ticularly in physics, and underpins several fundamental of motion [21], Gauss’s Law [7], and Bouhsine’s Prod-
laws [12]. Newton’s Law of Universal Gravitation is one ucts [1]. By redefining neural fundamentals through
of the earliest applications of the inverse-square law. It the lens of physics and abstract topology, we demon-
states that the gravitational force between two masses strated that a single activation-free neuron could solve
is inversely proportional to the square of the distance the XOR problem - a task that traditionally required
between their centers [21]. Similarly, Coulomb’s Law multiple neurons with complex activation functions.
describes the electrostatic force between two charged Our extensive experiments show that networks built
particles, with a force that diminishes with the square with the ⵟ-neurons consistently outperform traditional
of the distance [4]. Gauss’s Law, though formulated architectures without using any non-linear activation
differently, implies an inverse-square relationship for functions, suggesting that we have only scratched the
electric and gravitational flux through a closed sur- surface of what’s possible when we look beyond biolog-
face, connecting field flux to the source charge or mass ical metaphors.
within that surface [7].
License
7.2. Multi Layer Perceptron Architectures
The source code, algorithms, and all contributions pre-
Recent research in computer vision explores Multilayer sented in this work are licensed under the GNU Affero
Perceptrons (MLPs) and Vision Transformers [6] as al- General Public License (AGPL) v3.0. This license en-
ternatives to conventional convolutional models, focus- sures that any use, modification, or distribution of the
ing on simplicity and computational efficiency. Mod- code and any adaptations or applications of the un-
els like MLP-Mixer [29] and gMLP [16] have demon- derlying models and methods must be made publicly
strated competitive results on benchmarks like Ima- available under the same license. This applies whether
geNet by using operations such as matrix multiplica- the work is used for personal, academic, or commercial
tion and spatial gating, effectively capturing spatial in- purposes, including services provided over a network.
formation without self-attention.
Acknowledgment
8. Conclusion
The Google Developer Expert program and Google
Perhaps artificial intelligence’s greatest limitation has AI/ML Developer Programs team supported this work
been our stubborn fixation on the human brain as the by providing Google Cloud Credit. I want to extend my
pinnacle of intelligence. The universe itself, governed gratitude to the staff at High Grounds Coffee Roast-
by elegant and powerful laws, demonstrates intelligence ers for their excellent coffee and peaceful atmosphere.

8
I would also like to thank Dr. Andrew Ng for creat- [13] Alex Krizhevsky, Geoffrey Hinton, et al. Learning mul-
ing the Deep Learning course that introduced me to tiple layers of features from tiny images. 2009. 6
this field, without his efforts to democratize access to [14] Fei-Fei Li, Marco Andreeto, Marc’Aurelio Ranzato,
knowledge, this work would not have been possible. and Pietro Perona. Caltech 101, 2022. 6
Additionally, I want to express my appreciation to all [15] Zachary C. Lipton. The mythos of model interpretabil-
the communities I have been part of, especially MLNo- ity: In machine learning, the concept of interpretabil-
mads, Google Developers, and MLCollective commu- ity is both important and slippery. Queue, 16(3):31–57,
2018. 1
nities.
[16] Hanxiao Liu, Zihang Dai, David R. So, and Quoc V.
References Le. Pay Attention to MLPs, 2021. arXiv:2105.08050
[cs]. 8
[1] Taha Bouhsine, Imad El Aaroussi, Atik Faysal, and [17] Siegrid Löwel and Wolf Singer. Selection of intrinsic
Wang. Simo loss: Anchor-free contrastive loss for fine- horizontal connections in the visual cortex by corre-
grained supervised contrastive learning. In Submitted lated neuronal activity. Science, 255(5041):209–212,
to The Thirteenth International Conference on Learn- 1992. 7
ing Representations, 2024. under review. 1, 2, 3, 4, [18] Warren S McCulloch and Walter Pitts. A logical cal-
8 culus of the ideas immanent in nervous activity. The
[2] Michael M Bronstein, Joan Bruna, Yann LeCun, bulletin of mathematical biophysics, 5:115–133, 1943.
Arthur Szlam, and Pierre Vandergheynst. Geomet- 1
ric deep learning: going beyond euclidean data. IEEE [19] Fraser Mince, Dzung Dinh, Jonas Kgomo, Neil
Signal Processing Magazine, 34(4):18–42, 2017. 1 Thompson, and Sara Hooker. The grand illusion: The
[3] Adam Coates, Andrew Ng, and Honglak Lee. An anal- myth of software portability and implications for ml
ysis of single-layer networks in unsupervised feature progress, 2023. 5
learning. In Proceedings of the fourteenth interna- [20] Vinod Nair and Geoffrey E. Hinton. Rectified linear
tional conference on artificial intelligence and statis- units improve restricted boltzmann machines. In In-
tics, pages 215–223. JMLR Workshop and Conference ternational Conference on Machine Learning, 2010. 1
Proceedings, 2011. 6 [21] Isaac Newton. Philosophiæ Naturalis Principia Math-
[4] Charles-Augustin de Coulomb. Premier mémoire sur ematica. S. Pepys, London, 1687. 8
l’électricité et le magnétisme. Histoire de l’Académie
[22] Maria-Elena Nilsback and Andrew Zisserman. Au-
Royale des Sciences, pages 1–31, 1785. in French. 8
tomated flower classification over a large number of
[5] Finale Doshi-Velez and Been Kim. Towards a rigor-
classes. In 2008 Sixth Indian conference on computer
ous science of interpretable machine learning. arXiv:
vision, graphics & image processing, pages 722–729.
Machine Learning, 2017. 1
IEEE, 2008. 6
[6] Alexey Dosovitskiy, Lucas Beyer, Alexander
[23] J Orbach. Principles of neurodynamics. perceptrons
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
and the theory of brain mechanisms. Archives of Gen-
Unterthiner, Mostafa Dehghani, Matthias Minderer,
eral Psychiatry, 7(3):218–219, 1962. 1
Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and
Neil Houlsby. An image is worth 16x16 words: [24] Mohammad Pezeshki, Sékou-Oumar Kaba, Yoshua
Transformers for image recognition at scale, 2021. 5, Bengio, Aaron Courville, Doina Precup, and Guil-
8 laume Lajoie. Gradient starvation: A learning pro-
[7] Carl Friedrich Gauss. Allgemeine Lehrsätze in clivity in neural networks, 2021. 6
Beziehung auf die im verkehrten Verhältniss des [25] Frank Rosenblatt. The perceptron: a probabilistic
Quadrats der Entfernung wirkenden Anziehungs- und model for information storage and organization in the
Abstossungskräfte. Dietrich, Göttingen, 1835. 8 brain. Psychological review, 65(6):386, 1958. 1
[8] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, [26] Juergen Schmidhuber. Annotated history of modern ai
Piotr Dollár, and Ross Girshick. Masked autoencoders and deep learning. arXiv preprint arXiv:2212.11279,
are scalable vision learners, 2021. 5 2022. 1
[9] Dan Hendrycks and Kevin Gimpel. Gaussian error [27] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai,
linear units (gelus), 2023. 5 Ross Wightman, Jakob Uszkoreit, and Lucas Beyer.
[10] Kurt Hornik, Maxwell B. Stinchcombe, and Halbert L. How to train your vit? data, augmentation, and regu-
White. Multilayer feedforward networks are universal larization in vision transformers, 2022. 5
approximators. Neural Networks, 2:359–366, 1989. 1 [28] Stephen M Stigler. Gauss and the invention of least
[11] Roger David Joseph. Contributions to perceptron the- squares. the Annals of Statistics, pages 465–474, 1981.
ory. Cornell University, 1961. 1 1
[12] Johannes Kepler. Ad vitellionem paralipomena, quibus [29] Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov,
astronomiae pars optica traditur. 1604. Johannes Ke- Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jes-
pler: Gesammelte Werke, Ed. Walther von Dyck and sica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lu-
Max Caspar, Münchenk, 1939. 8 cic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp ar-

9
chitecture for vision. arXiv preprint arXiv:2105.01601,
2021. 8

10
Appendix
8.1. Geometric Topology
In geometric topology, various types of spaces are used to study notions of distance and convergence. Each type
of space has a specific set of properties defined by a function known as a metric or a generalization thereof. We
describe below metric spaces, semi-metric spaces, and pseudo-metric spaces, emphasizing the differences in their
definitions.

8.1.1 Metric Space


A metric space is a set X equipped with a distance function d : X × X → R, called a metric, which satisfies the
following properties for all x, y, z ∈ X:
1. Non-negativity: d(x, y) ≥ 0.
2. Identity of indiscernibles: d(x, y) = 0 if and only if x = y.
3. Symmetry: d(x, y) = d(y, x).
4. Triangle inequality: d(x, z) ≤ d(x, y) + d(y, z).
These properties ensure that a metric space has a well-defined notion of distance between any pair of points in
X, which is fundamental to many topological and analytical concepts.

8.1.2 Semi-Metric Space


A semi-metric space generalizes a metric space by relaxing the triangle inequality requirement. A semi-metric
space is defined as a set X with a distance function d : X × X → R that satisfies:
1. Non-negativity: d(x, y) ≥ 0.
2. Identity of indiscernibles: d(x, y) = 0 if and only if x = y.
3. Symmetry: d(x, y) = d(y, x).
In this case, d provides a notion of distance that is symmetric and non-negative, though it does not necessarily
satisfy the triangle inequality.

8.1.3 Pseudo-Metric Space


A pseudo-metric space is another generalization of a metric space, where the identity of indiscernibles requirement
is omitted. Thus, a pseudo-metric space is a set X with a distance function d : X × X → R that satisfies:
1. Non-negativity: d(x, y) ≥ 0.
2. Symmetry: d(x, y) = d(y, x).
3. Triangle inequality: d(x, z) ≤ d(x, y) + d(y, z).
In a pseudo-metric space, d(x, y) = 0 does not necessarily imply that x = y; points can have zero distance between
them without being identical, which makes pseudo-metric spaces useful in contexts where such indistinguishability
is needed.
8.2. ⵟ is pseudo-metric space
Theorem 8.1 (ⵟ is semi-metric and ⵟ is pseudo-metric). Let (Rn , ⵟ) and (Rn , ⵟ) be two spaces where:

oij (ei · ej )2
ⵟ(ei , ej ) = =
d2ij |ei − ej |2

d2ij |ei − ej |2
ⵟ(ei , ej ) = =
oij (ei · ej )2
for ei , ej ∈ Rn \ {0} where ei 6= ej , with d2ij = |ei − ej |2 being the squared Euclidean distance and oij = (ei · ej )2
being the squared dot product.
Then (Rn , ⵟ) is semi-metric, while (Rn , ⵟ) is pseudo-metric.
Proof. We structure this proof into four parts:

11
1. Preliminary observations and domain analysis
2. Proof of common properties for both measures
3. Proof that ⵟ is a pseudo-metric
4. Proof that ⵟ is a semi-metric
Part I: Preliminary Observations
Before proving the metric properties, we must establish the domain where these measures are well-defined:
1. For non-zero vectors ei , ej :
• d2ij = 0 ⇐⇒ ei = ej
• oij = 0 ⇐⇒ ei ⊥ ej (vectors are orthogonal)
2. Domain restrictions:
• ⵟ is defined when d2ij 6= 0 (distinct vectors)
• ⵟ is defined when oij 6= 0 (non-orthogonal vectors)
Part II: Common Properties
Both measures satisfy the following properties:
1. Non-negativity: Since both d2ij and oij are squared quantities:

d2ij = |ei − ej |2 ≥ 0 and oij = (ei · ej )2 ≥ 0

Therefore:
ⵟ(ei , ej ) ≥ 0 and ⵟ(ei , ej ) ≥ 0
2. Identity of Indiscernibles: For both measures, we prove this bidirectionally:
(⇒) If ei = ej :
• d2ij = 0
• oij = |ei |4 > 0 (for non-zero vectors)
Therefore, ⵟ(ei , ej ) = 0 and ⵟ(ei , ej ) = 0
(⇐) If ⵟ(ei , ej ) = 0 or ⵟ(ei , ej ) = 0:
• For ⵟ: dij 2 = 0 =⇒ oij = 0 (since dij 6= 0 for distinct vectors)
o 2
ij
d2ij
• For ⵟ: oij = 0 =⇒ d2ij = 0 (since oij 6= 0 in domain)
In both cases, this implies that this rule stand for ⵟ, but not for ⵟ.
3. Symmetry: Symmetry follows from the symmetry of dot product and Euclidean distance:

(ei · ej )2 (ej · ei )2
ⵟ(ei , ej ) = = = ⵟ(ej , ei )
|ei − ej |2 |ej − ei |2

And similarly for ⵟ.


Part III: Proof the triangle inequality for ⵟ
To prove ⵟ satisfies the triangle inequality, we proceed in steps:
Given:
(e1 · e2 )2
e1 ⵟe2 = .
||e2 − e1 ||2
Let:
- e1 and e2 be vectors in Rn ,
- θ be the angle between e1 and e2 .
The dot product between e1 and e2 can be written as:

e1 · e2 = ||e1 || ||e2 || cos θ.


Thus, (e1 · e2 )2 becomes:

(e1 · e2 )2 = (||e1 || ||e2 || cos θ)2 = ||e1 ||2 ||e2 ||2 cos2 θ.
The Euclidean distance between e1 and e2 is:

12
||e2 − e1 ||2 = ||e1 ||2 + ||e2 ||2 − 2 ||e1 || ||e2 || cos θ.
Now we substitute these expressions into the formula for e1 ⵟe2 :

||e1 ||2 ||e2 ||2 cos2 θ


e1 ⵟe2 = .
||e1 ||2 + ||e2 ||2 − 2 ||e1 || ||e2 || cos θ
Let’s simplify by defining:
- A = ||e1 ||,
- B = ||e2 ||.
Thus, the expression becomes:

A2 B 2 cos2 θ
f (θ) = e1 ⵟe2 = .
A2 + B 2 − 2AB cos θ
Let’s factor out common terms in the numerator. Notice that each term in the numerator has a factor of
A2 B 2 sin θ, so we can factor that out:

A2 B 2 sin θ −2 cos θ(A2 + B 2 − 2AB cos θ) − 2AB cos2 θ


 
0
f (θ) = .
(A2 + B 2 − 2AB cos θ)2
Now, distribute −2 cos θ in the first term inside the brackets:

A2 B 2 sin θ −2A2 cos θ − 2B 2 cos θ + 4AB cos2 θ − 2AB cos2 θ


 
= .
(A2 + B 2 − 2AB cos θ)2
Combine the cos2 θ terms:

A2 B 2 sin θ −2A2 cos θ − 2B 2 cos θ + 2AB cos2 θ


 
= .
(A2 + B 2 − 2AB cos θ)2
Thus, the simplified form of f 0 (θ) is:

−2A2 B 2 sin θ A2 cos θ + B 2 cos θ − AB cos2 θ



0
f (θ) = .
(A2 + B 2 − 2AB cos θ)2
This form is simpler and allows us to see that the sign of f 0 (θ) depends on the sign of − sin θ, which is non-positive
on the interval [0, π]. Therefore, f 0 (θ) ≤ 0 on this interval, confirming that f (θ) is monotonically decreasing.
Since f (θ) is monotonically decreasing, it follows that ⵟ(e1 , e2 ) = f (θ) decreases as θ increases.
Applying the Angular Triangle Inequality Angles in Euclidean space satisfy the triangle inequality
(Cauchy–Schwarz inequality):
θik ≤ θij + θjk .
Since ⵟ(ei , ej ) is a decreasing function of θ, we conclude:

ⵟ(ei , ek ) ≤ ⵟ(ei , ej ) + ⵟ(ej , ek ).

Part IV: Disproof the triangle inequality for ⵟ


To show ⵟ is not a metric, we provide a counter-example where the triangle inequality fails:
Consider three points in R2 :

e1 = (1, 0)
e2 = (0, 1)
e3 = (1, 1)

Computing the values:

13
ⵟ(e1 , e2 ) = 0
1
ⵟ(e2 , e3 ) = = 1
1
2
ⵟ(e1 , e3 ) = = 2
1
Which means:
ⵟ(e1 , e3 ) ≥ ⵟ(e1 , e2 ) + ⵟ(e2 , e3 )
Hence the inequality doesn’t stand for ⵟ.
Therefore, we conclude that (Rn , ⵟ) is pseudo-metric and (Rn , ⵟ) is semi-metric.

8.3. ⵟ-product vs Dot Product

# P l o t t h e n euro ns and t h e t e s t p o i n t i n t h e 2D s p a c e

p l t . f i g u r e ( f i g s i z e =(6 , 6 ) )

# P l o t n euro ns a s r e d d o t s
f o r i , neuron i n enumerate ( neur ons ) :
p l t . p l o t ( neuron [ 0 ] , neuron [ 1 ] , ’ ro ’ )
p l t . t e x t ( neuron [ 0 ] + 0 . 1 , neuron [ 1 ] , f ’ Neuron { i +1} ’ , f o n t s i z e =12)

# Plot the t e s t point (6 , 6) as a blue s t a r


p l t . p l o t ( t e s t _ p o i n t [ 0 ] , t e s t _ p o i n t [ 1 ] , ’ b ∗ ’ , m a r k e r s i z e =15 , l a b e l =’ Test Point ( 6 , 6 ) ’ )
p l t . t e x t ( t e s t _ p o i n t [ 0 ] + 0 . 1 , t e s t _ p o i n t [ 1 ] , ’ Test Point ( 6 , 6 ) ’ , f o n t s i z e =12)

# S e t up t h e p l o t l i m i t s and l a b e l s
p l t . xlim ( 0 , 7 )
p l t . ylim ( 0 , 7 )
p l t . x l a b e l ( ’X−a x i s ’ )
p l t . y l a b e l ( ’Y−a x i s ’ )
p l t . t i t l e ( ’ Neurons and Test Point i n 2D Space ’ )

# Draw l i n e s c o n n e c t i n g t h e o r i g i n t o t h e neur ons and t e s t p o i n t


f o r neuron i n n euro ns :
p l t . p l o t ( [ 0 , neuron [ 0 ] ] , [ 0 , neuron [ 1 ] ] , ’ r −−’)

plt . plot ([0 , test_point [ 0 ] ] , [0 , test_point [ 1 ] ] , ’ b−−’)

p l t . g r i d ( True )
p l t . gca ( ) . s e t _ a s p e c t ( ’ equal ’ , a d j u s t a b l e =’box ’ )
p l t . show ( )

%%%%%%% TODO change t h i s code i n t h e paper t o t h e new one


import numpy a s np
import m a t p l o t l i b . p y p l o t a s p l t

# D e f i n e t h e n euro ns a s v e c t o r s
neur ons = np . a r r a y ( [ [ 1 , 1 ] , [ 2 , 2 ] , [ 3 , 3 ] , [ 4 , 4 ] , [ 5 , 5 ] , [ 7 , 7 ] , [ 8 , 8 ] ] )
t e s t _ p o i n t = np . a r r a y ( [ 6 , 6 ] )

# dot p r o d uc t f u n c t i o n

14
d e f c o s i n e _ s i m i l a r i t y ( v1 , v2 ) :
dot_product = np . dot ( v1 , v2 )
r e t u r n dot_product

# Yat−p r o d uc t f u n c t i o n
d e f yat_product ( v1 , v2 , e p s i l o n =1e −6):
dot_product_squared = np . dot ( v1 , v2 ) ∗∗ 2
d i s t a n c e _ s q u a r e d = np . l i n a l g . norm ( v2 − v1 ) ∗∗ 2
r e t u r n dot_product_squared / ( d i s t a n c e _ s q u a r e d + e p s i l o n )
# C a l c u l a t e c o s i n e s i m i l a r i t i e s and yat−p r o d u c t s
c o s i n e _ s i m i l a r i t i e s = [ c o s i n e _ s i m i l a r i t y ( t e s t _ p o i n t , neuron ) f o r neuron i n n euro ns ]
yat_products = [ yat_product ( t e s t _ p o i n t , neuron ) f o r neuron i n n euro ns ]

# Plot the c o s i n e s i m i l a r i t i e s
p l t . f i g u r e ( f i g s i z e =(10 , 5 ) )

plt . subplot (1 , 2 , 1)
plt . p l o t ( r a n g e ( 1 , 6 ) , c o s i n e _ s i m i l a r i t i e s , marker =’o ’ , c o l o r =’ blue ’ , l a b e l =’ Co s i n e S i m i l a r i t y
plt . t i t l e ( ” C o s i n e S i m i l a r i t y with ( 6 , 6 ) ” )
plt . x l a b e l ( ” Neuron ” )
plt . y l a b e l (” Cosine S i m i l a r i t y ”)
plt . ylim ( 0 , 1 . 1 )
plt . g r i d ( True )
plt . xticks ([1 , 2 , 3 , 4 , 5])

# P l o t t h e Yat−p r o d u c t s
p l t . subplot (1 , 2 , 2)
p l t . p l o t ( r a n g e ( 1 , 6 ) , yat_products , marker =’o ’ , c o l o r =’ red ’ , l a b e l =’Yat−product ’ )
p l t . t i t l e ( ” Yat−p r o d u c t with ( 6 , 6 ) ” )
p l t . x l a b e l ( ” Neuron ” )
p l t . y l a b e l ( ” Yat−p r o d u c t ” )
p l t . g r i d ( True )
plt . xticks ([1 , 2 , 3 , 4 , 5])

plt . tight_layout ()
p l t . show ( )

ⵟ-Neuron with numpy

import numpy a s np
def yat_neuron (X, w, b ) :
# Squared d o t p r o d u c t
dot_squared = np . dot (X, w) ∗∗ 2
# Squared E u c l i d e a n d i s t a n c e
d i s t a n c e _ s q u a r e d = np .sum( (w − X) ∗∗ 2 , a x i s =1)
# Avoid d i v i s i o n by z e r o by a d d i n g a s m a l l e p s i l o n
e p s i l o n = 1 e−6
return dot_squared / ( d i s t a n c e _ s q u a r e d + e p s i l o n ) + b

8.4. XOR Code

import numpy a s np
np . random . s e e d ( 4 2 ) # For r e p r o d u c i b i l i t y

15
w = np . random . randn ( 2 ) # Random w e i g h t s f o r a 2D i n p u t
b = np . random . randn ( ) # Random b i a s
# XOR d a t a s e t : i n p u t s and c o r r e s p o n d i n g o u t p u t s
X = np . a r r a y ( [ [ 0 , 0 ] , [ 0 , 1 ] , [ 1 , 0 ] , [ 1 , 1 ] ] ) # I n p u t s
y = np . a r r a y ( [ 0 , 1 , 1 , 0 ] ) # E x p e c t e d o u t p u t s f o r XOR

# Apply custom neuron t o each i n p u t i n XOR d a t a s e t


o u t p u t s = yat_neuron (X, w, b )

# Print r e s u l t s
print (w)
print ( b )
print ( o u t p u t s )
w, b , o u t p u t s

from s c i p y . o p t i m i z e import minimize


# Loss f u n c t i o n : Mean Squared Error be twee n t h e neuron o u t p u t and t h e t a r g e t XOR o u t p u t
def l o s s _ f u n c t i o n ( params ) :
w = params [ : 2 ] # F i r s t two v a l u e s a r e w e i g h t s
b = params [ 2 ] # Last value i s b i a s
o u t p u t s = yat_neuron (X, w, b )
return np . mean ( ( o u t p u t s − y ) ∗∗ 2 ) # Mean Squared Error (MSE)

# I n i t i a l p a r a m e t e r s : [ w1 , w2 , b ]
i n i t i a l _ p a r a m s = np . append (w, b )

# Optimize t h e w e i g h t s and b i a s u s i n g ’ minimize ’ from SciPy


r e s u l t = minimize ( l o s s _ f u n c t i o n , i n i t i a l _ p a r a m s , method= ’BFGS ’ )

# E x t r a c t o p t i m i z e d w e i g h t s and b i a s
optimized_params = r e s u l t . x
o p t i m i z e d _ w e i g h t s = optimized_params [ : 2 ]
o p t i m i z e d _ b i a s = optimized_params [ 2 ]

# Apply o p t i m i z e d w e i g h t s and b i a s t o XOR d a t a s e t


o p t i m i z e d _ o u t p u t s = yat_neuron (X, optimized_weights , o p t i m i z e d _ b i a s )

print ( ’###############’ )
print ( o p t i m i z e d _ w e i g h t s )
print ( o p t i m i z e d _ b i a s )
print ( o p t i m i z e d _ o u t p u t s )

import m a t p l o t l i b . p y p l o t a s p l t

# Function t o p l o t XOR d a t a and d e c i s i o n boundary


def plot_xor_decision_boundary ( w e i g h t s , b i a s ) :
# XOR i n p u t d a t a p o i n t s
X_pos = X[ y == 1 ] # P o i n t s where y == 1
X_neg = X[ y == 0 ] # P o i n t s where y == 0

# Create a mesh grid f o r t h e p l o t


xx , yy = np . meshgrid ( np . l i n s p a c e ( −0.5 , 1 . 5 , 4 0 0 ) , np . l i n s p a c e ( −0.5 , 1 . 5 , 4 0 0 ) )
g r i d _ p o i n t s = np . c_ [ xx . r a v e l ( ) , yy . r a v e l ( ) ]

16
# Compute neuron o u t p u t f o r each p o i n t i n t h e g r i d
Z = yat_neuron ( g r i d _ p o i n t s , w e i g h t s , b i a s )
Z = Z . r e s h a p e ( xx . shape )
print (Z)
print ( xx )
# P l o t t h e d e c i s i o n boundary and XOR p o i n t s
p l t . f i g u r e ( f i g s i z e =(6 , 6 ) )
p l t . c o n t o u r f ( xx , yy , Z > 0 . 5 , a l p h a =0.5 , cmap= ’ coolwarm ’ ) # D e c i s i o n boundary
p l t . s c a t t e r ( X_pos [ : , 0 ] , X_pos [ : , 1 ] , c o l o r= ’ r e d ’ , l a b e l= ’ 1 ’ , e d g e c o l o r s= ’ k ’ )
p l t . s c a t t e r ( X_neg [ : , 0 ] , X_neg [ : , 1 ] , c o l o r= ’ b l u e ’ , l a b e l= ’ 0 ’ , e d g e c o l o r s= ’ k ’ )
p l t . t i t l e ( ”XOR Problem : D e c i s i o n Boundary � ( Neuron ) ” )
p l t . x l a b e l ( ’ x1 ’ )
p l t . y l a b e l ( ’ x2 ’ )
plt . legend ()
p l t . g r i d ( True )
p l t . show ( )

# P l o t t h e XOR d e c i s i o n boundary w i t h o p t i m i z e d w e i g h t s and b i a s


plot_xor_decision_boundary ( optimized_weights , o p t i m i z e d _ b i a s )

8.5. ⵟ-Layer (NML) with Flax

import j a x . numpy a s jnp


from f l a x . l i n e n . d t y p e s import promote_dtype
from f l a x . l i n e n . module import Module , compact
from f l a x . t y p i n g import (
PRNGKey a s PRNGKey,
Shape a s Shape ,
DotGeneralT ,
)

from t y p i n g import (
Any ,
)
import j a x . numpy a s jnp
import j a x . l a x a s l a x
from f l a x . l i n e n import Module , compact
from f l a x import l i n e n a s nn
from f l a x . l i n e n . i n i t i a l i z e r s import z e r o s _ i n i t , lecun_normal
from t y p i n g import Any , O p t i o n a l

c l a s s YatDense ( Module ) :
”””
Attributes :
f e a t u r e s : t h e number o f o u t p u t f e a t u r e s .
u s e _ b i a s : w h e t h e r t o add a b i a s t o t h e o u t p u t ( d e f a u l t : True ) .
d t y p e : t h e d t y p e o f t h e co m pu t at i on .
param_dtype : t h e d t y p e p a s s e d t o parameter i n i t i a l i z e r s ( d e f a u l t : f l o a t 3 2 ) .
p r e c i s i o n : n u m e r i c a l p r e c i s i o n o f t h e c o mp u ta t io n .
k e r n e l _ i n i t : i n i t i a l i z e r function f o r the weight matrix .
bias_init : i n i t i a l i z e r function for the bias .
epsilon : small constant .

17
”””
f e a t u r e s : int
u s e _ b i a s : bool = True
dtype : O p t i o n a l [ Any ] = None
param_dtype : Any = jnp . f l o a t 3 2
p r e c i s i o n : Any = None
k e r n e l _ i n i t : Any = nn . i n i t i a l i z e r s . o r t h o g o n a l ( )
b i a s _ i n i t : Any = z e r o s _ i n i t ( )

# I n i t i a l i z e alpha to 1.0
a l p h a _ i n i t : Any = lambda key , shape , dtype : jnp . o n e s ( shape , type )
e p s i l o n : f l o a t = 1 e−6
d o t _ g e n e r a l : DotGeneralT | None = None
d o t _ g e n e r a l _ c l s : Any = None
r e t u r n _ w e i g h t s : bool = F a l s e

@compact
def __call__ ( s e l f , i n p u t s : Any) −> Any :
”””
Args :
i n p u t s : The nd−a r r a y t o be t r a n s f o r m e d .

Returns :
The t r a n s f o r m e d i n p u t .
”””
k e r n e l = s e l f . param (
’ kernel ’ ,
s e l f . kernel_init ,
( s e l f . f e a t u r e s , jnp . shape ( i n p u t s ) [ − 1 ] ) ,
s e l f . param_dtype ,
)
a l p h a = s e l f . param (
’ alpha ’ ,
s e l f . alpha_init ,
( 1 , ) , # S i n g l e s c a l a r parameter
s e l f . param_dtype ,
)
i f s e l f . use_bias :
b i a s = s e l f . param (
’ b i a s ’ , s e l f . b i a s _ i n i t , ( s e l f . f e a t u r e s , ) , s e l f . param_dtype
)
else :
b i a s = None

i n p u t s , k e r n e l , b i a s = promote_dtype ( i n p u t s , k e r n e l , b i a s , dtype= s e l f . dtype )


# Compute d o t p r o d u c t bet ween i n p u t and k e r n e l
i f s e l f . d o t _ g e n e r a l _ c l s i s not None :
dot_general = s e l f . dot_general_cls ( )
e l i f s e l f . d o t _ g e n e r a l i s not None :
dot_general = s e l f . dot_general
else :
dot_general = lax . dot_general
y = dot_general (

18
inputs ,
jnp . t r a n s p o s e ( k e r n e l ) ,
( ( ( i n p u t s . ndim − 1 , ) , ( 0 , ) ) , ( ( ) , ( ) ) ) ,
p r e c i s i o n= s e l f . p r e c i s i o n ,
)
inputs_squared_sum = jnp .sum( i n p u t s ∗ ∗ 2 , a x i s =−1, keepdims=True )
kernel_squared_sum = jnp .sum( k e r n e l ∗ ∗ 2 , a x i s =−1)
d i s t a n c e s = inputs_squared_sum + kernel_squared_sum − 2 ∗ y

# Element−w i s e o p e r a t i o n
y = y ∗∗ 2 / ( d i s t a n c e s + s e l f . e p s i l o n )
s c a l e = ( jnp . s q r t ( s e l f . f e a t u r e s ) / jnp . l o g ( 1 + s e l f . f e a t u r e s ) ) ∗∗ a l p h a
y = y ∗ scale
# Normalize y
i f b i a s i s not None :
y += jnp . r e s h a p e ( b i a s , ( 1 , ) ∗ ( y . ndim − 1 ) + ( −1 ,))
i f s e l f . return_weights :
return y , k e r n e l
return y

19

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy