TMAE Updated V2
TMAE Updated V2
net/publication/375624578
CITATION READS
1 92
4 authors:
All content following this page was uploaded by Mahmoud A. Hasabelnaby on 31 January 2024.
Abstract—Next-generation communication networks are ex- such, optimizing communication schemes using mathematical
pected to exploit recent advances in data science and cutting-edge models becomes extremely challenging and computationally
communications technologies to improve the utilization of the complex, particularly with the integration of demanding 6G
available communications resources. In this article, we introduce
an emerging deep learning (DL) architecture, the transformer- applications and services, such as the Metaverse, ubiquitous
masked autoencoder (TMAE), and discuss its potential in next- Extended Reality (XR), intelligent connected robotics, large-
generation wireless networks. We discuss the limitations of scale intelligent reconfigurable surfaces (IRSs), and ultra-
current DL techniques in meeting the requirements of 5G and massive MIMO (UM-MIMO) networks.
beyond 5G networks, and how the TMAE differs from the Due to this, the use of deep learning (DL) approaches
classical DL techniques can potentially address several wireless
communication problems. We highlight various areas in next- has been proposed to solve wireless communications chal-
generation mobile networks which can be addressed using a lenges due to their ability to adapt to dynamic environments,
TMAE, including source and channel coding, estimation, and approximate complex models, and utilize data to improve
security. Furthermore, we demonstrate a case study showing performance [2]. Transformer-enabled DL, initially proposed
how a TMAE can improve data compression performance and for natural language processing (NLP) tasks [3], opens the
complexity compared to existing schemes. Finally, we discuss key
challenges and open future research directions for deploying the door for further advances in this area. The main advantage
TMAE in intelligent next-generation mobile networks. of transformers is their superior ability to learn complex
dependencies between input features compared to classical
Index Terms—6G, 5G, convolutional neural networks, deep
learning, wireless communication, recurrent neural networks, deep neural networks (DNN) [4]. The goal of this paper is
transformer, masked autoencoder. to discuss a transformer-based architecture, the transformer
masked autoencoder (TMAE), and its application in communi-
cations. Particularly, we start by highlighting some limitations
I. I NTRODUCTION of existing DNNs (Sec. II), then we explain the general trans-
Next-generation (NG) mobile networks are increasingly former and the MAE architectures (Sec. III). Next, we present
calling for intelligent architectures that support massive con- a use case where a TMAE enhances data compression by
nectivity, ultra-low latency, ultra-high reliability, high-quality exploiting semantics, which can significantly improve achiev-
of experience, high spectral and energy efficiency, and lower able rates in wireless networks (Sec. IV). Finally, we discuss
deployment costs [1]. One way to meet these stringent require- opportunities for improving wireless communications using
ments is to rethink traditional communication techniques by the TMAE (Sec. V) and summarize the takeaway message
exploiting recent advances in artificial intelligence. of the paper (Sec. VI).
Traditionally, functions such as waveform design, channel
estimation, interference mitigation, and error detection and II. C LASSICAL DNN L IMITATIONS IN NG N ETWORKS
correction are developed based on theoretical models and Recent advances in DL opened the possibility for designing
assumptions. This traditional approach is not capable of adapt- intelligent mobile networks that learn to operate optimally
ing to new challenges introduced by emerging technologies. using massive amounts of data [2], thus overcoming math-
For instance, the pilot-based channel estimation technique, ematical modeling and computational complexity challenges
while efficient for MIMO systems with a few antennas and [5]. Although DL offers advantages over mathematical models,
low-mobility users, is not efficient for massive multiple- it is not free of limitations. Some common DNNs are discussed
input multiple-output (MIMO) systems or high-mobility users. next, followed by their limitations in NG networks.
Additionally, the plethora of communication protocols, tech-
nologies, and services that have been introduced to support the
A. Common DNN Architectures
growing demand and diversity of use cases make it increas-
ingly difficult to mathematically model wireless networks. As The most commonly used DNN architectures in wireless
communications include the following:
A. Zayat, M. A. Hasabelnaby, and A. Chaaban are with the School of • Multi-Layer Perceptions (MLP): An MLP is a feed-
Engineering, University of British Columbia, Kelowna, BC V1V1V7, Canada.
M. Obeed is with the Systems and Computer Engineering Department, forward neural network (NN) that consists of at least
Carleton University, Ottawa, ON K1S 5B6, Canada. three layers of fully-connected nodes: an input layer,
2
a hidden layer, and an output layer. MLPs have been compared to classical DL [3]. Next, we present the basic ar-
proposed to tackle various mobile network problems, chitecture of transformers and discuss their potential compared
such as beamforming and channel estimation [5]. to classical DNNs. The potential of the TMAE, a transformer-
• Convolutional Neural Networks (CNN): CNNs replace based autoencoder, for NG networks, is discussed afterward.
fully-connected layers with locally connected kernels that
capture local correlations in data. This reduces the num- III. P OTENTIAL OF T RANSFORMER -BASED NN S
ber of model parameters which simplifies training and A transformer is an NN architecture proposed originally
reduces the risk of overfitting. As a result, CNNs often for NLP [3]. Owing to their remarkable ability to capture
outperform MLPs in many applications such as com- complex patterns and relationships in data, transformers have
puter vision (CV). CNNs have also been used in many been adapted for various applications, including CV and
wireless communications applications such as semantic wireless communications. This is because they have several
communications [2], beamforming, channel estimation, advantages over classical DNNs. First, transformers employ
and channel state information (CSI) feedback [5]. the attention mechanism, which allows them to dynamically
• Recurrent Neural Networks (RNN): The main difference weigh the importance of segments in the data, which allows
between a typical NN (MLP or CNN) and an RNN is them to capture short- and long-term dependencies in data.
that an RNN has feedback connections in addition to This is particularly beneficial in time series analysis where
feed-forward connections. These connections give RNNs extended sequences of data points are involved, enhancing
memory of previous inputs/outputs, which is useful for their capability of generalizing and adapting to changing
sequential processing and helps in exploiting local and environments. Second, transformers take advantage of paral-
global correlations. This makes them ideal for applica- lel processing, which significantly improves their efficiency.
tions with time-series data [5]. Finally, pre-trained transformer models, such as BERT [3],
• Long Short-Term Memory (LSTM) Networks: Although can be fine-tuned for customized objectives using a small
RNNs have memory, their memory is short. As a special dataset (realizing transfer learning). These advantages make
type of RNNs, LSTM networks feature gated memory transformers attractive for challenging DL tasks, including
cells which extends their memory to longer sequences. wireless communication applications. This section delves into
Due to this, LSTM networks have been used for channel the workings of transformers and examines their key compo-
estimation in channels with memory [5]. nents. It also examines the architecture of the TMAE.
B. Limitations in NG Networks
The aforementioned NNs can be employed to build en- A. Transformer Architecture
coding and decoding layers (an autoencoder), that produce a There is a variety of transformer architectures depending
different representation of data and reconstruct it from the new on the application. However, all architectures share the same
representation, respectively. This can then be used to realize fundamental principle: the attention mechanism. The main
physical layer processing tasks such as signal design, channel components of transformers, as shown in Fig. 1, are input
estimation, CSI feedback, modulation, and coding [5]. embedding, positional encoding, and multi-head attention,
However, classical DNNs encounter limitations in fully which are discussed next.
meeting the demands of NG networks. MLPs have limited Input Embedding: First, the input vector is segmented
ability to extract deep features from raw data, and their per- and projected into the embedding space (usually of higher
formance on sequential data is poor. This leads to challenges dimension than the input). This can be achieved using a single
in generalizing and transferring learned knowledge between convolutional layer for example. The result of this step is a
different scenarios, hindering their ability to seamlessly adapt representation in the embedding space of segments of the input
and perform in a dynamically changing environment which vector, each of which is characterized by a position.
is common in wireless networks. While CNNs, RNNs, and Positional Encoding: Since the transformer encoder has
LSTM networks are more adaptable spatially/temporally to no recurrence (unlike RNN), it is essential to add segment
local/short-term changes, they have limited capability to ef- position into the data. This is accomplished with positional
fectively exploit global/long-term dependencies in sequential encoding. There are numerous methods for positional encod-
data, which is crucial for capturing the intricate patterns and ing, one of which employs trigonometric functions. In this
dynamics inherent in NG wireless networks. In addition to case, each odd indexed segment is encoded using samples
this, training recurrent networks (RNN and LSTM) suffers of a cosine function with a frequency that depends on the
from challenges related to convergence, vanishing gradients, position, effectively encoding this positional information in
and parallelization, which limits their usefulness in latency- the generated vector. Similarly, samples of a sine function
sensitive applications. are used to encode the positions of even indexed segments.
These limitations underscore the need for innovative en- These positional encoding vectors are then added to the input
hancements, alternative architectures, and hybrid approaches embeddings of their respective segments.
to effectively address the evolving requirements of NG net- Multi-head Attention: Multi-head attention is the most
works. Recently, attention-based DL realized using trans- important component of transformers and plays a crucial role
former networks was proposed and shown to achieve remark- in quantifying the relationships between the inputs. This is
able performance gains in various CV and NLP applications achieved using the self-attention mechanism, which relates
3
ability, and robustness in tasks such as signal processing, TMAE-enhanced compression offers a distinct advantage.
channel estimation, and resource allocation, contributing to the Once an image is captured, the UAV segments it into non-
evolution of resilient and high-performance wireless networks. overlapping patches and randomly masks some of them,
relying on the ability of the BS to infer them from un-
C. Transformer Challenges masked patches due to the potentially high correlation between
The superiority of transformer-based NNs comes at the patches. Once the patches are masked and removed, the
cost of some challenges. Their parallel nature increases the remaining patches are stacked to form a condensed image,
resources needed for them to run. Transformers also require a which is then further compressed using a standard compression
substantial amount of labeled data for training, often necessi- algorithm. Upon receipt at the BS, the compressed image is
tating pre-training on large corpora before fine-tuning on the decompressed to retrieve the stacked patches, and the TMAE,
target task. Additionally, fine-tuning pre-trained transformers trained to handle random masks, reconstructs the masked
on specific tasks requires careful balancing to avoid catas- patches to reproduce the complete image. For accurate recon-
trophic forgetting and retain previously learned knowledge [8]. struction, the TMAE requires information about the location
Note however that some of these challenges are shared with of the masked patches. This mask can be obtained by sharing
classical DNNs. A comparison between classical DNNs and the seed of the random number generator used to generate the
transformer-based NNs is given in Table I, providing a concise mask (or can be stored at the receiver if it is deterministic).
overview of their strengths and limitations, which helps in While the resulting image might not be of the highest quality, it
understanding their suitability for different tasks. conveys essential and interpretable information from the UAV
In the next section, we provide a case study illustrating despite the communication challenges, as can be seen in Fig. 4.
how a TMAE enhances the compression rate of existing Next, we examine the proposed scheme quantitatively when
compression schemes. JPEG is used as the standard compression algorithm.
B. JPEG-TMAE Compression
IV. C ASE S TUDY: TMAE-E NHANCED C OMPRESSION
The proposed JPEG-TMAE compression scheme is a com-
In some communication applications, communicating nodes
bination of the widely-used JPEG compression and the TMAE,
have limited computational resources and are deployed in
as shown in Fig. 4. To implement this, we use patches
resource-constrained environments. Examples include Un-
of size 16 × 16 pixels (as an example). Then, we use a
manned Aerial Vehicle (UAV) and Internet of Things (IoT)
masking ratio of Rmask = 0.67 (i.e., 2/3 of the image is
applications. Yet, the demand for transmitting vast amounts of
masked) and random masking with a seed that is shared
data (whether it be images, sensor readings, or text) persists.
between the UAV and the BS. Thus, the compression gain
Given the limited resources in these systems, transmitting
achieved by the TMAE is the ratio between the size of the
vast amounts of data becomes challenging. Fortunately, such
original and the stacked patches (1 − Rmask ). The remaining
data often exhibits inherent correlations between its segments.
patches are stacked and compressed using JPEG. As such, the
For instance, in an image, certain patches can be inferred
overall compression rate equals the masking ratio multiplied
from their neighboring patches due to the visual structure and
by the JPEG compression rate. Upon receiving the compressed
semantics of the scene. Similarly, in a sequence of sensor
image, the BS decompresses it using JPEG, and then uses a
readings, some values can be predicted based on the preceding
pre-trained TMAE to reconstruct the image.
and succeeding values. Thus, one can omit certain segments of
Leveraging the pre-trained transformers-based model from
the data, leveraging the correlations to infer these segments at
Facebook AI Research (FAIR) [7], we bypassed extensive
the receiver’s end. This approach not only reduces the amount
training, focusing instead on architectural modifications tai-
of data to be transmitted but also requires less processing re-
lored to our compression goals. This strategic approach en-
sources at the transmitter, making it particularly advantageous
sured an optimal balance between computational efficiency
for systems like UAVs capturing and transmitting real-time
and performance in the described UAV communication sce-
images or IoT devices sending frequent sensor updates.
nario.
Fig. 4 shows a comparison where JPEG and the proposed
A. Example: UAV with TMAE-Enhanced Compression JPEG-TMAE schemes are used to compress an image, using
Consider a UAV operating in a distant, hard-to-reach re- a compression rate of 0.41 and 0.255 BPP for JPEG, and a
gion, tasked with gathering essential information about the compression rate of 0.255 BPP for the JPEG-TMAE scheme.
landscape through imaging. The primary challenge for this The JPEG scheme fails at the compression rate 0.255 BPP,
UAV is not merely capturing the images but efficiently trans- contrary to the proposed JPEG-TMAE which reconstructs
mitting them back to a base station (BS). In this scenario, the the image with a much better quality at the same overall
communication channel quality is poor, limiting the amount compression rate (a quality comparable with JPEG at 0.41
of information that can be sent. This requires compressing BPP).
the image aggressively, which can compromise its meaningful For a quantitative comparison, we can use the structural
interpretation. Conventional compression techniques may fail similarity measure (SSIM) to assess the quality of the recon-
in this scenario. Fig. 4 shows an example where a standard structed images [9]. SSIM gauges the similarity between the
compression scheme produces a very low-quality image when input and output images, factoring in the interpixel dependen-
the compression (in bits per pixel (BPP)) is low. cies of closely situated pixels. Recognized as a full reference
5
TABLE I
C OMPARATIVE E VALUATION OF D EEP L EARNING T ECHNIQUES BASED ON K EY C HARACTERISTICS
JPEG JPEG
Encoder Decoder
Captured
Encoder Decoder
Encoder Decoder
Base Station
Masking Layer
Fig. 4. Remote UAV imaging: A qualitative comparison between the conventional JPEG compression and the proposed JPEG-TMAE compression scheme.
metric, SSIM measures image quality using an uncompressed JPEG-TMAE seamlessly integrates the entire NN onto the BS,
or noise-free initial input image as its reference. Thus, we thus simplifying processing at the UAV. Note that while the
examine the SSIM of the proposed JPEG-TMAE on the Kodak ConvLSTM method performs best at higher compression rates,
dataset [10] (often used to compare compression methods). we expect a similar transformer–based architecture to perform
Through a methodical process of iterating the masking ratio even better.
and adjusting the JPEG compression rate for every iteration, In summary, this approach is highly suitable for commu-
we were able to delineate an optimized performance spectrum nication applications with limited resources at the transmitter
across all examined masking ratios. We compare with sev- side (UAVs, IoT, etc.). Note that the results in Fig. 5 can be
eral leading models, specifically mbt2018 (CNN-Based) [11], further improved using dynamic masking to ensure sufficient
cheng2020-anchor (CNN-Based) [12], and ConvLSTM (CNN- correlation between masked/unmasked patches, leading to
LSTM Based) [13]. either a lower compression rate or better image quality.
Fig 5 displays the SSIM in relation to the overall com-
pression rate. Notably, the JPEG-TMAE not only outperforms
JPEG, particularly at lower compression rates, but it also V. A PPLICATIONS IN NG N ETWORKS
exhibits superior performance compared to leading models
such as mbt2018, cheng2020-anchor, and ConvLSTM at these As demonstrated above, the TMAE can be used efficiently
rates. In the realm of moderate compression rates, it matches to reduce the size of an image and reconstruct it using its
the performance of these models. An added advantage is its semantics. This improves source coding and consequently
design approach: while the aforementioned methods are based increases the throughput of wireless communications. In this
on autoencoders and thus require part of the DNN architecture section, we discuss applications of the TMAE in various areas
to be incorporated at the UAV—increasing its complexity, our of wireless communication.
6