0% found this document useful (0 votes)

7 views6 pages

Background: Image Transformer

The document discusses the Image Transformer, a model that utilizes self-attention mechanisms for image generation, achieving state-of-the-art results on datasets like ImageNet. It compares the Image Transformer to existing models such as PixelCNN and GANs, highlighting its advantages in terms of computational efficiency and image quality. The architecture incorporates local self-attention to manage the complexity of processing high-dimensional image data while maintaining effective receptive fields.

Uploaded by

hansneuro2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views6 pages

Background: Image Transformer

Uploaded by

hansneuro2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Image Transformer

of parameters and consequently computational performance 2. Background

and can make training such models more challenging.
There is a broad variety of types of image generation mod-
In this work we show that self-attention (Cheng et al., 2016; els in the literature. This work is strongly inspired by au-
Parikh et al., 2016; Vaswani et al., 2017) can achieve a better toregressive models such as fully visible belief networks
balance in the trade-off between the virtually unlimited and NADE (Bengio & Bengio, 2000; Larochelle & Mur-
receptive field of the necessarily sequential PixelRNN and ray, 2011) in that we also factor the joint probability of
the limited receptive field of the much more parallelizable the image pixels into conditional distributions. Following
PixelCNN and its various extensions. PixelRNN (van den Oord et al., 2016a), we also model the
We adopt similar factorizations of the joint pixel distribu- color channels of the output pixels as discrete values gener-
tion as previous work. Following recent work on model- ated from a multinomial distribution, implemented using a
ing text (Vaswani et al., 2017), however, we propose es- simple softmax layer.
chewing recurrent and convolutional networks in favor of The current state of the art in modeling images on CIFAR-
the Image Transformer, a model based entirely on a self- 10 data set was achieved by PixelCNN++, which models the
attention mechanism. The specific, locally restricted form output pixel distribution with a discretized logistic mixture
of multi-head self-attention we propose can be interpreted likelihood, conditioning on whole pixels instead of color
as a sparsely parameterized form of gated convolution. By channels and changes to the architecture (Salimans et al.).
decoupling the size of the receptive field from the num- These modifications are readily applicable to our model,
ber of parameters, this allows us to use significantly larger which we plan to evaluate in future work.
receptive fields than the PixelCNN.
Another, popular direction of research in image generation
Despite comparatively low resource requirements for train- is training models with an adversarial loss (Goodfellow
ing, the Image Transformer attains a new state of the art et al., 2014). Typically, in this regime a generator net-
in modeling images from the standard ImageNet data set, work is trained in opposition to a discriminator network
as measured by log-likelihood. Our experiments indicate trying to determine if a given image is real or generated. In
that increasing the size of the receptive field plays a sig- contrast to the often blurry images generated by networks
nificant role in this improvement. We observe significant trained with likelihood-based losses, generative adversar-
improvements up to effective receptive field sizes of 256 ial networks (GANs) have been shown to produce sharper
pixels, while the PixelCNN (van den Oord et al., 2016b) images with realistic high-frequency detail in generation
with 5x5 filters used 25. and image super-resolution tasks (Zhang et al., 2016; Ledig
Many applications of image density models require condi- et al., 2016).
tioning on additional information of various kinds: from im- While very promising, GANs have various drawbacks. They
ages in enhancement or reconstruction tasks such as super- are notoriously unstable (Radford et al., 2015), motivating
resolution, in-painting and denoising to text when synthesiz- a large number of methods attempting to make their train-
ing images from natural language descriptions (Mansimov ing more robust (Metz et al., 2016; Berthelot et al., 2017).
et al., 2015). In visual planning tasks, conditional image Another common issue is that of mode collapse, where gen-
generation models could predict future frames of video con- erated images fail to reflect the diversity in the training set
ditioned on previous frames and taken actions. (Metz et al., 2016).
In this work we hence also evaluate two different methods A related problem is that GANs do not have a density in
of performing conditional image generation with the Im- closed-form. This makes it challenging to measure the
age Transformer. In image-class conditional generation we degree to which the models capture diversity. This also
condition on an embedding of one of a small number of complicates model design. Objectively evaluating and com-
image classes. In super-resolution with high magnification paring, say, different hyperparameter choices is typically
ratio (4x), we condition on a very low-resolution image, much more difficult in GANs than in models with a tractable
employing the Image Transformer in an encoder-decoder likelihood.
configuration (Kalchbrenner & Blunsom, 2013). In com-
parison to recent work on autoregressive super-resolution
(Dahl et al., 2017), a human evaluation study found im-
3. Model Architecture
ages generated by our models to look convincingly natural 3.1. Image Representation
significantly more often.
We treat pixel intensities as either discrete categories or ordi-
nal values; this setting depends on the distribution (Section
3.4). For categories, each of the input pixels’ three color
channels is encoded using a channel-specific set of 256
Image Transformer

Input Gen Truth Input Gen Truth

Table 2. On the left are image completions from our best conditional generation model, where we sample the second half. On the right are
samples from our four-fold super-resolution model trained on CIFAR-10. Our images look realistic and plausible, show good diversity
among the completion samples and observe the outputs carry surprising details for coarse inputs in super-resolution.

d-dimensional embedding vectors of the intensity values

q’
0 − 255. For output intensities, we share a single, separate
set of 256 d-dimensional embeddings across the channels. LayerNorm
For an image of width w and height h, we combine the
Dropout
width and channel dimensions yielding a 3-dimensional
tensor with shape [h, w · 3, d]. LayerNorm FFNN

For ordinal values, we run a 1x3 window size, 1x3 strided Dropout
convolution to combine the 3 channels per pixel to form an
input representation with shape [h, w, d].
To each pixel representation, we add a d-dimensional encod-
Local Self-Attention

ing representing coordinates of that pixel. We evaluated two MatMul Wv MatMul Wv MatMul Wv
different coordinate encodings: sine and cosine functions of
softmax
the coordinates, with different frequencies across different
cmp cmp cmp
dimensions, following (Vaswani et al., 2017), and learned
·
position embeddings. Since we need to represent two co-
+ + + + MatMul Wq MatMul Wk
ordinates, we use d/2 of the dimensions to encode the row
number and the other d/2 of the dimensions to encode the pq p1 p2 p3
q m1 m2 m3
the column and color channel.

3.2. Self-Attention Figure 1. A slice of one layer of the Image Transformer, recom-
puting the representation q 0 of a single channel of one pixel q by
For image-conditioned generation, as in our super-resolution attending to a memory of previously generated pixels m1 , m2 , . . ..
models, we use an encoder-decoder architecture. The en- After performing local self-attention we apply a two-layer position-
coder generates a contextualized, per-pixel-channel repre- wise feed-forward neural network with the same parameters for
sentation of the source image. The decoder autoregressively all positions in a given layer. Self-attention and the feed-forward
networks are followed by dropout and bypassed by a residual
generates an output image of pixel intensities, one channel
connection with subsequent layer normalization. The position
per pixel at each time step. While doing so, it consumes the encodings pq , p1 , . . . are added only in the first layer.
previously generated pixels and the input image represen-
Image Transformer

tation generated by the encoder. For both the encoder and the transformed memory, weighted by the attention distribu-
decoder, the Image Transformer uses stacks of self-attention tion. In the decoders of our different models we mask the
and position-wise feed-forward layers, similar to (Vaswani outputs of the comparisons appropriately so that the model
et al., 2017). In addition, the decoder uses an attention cannot attend to positions in the memory that have not been
mechanism to consume the encoder representation. For un- generated, yet.
conditional and class-conditional generation, we employ the
To the resulting vector we then apply a single-layer fully-
Image Transformer in a decoder-only configuration.
connected feed-forward neural network with rectified linear
Before we describe how we scale self-attention to images activation followed by another linear transformation. The
comprised of many more positions than typically found in learned parameters of these are shared across all positions
sentences, we give a brief description of self-attention. but different from layer to layer.
Each self-attention layer computes a d-dimensional repre- As illustrated in Figure1, we perform dropout, merge in
sentation for each position, that is, each channel of each residual connections and perform layer normalization after
pixel. To recompute the representation for a given posi- each application of self-attention and the position-wise feed-
tion, it first compares the position’s current representation forward networks (Ba et al., 2016; Srivastava et al., 2014).
to other positions’ representations, obtaining an attention
The entire self-attention operation can be implemented using
distribution over the other positions. This distribution is
highly optimized matrix multiplication code and executed
then used to weight the contribution of the other positions’
in parallel for all pixels’ channels.
representations to the next representation for the position at
hand.
3.3. Local Self-Attention
Equations 1 and 2 outline the computation in our self-
attention and fully-connected feed-forward layers; Figure The number of positions included in the memory lm , or the
1 depicts it. W1 and W2 are the parameters of the feed- number of columns of M , has tremendous impact on the
forward layer, and are shared across all the positions in a scalability of the self-attention mechanism, which has a time
layer. These fully describe all operations performed in every complexity in O(h · w · lm · d).
layer, independently for each position, with the exception of The encoders of our super-resolution models operate on 8×8
multi-head attention. For details of multi-head self-attention, pixel images and it is computationally feasible to attend to
see (Vaswani et al., 2017). all of their 192 positions. The decoders in our experiments,
however, produce 32 × 32 pixel images with 3072 positions,
rendering attending to all positions impractical.
qa = layernorm(q + dropout( Inspired by convolutional neural networks we address this
Wq q(M Wk )T by adopting a notion of locality, restricting the positions

softmax √ M Wv )) (1) in the memory matrix M to a local neighborhood around
d
the query position. Changing this neighborhood per query
position, however, would prohibit packing most of the com-
putation necessary for self-attention into two matrix multi-
q 0 = layernorm(qa + dropout(W1 ReLu(W2 qa ))) (2)
plications - one for computing the pairwise comparisons and
another for generating the weighted averages. To avoid this,
In more detail, following previous work, we call the cur- we partition the image into query blocks and associate each
rent representation of the pixel’s channel, or position, to be of these with a larger memory block that also contains the
recomputed the query q. The other positions whose repre- query block. For all queries from a given query block, the
sentations will be used in computing a new representation model attends to the same memory matrix, comprised of all
for q are m1 , m2 , . . . which together comprise the columns positions from the memory block. The self-attention is then
of the memory matrix M . Note that M can also contain q. computed for all query blocks in parallel. The feed-forward
We first transform q and M linearly by learned matrices Wq networks and layer normalizations are computed in parallel
and Wk , respectively. for all positions.
The self-attention mechanism then compares q to each of In our experiments we use two different schemes for choos-
the pixel’s channel representations
√ in the memory with a dot- ing query blocks and their associated memory block neigh-
product, scaled by 1/ d. We apply the softmax function borhoods, resulting in two different factorizations of the
to the resulting compatibility scores, treating the obtained joint pixel distribution into conditional distributions. Both
vector as attention distribution over the pixel channels in are illustrated in Figure 2.
the memory. After applying another linear transformation
Wv to the memory M , we compute a weighted average of
Image Transformer

Local 1D Attention Local 2D Attention

1D Local Attention For 1D local attention (Section 3.3)
we first flatten the input tensor with positional encodings in Memory Block
Memory Block
raster-scan order, similar to previous work (van den Oord
et al., 2016a). To compute self-attention on the resulting
linearized image, we then partition the length into non-
q q
overlapping query blocks Q of length lq , padding with ze- Query Block

roes if necessary. While contiguous in the linearized image, Query Block

these blocks can be discontiguous in image coordinate space.

For each query block we build the memory block M from
the same positions as Q and an additional lm positions cor-
responding to pixels that have been generated before, which Figure 2. The two different conditional factorizations used in our
can result in overlapping memory blocks. experiments, with 1D and 2D local attention on the left and right,
respectively. In both, the image is partitioned into non-overlapping
query blocks, each associated with a memory block covering a
2D Local Attention In 2D local attention models, we
superset of the query block pixels. In every self-attention layer,
partition the input tensor with positional encodings into
each position in a query block attends to all positions in the memory
rectangular query blocks contiguous in the original image block. The pixel marked as q is the last that was generated. All
space. We generate the image one query block after another, channels of pixels in the memory and query blocks shown in white
ordering the blocks in raster-scan order. Within each block, have masked attention weights and do not contribute to the next
we generate individual positions, or pixel channels, again in representations of positions in the query block. While the effective
raster-scan order. receptive field size in this figure is the same for both schemes, in
2D attention the memory block contains a more evenly balanced
As illustrated in the right half of Figure 2, we generate the number of pixels next to and above the query block, respectively.
blocks outlined in grey lines left-to-right and top-to-bottom.
We use 2-dimensional query blocks of a size lq specified by
height and width lq = wq ·hq , and memory blocks extending
the query block to the top, left and right by hm , wm and dence across channels (Salimans et al.). For each pixel, the
again wm pixels, respectively. number of parameters is 10 times the number of mixture
components: 10 for one unnormalized mixture probability,
In both 1D and 2D local attention, we mask attention three means, three standard deviations, and three coefficients
weights in the query and memory blocks such that posi- which capture the linear dependence. For 10 mixtures, this
tions that have not yet been generated are ignored. translates to 100 parameters for each pixel; for 32 × 32 im-
As can be seen in Figure 2, 2D local attention balances hori- ages, the network outputs 102, 400 dimensions, which is a
zontal and vertical conditioning context much more evenly. 7x reduction enabling denser gradients and lower memory.
We believe this might have an increasingly positive effect on
quality with growing image size as the conditioning informa- 4. Inference
tion in 1D local attention becomes increasingly dominated
by pixels next to a given position as opposed to above it. Across all of the presented experiments, we use categorical
sampling during decoding with a tempered softmax (Dahl
3.4. Loss Function et al., 2017). We adjust the concentration of the distribution
we sample from with a temperature τ > 0 by which we
We perform maximum likelihood, in which we maximize divide the logits for the channel intensities.
Ph·w·3
log p(x) = t=1 log p(xt | x<t ) with respect to network
We tuned τ between 0.8 and 1.0, observing the highest
parameters, and where the network outputs all parameters
perceptual quality in unconditioned and class-conditional
of the autoregressive distribution. We experiment with two
image generation with τ = 1.0. For super-resolution we
settings of the distribution: a categorical distribution across
present results for different temperatures in Table 5.
each channel (van den Oord et al., 2016a) and a mixture of
discretized logistics over three channels (Salimans et al.).
5. Experiments
The categorical distribution (cat) captures each intensity
value as a discrete outcome and factorizes across channels. All code we used to develop, train, and evaluate our models
In total, there are 256·3 = 768 parameters for each pixel; for is available in Tensor2Tensor (Vaswani et al., 2018).
32 × 32 images, the network outputs 786, 432 dimensions.
For all experiments we optimize with Adam (Kingma & Ba,
Unlike the categorical distribution, the discretized mixture 2015), and vary the learning rate as specified in (Vaswani
of logistics (DMOL) captures two important properties: et al., 2017). We train our models on both p100 and k40
the ordinal nature of pixel intensities and simpler depen- GPUs, with batch sizes ranging from 1 to 8 per GPU.
Image Transformer

Table 4. Bits/dim on CIFAR-10 test and ImageNet validation sets.

The Image Transformer outperforms all models and matches Pixel-
CNN++, achieving a new state-of-the-art on ImageNet. Increasing
memory block size (bsize) significantly improves performance.
Model Type bsize NLL
CIFAR-10 ImageNet
(Test) (Validation)
Pixel CNN - 3.14 -
Row Pixel RNN - 3.00 3.86
Gated Pixel CNN - 3.03 3.83
Pixel CNN++ - 2.92 -
PixelSNAIL - 2.85 3.80
Ours 1D local (8l, cat) 8 4.06 -
16 3.47 -
64 3.13 -
256 2.99 -
Ours 1D local (cat) 256 2.90 3.77
Ours 1D local (dmol) 256 2.90 -

Table 3. Conditional image generations for all CIFAR-10 cate- ImageNet is a much larger dataset, with many more cate-
gories. Images on the left are from a model that achieves 3.03 gories than CIFAR-10, requiring more parameters in a gener-
bits/dim on the test set. Images on the right are from our best ative model. Our ImageNet unconditioned generation model
non-averaged model with 2.99 bits/dim. Both models are able
has 12 self-attention and feed-forward layers, d = 512, 8
to generate convincing cars, trucks, and ships. Generated horses,
planes, and birds also look reasonable.
attention heads, 2048 dimensions in the feed-forward lay-
ers, and dropout of 0.1. It significantly outperforms the
Gated PixelCNN and establishes a new state-of-the-art of
3.77 bits/dim with checkpoint averaging. We trained only
5.1. Generative Image Modeling unconditional generative models on ImageNet, since class
labels were not available in the dataset provided by (van den
Our unconditioned and class-conditioned image generation Oord et al., 2016a).
models both use 1D local attention, with lq = 256 and a total
memory size of 512. On CIFAR-10 our best unconditional Table 4 shows that growing the receptive field improves
models achieve a perplexity of 2.90 bits/dim on the test set perplexity significantly. We believe this to highlight a key
using either DMOL or categorical. For categorical, we use advantage of local self-attention over CNNs: namely that
12 layers with d = 512, heads=4, feed-forward dimension the number of parameters used by local self-attention is
2048 with a dropout of 0.3. In DMOL, our best config uses independent of the size of the receptive field. Furthermore,
14 layers, d = 256, heads=8, feed-forward dimension 512 while d > receptivefield, self-attention still requires fewer
and a dropout of 0.2. This is a considerable improvement floating-point operations.
over two baselines: the PixelRNN (van den Oord et al., For experiments with the categorical distribution we evalu-
2016a) and PixelCNN++ (Salimans et al.). Introduced after ated both coordinate encoding schemes described in Section
the Image Transformer, the also self-attention based Pixel- 3.3 and found no difference in quality. For DMOL we only
SNAIL model reaches a significantly lower perplexity of evaluated learned coordinate embeddings.
2.85 bits/dim on CIFAR-10 (Chen et al., 2017). On the
more challenging ImageNet data set, however, the Image
5.2. Conditioning on Image Class
Transformer performs significantly better than PixelSNAIL.
We represent the image classes as learned d-dimensional
We also train smaller 8 layer CIFAR-10 models which have
embeddings per class and simply add the respective em-
d = 512, 1024 dimensions in the feed-forward layers, 8
bedding to the input representation of every input position
attention heads and use dropout of 0.1, and achieve 3.03
together with the positional encodings.
bits/dim, matching the PixelCNN model (van den Oord
et al., 2016a). Our best CIFAR-10 model with DMOL has d We trained the class-conditioned Image Transformer on
and feed-forward layer layer dimension of 256 and perform CIFAR-10, achieving very similar log-likelihoods as in un-
attention in 512 dimensions. conditioned generation. The perceptual quality of generated
Image Transformer

images, however, is significantly higher than that of our

Table 5. Negative log-likelihood and human eval performance for
unconditioned models. The samples from our 8-layer class-
the Image Transformer on CelebA. The fraction of humans fooled
conditioned models in Table 3, show that we can generate is significantly better than the previous state of the art.
realistic looking images for some categories, such as cars
and trucks. Model Type τ %Fooled
ResNet n/a 4.0
5.3. Image Super-Resolution srez GAN n/a 8.5
Super-resolution is the process of recovering a high resolu- PixelRecursive 1.0 11.0
tion image from a low resolution image while generating (Dahl et al., 2017) 0.9 10.4
realistic and plausible details. Following (Dahl et al., 2017), 0.8 10.2
in our experimental setup we enlarge an 8 × 8 pixel image 1D local 1.0 29.6 ± 4.0
four-fold to 32 × 32, a process that is massively underspec- Image Transformer 0.9 33.5 ± 3.5
ified: the model has to generate aspects such as texture of 0.8 35.94 ± 3.0
hair, makeup, skin and sometimes even gender that cannot 2D local 1.0 30.64 ± 4
possibly be recovered from the source image. Image Transformer 0.9 34 ± 3.5
0.8 36.11 ± 2.5
Here, we use the Image Transformer in an encoder-decoder
configuration, connecting the encoder and decoder through
an attention mechanism (Vaswani et al., 2017). For the
encoder, we use embeddings for RGB intensities for each
pixel in the 8×8 image and add 2 dimensional positional
encodings for each row and width position. Since the input Turk where each worker is required to make a binary choice
is small, we flatten the whole image as a [h × w × 3, d] when shown one generated and one real image. Following
tensor, where d is typically 512. We then feed this sequence the same procedure for the evaluation study as (Dahl et al.,
to our stack of transformer encoder layers that uses repeated 2017), we show 50 pairs of images, selected randomly from
self-attention and feed forward layers. In the encoder we the validation set, to 50 workers each. Each generated and
don’t require masking, but allow any input pixel to attend original image is upscaled to 128 × 128 pixels using the
to any other pixel. In the decoder, we use a stack of local Bilinear interpolation method. Each worker then has 1-2
self-attention, encoder-decoder-attention and feed-forward seconds to make a choice between these two images. In
layers. We found using two to three times fewer encoder our method, workers choose images from our model up
than decoder layers to be ideal for this task. to 36.1% of the time, a significant improvement over pre-
We perform end-to-end training of the encoder-decoder vious models. Sampling temperature of 0.8 and 2D local
model for Super resolution using the log-likelihood ob- attention maximized perceptual quality as measured by this
jective function. Our method generates higher resolution evaluation.
images that look plausible and realistic across two datasets. To measure how well the high resolution samples correspond
For both of the following data sets, we resized the image to to the low resolution input, we calculate Consistency, the
8 × 8 pixels for the input and 32 × 32 pixels for the label L2 distance between the low resolution input and a bicubic
using TensorFlow’s area interpolation method. downsampled version of the high resolution sample. We
observe a Consistency score of 0.01 which is on par with
the models in (Dahl et al., 2017).
CelebA We trained both our 1D Local and 2D Local mod-
els on the standard CelebA data set of celebrity faces with We quantify that our models are more effective than exem-
cropped boundaries. With the 1D Local, we achieve a nega- plar based Super Resolution techniques like Nearest Neigh-
tive log likelihood (NLL) of 2.68 bits/dim on the dev set, bors, which perform a naive look-up of the training data to
using lq = 128, memory size of 256, 12 self-attention and find the high resolution output. We take a bicubic down-
feed-forward layers, d = 512, 8 attention heads, 2048 di- sampled version of our high resolution sample, find the
mensions in the feed-forward layers, and a dropout of 0.1. nearest low resolution input image in the training data for
With the 2D Local model, we only change the query and that sample, and calculate the MS-SSIM score between the
memory to now represent a block of size 8 × 32 pixels and high resolution sample and the corresponding high reso-
16 × 64 pixels respectively. This model achieves a NLL lution image in the training data. On average, we get a
of 2.61 bits/dim. Existing automated metrics like pSNR, MS-SSIM score of 44.3, on 50 samples from the validation
SSIM and MS-SSIM have been shown to not correlate with set, which shows that our models don’t merely learn to copy
perceptual image quality (Dahl et al., 2017). Hence, we con- training images but generate high-quality images by adding
ducted a human evaluation study on Amazon Mechanical synthesized details on the low resolution input image.

L S Gan T H F N I S: Arge Cale Raining For IGH Idelity Atural Mage Ynthesis
No ratings yet
L S Gan T H F N I S: Arge Cale Raining For IGH Idelity Atural Mage Ynthesis
35 pages
Internal Resistance Project Class 12
100% (6)
Internal Resistance Project Class 12
16 pages
Image Restoration Using Residual Generative Adversarial Networks-FINAL
No ratings yet
Image Restoration Using Residual Generative Adversarial Networks-FINAL
21 pages
StackGAN Realistic Image Synthesis With Stacked Generative Adversarial Networks
No ratings yet
StackGAN Realistic Image Synthesis With Stacked Generative Adversarial Networks
16 pages
Training Generative Adversarial Networks With Limited Data
No ratings yet
Training Generative Adversarial Networks With Limited Data
37 pages
Minh Hoa KTHK1 Anh 11 - Linh
No ratings yet
Minh Hoa KTHK1 Anh 11 - Linh
2 pages
Exploring Painting Synthesis With Diffusion Models 2
No ratings yet
Exploring Painting Synthesis With Diffusion Models 2
10 pages
Application Letter For Safaricom
No ratings yet
Application Letter For Safaricom
4 pages
GIRAFFE Representing Scenes As Compositional Generative Neural Feature Fields - 2011.12100v2
No ratings yet
GIRAFFE Representing Scenes As Compositional Generative Neural Feature Fields - 2011.12100v2
12 pages
StyleSwin Transformer-Based GAN For High-Resolution Image Generation
No ratings yet
StyleSwin Transformer-Based GAN For High-Resolution Image Generation
11 pages
Generative Adversarial Network An Overview of Theory and Applications
No ratings yet
Generative Adversarial Network An Overview of Theory and Applications
9 pages
1D Local Attention: 3.3 Van Den Oord Et Al. 2016a
No ratings yet
1D Local Attention: 3.3 Van Den Oord Et Al. 2016a
6 pages
A Comprehensive Survey of Image Generation Models Based On Deep Learning
No ratings yet
A Comprehensive Survey of Image Generation Models Based On Deep Learning
30 pages
Epstein Online Papers 2024
No ratings yet
Epstein Online Papers 2024
11 pages
Methods and Trends in Detecting Generated Images: A Comprehensive Review
No ratings yet
Methods and Trends in Detecting Generated Images: A Comprehensive Review
30 pages
Satgan Paper
No ratings yet
Satgan Paper
17 pages
Generative Adversarial Transformers
No ratings yet
Generative Adversarial Transformers
22 pages
2103 - Generative Adversarial Transformers
No ratings yet
2103 - Generative Adversarial Transformers
22 pages
2021 Arxiv - TransGAN Two Transformers Can Make One Strong GAN
No ratings yet
2021 Arxiv - TransGAN Two Transformers Can Make One Strong GAN
13 pages
Few-Shot Image Generation Via Style Adaptation and Content Preservation
No ratings yet
Few-Shot Image Generation Via Style Adaptation and Content Preservation
12 pages
XQ-GAN: An Open-Source Image Tokenization Framework For Autoregressive Generation
No ratings yet
XQ-GAN: An Open-Source Image Tokenization Framework For Autoregressive Generation
12 pages
DL U-III Computer Vision
No ratings yet
DL U-III Computer Vision
30 pages
Are GAN Generated Images Easy To Detect A Critical Analysis of The State-Of-The-Art.
No ratings yet
Are GAN Generated Images Easy To Detect A Critical Analysis of The State-Of-The-Art.
7 pages
Sensors: HEMIGEN: Human Embryo Image Generator Based On Generative Adversarial Networks
No ratings yet
Sensors: HEMIGEN: Human Embryo Image Generator Based On Generative Adversarial Networks
16 pages
Vision Transformers For Dense Prediction: Ren e Ranftl Alexey Bochkovskiy Intel Labs Vladlen Koltun
No ratings yet
Vision Transformers For Dense Prediction: Ren e Ranftl Alexey Bochkovskiy Intel Labs Vladlen Koltun
15 pages
UVCGAN UNetVision Transformer Cycle-Consistent GAN For Unpaired
No ratings yet
UVCGAN UNetVision Transformer Cycle-Consistent GAN For Unpaired
17 pages
Visual Gans
No ratings yet
Visual Gans
19 pages
Self-Attention GAN
No ratings yet
Self-Attention GAN
10 pages
Generative - Adversarial - Networks - For - Extreme - Learned - Image - Compression
No ratings yet
Generative - Adversarial - Networks - For - Extreme - Learned - Image - Compression
11 pages
VQGAN: Taming Transformer For High-Resolution Image Synthesis
No ratings yet
VQGAN: Taming Transformer For High-Resolution Image Synthesis
52 pages
Proceedings of The 1ST International Congress of The International Society of Sports Sciences in The Arab World
No ratings yet
Proceedings of The 1ST International Congress of The International Society of Sports Sciences in The Arab World
140 pages
Zero-Shot Detection of AI-Generated Images: (Davide - Cozzolino, Poggi, Verdoliv) @unina - It Niessner@tum - de
No ratings yet
Zero-Shot Detection of AI-Generated Images: (Davide - Cozzolino, Poggi, Verdoliv) @unina - It Niessner@tum - de
24 pages
Bicycle Gan
No ratings yet
Bicycle Gan
12 pages
ASTM D 422-63 (Reapproved 2002)
No ratings yet
ASTM D 422-63 (Reapproved 2002)
8 pages
FS 1 Observation Narrative Report SHS Format For Sy 2025 2026
No ratings yet
FS 1 Observation Narrative Report SHS Format For Sy 2025 2026
3 pages
Generative Adversarial Network For Medic
No ratings yet
Generative Adversarial Network For Medic
19 pages
Image Super Resolution
No ratings yet
Image Super Resolution
8 pages
Kim2019 Article LatentTransformationsNeuralNet
No ratings yet
Kim2019 Article LatentTransformationsNeuralNet
15 pages
Harsha Thesis
No ratings yet
Harsha Thesis
62 pages
Mechanics of Materials B.C. Punmia - Get The Ebook in PDF Format For A Complete Experience
No ratings yet
Mechanics of Materials B.C. Punmia - Get The Ebook in PDF Format For A Complete Experience
56 pages
Hardy Weinberg Law
No ratings yet
Hardy Weinberg Law
7 pages
Exploring The Various Machine Learning Models For Image Generation - A Comprehensive Survey Unlocking The Future of Digital Creativity
No ratings yet
Exploring The Various Machine Learning Models For Image Generation - A Comprehensive Survey Unlocking The Future of Digital Creativity
15 pages
Wang High-Resolution Image Synthesis CVPR 2018 Paper
No ratings yet
Wang High-Resolution Image Synthesis CVPR 2018 Paper
10 pages
AI Resubmtion
No ratings yet
AI Resubmtion
18 pages
Paper4 (GAN)
No ratings yet
Paper4 (GAN)
24 pages
LargeGANS PDF
No ratings yet
LargeGANS PDF
29 pages
Data-Sheet FieldJointCoating
No ratings yet
Data-Sheet FieldJointCoating
2 pages
Style
No ratings yet
Style
51 pages
ImageGenerationwithGans basedTechniquesASurvey
No ratings yet
ImageGenerationwithGans basedTechniquesASurvey
19 pages
Kang Scaling Up GANs For Text-to-Image Synthesis CVPR 2023 Paper
No ratings yet
Kang Scaling Up GANs For Text-to-Image Synthesis CVPR 2023 Paper
11 pages
NEET Chemistry Chapter Wise Mock Test - Physical Chemistry I - CBSE Tuts
No ratings yet
NEET Chemistry Chapter Wise Mock Test - Physical Chemistry I - CBSE Tuts
25 pages
Wang CNN-Generated Images Are Surprisingly Easy To Spot... For Now CVPR 2020 Paper
No ratings yet
Wang CNN-Generated Images Are Surprisingly Easy To Spot... For Now CVPR 2020 Paper
10 pages
Ardizzone2019 - Conditional Coupling Layers
No ratings yet
Ardizzone2019 - Conditional Coupling Layers
11 pages
1 Introduction To Consumer Behavior and Marketing Strategy
No ratings yet
1 Introduction To Consumer Behavior and Marketing Strategy
3 pages
Recent IELTS Writing Topics and Questions 2024 - How To Do IELTS
No ratings yet
Recent IELTS Writing Topics and Questions 2024 - How To Do IELTS
49 pages
Training Tips For The Transformer Model: Martin Popel, Ondřej Bojar
No ratings yet
Training Tips For The Transformer Model: Martin Popel, Ondřej Bojar
28 pages
Meta
No ratings yet
Meta
17 pages
Pmi Test Sample E368808-Pm012-240-056-0003
No ratings yet
Pmi Test Sample E368808-Pm012-240-056-0003
6 pages
Photographic Text-to-Image Synthesis With A Hierarchically-Nested Adversarial Network
No ratings yet
Photographic Text-to-Image Synthesis With A Hierarchically-Nested Adversarial Network
10 pages
Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications
No ratings yet
Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications
24 pages
Responsive Css
No ratings yet
Responsive Css
24 pages
Deep Generative Image Models Using A Laplacian Pyramid of Adversarial Networks
No ratings yet
Deep Generative Image Models Using A Laplacian Pyramid of Adversarial Networks
10 pages
Base Paper Batch 9 Final Updated 3
No ratings yet
Base Paper Batch 9 Final Updated 3
10 pages
4.2. Training Data Size
No ratings yet
4.2. Training Data Size
19 pages
ATT III - 17. Application of Leadership and Teamworking Skills
No ratings yet
ATT III - 17. Application of Leadership and Teamworking Skills
6 pages
A Review of Generative Adversarial Networks For Computer Vision TasksElectronics Switzerland
No ratings yet
A Review of Generative Adversarial Networks For Computer Vision TasksElectronics Switzerland
17 pages
G10 Answer Key Midterm Exam, Term 1
No ratings yet
G10 Answer Key Midterm Exam, Term 1
5 pages
Image-to-Image Translation With Conditional Adversarial Networks
No ratings yet
Image-to-Image Translation With Conditional Adversarial Networks
17 pages
Synthetic Data Generation For Scarce Road Scene Detection Scenarios
No ratings yet
Synthetic Data Generation For Scarce Road Scene Detection Scenarios
10 pages
Deep Generative Adversarial Networks For Image-To
No ratings yet
Deep Generative Adversarial Networks For Image-To
26 pages
A Survey of Image Synthesis and Editing With Generative Adversarial Networks PDF
No ratings yet
A Survey of Image Synthesis and Editing With Generative Adversarial Networks PDF
15 pages
Image Transformer: Van Den Oord & Schrauwen 2014 Bellemare Et Al. 2016
No ratings yet
Image Transformer: Van Den Oord & Schrauwen 2014 Bellemare Et Al. 2016
10 pages
Enhancing The Weather - Governance of Weather Modification Activit
No ratings yet
Enhancing The Weather - Governance of Weather Modification Activit
69 pages
The Six Fronts of The Generative Adversarial Networks
No ratings yet
The Six Fronts of The Generative Adversarial Networks
11 pages
Hawas Bajawi CV
No ratings yet
Hawas Bajawi CV
4 pages
UAS High School Profile 2024 25 Vers2
No ratings yet
UAS High School Profile 2024 25 Vers2
4 pages
3 - Thermal Energy Storage in District Heating and Cooling Systems A Review
No ratings yet
3 - Thermal Energy Storage in District Heating and Cooling Systems A Review
22 pages
Sr. No Title Published Problem Statement Methodology Dataset Dataset Avail-Ability
No ratings yet
Sr. No Title Published Problem Statement Methodology Dataset Dataset Avail-Ability
2 pages
Production - Derieux - Cedric - Advances in Automatic Image Restoration and Upscaling
No ratings yet
Production - Derieux - Cedric - Advances in Automatic Image Restoration and Upscaling
4 pages
Lata 2019
No ratings yet
Lata 2019
4 pages
Report 16
No ratings yet
Report 16
9 pages
Discipline of Focus
No ratings yet
Discipline of Focus
9 pages
Local Media3092843488830198412
100% (1)
Local Media3092843488830198412
2 pages
WP0 REPLA0140 Same 0 Box 00 PUBLIC0
No ratings yet
WP0 REPLA0140 Same 0 Box 00 PUBLIC0
114 pages
A Cell-Based Smoothed Finite Element Method For TH
No ratings yet
A Cell-Based Smoothed Finite Element Method For TH
14 pages
Electronic Temperature Controllers: Multipact
No ratings yet
Electronic Temperature Controllers: Multipact
6 pages
Research
No ratings yet
Research
15 pages
Types of Load Pavement Failures in Kenya
No ratings yet
Types of Load Pavement Failures in Kenya
4 pages
ME451: Control Systems Course Roadmap
No ratings yet
ME451: Control Systems Course Roadmap
5 pages
Acta Tropica: P. Coelho, P. Sousa, D.J. Harris, A. Van Der Meijden
No ratings yet
Acta Tropica: P. Coelho, P. Sousa, D.J. Harris, A. Van Der Meijden
9 pages
CC Block (Horiskhali)
No ratings yet
CC Block (Horiskhali)
1 page
VADOSE ZONE Microbial Ecology
No ratings yet
VADOSE ZONE Microbial Ecology
9 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Multi View Three Dimensional Reconstruction: Advanced Techniques for Spatial Perception in Computer Vision
From Everand
Multi View Three Dimensional Reconstruction: Advanced Techniques for Spatial Perception in Computer Vision
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Background: Image Transformer

Uploaded by

Background: Image Transformer

Uploaded by

Image Transformer

of parameters and consequently computational performance 2. Background

Input Gen Truth Input Gen Truth

d-dimensional embedding vectors of the intensity values

Local 1D Attention Local 2D Attention

roes if necessary. While contiguous in the linearized image, Query Block

these blocks can be discontiguous in image coordinate space.

Table 4. Bits/dim on CIFAR-10 test and ImageNet validation sets.

images, however, is significantly higher than that of our

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.