0% found this document useful (0 votes)

15 views21 pages

A Survey On Contrastive Self-Supervised Learning

This document provides a comprehensive review of contrastive self-supervised learning methods, highlighting their significance in avoiding costly data annotation. It discusses various pretext tasks and architectures used in contrastive learning, along with performance comparisons across multiple downstream tasks. The paper concludes by addressing the limitations of current methods and suggesting future research directions to enhance self-supervised learning techniques.

Uploaded by

Jing Ma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views21 pages

A Survey On Contrastive Self-Supervised Learning

Uploaded by

Jing Ma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

A S URVEY ON C ONTRASTIVE S ELF - SUPERVISED L EARNING

Ashish Jaiswal Ashwin Ramesh Babu

The University of Texas at Arlington The University of Texas at Arlington
Arlington, TX 76019 Arlington, TX 76019
ashish.jaiswal@mavs.uta.edu ashwin.rameshbabu@mavs.uta.edu
arXiv:2011.00362v3 [cs.CV] 7 Feb 2021

Mohammad Zaki Zadeh Debapriya Banerjee

The University of Texas at Arlington The University of Texas at Arlington
Arlington, TX 76019 Arlington, TX 76019
mohammad.zakizadehgharie@mavs.uta.edu debapriya.banerjee2@mavs.uta.edu

Fillia Makedon
The University of Texas at Arlington
Arlington, TX 76019
makedon@uta.edu

A BSTRACT
Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating
large-scale datasets. It is capable of adopting self-defined pseudo labels as supervision and use the
learned representations for several downstream tasks. Specifically, contrastive learning has recently
become a dominant component in self-supervised learning methods for computer vision, natural
language processing (NLP), and other domains. It aims at embedding augmented versions of the
same sample close to each other while trying to push away embeddings from different samples. This
paper provides an extensive review of self-supervised methods that follow the contrastive approach.
The work explains commonly used pretext tasks in a contrastive learning setup, followed by different
architectures that have been proposed so far. Next, we have a performance comparison of different
methods for multiple downstream tasks such as image classification, object detection, and action
recognition. Finally, we conclude with the limitations of the current methods and the need for further
techniques and future directions to make substantial progress.

Keywords contrastive learning · self-supervised learning · discriminative learning · image/video classification · object
detection · unsupervised learning · transfer learning

1 Introduction
The advancements in deep learning have elevated it to become one of the core components in most intelligent systems in
existence. The ability to learn rich patterns from the abundance of data available today has made deep neural networks
(DNNs) a compelling approach in the majority of computer vision (CV) tasks such as image classification, object
detection, image segmentation, activity recognition as well as natural language processing (NLP) tasks such as sentence
classification, language models, machine translation, etc. However, the supervised approach to learning features from
labeled data has almost reached its saturation due to intense labor required in manually annotating millions of data
samples. This is because most of the modern computer vision systems (that are supervised) try to learn some form of
image representations by finding a pattern between the data points and their respective annotations in large datasets.
Works such as GRAD-CAM [1] have proposed techniques that provide visual explanations for decisions made by a
model to make them more transparent and explainable.
Traditional supervised learning approaches heavily rely on the amount of annotated training data available. Even
though there’s a plethora of data available out there, the lack of annotations has pushed researchers to find alternative
A PREPRINT - F EBRUARY 9, 2021

approaches that can leverage them. This is where self-supervised methods plays a vital role in fueling the progress of
deep learning without the need for expensive annotations and learn feature representations where data itself provides
supervision.

Figure 1: Basic intuition behind contrastive learning paradigm: push original and augmented images closer and push
original and negative images away

Supervised learning not only depends on expensive annotations but also suffers from issues such as generalization
error, spurious correlations, and adversarial attacks [2]. Recently, self-supervised learning methods have integrated
both generative and contrastive approaches that have been able to utilize unlabeled data to learn the underlying
representations. A popular approach has been to propose various pretext tasks that help in learning features using
pseudo-labels. Tasks such as image-inpainting, colorizing greyscale images, jigsaw puzzles, super-resolution, video
frame prediction, audio-visual correspondence, etc have proven to be effective for learning good representations.

Figure 2: Contrastive learning pipeline for self-supervised training

Generative models gained its popularity after the introduction of Generative Adversarial Networks (GANs) [3] in
2014. The work later became the foundation for many successful architectures such as CycleGAN [4], StyleGAN [5],

2
A PREPRINT - F EBRUARY 9, 2021

PixelRNN [6], Text2Image [7], DiscoGAN [8], etc. These methods inspired more researchers to switch to training deep
learning models with unlabeled data in an self-supervised setup. Despite their success, researchers started realizing
some of the complications in GAN-based approaches. They are harder to train because of two main reasons: (a)
non-convergence–the model parameters oscillate a lot and rarely converge, and (b) the discriminator gets too successful
that the generator network fails to create real-like fakes due to which the learning cannot be continued. Also, proper
synchronization is required between the generator and the discriminator that prevents the discriminator to converge and
the generator to diverge.

Figure 3: Top-1 classification accuracy of different contrastive learning methods against baseline supervised method on
ImageNet

Unlike generative models, contrastive learning (CL) is a discriminative approach that aims at grouping similar samples
closer and diverse samples far from each other as shown in figure 1. To achieve this, a similarity metric is used to
measure how close two embeddings are. Especially, for computer vision tasks, a contrastive loss is evaluated based
on the feature representations of the images extracted from an encoder network. For instance, one sample from the
training dataset is taken and a transformed version of the sample is retrieved by applying appropriate data augmentation
techniques. During training referring to figure 2, the augmented version of the original sample is considered as a
positive sample, and the rest of the samples in the batch/dataset (depends on the method being used) are considered
negative samples. Next, the model is trained in a way that it learns to differentiate positive samples from the negative
ones. The differentiation is achieved with the help of some pretext task (explained in section 2). In doing so, the model
learns quality representations of the samples and is used later for transferring knowledge to downstream tasks. This
idea is advocated by an interesting experiment conducted by Epstein [9] in 2016, where he asked his students to draw a
dollar bill with and without looking at the bill. The results from the experiment show that the brain does not require
complete information of a visual piece to differentiate one object from the other. Instead, only a rough representation of
an image is enough to do so.
Most of the earlier works in this area combined some form of instance-level classification approach[10][11][12] with
contrastive learning and were successful to some extent. However, recent methods such as SwAV [13], MoCo [14], and
SimCLR [15] with modified approaches have produced results comparable to the state-of-the-art supervised method on

3
A PREPRINT - F EBRUARY 9, 2021

ImageNet [16] dataset as shown in figure 3. Similarly, PIRL [17], Selfie [18], and [19] are some papers that reflect the
effectiveness of the pretext tasks being used and how they boost the performance of their models.

2 Pretext Tasks
Pretext tasks are self-supervised tasks that act as an important strategy to learn representations of the data using pseudo
labels. These pseudo labels are generated automatically based on the attributes found in the data. The learned model
from the pretext task can be used for any downstream tasks such as classification, segmentation, detection, etc. in
computer vision. Furthermore, these tasks can be applied to any kind of data such as image, video, speech, signals,
and so on. For a pretext task in contrastive learning, the original image acts as an anchor, its augmented(transformed)
version acts as a positive sample, and the rest of the images in the batch or in the training data act as negative samples.
Most of the commonly used pretext tasks are divided into four main categories: color transformation, geometric
transformation, context-based tasks, and cross-modal based tasks. These pretext tasks have been used in various
scenarios based on the problem intended to be solved.

2.1 Color Transformation

Figure 4: Color Transformation as pretext task [15]. (a) Original (b) Gaussian noise (c) Gaussian blur (d) Color
distortion (Jitter)

Color transformation involves basic adjustments of color levels in an image such as blurring, color distortions, converting
to grayscale, etc. Figure 4 represents an example of color transformation applied on a sample image from the ImageNet
dataset [15]. During this pretext task, the network learns to recognize similar images invariant to their colors.

2.2 Geometric Transformation

A geometric transformation is a spatial transformation where the geometry of the image is modified without altering
its actual pixel information. The transformations include scaling, random cropping, flipping (horizontally, vertically),
etc. as represented in figure 5 through which global-to-local view prediction is achieved. Here the original image is
considered as the global view and the transformed version is considered as the local view. Chen et. al. [15] performed
such transformations to learn features during pretext task.

2.3 Context-Based

2.3.1 Jigsaw puzzle

Traditionally, solving jigsaw puzzles has been a prominent task in learning features from an image in an unsupervised
way. It involves identifying the correct position of the scrambled patches in an image by training an encoder (figure 6).

4
A PREPRINT - F EBRUARY 9, 2021

Figure 5: Geometric Transformation as pretext task [15]. (a) Original (b) Crop and Resize (c) Rotate(90◦ , 180◦ , 270◦ )
(d) crop, resize, flip

In terms of contrastive learning, the original image is the anchor, and an augmented image formed by scrambling the
patches in the original image acts as a positive sample. The rest of the images in the dataset/batch are considered to be
negative samples [17].

Figure 6: Solving jigsaw puzzle being used as a pretext task to learn representation. (a) Original Image (b) reshuffled
image. The original image is the anchor and the reshuffled image is the positive sample.

2.3.2 Frame order based

This approach applies to data that extends through time. An ideal application would be in the case of sensor data or
a sequence of image frames (video). A video contains a sequence of semantically related frames. This implies that
frames that are nearby with respect to time are closely related and the ones that are far away are less likely to be related.
Intuitively, the motive for using such an approach is, solving a pretext task that allows the model to learn useful visual
representations while trying to recover the temporal coherence of a video. Here, a video with shuffled order in the
sequence of its image frames acts as a positive sample while all other videos in the batch/dataset would be negative
samples.
Similarly, other possible approaches include randomly sampling two clips of the same length from a longer video or
applying spatial augmentation for each video clip. The goal is to use a contrastive loss to train the model such that clips
taken from the same video are arranged closer whereas clips from different videos are pushed away in the embedding
space. In the work proposed by Qian et. al. [20], the framework contrasts the similarity between two positive samples
to those of negative samples. The positive pairs are two augmented clips from the same video. As a result, it separates
all encoded videos into non-overlapping regions such that an augmentation used in the training perturbs an encoded
video only within a small region in the representation space.

5
A PREPRINT - F EBRUARY 9, 2021

Figure 7: Contrastive Predictive Coding: Although the figure shows audio as input, similar setup can be used for videos,
images, text etc. [21]

2.3.3 Future prediction

One of the most common strategies for data that extends through time is to predict future or missing information. This
is commonly used for sequential data such as sensory data, audio signals, videos, etc. The goal of a future prediction
task is to predict high-level information of future time-step given a series of past ones. In the work proposed by
[21, 22], high-dimensional data is compressed into a compact lower-dimensional latent embedding space. Powerful
autoregressive models are used to summarize the information in the latent space and a context latent representation
Ct is produced as represented in figure 7. When predicting future information, the target (future) and context Ct are
encoded into a compact distributed vector representation in a way that maximally preserves the mutual information of
the original signals.

2.4 View Prediction (Cross modal-based)

Figure 8: Learning representation from video frame sequence [23]

View prediction tasks are preferred for data that has multiple views of the same scene. Following this approach, in [23],
the anchor and its positive images taken from simultaneous viewpoints, are encouraged to be close in the embedding

6
A PREPRINT - F EBRUARY 9, 2021

space while distant from negative images taken from a different time within the same sequence. The model learns
by trying to simultaneously identify similar features between the frames from different angles and also trying to find
the difference between frames that occur later in the sequence. Figure 8 represents their approach for view prediction.
Similarly, recent work proposes an inter-intra contrastive framework where inter-sampling is learned through multi-view
of the same sample, and intra-sampling that learns the temporal relation is performed through multiple approaches such
as frame repetition and frame order shuffling that acts as the negative samples [24].

2.5 Identifying the right pre-text task

The choice of pretext task relies on the type of problem being solved. Although numerous methods have been proposed
in contrastive learning, a separate track of research is still going on to identify the right pre-text task. Work has identified
and proved that it is important to determine the right kind of pre-text task for a model to perform well with contrastive
learning. The main aim of a pre-text task is to compel the model to be invariant to these transformations while remaining
discriminative to other data points. But the bias introduced through such augmentations could be a double-edged
sword, as each augmentation encourages invariances to a transformation which can be beneficial in some cases and
harmful in others. For instance, applying rotation may help with view-independent aerial image recognition but might
significantly downgrade the performance while trying to solve downstream tasks such as detecting which way is up in a
photograph for a display application. [25]. Similarly, colorization-based pretext tasks might not work out in a fine-grain
classification represented in figure 9.
Similarly, in work [26], the authors focus on the importance of using the right pretext task. The authors pointed out
that in their scenario, except for rotation, other transformations such as scaling and changing aspect ratio may not be
appropriate for the pretext task because they produce easily detectable visual artifacts. They also reveal that rotation
does not work well when the image in a target dataset is constructed by color textures as in DTD dataset [27] as shown
in figure 10.

Figure 9: Most of the shapes of these two pairs of images are same. However, low-level statistics are different (color
and texture). Usage of right pre-text task here is necessary[28]

Figure 10: A sample from the DTD dataset [27]. An example of why rotation based pretext task will not work well.

2.6 Pre-text tasks in NLP

While self-supervised learning has been making significant progress in computer vision tasks for the past few years,
it has been an active area of research in NLP for decades. Using a pretext task refers to generating labels such that
supervised approaches can be applied to unsupervised problems to pre-train models. In NLP, text representations can be
learned from large text corpora using any of the available pretext tasks that are discussed below.

7
A PREPRINT - F EBRUARY 9, 2021

2.6.1 Center and Neighbor Word Prediction

Back in 2013, Word2Vec [29] first introduced self-supervised methods to learn word representations in vector space.
The continuous bag-of-words version of the model used "center word prediction" as the pretext task while the continuous
skip-gram model implemented "neighbor word prediction" task. In center word prediction, the input to the model is a
sequence of words with a fixed window size and one word missing from the center of the sequence. The task of the
model is to predict the missing word in the sequence. On the other hand, the input in skip-gram model is a single word
where the model predicts its neighbor words. By performing these particular tasks, the model is able to learn word
representations that can be further used to train models for downstream tasks.

2.6.2 Next and Neighbor Sentence Prediction

In "next sentence prediction", the model predicts whether two inputs sentences can be consecutive sentences or not. A
positive sample in this case would be a sample that follows the original sentence while a negative sample is a sentence
from a random document. BERT [30] used this method to drastically improve performance on downstream tasks that
required an understanding of sentence relations such as question answering and language inference.
Similarly, given a sentence, a model has to predict its previous and the next sentence in "neighbor sentence prediction
task". This approach was inherited by Skip-Thought Vectors [31] paper. It is similar to the skip-gram method but rather
applied to sentences in place of words.

2.6.3 Auto-regressive Language Modeling

This task involves predicting the next word, given previous words or vice-versa. A sequence of words from a text
document is provided and the model tries to predict the next word that follows the sequence. This technique has been
used by several n-gram models and neural networks such as GPT [32] and its recent versions.

2.6.4 Sentence Permutation

A recent paper known as BART [33] used a pretext task where a continuous span of text from the corpus is taken and
broken into multiple sentences. The position of the sentences are randomly reshuffled and the task of the model is to
predict the original order of the sentences.

3 Architectures
Contrastive learning methods rely on the number of negative samples for generating good quality representations. It can
be seen as a dictionary-lookup task where the dictionary is sometimes the whole training set and the rest of the times
some subset of the dataset. An interesting way to categorize these methods would be based on the technique used to
collect negative samples against a positive data-point during training. Based on the approach taken, we categorized
the methods into four major architectures as shown in figure 11. Each architecture is explained separately along with
examples of successful methods that follow similar principles.

3.1 End-to-End Learning

End-to-end learning is a complex learning system that uses gradient-based learning and is designed in such a way
that all modules are differentiable [34]. This architecture prefers large batch sizes to accumulate a greater number
of negative samples. Except for the original image and its augmented version, the rest of the images in the batch are
considered negative. The pipeline employs two encoders: a Query encoder (Q) and a Key encoder (K) as shown in
figure (11a). The two encoders can be different and are updated end-to-end by backpropagation during training. The
main idea behind training these encoders separately is to generate distinct representations of the same sample. Using a
contrastive loss, it converges to make positive samples closer and negative samples far from the original sample. Here,
the query encoder Q is trained on the original samples and the key encoder K is trained on their augmented versions
(positive samples) along with the negative samples in the batch. The features q and k generated from these encoders
are used to calculate the similarity between the respective inputs using a similarity metric (discussed later in section
5). Most of the time, the similarity metric used is "cosine similarity" which is simply the inner product of two vectors
normalized to have length 1 as defined in equation 2.
Recently, a successful end-to-end model was proposed in SimCLR [15] where they used a batch size of 4096 for
100 epochs. It has been verified that end-to-end architectures are simple in complexity but perform better with large
batch sizes and a higher number of epochs as represented in figure 12. Another popular work that follows end-to-end

8
A PREPRINT - F EBRUARY 9, 2021

(a) End-to-End (b) Memory Bank (c) Momentum Encoder (d) Clustering
Figure 11: Different architecture pipelines for Contrastive Learning: (a) End-to-End training of two encoders where one
generates representation for positive samples and the other for negative samples (b) Using a memory bank to store and
retrieve encodings of negative samples (c) Using a momentum encoder which acts as a dynamic dictionary lookup for
encodings of negative samples during training (d) Implementing a clustering mechanism by using swapped prediction
of the obtained representations from both the encoders using end-to-end architecture

Figure 12: Linear evaluation models (ResNet-50) trained with different batch size and epochs. Each bar represents a
single run from scratch [15]

architecture was proposed by Oord et. al [21] where they learn feature representations of high-dimensional time series
data by predicting the future in latent space by using powerful autoregressive models along with a contrastive loss. This
approach makes the model tractable by using negative sampling. Also, other works that follow this approach include
[35, 36, 37, 38, 39].
The number of negative samples available in this approach is coupled with the batch size as it accumulates negative
samples from the current batch. Since the batch size is limited by the GPU memory size, the scalability factor with these
methods remains an issue. Furthermore, for larger batch sizes, the methods suffer from a large mini-batch optimization
problem and require effective optimization strategies as pointed out by [40].

3.2 Using a Memory Bank

With potential issues from having larger batch sizes that could inversely impact the optimization during training, a
possible solution is to maintain a separate dictionary known as memory bank.
Memory Bank: The aim of maintaining a memory bank is to accumulate a large number of feature representations
of samples that are used as negative samples during training. For this purpose, a dictionary is created that stores and
updates the embeddings of samples with the most recent ones at regular intervals. The memory bank (M) contains a
feature representation mI for each sample I in dataset D. The representation mI is an exponential moving average of
feature representations that were computed in prior epochs. It enables replacing negative samples mI 0 by their memory
bank representations without increasing the training batch size.

9
A PREPRINT - F EBRUARY 9, 2021

Figure 13: Usage of memory bank in PIRL: memory bank contains the moving average representations of all negative
images to be used in contrastive learning. [17]

The representation of a sample in the memory bank gets updated when it is last seen, so the sampled keys are essentially
about the encoders at multiple different steps all over the past epoch. PIRL [17] is one of the recent successful methods
that learns good visual representations of images trained using a memory bank as shown in figure 13. It requires the
learner to construct representations of images that are covariant to any of the pretext tasks being used, though they
focus mainly on the Jigsaw pretext task. Another popular work that uses a memory bank under contrastive setting was
proposed by Wu et al. [12] where they implemented a non-parametric variant of softmax classifier that is more scalable
for big data applications.
However, maintaining a memory bank during training can be a complicated task. One of the potential drawbacks of
this approach is that it can be computationally expensive to update the representations in the memory bank as the
representations get outdated quickly in a few passes.

3.3 Using a Momentum Encoder

To address the issues with a memory bank explained in the previous section 3.2, the memory bank gets replaced by a
separate module called Momentum Encoder. The momentum encoder generates a dictionary as a queue of encoded keys
with the current mini-batch enqueued and the oldest mini-batch dequeued. The dictionary keys are defined on-the-fly by
a set of data samples in the batch during training. The momentum encoder shares the same parameters as the encoder Q
as shown in figure 11c. It is not backpropagated after every pass, instead, it gets updated based on the parameters of the
query encoder as represented by equation 1 [14].

θk ← mθk + (1 − m)θq (1)

In the equation, m ∈ [0, 1) is the momentum coefficient. Only the parameters θq are updated by back-propagation.
The momentum update makes θk evolve smoothly than θq . As a result, though the keys in the queue are encoded by
different encoders (in different mini-batches), the difference among these encoders can be made small.
The advantage of using this architecture over the first two is that it does not require training two separate models.
Furthermore, there is no need to maintain a memory bank that is computationally and memory inefficient.

3.4 Clustering Feature Representations

All three architectures explained above focus on comparing samples using a similarity metric and try to keep similar
items closer and dissimilar items far from each other allowing the model to learn better representations. On the
contrary, this architecture follows an end-to-end approach with two encoders that share parameters, but instead of using
instance-based contrastive approach, they utilize a clustering algorithm to group similar features together.
One of the most recent works that employ clustering methods, SwAV [13] is represented in figure 14. The diagram
points out the differences between other instance-based contrastive learning architectures and the clustering-based
methods. Here, the goal is not only to make a pair of samples close to each other but also, make sure that all other
features that are similar to each other form clusters together. For example, in an embedded space of images, the features
of cats should be closer to the features of dogs (as both are animals) but should be far from the features of houses (as
both are distinct).
In instance-based learning, every sample is treated as a discrete class in the dataset. This makes it unreliable in
conditions where it compares an input sample against other samples from the same class that the original sample

10
A PREPRINT - F EBRUARY 9, 2021

Figure 14: Conventional Contrastive Instance Learning v/s Contrastive Clustering of Feature Representations in SwAV
[13]

belongs to. To explain it clearly, imagine we have an image of a cat in the training batch that is the current input to the
model. During this pass, all other images in the batch are considered as negative. The issue arises when there are images
of other cats in the negative samples. This condition forces the model to learn two images of cats as not similar during
training despite both being from the same class. This problem is implicitly addressed by a clustering-based approach.

4 Encoders

Figure 15: Training an Encoder and transfering knowledge for downstream tasks

Encoders play an integral role in any self-supervised learning pipeline as they are responsible for mapping the input
samples to a latent space. Figure 15 reflects the role of an encoder in a self-supervised learning pipeline. Without
effective feature representations, a classification model might have difficulty in learning to distinguish among different
classes. Most of the works in contrastive learning utilize some variant of the ResNet [41] model. Among its variants,
ResNet-50 has been the most widely used because of its balance between size and learning capability.
In an encoder, the output from a specific layer is pooled to get a single-dimensional feature vector for every sample.
Depending on the approach, they are either upsampled or downsampled. For example, in the work proposed by Misra
et. al. [17], a ResNet-50 architecture is used where the output of the res5 (residual block) features are average-pooled to
get a 2048-dimensional vector for the given sample (image in their case). They further apply a single linear projection
to get a 128-dimensional feature vector. Also, as part of their ablation test, they investigated features from various
stages such as res2, res3, and res4 to evaluate the performance. As expected, features extracted from the later stages of
the encoder proved to be a better representation of the input than the features extracted from the earlier stages.
Similarly, in the work proposed by Chen et. al. [42], a traditional ResNet is used as an encoder where the features are
extracted from the output of the average pooling layer. Further, a shallow MLP (1 hidden layer) maps representations
to a latent space where a contrastive loss is applied. For training a model for action recognition, the most common
approach to extract features from a sequence of image frames is to use a 3D-ResNet as encoder [22, 24].

11
A PREPRINT - F EBRUARY 9, 2021

5 Training
To train an encoder, a pretext task is used that utilizes contrastive loss for backpropagation. The central idea in
contrastive learning is to bring similar instances closer and push away dissimilar instances far from each other. One
way to achieve this is to use a similarity metric that measures the closeness between the embeddings of two samples.
In a contrastive setup, the most common similarity metric used is cosine similarity that acts as a basis for different
contrastive loss functions. The cosine similarity of two variables (vectors) is the cosine of the angle between them and
is defined as follows:

A.B
cos_sim(A, B) = (2)
kAkkBk
Contrastive learning focuses on comparing the embeddings with a Noise Contrastive Estimation (NCE) [43] function
that is defined as follows:

exp(sim(q, k+ )/τ )
LN CE = −log (3)
exp(sim(q, k+ )/τ ) + exp(sim(q, k_ )/τ )
where q is the original sample, k+ represents a positive sample, and k_ represents a negative sample. τ is a hyperparam-
eter used in most of the recent methods and is called temperature coefficient. The sim() function can be any similarity
function, but generally a cosine similarity as defined in equation 2 is used. The initial idea behind NCE was to perform
a non-linear logistic regression that discriminates between observed data and some artificially generated noise.
If the number of negative samples is greater, a variant of NCE called InfoNCE is used as represented in equation 4. The
use of L2 normalization (i.e. cosine similarity) and the temperature coefficient, effectively weighs different examples
and can help the model learn from hard negatives.

exp(sim(q, k+ )/τ )
Linf oN CE = −log PK (4)
exp(sim(q, k+ )/τ ) + i=0 exp(sim(q, ki )/τ )

where ki represents a negative sample.

Similar to other deep learning methods, contrastive learning employs a variety of optimization algorithms for training.
The training process involves learning the parameters of encoder network by minimizing the loss function.
Stochastic Gradient Descent (SGD) has one of the most popular optimization algorithms used with contrastive learning
methods [17, 14, 10, 12]. It is an stochastic approximation of gradient descent optimization since it replaces the actual
gradient (calculated from the entire data set) by an estimate calculated from a randomly selected subset of data. A
crucial hyperparameter for the SGD algorithm is the learning rate which in practice should gradually be decreased over
time. An improved version of SGD (with momentum) is used in most deep learning approaches.
Another popular optimization method known as adaptive learning rate optimization algorithm (Adam) [44] has been
used in a few methods [21, 45, 46]. In Adam, momentum is incorporated directly as an estimate of the first-order
moment. Furthermore, Adam includes bias corrections to the estimates of both the first-order moments and the
second-order moments to account for their initialization at the origin.
Since some of the end-to-end methods [15, 47, 13] use a very large batch size, training with standard SGD-based
optimizers with a linear learning rate scaling becomes unstable. In order to stabilize the training, Layer-wise Adaptive
Rate Scaling (LARS) [48] optimizer along with cosine learning rate [49] was introduced. There are two main differences
between LARS and other adaptive algorithms such as Adam. First, LARS uses a different learning rate for every layer
that leads to better stability. Second, the magnitude of the update is based on the weight norm for better control of
training speed. Furthermore, employing cosine learning rate involves periodically warm restarts of SGD, where in each
restart, the learning rate is initialized to some value and is scheduled to decrease over time.

6 Downstream Tasks
Generally, computer vision pipelines that employ self-supervised learning involve performing two tasks: a pretext task
and a downstream task. Downstream tasks are application-specific tasks that utilize the knowledge that was learned
during the pretext task. They can be anything such as classification, detection, segmentation, future prediction, etc. in
computer vision. Once example of downstream task can be hand gesture classification [55] that involves both object

12
A PREPRINT - F EBRUARY 9, 2021

ImageNet (Self-supervised) Semi-supervised (Top-5)

Method Architecture
Top-1 Top-5 1% Labels 10% Labels
Supervised ResNet50 76.5 - 56.4 80.4
CPC [38] ResNet v2 101 48.7 73.6 - -
InstDisc [12] ResNet50 56.5 - 39.2 77.4
LA [50] ResNet50 60.2 - - -
MoCo [14] ResNet50 60.6 - - -
BigBiGAN [51] ResNet50 (4x) 61.3 81.9 55.2 78.8
PCL [52] ResNet50 61.5 - 75.3 85.6
SeLa [53] ResNet50 61.5 84.0 - -
PIRL [17] ResNet50 63.6 - 57.2 83.8
CPCv2 [38] ResNet50 63.8 85.3 77.9 91.2
PCLv2 [52] ResNet50 67.6 - - -
SimCLR [15] ResNet50 69.3 89.0 75.5 87.8
MoCov2 [47] ResNet50 71.1 - - -
InfoMin Aug [19] ResNet50 73.0 91.1 - -
SwAV [13] ResNet50 75.3 - 78.5 89.9

Table 1: Performance on ImageNet Dataset: Top-1 and Top-5 accuracies of different contrastive learning methods on
ImageNet using self-supervised approach where models are used as frozen encoders for a linear classifier. The second
half of the table (rightmost two columns) show the performance (top-5 accuracy) of these methods after fine-tuning on
1% and 10% of labels from ImageNet

Figure 16: Image classification, localization, detection, and segmentation as downstream tasks in computer vision [54]

detection and classification. Figure 17 represents the overview of how knowledge is transferred to a downstream task.
The learned parameters serve as a pretrained model and are transferred to other downstream computer vision tasks
by fine-tuning. The performance of transfer learning on these high-level vision tasks demonstrates the generalization
ability of the learned features.

Figure 17: An overview of downstream task for images

13
A PREPRINT - F EBRUARY 9, 2021

To evaluate the effectiveness of features learned with a self-supervised approach for downstream tasks, methods such as
kernel visualization, feature map visualization, nearest-neighbor based approaches are commonly used to analyze the
effectiveness of the pretext task.

6.1 Visualizing Kernels and Feature Maps

Here, the kernels of the first convolutional layer from encoders trained with both self-supervised (contrastive) and
supervised approaches are compared. This helps to estimate the effectiveness of the self-supervised approach [56].
Similarly, attention maps generated from different layers of the encoders can be used to evaluate if an approach works
or not. Gidaris et. al. [57] assessed the effectiveness based on the activated regions observed in the input as shown in
figure 18.

Figure 18: Attention maps generated by a trained AlexNet. The images represent the attention maps applied on features
from Conv1 27x27, Conv3 13x13 and Conv5 6x6

6.2 Nearest Neighbor retrieval

In general, the samples that belong to the same class are expected to be closer to each other in the latent space. With the
nearest neighbor approach, for a given input sample, top-K retrieval of the samples from the dataset can be used to
analyze whether a self-supervised approach performs as expected or not.

7 Benchmarks
Recently, several self-supervised learning methods for computer vision tasks have been proposed that challenge the
existing state-of-the-art supervised models. In this section, we collect and compare the performances of these methods
based on the downstream tasks they were evaluated on. For image classification, two popular datasets ImageNet
[16] and Places [58] have been used by most of the methods. Similarly, for object detection, Pascal VOC dataset has
often been referred to for evaluation where these methods have outperformed the best supervised models. For action
recognition and video classification, datasets such as UCF-101 [59], HMDB-51 [60], and Kinetics [61] have been used.
Table 1 highlights the performance of several methods on ImageNet and reflects how these methods have evolved and
performed better with time. At the moment, as seen in figure 3, SwAV [13] produces comparable accuracy to the
state-of-the-art supervised model in learning image representations from ImageNet. Similarly, for image classification
task on Places [58] dataset, SwAV [13] and AMDIM [37] have outperformed top supervised models with higher top-1
accuracies as shown in table 3. The methods shown in the table were first pretrained on ImageNet and later inferred on
Places dataset using a linear classifier. The results advocate that representations learned by contrastive learning methods
performed better than the supervised approach when tested on a different dataset.
These methods have not only excelled in image classification but also have performed well on other tasks like object
detection and action recognition. As shown in table 3, SwAV [13] outperforms the state-of-the-art supervised model in
both linear classification and object detection in the Pascal VOC7 dataset. For linear classification, the models shown in
the table were pretrained on VOC7 and features were taken for training a linear classification model. Similarly, for
object detection, models were finetuned on VOC7+12 using Faster-RCNN. For video classification tasks, contrastive
learning methods have shown promising results in datasets like UCF101, HMDB51, and Kinetics as reflected by table 4.

8 Contrastive Learning in NLP

Contrastive learning was first introduced by Mikolov et. al.[78] for natural language processing in 2013. The authors
proposed a contrastive learning-based framework by using co-occurring words as semantically similar points and

14
A PREPRINT - F EBRUARY 9, 2021

Method Architecture Parameters Top-1 Accuracy

Supervised ResNet50 25.6M 53.2
BiGAN [62] AlexNet 61M 31.0
Context [63] AlexNet 61M 32.7
SplitBrain [64] AlexNet 61M 34.1
AET [65] AlexNet 61M 37.1
DeepCluster [56] AlexNet 61M 37.5
Color [66] ResNet50 25.6M 37.5
Jigsaw [66] ResNet50 25.6M 41.2
Rotation [57] ResNet50 25.6M 41.4
NPID [12] ResNet50 25.6M 45.5
PIRL [17] ResNet50 25.6M 49.8
LA [50] ResNet50 25.6M 50.1
AMDIM [37] - 670M 55.1
SwAV [13] ResNet50 25.6M 56.7

Table 2: Image classification accuracy on Places dataset pretrained on ImageNet.

Method Architecture Parameters (1)Classification (2)Detection

Supervised AlexNet 61M 79.9 56.8
Supervised ResNet50 25.6M 87.5 81.3
Inpaint [67] AlexNet 61M 56.5 44.5
Color [68] AlexNet 61M 65.6 46.9
BiGAN [62] AlexNet 61M 60.1 46.9
NAT [10] AlexNet 61M 65.3 49.4
Context [63] AlexNet 61M 65.3 51.1
DeepCluster [56] AlexNet 61M 72.0 55.4
Color [68] ResNet50 25.6M 55.6 −
Rotation [57] ResNet50 25.6M 63.9 72.5
Jigsaw [66] ResNet50 25.6M 64.5 75.1
LA [50] ResNet50 25.6M 69.1 −
NPID [12] ResNet50 25.6M 76.6 79.1
PIRL [17] ResNet50 25.6M 81.1 80.7
MoCo [14] ResNet50 25.6M − 81.4
SwAV [13] ResNet50 25.6M 88.9 82.6

Table 3: (1) Linear classification top-1 accuracy on top of frozen features and (2) Object detection with finetuned
features on VOC7+12 using Faster-CNN

negative sampling[79] for learning word embeddings. Negative sampling algorithm differentiates a word from the
noise distribution using logistic regression and helps to simplify the training method. This framework results in huge
improvement in the quality of representations of learned words and phrases in a computationally efficient way. Arora et
al.[80] proposed a theoretical framework for contrastive learning that learns useful feature representations from unlabeled
data and introduced latent classes to formalize the notion of semantic similarity and performs well on classification
tasks using the learned representations. Its performance is comparable to the state-of-the-art supervised approach on the
Wiki-3029 dataset. Another recent model, CONtrastive Position and Ordering with Negatives Objective(CONPONO)
[81] discourses coherence and encodes fine-grained sentence ordering in text and outperforms BERT-Large model
despite having the same number of parameters as BERT-Base.
Contrastive Learning has started gaining popularity on several NLP tasks in the recent years. It has shown significant
improvement on NLP downstream tasks such as cross-lingual pre-training [82], language understanding [83], and textual
representations learning [84]. INFOXLM [82], a cross-lingual pretraining model, proposes a cross-lingual pretraining
task based on maximizing the mutual information between two input sequences and learns to differentiate machine
translation of input sequences using contrastive learning. Unlike TLM [85], this model aims to maximize mutual
information between machine translation pairs in cross-lingual platform and improves the cross-lingual transferability
in various downstream tasks, such as cross-lingual classification and question answering. Table 6 shows the recent
contrastive learning methods on NLP downstream task.

15
A PREPRINT - F EBRUARY 9, 2021

Method Model UCF-101 HMDB-51 K (top1) K (top5)

C3D (Supervised) - 82.3 † - - -
3DResNet-18 (Supervised) R3D 84.4† 56.4† - -
P3D (Supervised) - 84.4† - - -
ImageNet-inflated [69] R3D 60.3 30.7 - -
jigsaw [28] - 51.5 22.5 - -
OPN [70] - 56.3 22.1 - -
Cross Learn (with Optical Flow)[71] - 58.7 27.2 - -
O3N [72] - 60.3 32.5 - -
Shuffle and Learn [73] - 50.2 18.1 - -
IIC (Shuffle + res)* [74] R3D 74.4 38.3 - -
inflated SIMCLR [20] R3D-50 - - 48.0 71.5
CVRL [20] R3D-50 - - 64.1 85.8
TCP [22] R3D 77.9 (3 splits) 45.3 - -
SeCo inter+intra+order [75] R3D 88.26† 55.5† 61.91 -
DTG-Net [76] R3D-18 85.6 49.9 - -
CMC(3 views) [77] R3D 59.1 26.7 - -
Table 4: Accuracy on Video Classification. All the proposed methods were pre-trained with their proposed contrastive
based approaches and a linear model was used for validation. R3D in model represents 3D-ResNet. † represents that the
model has been trained on another dataset and further fine-tuned with the specific dataset. K represents Kinetics dataset

Architecture Dataset Accuracy

INFOXLM [82] XNLI and MLQA 79.7 (AVG-5)
Distributed [78] Google internal 72
CERT [83] QQP 90.3
CONPONO [81] DiscoEval 63.0 (AVG-10)
Contrastive [80] Wiki-3029 83.5 (AVG-10)
Table 5: Accuracy on different NLP dataset

Most of the popular language models such as BERT [30], GPT [32] approach pretraining on tokens and hence may
not capture sentence-level semantics. To address this issue, CERT [83] that pretrains models on the sentence level
using contrastive learning was proposed. This model works in two steps: 1) creating augmentation of sentences using
back-translation, and 2) predicting whether two augmented versions are from the same sentence or not by fine-tuning
a pretrained language representation model (e.g., BERT, BART). CERT was also evaluated on 11 different natural
language understanding tasks in the GLUE benchmark where it outperformed BERT on 7 tasks. DeCLUTR [84] is self-
supervised model for learning universal sentence embeddings. This model outperforms InferSent, a popular sentence
encoding method. It has been evaluated based on the quality of sentence embedding on the SentEval benchmark. Table
5 provides the comparison of accuracy on different NLP dataset.

9 Discussions and Future Directions

Although empirical results show that contrastive learning has decreased the gap in performance with supervised models,
there is a need for more theoretical analysis to form a solid justification. For instance, a study by Purushwalkam et. al.

Model Dataset Application areas

Distributed Representations [78] Google internal Training with Skip-gram model
Contrastive Unsupervised [80] Wiki-3029 Unsupervised representation learning
CONPONO [81] RTE, COPA, ReCoRD Discourse fine-grained sentence ordering in text
INFOXLM [82] XNLI and MLQA Learning cross-lingual representations
CERT [83] GLUE benchmark Capturing sentence-level semantics
DeCLUTR [84] OpenWebText Learning universal sentence representations
Table 6: Recent contrastive learning methods in NLP along with the datasets they were evaluated on and the respective
downstream tasks

16
A PREPRINT - F EBRUARY 9, 2021

[86] reveals that approaches like PIRL [17] and MoCo [14] fail to capture viewpoint and category instance invariance
that are crucial components for object recognition. Some of these issues are further discussed below.

9.1 Lack of Theoretical Foundation

In an attempt to investigate the generalization ability of contrastive objective function, the empirical results from Arora
et. al. [80] show that architecture design and sampling techniques also have a profound effect on the performance. Tsai
et. al. [87] provide an information-theoretical framework from a multi-view perspective to understand the properties
that encourage successful self-supervised learning. They demonstrate that self-supervised learned representations
can extract task-relevant information (with a potential loss) and discard task-irrelevant information (with a fixed gap).
Ultimately, it propels the methods towards being highly dependent on the pretext task chosen during training. This
affirms the need for more theoretical analysis on different modules in a contrastive pipeline.

9.2 Selection of Data Augmentation and Pretext Tasks

PIRL [17] emphasizes on methods that produce consistent results irrespective of the pretext task selected, but works like
SimCLR [42], MoCo-v2 [47] and Tian et. al. [19] demonstrate that selecting robust pretext tasks along with suitable
data augmentations can highly boost the quality of the representations. Recently, SwAV [13] beat other self-supervised
methods by using multiple augmentations. It is difficult to directly compare these methods to choose specific tasks and
transformations that can yield the best results on any dataset.

9.3 Proper Negative Sampling during Training

During training, an original (positive) sample is compared against its negative counterparts that contribute towards a
contrastive loss to train the model. In cases of easy negatives (where the similarity between the original sample and a
negative sample is very low), the contribution towards the contrastive loss is minimal. This limits the ability of the
model to converge quickly. To get more meaningful negative samples, top self-supervised methods either increase
the batch sizes [15] or maintain a very large memory bank [17]. Recently, Kalantidis et. al. [88] proposed a few
hard negative mixing strategies to facilitate faster and better learning. However, this introduces a large number of
hyperparameters that are specific to the training set and are difficult to generalize for other datasets.

9.4 Dataset Biases

In any self-supervised learning task, the data itself provides supervision. In effect, the representations learned using
self-supervised objectives are influenced by the underlying data. Such biases are difficult to minimize with increase in
the size of the datasets.

10 Conclusion
This paper has extensively reviewed recent top-performing self-supervised methods that follow contrastive learning
for both vision and NLP tasks. We clearly explain different modules in a contrastive learning pipeline; from choosing
the right pretext task, selecting an architectural design, to using the learned parameters for a downstream task. The
works based on contrastive learning have shown promising results on several downstream tasks such as image/video
classification, object detection, and other NLP tasks. Finally, this work concludes by discussing some of the open
problems of current approaches that are yet to be addressed. New techniques and paradigms are needed to tackle these
issues.

References
[1] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv
Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the
IEEE international conference on computer vision, pages 618–626, 2017.
[2] Xiao Liu, Fanjin Zhang, Zhenyu Hou, Zhaoyu Wang, Li Mian, Jing Zhang, and Jie Tang. Self-supervised learning:
Generative or contrastive, 2020.
[3] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adversarial networks, 2014.

17
A PREPRINT - F EBRUARY 9, 2021

[4] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using
cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision,
pages 2223–2232, 2017.
[5] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial
networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4401–4410,
2019.
[6] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks, 2016.
[7] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative
adversarial text to image synthesis, 2016.
[8] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-domain
relations with generative adversarial networks, 2017.
[9] Robert Epstein. The empty brain, 2016. https://aeon.co/essays/
your-brain-does-not-process-information-and-it-is-not-a-computer.
[10] Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise, 2017.
[11] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discrimina-
tive unsupervised feature learning with exemplar convolutional neural networks, 2014.
[12] Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Unsupervised feature learning via non-parametric
instance-level discrimination, 2018.
[13] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised
learning of visual features by contrasting cluster assignments, 2020.
[14] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual
representation learning, 2019.
[15] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive
learning of visual representations, 2020.
[16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[17] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations, 2019.
[18] Trieu H Trinh, Minh-Thang Luong, and Quoc V Le. Selfie: Self-supervised pretraining for image embedding,
2019.
[19] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good
views for contrastive learning, 2020.
[20] Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui.
Spatiotemporal contrastive video representation learning, 2020.
[21] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding,
2018.
[22] Guillaume Lorre, Jaonary Rabarisoa, Astrid Orcesi, Samia Ainouz, and Stephane Canu. Temporal contrastive
pretraining for video action recognition. In The IEEE Winter Conference on Applications of Computer Vision,
pages 662–670, 2020.
[23] Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey Levine.
Time-contrastive networks: Self-supervised learning from video, 2017.
[24] Li Tao, Xueting Wang, and Toshihiko Yamasaki. Self-supervised video representation learning using inter-intra
contrastive framework, 2020.
[25] Tete Xiao, Xiaolong Wang, Alexei A. Efros, and Trevor Darrell. What should not be contrastive in contrastive
learning, 2020.
[26] Shin’ya Yamaguchi, Sekitoshi Kanai, Tetsuya Shioda, and Shoichiro Takeda. Multiple pretext-task for self-
supervised learning via mixing multiple image transformations, 2019.
[27] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in
the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613,
2014.
[28] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In
European Conference on Computer Vision, pages 69–84. Springer, 2016.

18
A PREPRINT - F EBRUARY 9, 2021

[29] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector
space, 2013.
[30] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding, 2018.
[31] Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja
Fidler. Skip-thought vectors. Advances in neural information processing systems, 28:3294–3302, 2015.
[32] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by
generative pre-training, 2018.
[33] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves
Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language
generation, translation, and comprehension, 2019.
[34] Tobias Glasmachers. Limits of end-to-end learning. arXiv preprint arXiv:1704.08305, 2017.
[35] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and
Yoshua Bengio. Learning deep representations by mutual information estimation and maximization, 2018.
[36] Mang Ye, Xu Zhang, Pong C. Yuen, and Shih-Fu Chang. Unsupervised embedding learning via invariant and
spreading instance feature, 2019.
[37] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual
information across views, 2019.
[38] Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami, and Aaron
van den Oord. Data-efficient image recognition with contrastive predictive coding, 2019.
[39] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu,
and Dilip Krishnan. Supervised contrastive learning, 2020.
[40] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2017.
[41] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[42] Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, and Neil Houlsby. Self-supervised gans via auxiliary
rotation loss, 2019.
[43] M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized
statistical models. In AISTATS, 2010.
[44] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014.
[45] Aravind Srinivas, Michael Laskin, and Pieter Abbeel. Curl: Contrastive unsupervised representations for
reinforcement learning, 2020.
[46] Hakim Hafidi, Mounir Ghogho, Philippe Ciblat, and Ananthram Swami. Graphcl: Contrastive self-supervised
learning of graph representations, 2020.
[47] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive
learning, 2020.
[48] Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks, 2017.
[49] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts, 2016.
[50] Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual
embeddings, 2019.
[51] Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In Advances in Neural
Information Processing Systems, pages 10542–10552, 2019.
[52] Junnan Li, Pan Zhou, Caiming Xiong, Richard Socher, and Steven C. H. Hoi. Prototypical contrastive learning of
unsupervised representations, 2020.
[53] Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and
representation learning, 2019.
[54] Michal Maj. Object detection and image classification with yolo, 2018.
[55] Farnaz Farahanipad, Harish Ram Nambiappan, Ashish Jaiswal, Maria Kyrarini, and Fillia Makedon. Hand-
reha: dynamic hand gesture recognition for game-based wrist rehabilitation. In Proceedings of the 13th ACM
International Conference on PErvasive Technologies Related to Assistive Environments, pages 1–9, 2020.

19
A PREPRINT - F EBRUARY 9, 2021

[56] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning
of visual features, 2019.
[57] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image
rotations, 2018.
[58] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image
database for scene recognition, 2017.
[59] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from
videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[60] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video
database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563.
IEEE, 2011.
[61] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In
proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[62] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning, 2017.
[63] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context
prediction, 2016.
[64] Richard Zhang, Phillip Isola, and Alexei A. Efros. Split-brain autoencoders: Unsupervised learning by cross-
channel prediction, 2017.
[65] Liheng Zhang, Guo-Jun Qi, Liqiang Wang, and Jiebo Luo. Aet vs. aed: Unsupervised representation learning by
auto-encoding transformations rather than data, 2019.
[66] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual
representation learning, 2019.
[67] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context encoders: Feature
learning by inpainting, 2016.
[68] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful image colorization, 2016.
[69] Dahun Kim, Donghyeon Cho, and In So Kweon. Self-supervised video representation learning with space-time
cubic puzzles. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8545–8552,
2019.
[70] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by
sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 667–676,
2017.
[71] Nawid Sayed, Biagio Brattoli, and Björn Ommer. Cross and learn: Cross-modal self-supervision. In German
Conference on Pattern Recognition, pages 228–243. Springer, 2018.
[72] Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation
learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 3636–3645, 2017.
[73] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal
order verification. In European Conference on Computer Vision, pages 527–544. Springer, 2016.
[74] Li Tao, Xueting Wang, and Toshihiko Yamasaki. Self-supervised video representation learning using inter-intra
contrastive framework, 2020.
[75] Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, and Tao Mei. Seco: Exploring sequence supervision for
unsupervised representation learning, 2020.
[76] Ziming Liu, Guangyu Gao, AK Qin, and Jinyang Li. Dtg-net: Differentiated teachers guided self-supervised
video action recognition, 2020.
[77] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding, 2019.
[78] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words
and phrases and their compositionality, 2013.
[79] Michael U. Gutmann and Aapo Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with
applications to natural image statistics. Journal of Machine Learning Research, 13(11):307–361, 2012.
[80] Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical
analysis of contrastive unsupervised representation learning, 2019.

20
A PREPRINT - F EBRUARY 9, 2021

[81] Dan Iter, Kelvin Guu, Larry Lansing, and Dan Jurafsky. Pretraining with contrastive sentence objectives improves
discourse performance of language models, 2020.
[82] Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan
Huang, and Ming Zhou. Infoxlm: An information-theoretic framework for cross-lingual language model pre-
training, 2020.
[83] Hongchao Fang, Sicheng Wang, Meng Zhou, Jiayuan Ding, and Pengtao Xie. Cert: Contrastive self-supervised
learning for language understanding, 2020.
[84] John M. Giorgi, Osvald Nitski, Gary D. Bader, and Bo Wang. Declutr: Deep contrastive learning for unsupervised
textual representations, 2020.
[85] Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining, 2019.
[86] Senthil Purushwalkam and Abhinav Gupta. Demystifying contrastive self-supervised learning: Invariances,
augmentations and dataset biases, 2020.
[87] Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Self-supervised learning
from a multi-view perspective, 2020.
[88] Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative
mixing for contrastive learning. arXiv preprint arXiv:2010.01028, 2020.

2448 Self Supervised Visual Re
No ratings yet
2448 Self Supervised Visual Re
109 pages
Decision Support and Business Intelligence 9th Edition
67% (21)
Decision Support and Business Intelligence 9th Edition
715 pages
Self-Supervised Contrastive Representation Learning For Semi-Supervised Time-Series Classification
No ratings yet
Self-Supervised Contrastive Representation Learning For Semi-Supervised Time-Series Classification
17 pages
Unsupervised Clustering For Deep Learnin
No ratings yet
Unsupervised Clustering For Deep Learnin
25 pages
Understanding Deep Contrastive Learning Via Coordinate-Wise Optimization
No ratings yet
Understanding Deep Contrastive Learning Via Coordinate-Wise Optimization
25 pages
Entropy 24 00551 v2
No ratings yet
Entropy 24 00551 v2
22 pages
A Survey On Semi-, Self - and Unsupervised Learning For Image Classification
No ratings yet
A Survey On Semi-, Self - and Unsupervised Learning For Image Classification
33 pages
ArXiv-2024-MingZhang-0-Towards Graph Contrastive Learning A Survey and Beyond
No ratings yet
ArXiv-2024-MingZhang-0-Towards Graph Contrastive Learning A Survey and Beyond
35 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
What Should Not Be Contrastive in Contrastive Learning
No ratings yet
What Should Not Be Contrastive in Contrastive Learning
13 pages
Dense Constrastive Learning For Self Supervised Visual Pre Training
No ratings yet
Dense Constrastive Learning For Self Supervised Visual Pre Training
11 pages
Basak Pseudo-Label Guided Contrastive Learning For Semi-Supervised Medical Image Segmentation CVPR 2023 Paper
No ratings yet
Basak Pseudo-Label Guided Contrastive Learning For Semi-Supervised Medical Image Segmentation CVPR 2023 Paper
12 pages
Unsupervised Feature Learning Via Non-Parametric Instance Discrimination
No ratings yet
Unsupervised Feature Learning Via Non-Parametric Instance Discrimination
10 pages
Semi-Supervised Learning Literature Survey
No ratings yet
Semi-Supervised Learning Literature Survey
59 pages
Self-Supervised Contrastive Representation Learning For Semi-Supervised Time-Series Classification
No ratings yet
Self-Supervised Contrastive Representation Learning For Semi-Supervised Time-Series Classification
15 pages
Contrastive Learning
No ratings yet
Contrastive Learning
10 pages
Contrastive Self Supervised Learning With Hard Negative Pair Mining
No ratings yet
Contrastive Self Supervised Learning With Hard Negative Pair Mining
8 pages
22 Self Supervised Representation
No ratings yet
22 Self Supervised Representation
15 pages
Learning With Few Data
No ratings yet
Learning With Few Data
67 pages
23-Lopes Self-Supervised Clustering Based On Manifold Learning and Graph Convolutional Networks WACV 2023 Paper
No ratings yet
23-Lopes Self-Supervised Clustering Based On Manifold Learning and Graph Convolutional Networks WACV 2023 Paper
10 pages
Sim CLR
No ratings yet
Sim CLR
11 pages
Self-Supervised Learning For Semi-Supervised Time Series Classification
No ratings yet
Self-Supervised Learning For Semi-Supervised Time Series Classification
13 pages
2020- 【Hard Negative】 - CONTRASTIVE LEARNING WITH Hard Negative Samples
No ratings yet
2020- 【Hard Negative】 - CONTRASTIVE LEARNING WITH Hard Negative Samples
28 pages
Cty I2a 20230403
No ratings yet
Cty I2a 20230403
6 pages
A Unified Contrastive Energy-Based Model For Understanding The Generative Ability of Adversarial Training
No ratings yet
A Unified Contrastive Energy-Based Model For Understanding The Generative Ability of Adversarial Training
18 pages
On The Duality Between Contrastive and Noncontrastive Self-Supervised Learning
No ratings yet
On The Duality Between Contrastive and Noncontrastive Self-Supervised Learning
28 pages
Self-Supervised Representation Learning - Introduction, Advances and Challenges
No ratings yet
Self-Supervised Representation Learning - Introduction, Advances and Challenges
19 pages
29299-Article Text-33353-1-2-20240324
No ratings yet
29299-Article Text-33353-1-2-20240324
9 pages
Supervised Contrastive Learning
No ratings yet
Supervised Contrastive Learning
23 pages
A Guide To Self-Supervised Learning in Computer Vision
No ratings yet
A Guide To Self-Supervised Learning in Computer Vision
15 pages
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
13 pages
Bayesian Self-Supervised Contrastive Learning
No ratings yet
Bayesian Self-Supervised Contrastive Learning
20 pages
Understanding Dimensional Collapse
No ratings yet
Understanding Dimensional Collapse
17 pages
Review On Self-Supervised Image Recognition Using Deep Neural
No ratings yet
Review On Self-Supervised Image Recognition Using Deep Neural
22 pages
ConCur Self-Supervised Graph Representation Based On Contrastive Learning With Curriculum Negative Sampling
No ratings yet
ConCur Self-Supervised Graph Representation Based On Contrastive Learning With Curriculum Negative Sampling
13 pages
SimCLR: Simple Framework For Contrastive Learning of Visual Representaitons
No ratings yet
SimCLR: Simple Framework For Contrastive Learning of Visual Representaitons
20 pages
Scarf
No ratings yet
Scarf
24 pages
Generative Adversarial Networks (Gans) : An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments
No ratings yet
Generative Adversarial Networks (Gans) : An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments
17 pages
6333 Regularization With Stochastic Transformations and Perturbations For Deep Semi Supervised Learning
No ratings yet
6333 Regularization With Stochastic Transformations and Perturbations For Deep Semi Supervised Learning
9 pages
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
No ratings yet
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
11 pages
Research Paper
No ratings yet
Research Paper
7 pages
2020 - Supervised Contrastive Learning - Khosla Et Al - Curran Associates, Inc.
No ratings yet
2020 - Supervised Contrastive Learning - Khosla Et Al - Curran Associates, Inc.
13 pages
Weakly Supervised Contrastive Learning
No ratings yet
Weakly Supervised Contrastive Learning
10 pages
Response Two
No ratings yet
Response Two
3 pages
Measurement of The Photosphere Oblateness of Cassiopeiae Via Stellar Intensity Interferometry With The VERITAS Observatory
100% (1)
Measurement of The Photosphere Oblateness of Cassiopeiae Via Stellar Intensity Interferometry With The VERITAS Observatory
28 pages
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
23 pages
GCLAA
No ratings yet
GCLAA
12 pages
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
No ratings yet
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
21 pages
He Momentum Contrast For Unsupervised Visual Representation Learning CVPR 2020 Paper
No ratings yet
He Momentum Contrast For Unsupervised Visual Representation Learning CVPR 2020 Paper
10 pages
Semi-Supervised Learning With Self-Supervised Networks
No ratings yet
Semi-Supervised Learning With Self-Supervised Networks
10 pages
SSL and Few Shot Learning
No ratings yet
SSL and Few Shot Learning
6 pages
Internship Report 2023-24 Data Science
100% (2)
Internship Report 2023-24 Data Science
23 pages
Vertical Structure and Dynamics of A Galactic Disk
No ratings yet
Vertical Structure and Dynamics of A Galactic Disk
223 pages
DrIver Drowsiness Detection PPT - PPTX 20250214 144426 0000
No ratings yet
DrIver Drowsiness Detection PPT - PPTX 20250214 144426 0000
18 pages
K-Means Clustering Numerical Example
No ratings yet
K-Means Clustering Numerical Example
5 pages
Self-Supervised Learning Generative or Contrastive
No ratings yet
Self-Supervised Learning Generative or Contrastive
20 pages
Feature Transformers A Unified Representation Learning Framework For Lifelong Learning
No ratings yet
Feature Transformers A Unified Representation Learning Framework For Lifelong Learning
11 pages
Semi-Supervised Learning With Ladder Network
No ratings yet
Semi-Supervised Learning With Ladder Network
19 pages
Semi Active PDF
No ratings yet
Semi Active PDF
8 pages
Self-Supervised Learning: Generative or Contrastive
No ratings yet
Self-Supervised Learning: Generative or Contrastive
20 pages
Alexnet Paper
No ratings yet
Alexnet Paper
39 pages
Using Weighted Nearest Neighbor To Benef PDF
No ratings yet
Using Weighted Nearest Neighbor To Benef PDF
12 pages
Curriculum Learning: A Survey: Petru Soviany Radu Tudor Ionescu Paolo Rota Nicu Sebe
No ratings yet
Curriculum Learning: A Survey: Petru Soviany Radu Tudor Ionescu Paolo Rota Nicu Sebe
40 pages
NNDL Technical Publication Notes
No ratings yet
NNDL Technical Publication Notes
81 pages
The Effective Field Theory of Large Scale Structure For Mixed Dark Matter Scenarios
No ratings yet
The Effective Field Theory of Large Scale Structure For Mixed Dark Matter Scenarios
49 pages
Relativistic Second Gradient Theory of Continuous Media: Abstract
No ratings yet
Relativistic Second Gradient Theory of Continuous Media: Abstract
31 pages
Dynamic Competition Between Phason and Amplitudon Observed by Ultrafast Multimodal Scanning Tunneling Micros
No ratings yet
Dynamic Competition Between Phason and Amplitudon Observed by Ultrafast Multimodal Scanning Tunneling Micros
32 pages
Agents IA
No ratings yet
Agents IA
23 pages
Recent Advances in Understanding - Process Nucleosynthesis in Metal-Poor Stars and Stellar Systems
No ratings yet
Recent Advances in Understanding - Process Nucleosynthesis in Metal-Poor Stars and Stellar Systems
26 pages
Axion USR Inflation: Alireza Talebian, Hassan Firouzjahi
No ratings yet
Axion USR Inflation: Alireza Talebian, Hassan Firouzjahi
36 pages
Large Scale Wind Driven Structures in The Orion Nebula: C. R. O'Dell and N. P. Abel
No ratings yet
Large Scale Wind Driven Structures in The Orion Nebula: C. R. O'Dell and N. P. Abel
32 pages
An Introductory Note On Machine Learning. A V Narasimhadhan
No ratings yet
An Introductory Note On Machine Learning. A V Narasimhadhan
2 pages
Massive Interacting Binaries Enhance Feedback in Star-Forming Regions
No ratings yet
Massive Interacting Binaries Enhance Feedback in Star-Forming Regions
28 pages
Chaos of Charged Particles in Quadrupole Magnetic Fields Under Schwarzschild Backgrounds
No ratings yet
Chaos of Charged Particles in Quadrupole Magnetic Fields Under Schwarzschild Backgrounds
10 pages
Space-Based Mm/mg-Scale Laser Interferometer For Quantum Gravity
No ratings yet
Space-Based Mm/mg-Scale Laser Interferometer For Quantum Gravity
9 pages
N The Factorization of Matrices Into Products of Positive Definite Ones
No ratings yet
N The Factorization of Matrices Into Products of Positive Definite Ones
7 pages
The SOMA-POL Survey. I. Polarization and Magnetic Field Properties of Massive Protostars
No ratings yet
The SOMA-POL Survey. I. Polarization and Magnetic Field Properties of Massive Protostars
17 pages
All-Flavor Time-Dependent Search For Transient Neutrino Sources
No ratings yet
All-Flavor Time-Dependent Search For Transient Neutrino Sources
10 pages
On A Conjecture of de Branges: Keywords: 2020 MSC
No ratings yet
On A Conjecture of de Branges: Keywords: 2020 MSC
6 pages
Agent Survey
No ratings yet
Agent Survey
35 pages
Prescribed Chern Scalar Curvatures On Complete Hermitian Manifolds
No ratings yet
Prescribed Chern Scalar Curvatures On Complete Hermitian Manifolds
21 pages
BeyondAI Proceedings 2024
No ratings yet
BeyondAI Proceedings 2024
19 pages
Full-Stack Optimized Large Language Models For Lifelong Sequential Behavior Comprehension in Recommendation
No ratings yet
Full-Stack Optimized Large Language Models For Lifelong Sequential Behavior Comprehension in Recommendation
30 pages
Unleashing The Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent Through Multi-Persona Self-Collaboration
No ratings yet
Unleashing The Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent Through Multi-Persona Self-Collaboration
23 pages
Glam: Efficient Scaling of Language Models With Mixture-Of-Experts
No ratings yet
Glam: Efficient Scaling of Language Models With Mixture-Of-Experts
23 pages
A General Relativistic Magnetohydrodynamics Extension To Mesh-Less Schemes in The Code GIZMO
No ratings yet
A General Relativistic Magnetohydrodynamics Extension To Mesh-Less Schemes in The Code GIZMO
20 pages
Mining The Alerts: A Preliminary Catalog of Compact Binaries From The Fourth Observing Run
No ratings yet
Mining The Alerts: A Preliminary Catalog of Compact Binaries From The Fourth Observing Run
9 pages
Measuring and Controlling Instruction in Stability in Language Model Dialogs
No ratings yet
Measuring and Controlling Instruction in Stability in Language Model Dialogs
19 pages
Novel Auroral Line Measurements Extend The Gradual Offset of The FMR Deep Into The First Gyr of Cosmic Time
No ratings yet
Novel Auroral Line Measurements Extend The Gradual Offset of The FMR Deep Into The First Gyr of Cosmic Time
13 pages
SQL Exercises
No ratings yet
SQL Exercises
17 pages
Customizing Language Models With Instance-Wise LoRA For Sequential Recommendation
No ratings yet
Customizing Language Models With Instance-Wise LoRA For Sequential Recommendation
17 pages
Cancellation of One-Loop Correction To Soft Tensor Power Spectrum
No ratings yet
Cancellation of One-Loop Correction To Soft Tensor Power Spectrum
13 pages
Gaia Tess: Jowen Callahan,, D. M. Rowan, C. S. Kochanek, K. Z. Stanek
No ratings yet
Gaia Tess: Jowen Callahan,, D. M. Rowan, C. S. Kochanek, K. Z. Stanek
11 pages
Generative AI For Software Development
No ratings yet
Generative AI For Software Development
2 pages
Separation Strategies For Three Pitfalls in AB Testing Withacknowledgments
No ratings yet
Separation Strategies For Three Pitfalls in AB Testing Withacknowledgments
7 pages
Testing The Asteroseismic Estimates of Stellar Radii With Surface Brightness-Colour Relations and DR3 Parallaxes
No ratings yet
Testing The Asteroseismic Estimates of Stellar Radii With Surface Brightness-Colour Relations and DR3 Parallaxes
5 pages
Annexure II Sci
No ratings yet
Annexure II Sci
109 pages
Deep Learning: Data Mining: Advanced Aspects
No ratings yet
Deep Learning: Data Mining: Advanced Aspects
131 pages
Classification of Diabetes Mellitus Prediction Using Hybrid Machine Learning Techniques
No ratings yet
Classification of Diabetes Mellitus Prediction Using Hybrid Machine Learning Techniques
10 pages
Checkpoint 2 Revision
No ratings yet
Checkpoint 2 Revision
17 pages
ETI Microproject Report
No ratings yet
ETI Microproject Report
19 pages
Driver Drowsiness Detection and Alert Generating System
No ratings yet
Driver Drowsiness Detection and Alert Generating System
5 pages
Qubo Frormulation
No ratings yet
Qubo Frormulation
7 pages
Human Activity Recognition Using ML Techniques
100% (1)
Human Activity Recognition Using ML Techniques
5 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
2 pages
CISCON2024PAPAER522
No ratings yet
CISCON2024PAPAER522
8 pages
Building A Convolutional Neural Network Using Tensorflow Keras
No ratings yet
Building A Convolutional Neural Network Using Tensorflow Keras
10 pages
Guidelines Data mining-II BA Major Sem 4 NEP
No ratings yet
Guidelines Data mining-II BA Major Sem 4 NEP
2 pages
AI - Instructional System Design (Final)
No ratings yet
AI - Instructional System Design (Final)
25 pages
(M) BROCHURE - Data Science Learning Path
No ratings yet
(M) BROCHURE - Data Science Learning Path
33 pages
A Flight Fare Prediction Using Machine Learning
No ratings yet
A Flight Fare Prediction Using Machine Learning
8 pages
Text Mining Project Report
No ratings yet
Text Mining Project Report
27 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
5 pages
Ex 9
No ratings yet
Ex 9
2 pages
Artificial Intelligence (AI) and Machine Learning - A Boon or Bane For Humanity
No ratings yet
Artificial Intelligence (AI) and Machine Learning - A Boon or Bane For Humanity
2 pages
Business Algorithm and Data Structures For Information Systems
No ratings yet
Business Algorithm and Data Structures For Information Systems
3 pages
Statement of Purpose
No ratings yet
Statement of Purpose
1 page
Online Analysis of Handwriting For Disease Diagnosis: A Review
No ratings yet
Online Analysis of Handwriting For Disease Diagnosis: A Review
7 pages
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
From Everand
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
Nietsnie Trebla
No ratings yet
Self-Supervised Learning: Teaching AI with Unlabeled Data
From Everand
Self-Supervised Learning: Teaching AI with Unlabeled Data
Robert Johnson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

A Survey On Contrastive Self-Supervised Learning

Uploaded by

A Survey On Contrastive Self-Supervised Learning

Uploaded by

A S URVEY ON C ONTRASTIVE S ELF - SUPERVISED L EARNING

Ashish Jaiswal Ashwin Ramesh Babu

Mohammad Zaki Zadeh Debapriya Banerjee

Figure 2: Contrastive learning pipeline for self-supervised training

2.1 Color Transformation

2.2 Geometric Transformation

2.3.1 Jigsaw puzzle

2.3.2 Frame order based

2.3.3 Future prediction

2.4 View Prediction (Cross modal-based)

Figure 8: Learning representation from video frame sequence [23]

2.5 Identifying the right pre-text task

2.6 Pre-text tasks in NLP

2.6.1 Center and Neighbor Word Prediction

2.6.2 Next and Neighbor Sentence Prediction

2.6.3 Auto-regressive Language Modeling

2.6.4 Sentence Permutation

3.1 End-to-End Learning

3.2 Using a Memory Bank

3.3 Using a Momentum Encoder

θk ← mθk + (1 − m)θq (1)

3.4 Clustering Feature Representations

where ki represents a negative sample.

ImageNet (Self-supervised) Semi-supervised (Top-5)

Figure 17: An overview of downstream task for images

6.1 Visualizing Kernels and Feature Maps

6.2 Nearest Neighbor retrieval

8 Contrastive Learning in NLP

Method Architecture Parameters Top-1 Accuracy

Table 2: Image classification accuracy on Places dataset pretrained on ImageNet.

Method Architecture Parameters (1)Classification (2)Detection

Method Model UCF-101 HMDB-51 K (top1) K (top5)

Architecture Dataset Accuracy

9 Discussions and Future Directions

Model Dataset Application areas

9.1 Lack of Theoretical Foundation

9.2 Selection of Data Augmentation and Pretext Tasks

9.3 Proper Negative Sampling during Training

9.4 Dataset Biases

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.