0% found this document useful (0 votes)
22 views11 pages

Self Supervised Multi Modal Sequential Recommendation

Uploaded by

Hiếu Phạm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

Self Supervised Multi Modal Sequential Recommendation

Uploaded by

Hiếu Phạm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Self-Supervised Multi-Modal Sequential Recommendation

Kunzhe Song Qingfeng Sun Can Xu


kz_song@stu.pku.edu.cn qins@microsoft.com caxu@microsoft.com
Peking University Microsoft Microsoft
Beijing, China Beijing, China Beijing, China

Kai Zheng Yaming Yang


zhengkai@microsoft.com yang.yaming@microsoft.com
Microsoft Microsoft
Beijing, China Beijing, China
arXiv:2304.13277v1 [cs.IR] 26 Apr 2023

ABSTRACT ACM Reference Format:


With the increasing development of e-commerce and online ser- Kunzhe Song, Qingfeng Sun, Can Xu, Kai Zheng, and Yaming Yang. 2023.
Self-Supervised Multi-Modal Sequential Recommendation. In Proceedings
vices, personalized recommendation systems have become crucial
of xx. ACM, New York, NY, USA, 11 pages. https://doi.org/XXXXXXX.
for enhancing user satisfaction and driving business revenue. Tra- XXXXXXX
ditional sequential recommendation methods that rely on explicit
item IDs encounter challenges in handling item cold start and do-
main transfer problems. Recent approaches have attempted to use 1 INTRODUCTION
modal features associated with items as a replacement for item In recent years, recommendation systems have become an essential
IDs, enabling the transfer of learned knowledge across different component of many online services and e-commerce platforms.
datasets. However, these methods typically calculate the correla- Unlike when searching for products with a specific intent, users
tion between the model’s output and item embeddings, which may browsing online content often lack a clear intention. As it is im-
suffer from inconsistencies between high-level feature vectors and practical to display all available items on a platform, personalized
low-level feature embeddings, thereby hindering further model recommendation systems have emerged. These systems [9] aim to
learning. To address this issue, we propose a dual-tower retrieval provide users with customized recommendations that help them
architecture for sequence recommendation. In this architecture, discover new items or products of interest. Personalized recom-
the predicted embedding from the user encoder is used to retrieve mendation systems focus on mining the similarity between users
the generated embedding from the item encoder, thereby alleviat- and items by obtaining the probability distribution of interactions
ing the issue of inconsistent feature levels. Moreover, in order to between users and all items. Early recommendation methods, such
further improve the retrieval performance of the model, we also as collaborative filtering [38], modeled users and items in a hidden
propose a self-supervised multi-modal pretraining method inspired space as vectors based on their past interaction information. Later,
by the consistency property of contrastive learning. This pretrain- deep learning-based recommendation methods modeled users using
ing method enables the model to align various feature combinations deep neural networks, introducing user features [11], multi-interest
of items, thereby effectively generalizing to diverse datasets with [5, 28], and other factors to more accurately predict user preferences.
different item features. We evaluate the proposed method on five However, these methods all face certain limitation in capturing the
publicly available datasets and conduct extensive experiments. The temporal dynamics and sequence information of user behavior.
results demonstrate significant performance improvement of our Sequential recommendation methods [21, 42] have emerged as
method. The code and pre-trained model for our method are pub- a promising approach to modeling user behavior in a more fine-
licly available at https://github.com/kz-song/MMSRec. grained and dynamic way. These methods exploit the temporal
order of user interactions to predict the next item that a user is
CCS CONCEPTS likely to engage with. By considering the sequential dependencies
• Information systems → Recommender systems. of user behavior, these methods can also provide more accurate
and diverse recommendations, especially for long-term user pref-
KEYWORDS erences. In the early days, simple sequential models like Markov
Chains [20, 39] were employed to capture sequential patterns in
Sequential Recommendation, Self-Supervised Learning, Multi-Modal user behavior data. However, this approach faced limitations in
capturing long-term dependencies and handling variable-length
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed sequences. With the advent of deep learning techniques, more ad-
for profit or commercial advantage and that copies bear this notice and the full citation vanced methods have been proposed. Recurrent Neural Networks
on the first page. Copyrights for components of this work owned by others than ACM (RNNs) [10, 21] can capture long-term dependencies in sequential
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a data, while Transformer-based models [25, 41] excel at capturing
fee. Request permissions from permissions@acm.org. complex sequential patterns. Graph-based models [6, 36, 46] can
xx, xx, xx explore more complex item transition patterns in user sequences.
© 2023 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 These techniques have demonstrated superior performance com-
https://doi.org/XXXXXXX.XXXXXXX pared to early methods.
xx, xx, xx Kunzhe Song, Qingfeng Sun, Can Xu, Kai Zheng, and Yaming Yang

Despite the potential of sequential recommendation methods, the • We propose a sequential recommendation method based on
prevalent approach is to represent items explicitly by their unique a dual-tower retrieval architecture. Our approach employs a
identifiers or IDs. While this approach has demonstrated efficacy sequence encoder to capture user interaction sequences and
in certain scenarios, it is not without limitations. Specifically, this an item encoder to encode item information, addressing the
approach is typically constrained to generating item recommenda- issue of inconsistent feature levels between model outputs
tions within the same platform, as items are not readily transfer- and item embeddings.
able across domains, hindering cross-domain generalization of the • We introduce a self-supervised multi-modal pre-training
model. Furthermore, this approach is unable to perform cold-start approach to enhance the retrieval capability of our model.
recommendations for items with limited interaction history on the By constructing contrastive learning tasks between vari-
platform, which can be considered a few-shot learning problem. ous modalities conbinations, our model strengthens its fine-
Consequently, numerous methods have been proposed to address grained discrimination ability for items and learns alignment
this issue. For instance, some methods mine precise cross-domain of different modality features in the latent space.
user preferences based on intra-sequence and inter-sequence item • We conduct extensive experiments on five public datasets to
interactions[4, 51]. Other methods construct mappings that project demonstrate the effectiveness of our proposed approach and
the cold item contents to the warm item embedding space[7, 34]. architecture.
Although these methods have made progress, they have not fully
resolved the fundamental issue caused by explicitly modeling item
IDs.
With the rapid advancement of pre-trained models, such as
BERT [12] and CLIP [37], recent methods for sequential recom- 2 RELATED WORK
mendation have addressed the limitations of traditional item ID
indexing by leveraging item-associated modalities to enable repre- 2.1 Sequential Recommendation
sentation transfer across diverse datasets. Pre-training on datasets Sequential recommendation has been widely investigated in recent
with modalities facilitates efficient model transfer to other datasets years, and various models have been proposed to address this task.
with similar modalities, leading to improved performance on those Markov Chain-based techniques have been employed to model se-
datasets or even achieving zero-shot recommendation. However, quential dependencies [20, 39] in recommendation tasks. However,
existing approaches [13, 23, 44] that employ item-associated modali- these approaches have limited capacity in capturing long-term de-
ties for cross-domain transfer still encounter a significant challenge pendencies in sequences. To overcome this limitation, recurrent
that needs to be addressed: the issue of inconsistent feature levels neural networks (RNNs) [10], including variants such as Long Short-
between model outputs and item embeddings. In most sequential Term Memory (LSTM) [22] and Gated Recurrent Unit (GRU) [14, 21],
recommendation methods, user interaction sequences are encoded have been introduced for sequential recommendation. These models
using a sequence encoder to obtain sequence-level representations, have demonstrated strong performance in capturing both short-
which are then directly compared with item embeddings using co- term and long-term dependencies in sequences.
sine similarity. However, the model’s output represents high-level Moreover, due to the powerful ability of attention mechanisms,
features with respect to the sequence context, while item embed- many works have incorporated attention mechanisms into RNN-
dings represent low-level features for individual items, resulting in based models [25, 41] to enhance the model’s capability to attend
a substantial gap in feature levels. Computing similarity directly to relevant parts of the sequence. In addition to modeling based
between them would forcibly align the model’s input and output, on explicit item IDs, several works are exploring other methods to
which prevent the model from generating effective representations. improve the model’s recommendation performance. For instance,
To address this issue, we propose a sequential recommendation FDSA [52] and 𝑆 3 -Rec [54] have introduced rich attribute informa-
method based on a dual-tower retrieval architecture. Our approach tion of items, while CL4SRec [47] and DuoRec [35] have introduced
utilizes a sequence encoder for capturing user interaction sequences self-supervised signals. However, these approaches face a common
and an item encoder for encoding item information, ensuring that issue: they are usually limited to specific data domains or platforms
both are represented as high-level features. Considering that user due to the existence of explicit item IDs.
interaction sequences are fundamentally composed of items, we Recently, some studies have attempted to overcome the limitation
also share parameters between the two encoders to enable mutual of explicit item IDs for item cold-start and cross-domain transfer.
reinforcement of information. To further enhance the retrieval ca- ZESRec [13] no longer relies on item indexes but instead uses natu-
pability of our model, we propose a self-supervised multi-modal ral language descriptions as item representations. UniSRec [23] has
pre-training approach. By constructing contrastive learning tasks introduced two contrastive learning tasks in the pre-training stage,
between each modality input and its own augment representa- enabling the model to learn universal item and sequential represen-
tion, as well as between representations of different modalities, our tations. However, these methods still face a significant limitation in
model is able to strengthen its fine-grained discrimination ability that the feature levels of model outputs and item embeddings are
for items and learn alignment of different modality features in the inconsistent, as the model’s output represents high-level features
latent space. This enables the pre-trained model to generalize well while item embeddings represent low-level features, resulting in
on downstream datasets with diverse item features. a substantial gap in feature levels, which prevents effective repre-
The main contributions of this paper are summarized as follows: sentation generation when computing similarity directly between
them.
Self-Supervised Multi-Modal Sequential Recommendation xx, xx, xx

2.2 Self-supervised Multi-modal Learning visual sequence is obtained by extracting frames from the video at
𝑗
Deep neural networks have proven to be highly effective in many regular intervals, with each 𝑣𝑖 representing a frame from the video.
applications, but their performance can be limited by the availabil- While in the case of images, the visual sequence length |𝑣𝑖 | = 1, as
ity of labeled data. While training on large-scale labeled datasets each item is represented by a single image. The
 textual information
[27, 30] can improve performance, the use of deep networks can be associated with an item is denoted as 𝑡𝑖 = 𝑡𝑖1, 𝑡𝑖2, . . . , 𝑡𝑖𝑛 , where
𝑗
constrained when data is scarce or obtaining annotations is chal- 𝑡𝑖 represents a word token, and the length of the textual sequence
lenging. To overcome these issues, self-supervised learning (SSL) denoted as |𝑡𝑖 | = 𝑛.
has emerged as a promising alternative. Unlike traditional super-
vised learning, SSL [24] enables models to learn from unlabeled 3.1.2 Sequence Recommendation. The objective of the sequence
data, which is often more abundant than labeled data. SSL works recommendation task is to infer a user’s preferences and provide
by training models on learning objectives derived from the training recommendations for the next item based on historical interaction
samples themselves, without the need for external annotations. data. Let us denote
 the user’s interaction sequence, sorted chrono-
One popular approach in self-supervised learning is contrastive logically, as 𝑆 = 𝑖 1, . . . , 𝑖𝑡 , . . . , 𝑖 |𝑆 | , where 𝑖𝑡 ∈ 𝐼 represents the
learning [8, 18], where the model is trained to distinguish between item that the user interacted with at time step 𝑡, and |𝑆 | denotes
a positive and a set of negative examples. This has been success- the length of the user’s interaction sequence. Accordingly, the task
fully applied in cross-modal tasks such as visual-text matching objective can be formulated as predicting the probability distribu-
[31, 37, 48], audio-visual alignment [26, 32], and multi-modal com- tion over the entire item set 𝐼 for the potential interaction of the
bining [1, 2]. These approaches have achieved state-of-the-art re- user at time step |𝑆 | + 1, which can be formulated as:
sults in various benchmarks, demonstrating the effectiveness of
 
𝑝 𝑖 |𝑆 |+1 = 𝑖 |𝑆 (1)
self-supervised learning and contrastive learning in solving real-
world problems.
Video is an excellent example of a multimodal learning source 3.2 Self-supervised Multi-modal Learning
that naturally combines multiple modalities and allows for learn- 3.2.1 Input Representation. In previous sequential recommenda-
ing from large-scale data that may not be feasible to manually tion approaches [21, 25, 41], explicit item IDs were used as trainable
annotate. Recently, there have been many efforts [29, 48, 50] in representations of items, which restrict the model’s ability to trans-
cross-modal self-supervised learning on video. Audioclip [16] uses fer general knowledge across different domains. To overcome this
pre-trained models for feature extraction from three modalities: vi- limitation and enable effective transfer of pre-trained models to new
sion, audio, and text. It employs contrastive learning between each recommendation scenarios, we propose to use modal information
pair of modalities to align their features. FrozenInTime [3] uses a associated with items as a bridge that connects multiple domains.
dual encoding model to separately encode video and text, with its By extracting and mapping the modal features of items from diverse
space-time transformer encoder capable of flexibly encoding images domains to a common semantic space using pre-trained modality
or videos by treating an image as a single-frame video. Wav2clip models [12, 19, 37], we overcome the reliance on explicit item IDs.
[45] utilizes the CLIP [37] as the encoder for text and images, and As most recommendation scenarios involve items with both visual
constructs a contrastive learning task between the frozen image en- and textual modalities, we choose these two modalities as the input
coder and audio encoder to align the hidden representations of the feature information for items.
three pre-trained modal encoders. EverythingAtOnce [40] aligns For an item with visual modality, we process it as two types of
different modal combinations with each other by constructing a input data: images and videos. For image data, we extract features
composite contrastive learning loss between multiple modalities. directly using a pre-trained visual model (PVM) [37] to obtain the
image representation. For video data, we first sample frames from
3 METHOD the videos at regular
 intervals to obtain a collection of video frames,
denoted as 𝑣𝑖 = 𝑣𝑖1, 𝑣𝑖2, . . . , 𝑣𝑖𝑚 . Next, we use the PVM to extract
In this section, we propose a self-supervised multi-modal sequen-
features from all video frames, and then take the average of all frame
tial recommendation model. By leveraging self-supervised learning
features to obtain the representation of the video. The obtained
on multi-modal datasets, our approach effectively enhances the
representation then passes through a single-layer neural network,
retrieval performance of the model, thereby improving its perfor-
producing the visual feature embedding:
mance on downstream recommendation tasks. n the following, for
convenience, we consider the use of two modalities: vision and text. |𝑚 |
1 ∑︁ 𝑗
𝐸𝑖𝑣 = 𝑓𝑁 𝑁 ( 𝑓𝑃𝑉 𝑀 (𝑣𝑖 )) (2)
|𝑚| 𝑗=1
3.1 Problem Statement
3.1.1 Input Formulation. Assuming that the set of all items is de- where 𝐸𝑖𝑣 represents the visual feature embedding for item 𝑖, 𝑓𝑁 𝑁
noted as 𝐼 , each item 𝑖 ∈ 𝐼 is represented not by a simple item ID, denotes the neural network function, and 𝑓𝑃𝑉 𝑀 represents the
but by a tuple (𝑣𝑖 , 𝑡𝑖 ) which encompasses both visual and textual feature extraction function of the pre-trained visual model.
information associated with the item. The visual information asso- For an item with textual modality, we utilize a pre-trained lan-
ciated with an item is represented as 𝑣𝑖 = 𝑣𝑖1, 𝑣𝑖2, . . . , 𝑣𝑖𝑚 , where 𝑣𝑖
𝑗
guage model (PLM) [37] to extract features from the associated
corresponds to an image, and the length of the visual sequence de- text. Given the text token sequence 𝑡𝑖 = 𝑡𝑖1, 𝑡𝑖2, . . . , 𝑡𝑖𝑛 of item 𝑖, we
noted as |𝑣𝑖 | = 𝑚. Notably, there are two types of input information first prepend a special token [𝐶𝐿𝑆] to the token sequence, and then
for the visual modality: videos and images. In the case of videos, the input the concatenated sequence into the PLM. Finally, we select
xx, xx, xx Kunzhe Song, Qingfeng Sun, Can Xu, Kai Zheng, and Yaming Yang

Figure 1: The framework of the self-supervised multi-modal sequential recommendation method. The left part illustrates
the self-supervised multi-modal pre-training method, while the right part shows the dual-tower retrieval architecture for
sequential recommendation.

the token corresponding to the [𝐶𝐿𝑆] position in the hidden layer information sharing and completion of missing modality informa-
as the representation of the text. The obtained representation then tion. Subsequently, we employ modality-specific projection layers
passes through a single-layer neural network, producing the text to project feature embeddings into a shared hidden space, followed
feature embedding: by normalization. Finally, a max-pooling layer is utilized to extract
the salient feature information from all modalities, resulting in a
𝐸𝑖𝑡 = 𝑓𝑁 𝑁 (𝑓𝑃𝐿𝑀 ( [𝐶𝐿𝑆], 𝑡𝑖1, 𝑡𝑖2, . . . , 𝑡𝑖𝑛 ))

(3) unified representation of the item:
where 𝐸𝑖𝑡 represents the text feature embedding for item 𝑖, and
ℎ𝑖 = MaxPooling(ℎ𝑖𝑣 , ℎ𝑖𝑡 ) (4)
𝑓𝑃𝐿𝑀 represents the feature extraction function of the pre-trained
language model. 3.2.3 Contrastive Loss. As our objective is to develop a modality-
Hence, following the feature extraction process using pre-trained agnostic model that can effectively handle diverse combinations of
modality models, an input item 𝑖 ∈ 𝐼 is converted into a feature input modalities, and align them in a shared hidden space, we draw
embedding tuple (𝐸𝑖𝑣 , 𝐸𝑖𝑡 ). As the item encompasses two distinct inspiration from recent advances in contrastive learning [37, 40].
modalities, we employ the sum of feature embedding and modality Contrastive learning has shown remarkable capabilities in represen-
embedding as the input to the model to enhance discriminability. tation alignment and generalization, where similar input contents
3.2.2 Multi-modal Item Encoder. In various recommendation sce- are encouraged to be close while inputs with distinct semantics are
narios, items may be associated with different modalities, such as pushed apart. To leverage these benefits, we adopt a retrieval pre-
visual and textual modalities. These modalities can take the form training task based on contrastive learning. This approach enhances
of single modalities, denoted as 𝑣 or 𝑡, or a combination of both the semantic alignment capability among multiple modalities, ad-
modalities, represented as a tuple (𝑣, 𝑡). Our objective is to develop dressing the challenge of inconsistent modalities associated with
a method that enables the model to be modality-agnostic, allowing items in different domains. By employing contrastive learning, we
it to map input from any modality combination of an item to a enable items with similar semantics but varying modalities from
common hidden space and align them using a mapping function distinct domains to align with each other, promoting robust and
𝑓𝑖𝑡𝑒𝑚 . Considering that an item may have multiple modal inputs, effective representation learning.
the model needs to encode bidirectionally, enabling jointly embed Taking text and visual modalities as an example, the model
them for representation. should possess the capability to align within a single modality,
Therefore, we propose a multi-modal item encoder based on such as (𝑣, 𝑣 ′ ) and (𝑡, 𝑡 ′ ), align across modalities, such as (𝑣, 𝑡), and
the foundational structure of a transformer block [43]. The trans- align modalities in combination, such as (𝑣𝑡, 𝑣𝑡 ′ ). Here, 𝑣 ′ , 𝑡 ′ , and
former block comprises a multi-head self-attention layer and a 𝑣𝑡 ′ represent the augmented data using an unsupervised data aug-
feed-forward network. Prior to inputting the feature embeddings mentation method [15]. The contrastive loss function is designed
of an item into the model, we add modality embeddings to them. as follows:
In cases where certain modalities are absent, we replace the cor-
responding feature embeddings with a [𝑚𝑎𝑠𝑘] token. Leveraging L =𝜆𝑣,𝑣′ ℓ (𝑣, 𝑣 ′ ) + 𝜆𝑡,𝑡 ′ ℓ (𝑡, 𝑡 ′ )
the powerful attention mechanism of the transformer block, dif- (5)
ferent modalities can mutually attend to each other, facilitating + 𝜆𝑣,𝑡 ℓ (𝑣, 𝑡) + 𝜆𝑣𝑡,𝑣𝑡 ′ ℓ (𝑣𝑡, 𝑣𝑡 ′ )
Self-Supervised Multi-Modal Sequential Recommendation xx, xx, xx

where 𝜆𝑥,𝑦 represents the weight coefficient between modality relationships among items within the sequence. As the item en-
𝑥 and 𝑦, and ℓ (𝑥, 𝑦) represents the contrastive loss using Noise coder has already acquired generalized item representations during
Contrastive Estimation [33]: the pre-training phase, the sequence encoder leverages the model
" # parameters of the item encoder directly and focuses on learning
𝑒𝑥𝑝 (𝑠𝑖𝑚(𝑥𝑖 , 𝑦𝑖 )/𝜏) the interaction relationships within the sequence. The structure of
ℓ (𝑥, 𝑦) = E −𝑙𝑜𝑔 Í B
the sequence encoder closely resembles that of the item encoder,
𝑗=1 𝑒𝑥𝑝 (𝑠𝑖𝑚(𝑥𝑖 , 𝑦 𝑗 )/𝜏)
𝑖 ∈|B |
" # (6) employing transformer blocks which are widely used in various
𝑒𝑥𝑝 (𝑠𝑖𝑚(𝑦 𝑗 , 𝑥 𝑗 )/𝜏) applications. The attention mechanism, known for its robust per-
+ E −𝑙𝑜𝑔 Í B
formance, enables each item in the sequence to attend to contextual
𝑖=1 𝑒𝑥𝑝 (𝑠𝑖𝑚(𝑦 𝑗 , 𝑥𝑖 )/𝜏)
𝑗 ∈|B |
information, facilitating the model in inferring representations of
where B represents the batch size, and 𝜏 is a temperature coefficient. each item based on the contextual cues within the sequence.
Through the utilization of contrastive loss, our model has achieved
the capability to represent items in a universal manner, overcoming 3.3.3 Masked Item Prediction. The primary objective of sequence
the limitations imposed by the size of the pretraining dataset in the recommendation is to infer user preferences and provide recommen-
recommendation domain. As a result, our model exhibits robust dations for the next potential item based on historical interaction
generalization capabilities, being able to effectively encode inputs information. To achieve this, the sequential encoder needs to pos-
of diverse modality combinations and align semantically similar sess the ability to predict items by leveraging contextual content
items in the shared hidden layer space. information. In light of the Masked Language Model task [12], we
propose a Masked Item Prediction task for our model.
Inthe Masked Item Prediction task, for a given input sequence
3.3 Sequential Representation Learning 𝑆 = 𝑖 1, 𝑖 2, . . . , 𝑖 |𝑆 | , each item 𝑖 is replaced with a special token
Due to the sensitive nature of user interactions in recommendation [𝑚𝑎𝑠𝑘] with a probability 𝑝. The model is then tasked with predict-
domain and strict privacy regulations, there is a lack of publicly ing the original item based on its contextual content information.
available large-scale pretraining datasets. Previous methods for As our model represents each item with multiple modal represen-
sequence-based recommendation [13, 23, 44] have relied on limited tation embeddings as inputs, for the masked item, we replace all
datasets for pretraining, leading to inadequate generalization ca- its associated modal representation embeddings with the token
pabilities and diminished performance when transferring across [𝑚𝑎𝑠𝑘].
domains. Recognizing the intricate nature of user behavior patterns During testing, the objective of predicting the next potential
in diverse recommendation domains, our objective is to avoid intro- interaction item requires the addition of the token [𝑚𝑎𝑠𝑘] at the
ducing domain-specific biases into our model. Therefore, we learn end of the user’s historical interaction behavior sequence. This
domain-specific sequence interaction patterns for each domain. enables the model to predict the next item in the sequence that the
user is likely to interact with.
3.3.1 Input Representation. In the pretraining retrieval task, as
To ensure consistency in input between training and testing,
discussed in Section 3.2, the model operates on individual items
where the token [𝑚𝑎𝑠𝑘] only appears at the last position during
without considering their relative positional relationships. However,
testing, we adopt a specific strategy during training. The last posi-
in the downstream sequential recommendation task, the model
tion of the input sequence sample is always replaced with the token
takes into account a user’s historical interaction sequence as input,
[𝑚𝑎𝑠𝑘] during training. For other positions of the token [𝑚𝑎𝑠𝑘] in
where items are arranged chronologically based on interaction
the sequence, we employ three replacement strategies: (1) replacing
time, thus having explicit positional relationships. To capture this
the item with the [𝑚𝑎𝑠𝑘] token 80% of the time, (2) replacing the
temporal order relationship, we introduce positional embeddings
item with a randomly selected item 10% of the time, and (3) keeping
as an additional component of the input for the model. Specifically,
the item unchanged 10% of the time. This approach addresses the
for a given item i, its input embedding is obtained by summing the
inconsistency in input between training and testing, ensuring that
feature embedding, modality embedding, and positional embedding.
the model is trained to effectively predict the masked items in the
In this work, in order to improve the performance of our model,
sequence during testing.
we utilize learnable positional embedding matrices instead of fixed
sinusoid embeddings. The positional embedding matrices enable
4 EXPERIMENT
our model to capture contextual relationships and the interac-
tion order of each item within the input sequence, leading to en- 4.1 Pre-training Experiments
hanced representations for user sequences. Moreover, consider- 4.1.1 Datasets. During the pre-training phase, we employed the
ing the constraint on the maximum length 𝑁 of input sequences WebVid dataset [3] as our training data. This dataset comprises a
that our model can handle, when the length of an input sequence large-scale collection of video-text pairs, totaling 10 million pairs
𝑆 = 𝑖 1, 𝑖 2, . . . , 𝑖 |𝑆 | exceeds 𝑁 , we truncate the sequence and retain obtained from stock footage websites. Videos, being a rich source

only the last 𝑁 items as 𝑆𝑡𝑟𝑢𝑛𝑐 = 𝑖 |𝑆 |−𝑁 +1, . . . , 𝑖 |𝑆 | . of diverse modal features, are particularly well-suited for our self-
supervised multi-modal pre-training task. To simplify the training
3.3.2 Multi-modal Sequence Encoder. The sequence encoder serves process, we utilized the pre-extracted features 1 obtained from CLIP
two primary objectives: firstly, to encode each item based on its (ViT-B/32) at a frame rate of 1FPS.
multiple modal information, and secondly, to enhance the contex-
tual representation of each item by incorporating the interaction 1 https://huggingface.co/datasets/iejMac/CLIP-WebVid
xx, xx, xx Kunzhe Song, Qingfeng Sun, Can Xu, Kai Zheng, and Yaming Yang

In addition, we employed MSR-VTT [49] as our test set to assess Table 1: Results of pre-training on MSR-VTT dataset for
the generalization ability of our model. MSR-VTT consists of 10,000 zero-shot text-to-video retrieval. "R@k" is short for "Re-
videos, each ranging in length from 10 to 32 seconds, and a total of call@k".
200,000 captions. For evaluation purposes, we utilized a set of 1,000
test clips to assess the performance of our model. Pretrain Backbone
Method Backbone R@5 R@10
Dataset Trainable
4.1.2 Evaluation Metrics. We employed the standard Recall@K Clip[37] - ViT-B/32 ✕ 42.1 51.7
retrieval metric to evaluate the performance of our model. Recall@K EAO[40] HW100M Clip (ViT-B/32) ✕ 32.5 42.4
assesses the proportion of test samples in which the correct result is FIT[3] CC+WV Clip (ViT-B/32) ✓ 46.9 57.2
present among the top K retrieved points for a given query sample. Ours WV10M Clip (ViT-B/32) ✕ 47.3 59.7
We report the results for Recall@5 and Recall@10 as performance
indicators for our model. Table 2: Statistics of the downstream task datasets after pre-
processing
4.1.3 Implement Details. We utilize CLIP (ViT-B/32) as the visual
encoder and textual encoder to extract modal features. For MSR- Dataset #Users #Items #Actions Sparsity
VTT, we evenly partition the videos into 10 clips based on their
Beauty 22,363 12,101 198,502 99.93%
total length, and extract one frame image from each clip for feature
Sports 35,598 18,357 296,337 99.95%
extraction. During training, the parameters of these modal feature
Clothing 39,387 23,033 278,677 99.97%
encoders are kept frozen, and only the item encoder is trained.
Home 66,519 28,237 551,682 99.97%
The model parameters are optimized using the AdamW optimizer
ML-1m 6,040 3,416 999,611 95.16%
with the learning rate of 5e-5, and exponential decay is applied to
the learning rate with a decay rate of 0.9. We set the maximum num-
ber of video frames allowed in the model to 10 and the maximum
4.2 Downstream Task Experiments
length of text tokens to 77. The batch size is configured to 48000
for training. The embedding dimension of the model is 512, with 4.2.1 Datasets. In order to evaluate the effectiveness of our pro-
2 layers and 8 heads. We set the embedding dropout and hidden posed model, we selected five open-source datasets from real-world
dropout to 0.2 and 0.5, respectively. Additionally, the weight coeffi- platforms, taking into consideration the total number of users, items,
cients 𝜆𝑣,𝑣′ , 𝜆𝑡,𝑡 ′ , 𝜆𝑣,𝑡 , and 𝜆𝑣𝑡,𝑣𝑡 ′ are all 0.25, and the temperature actions, and dataset sparsity. Among these datasets, four are Ama-
coefficient is 0.05. The model is pretrained for 15 epochs using 8 zon platform datasets 2 [30], including "Beauty," "Sports and Out-
V100 GPUs (32GB memory), and the training process is completed doors," "Clothing Shoes and Jewelry," and "Home and Kitchen."
in 5 hours. Additionally, to evaluate the generalization ability of the model
across different platforms, we also selected Movielens-1M 3 [17],
which is widely used for evaluating recommendation algorithms.
4.1.4 Performance. The objective of evaluating the pre-trained We followed the approach used in previous works [25, 35, 41, 54]
model is to assess its efficacy in aligning diverse modalities and to process these datasets, by keeping the five-core datasets and
its ability to generalize. As the downstream sequential recommen- filtering out users and items with fewer than five interactions. Sub-
dation task involves predicting representations of specific item sequently, we grouped the interactions by users and sorted them
positions and retrieving items based on these representations, the in ascending order based on the timestamps. For the four Amazon
retrieval performance of the pre-trained model directly impacts its datasets, we crawled the image links associated with the Amazon
performance on the downstream task. To evaluate the generaliza- dataset items as visual modality information. Items without image
tion ability of the model, zero-shot testing is conducted directly on links were labeled as having missing visual modality information.
the test set. To assess the effectiveness of the alignment between Furthermore, we concatenated the titles and descriptions associated
different modalities, the text-to-video retrieval task is chosen as the with the Amazon dataset items as textual modality information.
evaluation metric. For Movielens-1M, we crawled corresponding movie trailers from
Table 1 presents the results of our model compared to Clip, Ev- YouTube 4 based on the movie names in the dataset as visual modal-
erythingAtOnce, and FrozenInTime on the zero-shot text-to-video ity information, and concatenated the movie names and tags as
retrieval task using MSR-VTT. The table reveals that our model textual modality information. The statistical information of the
outperforms Clip in terms of retrieval performance. By integrating preprocessed datasets is shown in Table 2.
the features of multiple modalities through the item encoder, our
model exhibits improved representation capability for items com- 4.2.2 Evaluation Metrics. To evaluate the performance of models,
pared to using Clip directly for extracting multi-modality features. we utilize the top-k Recall and top-k Normalized Discounted Cumu-
Furthermore, when comparing our model with EverythingAtOnce, lative Gain (NDCG) metrics, which are commonly used in related
despite both models utilizing the same backbone, the variance in works [23, 41, 47]. Recall measures the presence of the positive
pre-training datasets also yields significant performance differences item, while NDCG takes into account both the rank position and
on the test set. Hence, training the model on larger pre-training 2 http://jmcauley.ucsd.edu/data/amazon/
datasets can further augment its generalization ability and enhance 3 https://grouplens.org/datasets/movielens/1m/

its performance. 4 https://www.youtube.com/


Self-Supervised Multi-Modal Sequential Recommendation xx, xx, xx

Table 3: Performance comparison of different recommendation models. The best and the second-best performances are de-
noted in bold and underlined fonts, respectively. “Improv.” indicates the relative improvement ratios of the proposed approach
over the best performance baselines. The features used for item representations of each compared model have been listed,
whether ID, feature (F), or both (ID+F).

Dataset Metric Pop GRU4Rec𝐼 𝐷 SASRec𝐼 𝐷 FDSA𝐼 𝐷+𝐹 S3 -Rec𝐼 𝐷+𝐹 DuoRec𝐼 𝐷 UniSRec𝐹 Ours𝐹 Improv id Text
Recall@10 0.0189 0.0627 0.0842 0.0874 0.0862 0.0853 0.0743 0.0949 8.58%
Recall@50 0.0555 0.1498 0.1766 0.1838 0.1871 0.1854 0.1885 0.2302 22.12%
Beauty
NDCG@10 0.0093 0.0330 0.0416 0.0462 0.0434 0.0449 0.0351 0.0476 3.03%
NDCG@50 0.0174 0.0520 0.0618 0.0680 0.0653 0.0663 0.0599 0.0754 10.88%
Recall@10 0.0177 0.0351 0.0477 0.0504 0.0523 0.0514 0.0513 0.0635 21.41%
Recall@50 0.0510 0.0928 0.1114 0.1183 0.1210 0.1176 0.1314 0.1607 22.30%
Sports
NDCG@10 0.0097 0.0183 0.0222 0.0276 0.0250 0.0248 0.0252 0.0323 17.03%
NDCG@50 0.0169 0.0308 0.0360 0.0422 0.0399 0.0391 0.0425 0.0534 25.65%
Recall@10 0.0085 0.0158 0.0268 0.0283 0.0370 0.0313 0.0382 0.0452 18.32%
Recall@50 0.0303 0.0477 0.0608 0.0682 0.0858 0.0677 0.1021 0.1269 24.29%
Clothing
NDCG@10 0.0045 0.0078 0.0122 0.0156 0.0169 0.0148 0.0191 0.0218 14.14%0.4040 0.2889
NDCG@50 0.0092 0.0146 0.0195 0.0242 0.0275 0.0227 0.0328 0.0395 20.43%
Recall@10 0.0136 0.0189 0.0303 0.0277 0.0329 0.0309 0.0265 0.0388 17.93%
Recall@50 0.0436 0.0532 0.0638 0.0680 0.0690 0.0669 0.0725 0.0969 33.66%
Home
NDCG@10 0.0069 0.0098 0.0152 0.0155 0.0163 0.0156 0.0135 0.0195 19.63%
NDCG@50 0.0134 0.0171 0.0224 0.0242 0.0241 0.0234 0.0234 0.0321 32.64%
Recall@10 0.0749 0.2988 0.2993 0.3028 0.3002 0.3013 0.1472 0.3124 3.17%
Recall@50 0.2110 0.5412 0.5457 0.5523 0.5467 0.5421 0.4114 0.5650 2.30%
ML-1m
NDCG@10 0.1310 0.1731 0.1690 0.1744 0.1694 0.1697 0.0665 0.1835 5.22%
NDCG@50 0.1522 0.2264 0.2237 0.2296 0.2240 0.2231 0.1242 0.2392 4.18%

the presence. In our experiments, we report Recall and NDCG at k = maximization (MIM) to learn correlations among attribute,
10, 50. In addition, we adopt the leave-one-out strategy, which has item, subsequence, and sequence.
been widely employed in previous works [35, 44, 54]. Specifically, • DuoRec 5 [35] addresses the representation degeneration
for each user, we retain the last interaction item as the test data, and problem in sequential recommendation. It uses contrastive
the item just before the last as the validation data. The remaining regularization to reshape the distribution of sequence repre-
items are used for training. We rank the ground-truth item of each sentations and improve the item embeddings distribution.
sequence against all other items for evaluation on the test set, and • UniSRec 6 [23] utilizes the associated description text of
finally calculate the average score across all test users. items to learn transferable representations across different
recommendation scenarios.
4.2.3 Baselines. We compare the proposed approach with the fol-
4.2.4 Implement Details. For DuoRec and UniSRec, we utilized the
lowing baseline methods:
source code provided by their respective authors. For the other
• Pop A non-personalized approach involves recommending methods, we implemented them using RecBole [53], a widely used
the same items to all users. These items are determined open-source recommendation library. All hyper-parameters were
based on their popularity, measured by the highest number set based on the recommendations from the original papers. Addi-
of interactions across the entire set of items. tionally, fine-tuning of all baseline models was conducted on the
• GRU4Rec [21] proposes an approach for session-based rec- five downstream recommendation datasets.
ommendations using recurrent neural networks with Gated For our proposed model, we use the AdamW optimizer with a
Recurrent Units. learning rate of 1e-3 and configure the batch size to 8192. The mask
• SASRec [25] proposes a self-attention based sequential model item ratio of the model is 0.2. For the four Amazon datasets, we
which uses the multi-head attention mechanism to recom- set the maximum sequence length to 20, the embedding dropout
mend the next item. to 0.2 and the hidden dropout to 0.5. For Movielens-1M, we set
• FDSA [52] integrates various heterogeneous features of the maximum sequence length to 100, the embedding dropout to
items into feature sequences with different weights through 0.2 and the hidden dropout to 0.2. Furthermore, we adopted early
a vanilla attention mechanism.
• S3 -Rec [54] is a self-supervised learning approach for se- 5 https://github.com/RuihongQiu/DuoRec

quential recommendation that utilizes mutual information 6 https://github.com/RUCAIBox/UniSRec


xx, xx, xx Kunzhe Song, Qingfeng Sun, Can Xu, Kai Zheng, and Yaming Yang

stopping with a patience of 10 epochs to prevent overfitting and Table 4: Statistical Information of Sports and Home datasets
set Recall@10 as the indicator. under Different k-core Strategies.

4.2.5 Performance. We conducted a comparative analysis of the


proposed approach against several baseline methods on five publicly Dataset k-core #Users #Items #Actions Sparsity
available datasets, and the results of our experiments are presented 4 74,224 33,980 499,812 99.98%
in Table 3. 5 35,598 18,357 296,337 99.95%
Sports
Based on the experimental results, we observed that models 6 17,281 9,758 169,846 99.90%
which incorporate both modal features and ID embeddings, such as 7 8,389 5,135 93,449 99.78%
FDSA and S3 -Rec, tend to outperform models that solely rely on ID
4 128,156 46,325 855,824 99.99%
embeddings, such as SASRec and DuoRec. This can be attributed
5 66,519 28,237 551,682 99.97%
to the fact that these models not only leverage trainable ID em- Home
6 35,452 17,382 351,703 99.94%
beddings but also utilize modal features of items as supplementary
7 18,748 10,606 217,497 99.89%
information, which provides additional cues during the inference
process. Notably, UniSRec achieves competitive performance with
these modal-assisted models on the four Amazon datasets by ex-
clusively utilizing text features as item representations without
employing ID embeddings, largely due to its pretraining on five
Amazon datasets 7 . However, as evident from the table, UniSRec
exhibits inferior performance on Movielens-1M compared to other
methods. This can be attributed to the fact that UniSRec was only
pretrained on a relatively small-scale dataset, resulting in limited
generalization capability and inadequate transfer of generic knowl-
edge to cross-platform datasets.
Lastly, our proposed method was systematically compared against
all baseline models, and the results clearly indicate that our model
achieves superior performance on all datasets. Notably, our model
exhibits significant improvements on both sparse datasets (e.g.,
Amazon review dataset) and dense datasets (e.g., Movielens-1M),
outperforming all other baseline models by a substantial margin
on sparse datasets in particular. Unlike these baselines, our model
does not rely on item embedding-based approaches for sequential
recommendation. Instead, it adopts a retrieval-based approach fa-
cilitated by self-supervised multi-modal pre-training task, enabling
it to learn universal item representations and enhance its general-
ization capability. In the downstream sequential recommendation
task, our model jointly encodes sequences and items, and predicts
tokens at masked positions in the sequences using the Masked Item
Prediction task for item retrieval. The final results conclusively Figure 2: Model Performance across Different Dataset Spar-
demonstrate the superior recommendation performance of our pro- sity Levels.
posed method on datasets from diverse domains and varying levels
of sparsity.
interactions, without thoroughly investigating the potential impact
of different k-core filtering strategies on model performance. To
4.3 Further Analysis
address this research gap, we conducted experiments on the Sports
4.3.1 Sparsity Influence. The experimental results from Table 3 and Home datasets using various k-core filtering strategies and
suggest that the effectiveness of our model is closely related to the evaluated the model’s performance under different filtering strate-
sparsity of datasets. Sparsity, defined as the proportion of missing gies. The statistical information of the Sports and Home datasets
values in relation to the total values in a dataset, is a critical measure under different k-core filtering strategies is presented in Table 4.
in recommendation systems. Higher sparsity indicates a larger num- The table reveals that as the k-core decreases, the sparsity of these
ber of missing values in the dataset, while lower sparsity indicates datasets increases. Specifically, when the k-core value decreases
fewer missing values. In the context of recommendation systems, from 7 to 4, the sparsity of the Sports dataset increases from 99.78%
the user-item rating matrix is typically sparse, as users only rate a to 99.98%, and the sparsity of the Home dataset increases from
small subset of all items. However, existing methods [25, 35, 41, 54] 99.89% to 99.99%.
have commonly adopted a direct approach of retaining the five- The experimental results of our model on datasets with different
core datasets and filtering out users and items with fewer than five levels of sparsity are presented in Figure 2. We compared our model
7 Amazon Review Dataset: "Grocery and Gourmet Food", "Home and Kitchen", "CDs with a baseline model, SASRec, in the experiments. As shown in
and Vinyl", "Kindle Store" and "Movies and TV" the figure, the evaluation results of both our model and SASRec
Self-Supervised Multi-Modal Sequential Recommendation xx, xx, xx

Table 5: Evaluation of Different Text Backbones on the


Sports. 𝑡 is for text modality only. "R@k" is short for "Re-
call@k", and "N@k" is short for "NDCG@k".

Metrics
Method Backbone
R@10 R@50 N@10 N@50
UniSRec𝑡 Bert (base) 0.0513 0.1314 0.0252 0.0425
Our𝑡 Bert (base) 0.0588 0.1501 0.0299 0.0497
Our𝑡 Clip (ViT-B/32) 0.0603 0.1524 0.0309 0.0501

individually. Furthermore, the comparison between pre-trained


and non-pre-trained models reveals that regardless of whether the
system has access to both visual and textual modalities or only
one modality, pre-training consistently enhances the performance
of the model in downstream recommendation tasks. This further
corroborates the notion that our pre-training task adeptly learns
representation alignment between diverse modalities and promotes
mutual reinforcement among them.

4.3.3 Backbone. We additionally employed other text backbones


to evaluate our model. Specifically, we utilized the Bert as our text
encoder instead of the Clip. We compared the performance of our
model with UniSRec𝑡 on the Sports, and the results are presented in
Table 5. Since we did not use any pretraining data to extract features
with the Bert, the results of our model are without pertaining.
Figure 3: Ablation study of our model on Sports and
From the table, it can be observed that when using the Bert
Movielens-1M.
as the backbone, our model outperforms the UniSRec𝑡 in all four
evaluation metrics, with an average improvement of 16.11%. This
demonstrates the effectiveness of our proposed model architec-
significantly decrease as k-core decreases. This is due to the fact
ture. Furthermore, by replacing the Bert with the Clip, our model
that as the sparsity of the dataset increases, there are a large num-
shows an average improvement of 2.06%. This further confirms that
ber of users and items with only a few interactions, which is a
utilizing a more powerful feature extractor can lead to additional
common cold start problem in the recommendation field. However,
performance gains in our model.
the performance degradation of our model is noticeably smaller
than that of SASRec as k-core decreases. The average improvement
of our model compared to SASRec on two datasets, in terms of
5 CONCLUTIONS
the Recall@10 metric, increased from 12.45% at k-core=7 to 37.78% In this paper, we propose a self-supervised multi-modal sequential
at k-core=4. Similarly, the average improvement in terms of the recommendation method. In contrast to conventional sequential
Recall@50 metric increased from 26.11% at k-core=7 to 62.08% at recommendation methods that rely on explicit item IDs, our ap-
k-core=4. Therefore, the experimental results further confirm that proach leverages the feature information associated with items to
our model is capable of effectively addressing the cold start problem represent them. To address the issue of inconsistent feature levels
in the recommendation field. between the model output and item embeddings, we introduce a
dual-tower retrieval architecture for sequential recommendation.
4.3.2 Ablation Study. In this section, we will conduct an in-depth In this architecture, the predicted embedding from the user encoder
analysis of the impact of each proposed technique and component is used to retrieve the generated embedding from the item encoder,
on the performance of our system. To this end, we compare our thereby mitigating the problem of inconsistent feature levels. Addi-
proposed model with several variants, including: (1) 𝑤/𝑜 𝑉 : without tionally, we share parameters between the two encoders to facilitate
the visual modality. (2) 𝑤/𝑜 𝑇 : without the text modality. (3) 𝑤/𝑜 𝑃: mutual reinforcement of information. To further enhance the re-
without pre-training. (4) 𝑤/𝑜 𝑃 + 𝑉 : without pre-training and the trieval performance of the model, we also propose a self-supervised
visual modality. (5) 𝑤/𝑜 𝑃 + 𝑇 : without pre-training and the text multi-modal pre-training approach. By constructing contrastive
modality. learning tasks among different feature combinations of items, our
The results of the abovementioned experiments are presented model is capable of improving its fine-grained discrimination ability
in Figure 3. Based on the experimental results, it can be inferred for items and learning alignment of different modality features in
that our model architecture effectively integrates features from the latent space. Extensive experiments conducted on five publicly
multiple modalities, and the incorporation of both visual and text available datasets demonstrate the effectiveness of our proposed
modalities leads to superior performance compared to using them model.
xx, xx, xx Kunzhe Song, Qingfeng Sun, Can Xu, Kai Zheng, and Yaming Yang

REFERENCES Technologies 9, 1 (2020), 2.


[1] Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin [25] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom-
Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised mendation. In 2018 IEEE international conference on data mining (ICDM). IEEE,
learning from raw video, audio and text. Advances in Neural Information Process- 197–206.
ing Systems 34 (2021), 24206–24221. [26] Bruno Korbar, Du Tran, and Lorenzo Torresani. 2018. Cooperative learning
[2] Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelović, of audio and video models from self-supervised synchronization. Advances in
Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Neural Information Processing Systems 31 (2018).
Zisserman. 2020. Self-supervised multimodal versatile networks. Advances in [27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. Imagenet classi-
Neural Information Processing Systems 33 (2020), 25–37. fication with deep convolutional neural networks. Commun. ACM 60, 6 (2017),
[3] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in 84–90.
time: A joint video and image encoder for end-to-end retrieval. In Proceedings of [28] Beibei Li, Beihong Jin, Jiageng Song, Yisong Yu, Yiyuan Zheng, and Wei Zhuo.
the IEEE/CVF International Conference on Computer Vision. 1728–1738. 2022. Improving Micro-video Recommendation via Contrastive Multiple Interests.
[4] Jiangxia Cao, Xin Cong, Jiawei Sheng, Tingwen Liu, and Bin Wang. 2022. Con- arXiv preprint arXiv:2205.09593 (2022).
trastive Cross-Domain Sequential Recommendation. In Proceedings of the 31st [29] Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui
ACM International Conference on Information & Knowledge Management. 138–147. Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval
[5] Yukuo Cen, Jianwei Zhang, Xu Zou, Chang Zhou, Hongxia Yang, and Jie Tang. and captioning. Neurocomputing 508 (2022), 293–304.
2020. Controllable multi-interest framework for recommendation. In Proceedings [30] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.
of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data 2015. Image-based recommendations on styles and substitutes. In Proceedings
Mining. 2942–2951. of the 38th international ACM SIGIR conference on research and development in
[6] Jianxin Chang, Chen Gao, Yu Zheng, Yiqun Hui, Yanan Niu, Yang Song, Depeng information retrieval. 43–52.
Jin, and Yong Li. 2021. Sequential recommendation with graph neural networks. [31] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan
In Proceedings of the 44th international ACM SIGIR conference on research and Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by
development in information retrieval. 378–387. watching hundred million narrated video clips. In Proceedings of the IEEE/CVF
[7] Hao Chen, Zefan Wang, Feiran Huang, Xiao Huang, Yue Xu, Yishi Lin, Peng International Conference on Computer Vision. 2630–2640.
He, and Zhoujun Li. 2022. Generative adversarial framework for cold-start item [32] Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2021. Audio-visual instance
recommendation. In Proceedings of the 45th International ACM SIGIR Conference discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Con-
on Research and Development in Information Retrieval. 2565–2571. ference on Computer Vision and Pattern Recognition. 12475–12486.
[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A [33] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning
simple framework for contrastive learning of visual representations. In Interna- with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
tional conference on machine learning. PMLR, 1597–1607. [34] Xingyu Pan, Yushuo Chen, Changxin Tian, Zihan Lin, Jinpeng Wang, He Hu,
[9] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, and Wayne Xin Zhao. 2022. Multimodal Meta-Learning for Cold-Start Sequential
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Recommendation. In Proceedings of the 31st ACM International Conference on
2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Information & Knowledge Management. 3421–3430.
workshop on deep learning for recommender systems. 7–10. [35] Ruihong Qiu, Zi Huang, Hongzhi Yin, and Zijian Wang. 2022. Contrastive
[10] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, learning for representation degeneration problem in sequential recommendation.
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase In Proceedings of the fifteenth ACM international conference on web search and
representations using RNN encoder-decoder for statistical machine translation. data mining. 813–823.
arXiv preprint arXiv:1406.1078 (2014). [36] Ruihong Qiu, Jingjing Li, Zi Huang, and Hongzhi Yin. 2019. Rethinking the
[11] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks item order in session-based recommendation with graph neural networks. In
for youtube recommendations. In Proceedings of the 10th ACM conference on Proceedings of the 28th ACM international conference on information and knowledge
recommender systems. 191–198. management. 579–588.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: [37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Pre-training of deep bidirectional transformers for language understanding. arXiv Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
preprint arXiv:1810.04805 (2018). et al. 2021. Learning transferable visual models from natural language supervision.
[13] Hao Ding, Yifei Ma, Anoop Deoras, Yuyang Wang, and Hao Wang. 2021. Zero- In International conference on machine learning. PMLR, 8748–8763.
shot recommender systems. arXiv preprint arXiv:2105.08318 (2021). [38] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.
[14] Tim Donkers, Benedikt Loepp, and Jürgen Ziegler. 2017. Sequential user-based 2012. BPR: Bayesian personalized ranking from implicit feedback. arXiv preprint
recurrent neural network recommendations. In Proceedings of the eleventh ACM arXiv:1205.2618 (2012).
conference on recommender systems. 152–160. [39] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factor-
[15] Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive izing personalized markov chains for next-basket recommendation. In Proceedings
learning of sentence embeddings. arXiv preprint arXiv:2104.08821 (2021). of the 19th international conference on World wide web. 811–820.
[16] Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. 2022. Audioclip: [40] Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian
Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Kingsbury, Rogerio S Feris, David Harwath, James Glass, and Hilde Kuehne.
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 976–980. 2022. Everything at once-multi-modal fusion transformer for video retrieval. In
[17] F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 20020–20029.
1–19. [41] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.
[18] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo- 2019. BERT4Rec: Sequential recommendation with bidirectional encoder rep-
mentum contrast for unsupervised visual representation learning. In Proceedings resentations from transformer. In Proceedings of the 28th ACM international
of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738. conference on information and knowledge management. 1441–1450.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual [42] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda-
learning for image recognition. In Proceedings of the IEEE conference on computer tion via convolutional sequence embedding. In Proceedings of the eleventh ACM
vision and pattern recognition. 770–778. international conference on web search and data mining. 565–573.
[20] Ruining He and Julian McAuley. 2016. Fusing similarity models with markov [43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
chains for sparse sequential recommendation. In 2016 IEEE 16th international Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
conference on data mining (ICDM). IEEE, 191–200. you need. Advances in neural information processing systems 30 (2017).
[21] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. [44] Jie Wang, Fajie Yuan, Mingyue Cheng, Joemon M Jose, Chenyun Yu, Beibei Kong,
2015. Session-based recommendations with recurrent neural networks. arXiv Zhijin Wang, Bo Hu, and Zang Li. 2022. TransRec: Learning Transferable Recom-
preprint arXiv:1511.06939 (2015). mendation from Mixture-of-Modality Feedback. arXiv preprint arXiv:2206.06190
[22] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural (2022).
computation 9, 8 (1997), 1735–1780. [45] Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. 2022.
[23] Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wav2clip: Learning robust audio representations from clip. In ICASSP 2022-2022
Wen. 2022. Towards Universal Sequence Representation Learning for Recom- IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
mender Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge IEEE, 4563–4567.
Discovery and Data Mining. 585–593. [46] Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. 2019.
[24] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Baner- Session-based recommendation with graph neural networks. In Proceedings of
jee, and Fillia Makedon. 2020. A survey on contrastive self-supervised learning. the AAAI conference on artificial intelligence, Vol. 33. 346–353.
Self-Supervised Multi-Modal Sequential Recommendation xx, xx, xx

[47] Xu Xie, Fei Sun, Zhaoyang Liu, Shiwen Wu, Jinyang Gao, Jiandong Zhang, Bolin recommendation. In Proceedings of the 43rd International ACM SIGIR conference
Ding, and Bin Cui. 2022. Contrastive learning for sequential recommendation. In on research and development in Information Retrieval. 1469–1478.
2022 IEEE 38th international conference on data engineering (ICDE). IEEE, 1259– [52] Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S Sheng, Jiajie Xu, Deqing
1273. Wang, Guanfeng Liu, Xiaofang Zhou, et al. 2019. Feature-level Deeper Self-
[48] Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Attention Network for Sequential Recommendation.. In IJCAI. 4320–4326.
Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: [53] Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu
Contrastive pre-training for zero-shot video-text understanding. arXiv preprint Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, et al. 2021. Recbole:
arXiv:2109.14084 (2021). Towards a unified, comprehensive and efficient framework for recommendation
[49] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description algorithms. In Proceedings of the 30th ACM International Conference on Information
dataset for bridging video and language. In Proceedings of the IEEE conference on & Knowledge Management. 4653–4664.
computer vision and pattern recognition. 5288–5296. [54] Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang,
[50] Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. 2021. Taco: Token-aware cascade Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised learning for se-
contrastive learning for video-text alignment. In Proceedings of the IEEE/CVF quential recommendation with mutual information maximization. In Proceedings
International Conference on Computer Vision. 11562–11572. of the 29th ACM international conference on information & knowledge management.
[51] Fajie Yuan, Xiangnan He, Alexandros Karatzoglou, and Liguang Zhang. 2020. 1893–1902.
Parameter-efficient transfer from sequential behaviors for user modeling and

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy