Self Supervised Multi Modal Sequential Recommendation
Self Supervised Multi Modal Sequential Recommendation
Despite the potential of sequential recommendation methods, the • We propose a sequential recommendation method based on
prevalent approach is to represent items explicitly by their unique a dual-tower retrieval architecture. Our approach employs a
identifiers or IDs. While this approach has demonstrated efficacy sequence encoder to capture user interaction sequences and
in certain scenarios, it is not without limitations. Specifically, this an item encoder to encode item information, addressing the
approach is typically constrained to generating item recommenda- issue of inconsistent feature levels between model outputs
tions within the same platform, as items are not readily transfer- and item embeddings.
able across domains, hindering cross-domain generalization of the • We introduce a self-supervised multi-modal pre-training
model. Furthermore, this approach is unable to perform cold-start approach to enhance the retrieval capability of our model.
recommendations for items with limited interaction history on the By constructing contrastive learning tasks between vari-
platform, which can be considered a few-shot learning problem. ous modalities conbinations, our model strengthens its fine-
Consequently, numerous methods have been proposed to address grained discrimination ability for items and learns alignment
this issue. For instance, some methods mine precise cross-domain of different modality features in the latent space.
user preferences based on intra-sequence and inter-sequence item • We conduct extensive experiments on five public datasets to
interactions[4, 51]. Other methods construct mappings that project demonstrate the effectiveness of our proposed approach and
the cold item contents to the warm item embedding space[7, 34]. architecture.
Although these methods have made progress, they have not fully
resolved the fundamental issue caused by explicitly modeling item
IDs.
With the rapid advancement of pre-trained models, such as
BERT [12] and CLIP [37], recent methods for sequential recom- 2 RELATED WORK
mendation have addressed the limitations of traditional item ID
indexing by leveraging item-associated modalities to enable repre- 2.1 Sequential Recommendation
sentation transfer across diverse datasets. Pre-training on datasets Sequential recommendation has been widely investigated in recent
with modalities facilitates efficient model transfer to other datasets years, and various models have been proposed to address this task.
with similar modalities, leading to improved performance on those Markov Chain-based techniques have been employed to model se-
datasets or even achieving zero-shot recommendation. However, quential dependencies [20, 39] in recommendation tasks. However,
existing approaches [13, 23, 44] that employ item-associated modali- these approaches have limited capacity in capturing long-term de-
ties for cross-domain transfer still encounter a significant challenge pendencies in sequences. To overcome this limitation, recurrent
that needs to be addressed: the issue of inconsistent feature levels neural networks (RNNs) [10], including variants such as Long Short-
between model outputs and item embeddings. In most sequential Term Memory (LSTM) [22] and Gated Recurrent Unit (GRU) [14, 21],
recommendation methods, user interaction sequences are encoded have been introduced for sequential recommendation. These models
using a sequence encoder to obtain sequence-level representations, have demonstrated strong performance in capturing both short-
which are then directly compared with item embeddings using co- term and long-term dependencies in sequences.
sine similarity. However, the model’s output represents high-level Moreover, due to the powerful ability of attention mechanisms,
features with respect to the sequence context, while item embed- many works have incorporated attention mechanisms into RNN-
dings represent low-level features for individual items, resulting in based models [25, 41] to enhance the model’s capability to attend
a substantial gap in feature levels. Computing similarity directly to relevant parts of the sequence. In addition to modeling based
between them would forcibly align the model’s input and output, on explicit item IDs, several works are exploring other methods to
which prevent the model from generating effective representations. improve the model’s recommendation performance. For instance,
To address this issue, we propose a sequential recommendation FDSA [52] and 𝑆 3 -Rec [54] have introduced rich attribute informa-
method based on a dual-tower retrieval architecture. Our approach tion of items, while CL4SRec [47] and DuoRec [35] have introduced
utilizes a sequence encoder for capturing user interaction sequences self-supervised signals. However, these approaches face a common
and an item encoder for encoding item information, ensuring that issue: they are usually limited to specific data domains or platforms
both are represented as high-level features. Considering that user due to the existence of explicit item IDs.
interaction sequences are fundamentally composed of items, we Recently, some studies have attempted to overcome the limitation
also share parameters between the two encoders to enable mutual of explicit item IDs for item cold-start and cross-domain transfer.
reinforcement of information. To further enhance the retrieval ca- ZESRec [13] no longer relies on item indexes but instead uses natu-
pability of our model, we propose a self-supervised multi-modal ral language descriptions as item representations. UniSRec [23] has
pre-training approach. By constructing contrastive learning tasks introduced two contrastive learning tasks in the pre-training stage,
between each modality input and its own augment representa- enabling the model to learn universal item and sequential represen-
tion, as well as between representations of different modalities, our tations. However, these methods still face a significant limitation in
model is able to strengthen its fine-grained discrimination ability that the feature levels of model outputs and item embeddings are
for items and learn alignment of different modality features in the inconsistent, as the model’s output represents high-level features
latent space. This enables the pre-trained model to generalize well while item embeddings represent low-level features, resulting in
on downstream datasets with diverse item features. a substantial gap in feature levels, which prevents effective repre-
The main contributions of this paper are summarized as follows: sentation generation when computing similarity directly between
them.
Self-Supervised Multi-Modal Sequential Recommendation xx, xx, xx
2.2 Self-supervised Multi-modal Learning visual sequence is obtained by extracting frames from the video at
𝑗
Deep neural networks have proven to be highly effective in many regular intervals, with each 𝑣𝑖 representing a frame from the video.
applications, but their performance can be limited by the availabil- While in the case of images, the visual sequence length |𝑣𝑖 | = 1, as
ity of labeled data. While training on large-scale labeled datasets each item is represented by a single image. The
textual information
[27, 30] can improve performance, the use of deep networks can be associated with an item is denoted as 𝑡𝑖 = 𝑡𝑖1, 𝑡𝑖2, . . . , 𝑡𝑖𝑛 , where
𝑗
constrained when data is scarce or obtaining annotations is chal- 𝑡𝑖 represents a word token, and the length of the textual sequence
lenging. To overcome these issues, self-supervised learning (SSL) denoted as |𝑡𝑖 | = 𝑛.
has emerged as a promising alternative. Unlike traditional super-
vised learning, SSL [24] enables models to learn from unlabeled 3.1.2 Sequence Recommendation. The objective of the sequence
data, which is often more abundant than labeled data. SSL works recommendation task is to infer a user’s preferences and provide
by training models on learning objectives derived from the training recommendations for the next item based on historical interaction
samples themselves, without the need for external annotations. data. Let us denote
the user’s interaction sequence, sorted chrono-
One popular approach in self-supervised learning is contrastive logically, as 𝑆 = 𝑖 1, . . . , 𝑖𝑡 , . . . , 𝑖 |𝑆 | , where 𝑖𝑡 ∈ 𝐼 represents the
learning [8, 18], where the model is trained to distinguish between item that the user interacted with at time step 𝑡, and |𝑆 | denotes
a positive and a set of negative examples. This has been success- the length of the user’s interaction sequence. Accordingly, the task
fully applied in cross-modal tasks such as visual-text matching objective can be formulated as predicting the probability distribu-
[31, 37, 48], audio-visual alignment [26, 32], and multi-modal com- tion over the entire item set 𝐼 for the potential interaction of the
bining [1, 2]. These approaches have achieved state-of-the-art re- user at time step |𝑆 | + 1, which can be formulated as:
sults in various benchmarks, demonstrating the effectiveness of
𝑝 𝑖 |𝑆 |+1 = 𝑖 |𝑆 (1)
self-supervised learning and contrastive learning in solving real-
world problems.
Video is an excellent example of a multimodal learning source 3.2 Self-supervised Multi-modal Learning
that naturally combines multiple modalities and allows for learn- 3.2.1 Input Representation. In previous sequential recommenda-
ing from large-scale data that may not be feasible to manually tion approaches [21, 25, 41], explicit item IDs were used as trainable
annotate. Recently, there have been many efforts [29, 48, 50] in representations of items, which restrict the model’s ability to trans-
cross-modal self-supervised learning on video. Audioclip [16] uses fer general knowledge across different domains. To overcome this
pre-trained models for feature extraction from three modalities: vi- limitation and enable effective transfer of pre-trained models to new
sion, audio, and text. It employs contrastive learning between each recommendation scenarios, we propose to use modal information
pair of modalities to align their features. FrozenInTime [3] uses a associated with items as a bridge that connects multiple domains.
dual encoding model to separately encode video and text, with its By extracting and mapping the modal features of items from diverse
space-time transformer encoder capable of flexibly encoding images domains to a common semantic space using pre-trained modality
or videos by treating an image as a single-frame video. Wav2clip models [12, 19, 37], we overcome the reliance on explicit item IDs.
[45] utilizes the CLIP [37] as the encoder for text and images, and As most recommendation scenarios involve items with both visual
constructs a contrastive learning task between the frozen image en- and textual modalities, we choose these two modalities as the input
coder and audio encoder to align the hidden representations of the feature information for items.
three pre-trained modal encoders. EverythingAtOnce [40] aligns For an item with visual modality, we process it as two types of
different modal combinations with each other by constructing a input data: images and videos. For image data, we extract features
composite contrastive learning loss between multiple modalities. directly using a pre-trained visual model (PVM) [37] to obtain the
image representation. For video data, we first sample frames from
3 METHOD the videos at regular
intervals to obtain a collection of video frames,
denoted as 𝑣𝑖 = 𝑣𝑖1, 𝑣𝑖2, . . . , 𝑣𝑖𝑚 . Next, we use the PVM to extract
In this section, we propose a self-supervised multi-modal sequen-
features from all video frames, and then take the average of all frame
tial recommendation model. By leveraging self-supervised learning
features to obtain the representation of the video. The obtained
on multi-modal datasets, our approach effectively enhances the
representation then passes through a single-layer neural network,
retrieval performance of the model, thereby improving its perfor-
producing the visual feature embedding:
mance on downstream recommendation tasks. n the following, for
convenience, we consider the use of two modalities: vision and text. |𝑚 |
1 ∑︁ 𝑗
𝐸𝑖𝑣 = 𝑓𝑁 𝑁 ( 𝑓𝑃𝑉 𝑀 (𝑣𝑖 )) (2)
|𝑚| 𝑗=1
3.1 Problem Statement
3.1.1 Input Formulation. Assuming that the set of all items is de- where 𝐸𝑖𝑣 represents the visual feature embedding for item 𝑖, 𝑓𝑁 𝑁
noted as 𝐼 , each item 𝑖 ∈ 𝐼 is represented not by a simple item ID, denotes the neural network function, and 𝑓𝑃𝑉 𝑀 represents the
but by a tuple (𝑣𝑖 , 𝑡𝑖 ) which encompasses both visual and textual feature extraction function of the pre-trained visual model.
information associated with the item. The visual information asso- For an item with textual modality, we utilize a pre-trained lan-
ciated with an item is represented as 𝑣𝑖 = 𝑣𝑖1, 𝑣𝑖2, . . . , 𝑣𝑖𝑚 , where 𝑣𝑖
𝑗
guage model (PLM) [37] to extract features from the associated
corresponds to an image, and the length of the visual sequence de- text. Given the text token sequence 𝑡𝑖 = 𝑡𝑖1, 𝑡𝑖2, . . . , 𝑡𝑖𝑛 of item 𝑖, we
noted as |𝑣𝑖 | = 𝑚. Notably, there are two types of input information first prepend a special token [𝐶𝐿𝑆] to the token sequence, and then
for the visual modality: videos and images. In the case of videos, the input the concatenated sequence into the PLM. Finally, we select
xx, xx, xx Kunzhe Song, Qingfeng Sun, Can Xu, Kai Zheng, and Yaming Yang
Figure 1: The framework of the self-supervised multi-modal sequential recommendation method. The left part illustrates
the self-supervised multi-modal pre-training method, while the right part shows the dual-tower retrieval architecture for
sequential recommendation.
the token corresponding to the [𝐶𝐿𝑆] position in the hidden layer information sharing and completion of missing modality informa-
as the representation of the text. The obtained representation then tion. Subsequently, we employ modality-specific projection layers
passes through a single-layer neural network, producing the text to project feature embeddings into a shared hidden space, followed
feature embedding: by normalization. Finally, a max-pooling layer is utilized to extract
the salient feature information from all modalities, resulting in a
𝐸𝑖𝑡 = 𝑓𝑁 𝑁 (𝑓𝑃𝐿𝑀 ( [𝐶𝐿𝑆], 𝑡𝑖1, 𝑡𝑖2, . . . , 𝑡𝑖𝑛 ))
(3) unified representation of the item:
where 𝐸𝑖𝑡 represents the text feature embedding for item 𝑖, and
ℎ𝑖 = MaxPooling(ℎ𝑖𝑣 , ℎ𝑖𝑡 ) (4)
𝑓𝑃𝐿𝑀 represents the feature extraction function of the pre-trained
language model. 3.2.3 Contrastive Loss. As our objective is to develop a modality-
Hence, following the feature extraction process using pre-trained agnostic model that can effectively handle diverse combinations of
modality models, an input item 𝑖 ∈ 𝐼 is converted into a feature input modalities, and align them in a shared hidden space, we draw
embedding tuple (𝐸𝑖𝑣 , 𝐸𝑖𝑡 ). As the item encompasses two distinct inspiration from recent advances in contrastive learning [37, 40].
modalities, we employ the sum of feature embedding and modality Contrastive learning has shown remarkable capabilities in represen-
embedding as the input to the model to enhance discriminability. tation alignment and generalization, where similar input contents
3.2.2 Multi-modal Item Encoder. In various recommendation sce- are encouraged to be close while inputs with distinct semantics are
narios, items may be associated with different modalities, such as pushed apart. To leverage these benefits, we adopt a retrieval pre-
visual and textual modalities. These modalities can take the form training task based on contrastive learning. This approach enhances
of single modalities, denoted as 𝑣 or 𝑡, or a combination of both the semantic alignment capability among multiple modalities, ad-
modalities, represented as a tuple (𝑣, 𝑡). Our objective is to develop dressing the challenge of inconsistent modalities associated with
a method that enables the model to be modality-agnostic, allowing items in different domains. By employing contrastive learning, we
it to map input from any modality combination of an item to a enable items with similar semantics but varying modalities from
common hidden space and align them using a mapping function distinct domains to align with each other, promoting robust and
𝑓𝑖𝑡𝑒𝑚 . Considering that an item may have multiple modal inputs, effective representation learning.
the model needs to encode bidirectionally, enabling jointly embed Taking text and visual modalities as an example, the model
them for representation. should possess the capability to align within a single modality,
Therefore, we propose a multi-modal item encoder based on such as (𝑣, 𝑣 ′ ) and (𝑡, 𝑡 ′ ), align across modalities, such as (𝑣, 𝑡), and
the foundational structure of a transformer block [43]. The trans- align modalities in combination, such as (𝑣𝑡, 𝑣𝑡 ′ ). Here, 𝑣 ′ , 𝑡 ′ , and
former block comprises a multi-head self-attention layer and a 𝑣𝑡 ′ represent the augmented data using an unsupervised data aug-
feed-forward network. Prior to inputting the feature embeddings mentation method [15]. The contrastive loss function is designed
of an item into the model, we add modality embeddings to them. as follows:
In cases where certain modalities are absent, we replace the cor-
responding feature embeddings with a [𝑚𝑎𝑠𝑘] token. Leveraging L =𝜆𝑣,𝑣′ ℓ (𝑣, 𝑣 ′ ) + 𝜆𝑡,𝑡 ′ ℓ (𝑡, 𝑡 ′ )
the powerful attention mechanism of the transformer block, dif- (5)
ferent modalities can mutually attend to each other, facilitating + 𝜆𝑣,𝑡 ℓ (𝑣, 𝑡) + 𝜆𝑣𝑡,𝑣𝑡 ′ ℓ (𝑣𝑡, 𝑣𝑡 ′ )
Self-Supervised Multi-Modal Sequential Recommendation xx, xx, xx
where 𝜆𝑥,𝑦 represents the weight coefficient between modality relationships among items within the sequence. As the item en-
𝑥 and 𝑦, and ℓ (𝑥, 𝑦) represents the contrastive loss using Noise coder has already acquired generalized item representations during
Contrastive Estimation [33]: the pre-training phase, the sequence encoder leverages the model
" # parameters of the item encoder directly and focuses on learning
𝑒𝑥𝑝 (𝑠𝑖𝑚(𝑥𝑖 , 𝑦𝑖 )/𝜏) the interaction relationships within the sequence. The structure of
ℓ (𝑥, 𝑦) = E −𝑙𝑜𝑔 Í B
the sequence encoder closely resembles that of the item encoder,
𝑗=1 𝑒𝑥𝑝 (𝑠𝑖𝑚(𝑥𝑖 , 𝑦 𝑗 )/𝜏)
𝑖 ∈|B |
" # (6) employing transformer blocks which are widely used in various
𝑒𝑥𝑝 (𝑠𝑖𝑚(𝑦 𝑗 , 𝑥 𝑗 )/𝜏) applications. The attention mechanism, known for its robust per-
+ E −𝑙𝑜𝑔 Í B
formance, enables each item in the sequence to attend to contextual
𝑖=1 𝑒𝑥𝑝 (𝑠𝑖𝑚(𝑦 𝑗 , 𝑥𝑖 )/𝜏)
𝑗 ∈|B |
information, facilitating the model in inferring representations of
where B represents the batch size, and 𝜏 is a temperature coefficient. each item based on the contextual cues within the sequence.
Through the utilization of contrastive loss, our model has achieved
the capability to represent items in a universal manner, overcoming 3.3.3 Masked Item Prediction. The primary objective of sequence
the limitations imposed by the size of the pretraining dataset in the recommendation is to infer user preferences and provide recommen-
recommendation domain. As a result, our model exhibits robust dations for the next potential item based on historical interaction
generalization capabilities, being able to effectively encode inputs information. To achieve this, the sequential encoder needs to pos-
of diverse modality combinations and align semantically similar sess the ability to predict items by leveraging contextual content
items in the shared hidden layer space. information. In light of the Masked Language Model task [12], we
propose a Masked Item Prediction task for our model.
Inthe Masked Item Prediction task, for a given input sequence
3.3 Sequential Representation Learning 𝑆 = 𝑖 1, 𝑖 2, . . . , 𝑖 |𝑆 | , each item 𝑖 is replaced with a special token
Due to the sensitive nature of user interactions in recommendation [𝑚𝑎𝑠𝑘] with a probability 𝑝. The model is then tasked with predict-
domain and strict privacy regulations, there is a lack of publicly ing the original item based on its contextual content information.
available large-scale pretraining datasets. Previous methods for As our model represents each item with multiple modal represen-
sequence-based recommendation [13, 23, 44] have relied on limited tation embeddings as inputs, for the masked item, we replace all
datasets for pretraining, leading to inadequate generalization ca- its associated modal representation embeddings with the token
pabilities and diminished performance when transferring across [𝑚𝑎𝑠𝑘].
domains. Recognizing the intricate nature of user behavior patterns During testing, the objective of predicting the next potential
in diverse recommendation domains, our objective is to avoid intro- interaction item requires the addition of the token [𝑚𝑎𝑠𝑘] at the
ducing domain-specific biases into our model. Therefore, we learn end of the user’s historical interaction behavior sequence. This
domain-specific sequence interaction patterns for each domain. enables the model to predict the next item in the sequence that the
user is likely to interact with.
3.3.1 Input Representation. In the pretraining retrieval task, as
To ensure consistency in input between training and testing,
discussed in Section 3.2, the model operates on individual items
where the token [𝑚𝑎𝑠𝑘] only appears at the last position during
without considering their relative positional relationships. However,
testing, we adopt a specific strategy during training. The last posi-
in the downstream sequential recommendation task, the model
tion of the input sequence sample is always replaced with the token
takes into account a user’s historical interaction sequence as input,
[𝑚𝑎𝑠𝑘] during training. For other positions of the token [𝑚𝑎𝑠𝑘] in
where items are arranged chronologically based on interaction
the sequence, we employ three replacement strategies: (1) replacing
time, thus having explicit positional relationships. To capture this
the item with the [𝑚𝑎𝑠𝑘] token 80% of the time, (2) replacing the
temporal order relationship, we introduce positional embeddings
item with a randomly selected item 10% of the time, and (3) keeping
as an additional component of the input for the model. Specifically,
the item unchanged 10% of the time. This approach addresses the
for a given item i, its input embedding is obtained by summing the
inconsistency in input between training and testing, ensuring that
feature embedding, modality embedding, and positional embedding.
the model is trained to effectively predict the masked items in the
In this work, in order to improve the performance of our model,
sequence during testing.
we utilize learnable positional embedding matrices instead of fixed
sinusoid embeddings. The positional embedding matrices enable
4 EXPERIMENT
our model to capture contextual relationships and the interac-
tion order of each item within the input sequence, leading to en- 4.1 Pre-training Experiments
hanced representations for user sequences. Moreover, consider- 4.1.1 Datasets. During the pre-training phase, we employed the
ing the constraint on the maximum length 𝑁 of input sequences WebVid dataset [3] as our training data. This dataset comprises a
that our model can handle, when the length of an input sequence large-scale collection of video-text pairs, totaling 10 million pairs
𝑆 = 𝑖 1, 𝑖 2, . . . , 𝑖 |𝑆 | exceeds 𝑁 , we truncate the sequence and retain obtained from stock footage websites. Videos, being a rich source
only the last 𝑁 items as 𝑆𝑡𝑟𝑢𝑛𝑐 = 𝑖 |𝑆 |−𝑁 +1, . . . , 𝑖 |𝑆 | . of diverse modal features, are particularly well-suited for our self-
supervised multi-modal pre-training task. To simplify the training
3.3.2 Multi-modal Sequence Encoder. The sequence encoder serves process, we utilized the pre-extracted features 1 obtained from CLIP
two primary objectives: firstly, to encode each item based on its (ViT-B/32) at a frame rate of 1FPS.
multiple modal information, and secondly, to enhance the contex-
tual representation of each item by incorporating the interaction 1 https://huggingface.co/datasets/iejMac/CLIP-WebVid
xx, xx, xx Kunzhe Song, Qingfeng Sun, Can Xu, Kai Zheng, and Yaming Yang
In addition, we employed MSR-VTT [49] as our test set to assess Table 1: Results of pre-training on MSR-VTT dataset for
the generalization ability of our model. MSR-VTT consists of 10,000 zero-shot text-to-video retrieval. "R@k" is short for "Re-
videos, each ranging in length from 10 to 32 seconds, and a total of call@k".
200,000 captions. For evaluation purposes, we utilized a set of 1,000
test clips to assess the performance of our model. Pretrain Backbone
Method Backbone R@5 R@10
Dataset Trainable
4.1.2 Evaluation Metrics. We employed the standard Recall@K Clip[37] - ViT-B/32 ✕ 42.1 51.7
retrieval metric to evaluate the performance of our model. Recall@K EAO[40] HW100M Clip (ViT-B/32) ✕ 32.5 42.4
assesses the proportion of test samples in which the correct result is FIT[3] CC+WV Clip (ViT-B/32) ✓ 46.9 57.2
present among the top K retrieved points for a given query sample. Ours WV10M Clip (ViT-B/32) ✕ 47.3 59.7
We report the results for Recall@5 and Recall@10 as performance
indicators for our model. Table 2: Statistics of the downstream task datasets after pre-
processing
4.1.3 Implement Details. We utilize CLIP (ViT-B/32) as the visual
encoder and textual encoder to extract modal features. For MSR- Dataset #Users #Items #Actions Sparsity
VTT, we evenly partition the videos into 10 clips based on their
Beauty 22,363 12,101 198,502 99.93%
total length, and extract one frame image from each clip for feature
Sports 35,598 18,357 296,337 99.95%
extraction. During training, the parameters of these modal feature
Clothing 39,387 23,033 278,677 99.97%
encoders are kept frozen, and only the item encoder is trained.
Home 66,519 28,237 551,682 99.97%
The model parameters are optimized using the AdamW optimizer
ML-1m 6,040 3,416 999,611 95.16%
with the learning rate of 5e-5, and exponential decay is applied to
the learning rate with a decay rate of 0.9. We set the maximum num-
ber of video frames allowed in the model to 10 and the maximum
4.2 Downstream Task Experiments
length of text tokens to 77. The batch size is configured to 48000
for training. The embedding dimension of the model is 512, with 4.2.1 Datasets. In order to evaluate the effectiveness of our pro-
2 layers and 8 heads. We set the embedding dropout and hidden posed model, we selected five open-source datasets from real-world
dropout to 0.2 and 0.5, respectively. Additionally, the weight coeffi- platforms, taking into consideration the total number of users, items,
cients 𝜆𝑣,𝑣′ , 𝜆𝑡,𝑡 ′ , 𝜆𝑣,𝑡 , and 𝜆𝑣𝑡,𝑣𝑡 ′ are all 0.25, and the temperature actions, and dataset sparsity. Among these datasets, four are Ama-
coefficient is 0.05. The model is pretrained for 15 epochs using 8 zon platform datasets 2 [30], including "Beauty," "Sports and Out-
V100 GPUs (32GB memory), and the training process is completed doors," "Clothing Shoes and Jewelry," and "Home and Kitchen."
in 5 hours. Additionally, to evaluate the generalization ability of the model
across different platforms, we also selected Movielens-1M 3 [17],
which is widely used for evaluating recommendation algorithms.
4.1.4 Performance. The objective of evaluating the pre-trained We followed the approach used in previous works [25, 35, 41, 54]
model is to assess its efficacy in aligning diverse modalities and to process these datasets, by keeping the five-core datasets and
its ability to generalize. As the downstream sequential recommen- filtering out users and items with fewer than five interactions. Sub-
dation task involves predicting representations of specific item sequently, we grouped the interactions by users and sorted them
positions and retrieving items based on these representations, the in ascending order based on the timestamps. For the four Amazon
retrieval performance of the pre-trained model directly impacts its datasets, we crawled the image links associated with the Amazon
performance on the downstream task. To evaluate the generaliza- dataset items as visual modality information. Items without image
tion ability of the model, zero-shot testing is conducted directly on links were labeled as having missing visual modality information.
the test set. To assess the effectiveness of the alignment between Furthermore, we concatenated the titles and descriptions associated
different modalities, the text-to-video retrieval task is chosen as the with the Amazon dataset items as textual modality information.
evaluation metric. For Movielens-1M, we crawled corresponding movie trailers from
Table 1 presents the results of our model compared to Clip, Ev- YouTube 4 based on the movie names in the dataset as visual modal-
erythingAtOnce, and FrozenInTime on the zero-shot text-to-video ity information, and concatenated the movie names and tags as
retrieval task using MSR-VTT. The table reveals that our model textual modality information. The statistical information of the
outperforms Clip in terms of retrieval performance. By integrating preprocessed datasets is shown in Table 2.
the features of multiple modalities through the item encoder, our
model exhibits improved representation capability for items com- 4.2.2 Evaluation Metrics. To evaluate the performance of models,
pared to using Clip directly for extracting multi-modality features. we utilize the top-k Recall and top-k Normalized Discounted Cumu-
Furthermore, when comparing our model with EverythingAtOnce, lative Gain (NDCG) metrics, which are commonly used in related
despite both models utilizing the same backbone, the variance in works [23, 41, 47]. Recall measures the presence of the positive
pre-training datasets also yields significant performance differences item, while NDCG takes into account both the rank position and
on the test set. Hence, training the model on larger pre-training 2 http://jmcauley.ucsd.edu/data/amazon/
datasets can further augment its generalization ability and enhance 3 https://grouplens.org/datasets/movielens/1m/
Table 3: Performance comparison of different recommendation models. The best and the second-best performances are de-
noted in bold and underlined fonts, respectively. “Improv.” indicates the relative improvement ratios of the proposed approach
over the best performance baselines. The features used for item representations of each compared model have been listed,
whether ID, feature (F), or both (ID+F).
Dataset Metric Pop GRU4Rec𝐼 𝐷 SASRec𝐼 𝐷 FDSA𝐼 𝐷+𝐹 S3 -Rec𝐼 𝐷+𝐹 DuoRec𝐼 𝐷 UniSRec𝐹 Ours𝐹 Improv id Text
Recall@10 0.0189 0.0627 0.0842 0.0874 0.0862 0.0853 0.0743 0.0949 8.58%
Recall@50 0.0555 0.1498 0.1766 0.1838 0.1871 0.1854 0.1885 0.2302 22.12%
Beauty
NDCG@10 0.0093 0.0330 0.0416 0.0462 0.0434 0.0449 0.0351 0.0476 3.03%
NDCG@50 0.0174 0.0520 0.0618 0.0680 0.0653 0.0663 0.0599 0.0754 10.88%
Recall@10 0.0177 0.0351 0.0477 0.0504 0.0523 0.0514 0.0513 0.0635 21.41%
Recall@50 0.0510 0.0928 0.1114 0.1183 0.1210 0.1176 0.1314 0.1607 22.30%
Sports
NDCG@10 0.0097 0.0183 0.0222 0.0276 0.0250 0.0248 0.0252 0.0323 17.03%
NDCG@50 0.0169 0.0308 0.0360 0.0422 0.0399 0.0391 0.0425 0.0534 25.65%
Recall@10 0.0085 0.0158 0.0268 0.0283 0.0370 0.0313 0.0382 0.0452 18.32%
Recall@50 0.0303 0.0477 0.0608 0.0682 0.0858 0.0677 0.1021 0.1269 24.29%
Clothing
NDCG@10 0.0045 0.0078 0.0122 0.0156 0.0169 0.0148 0.0191 0.0218 14.14%0.4040 0.2889
NDCG@50 0.0092 0.0146 0.0195 0.0242 0.0275 0.0227 0.0328 0.0395 20.43%
Recall@10 0.0136 0.0189 0.0303 0.0277 0.0329 0.0309 0.0265 0.0388 17.93%
Recall@50 0.0436 0.0532 0.0638 0.0680 0.0690 0.0669 0.0725 0.0969 33.66%
Home
NDCG@10 0.0069 0.0098 0.0152 0.0155 0.0163 0.0156 0.0135 0.0195 19.63%
NDCG@50 0.0134 0.0171 0.0224 0.0242 0.0241 0.0234 0.0234 0.0321 32.64%
Recall@10 0.0749 0.2988 0.2993 0.3028 0.3002 0.3013 0.1472 0.3124 3.17%
Recall@50 0.2110 0.5412 0.5457 0.5523 0.5467 0.5421 0.4114 0.5650 2.30%
ML-1m
NDCG@10 0.1310 0.1731 0.1690 0.1744 0.1694 0.1697 0.0665 0.1835 5.22%
NDCG@50 0.1522 0.2264 0.2237 0.2296 0.2240 0.2231 0.1242 0.2392 4.18%
the presence. In our experiments, we report Recall and NDCG at k = maximization (MIM) to learn correlations among attribute,
10, 50. In addition, we adopt the leave-one-out strategy, which has item, subsequence, and sequence.
been widely employed in previous works [35, 44, 54]. Specifically, • DuoRec 5 [35] addresses the representation degeneration
for each user, we retain the last interaction item as the test data, and problem in sequential recommendation. It uses contrastive
the item just before the last as the validation data. The remaining regularization to reshape the distribution of sequence repre-
items are used for training. We rank the ground-truth item of each sentations and improve the item embeddings distribution.
sequence against all other items for evaluation on the test set, and • UniSRec 6 [23] utilizes the associated description text of
finally calculate the average score across all test users. items to learn transferable representations across different
recommendation scenarios.
4.2.3 Baselines. We compare the proposed approach with the fol-
4.2.4 Implement Details. For DuoRec and UniSRec, we utilized the
lowing baseline methods:
source code provided by their respective authors. For the other
• Pop A non-personalized approach involves recommending methods, we implemented them using RecBole [53], a widely used
the same items to all users. These items are determined open-source recommendation library. All hyper-parameters were
based on their popularity, measured by the highest number set based on the recommendations from the original papers. Addi-
of interactions across the entire set of items. tionally, fine-tuning of all baseline models was conducted on the
• GRU4Rec [21] proposes an approach for session-based rec- five downstream recommendation datasets.
ommendations using recurrent neural networks with Gated For our proposed model, we use the AdamW optimizer with a
Recurrent Units. learning rate of 1e-3 and configure the batch size to 8192. The mask
• SASRec [25] proposes a self-attention based sequential model item ratio of the model is 0.2. For the four Amazon datasets, we
which uses the multi-head attention mechanism to recom- set the maximum sequence length to 20, the embedding dropout
mend the next item. to 0.2 and the hidden dropout to 0.5. For Movielens-1M, we set
• FDSA [52] integrates various heterogeneous features of the maximum sequence length to 100, the embedding dropout to
items into feature sequences with different weights through 0.2 and the hidden dropout to 0.2. Furthermore, we adopted early
a vanilla attention mechanism.
• S3 -Rec [54] is a self-supervised learning approach for se- 5 https://github.com/RuihongQiu/DuoRec
stopping with a patience of 10 epochs to prevent overfitting and Table 4: Statistical Information of Sports and Home datasets
set Recall@10 as the indicator. under Different k-core Strategies.
Metrics
Method Backbone
R@10 R@50 N@10 N@50
UniSRec𝑡 Bert (base) 0.0513 0.1314 0.0252 0.0425
Our𝑡 Bert (base) 0.0588 0.1501 0.0299 0.0497
Our𝑡 Clip (ViT-B/32) 0.0603 0.1524 0.0309 0.0501
[47] Xu Xie, Fei Sun, Zhaoyang Liu, Shiwen Wu, Jinyang Gao, Jiandong Zhang, Bolin recommendation. In Proceedings of the 43rd International ACM SIGIR conference
Ding, and Bin Cui. 2022. Contrastive learning for sequential recommendation. In on research and development in Information Retrieval. 1469–1478.
2022 IEEE 38th international conference on data engineering (ICDE). IEEE, 1259– [52] Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S Sheng, Jiajie Xu, Deqing
1273. Wang, Guanfeng Liu, Xiaofang Zhou, et al. 2019. Feature-level Deeper Self-
[48] Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Attention Network for Sequential Recommendation.. In IJCAI. 4320–4326.
Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: [53] Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu
Contrastive pre-training for zero-shot video-text understanding. arXiv preprint Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, et al. 2021. Recbole:
arXiv:2109.14084 (2021). Towards a unified, comprehensive and efficient framework for recommendation
[49] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description algorithms. In Proceedings of the 30th ACM International Conference on Information
dataset for bridging video and language. In Proceedings of the IEEE conference on & Knowledge Management. 4653–4664.
computer vision and pattern recognition. 5288–5296. [54] Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang,
[50] Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. 2021. Taco: Token-aware cascade Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised learning for se-
contrastive learning for video-text alignment. In Proceedings of the IEEE/CVF quential recommendation with mutual information maximization. In Proceedings
International Conference on Computer Vision. 11562–11572. of the 29th ACM international conference on information & knowledge management.
[51] Fajie Yuan, Xiangnan He, Alexandros Karatzoglou, and Liguang Zhang. 2020. 1893–1902.
Parameter-efficient transfer from sequential behaviors for user modeling and