FGCN
FGCN
1575
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland
of the model as well as prevents the further precision improvement "mountain" will rarely co-occur in the same image. Therefore, a
of multi-label image recognition. large number of combinations of labels will hardly occur in the real
In this paper, we introduce Multi-modal Factorized Bilinear pool- world.
ing (MFB) [39] which works as an efficient component to fuse Various approaches have been considered to explore the label
cross-modal vectors and propose a fast GCN based model (termed dependencies to reduce and optimize the label prediction space.
as F-GCN) to fuse image representations and label embeddings for Gong et al. [10] adopt a deep convolution architecture to learn the
multi-label image recognition. Our F-GCN mainly contains image approximate top-k ranking objective function for multi-label image
representation learning module, label co-occurrence embedding recognition. Wang et al. [32] combine CNN with RNN to model the
module and MFB fusion module. In the image representation learn- label dependencies in a sequential fashion by embedding semantic
ing module, following ML-GCN, we use a CNN (i.e., ResNet-101 [11]) labels into vectors. Besides, others propose to utilize the attention
based model to obtain the latent representation of each image. At mechanism to capture the correlation between labels. Zhu et al. [41]
the same time, in the label co-occurrence embedding module, we propose a spatial regularization network to capture both semantic
take both the word vectors and co-occurrence correlations of labels and spatial relations of these multiple labels based on weighted
as input, then use GCN to train and learn the label embeddings that attention maps. Wang et al. [33] introduce a spatial transformer
reflect the co-occurrence relationships between different objects. layer and long short-term memory (LSTM) units to capture the
Different from previous studies, in the following, we fuse these two- label correlation. In addition, a graph based framework [19] has
modal vectors (i.e., image representations and label co-occurrence been proposed to describe the relationships between labels via
embeddings) via MFB to train an end-to-end multi-label image knowledge graphs which aims to generate more accurate image
recognition model with a multi-label loss function. We conduct representations.
extensive experiments on two multi-label datasets including MS-
COCO [23] and VOC2007 [6]. Experimental results demonstrate 2.2 Graph Convolution Network
the MFB component efficiently fuses image representations and The basic idea of graph convolution network (GCN) [17] is to update
label co-occurrence embeddings and thus greatly improves the one node’s feature based on those features of the node itself and
convergence efficiency of the model. In addition, the performance other related neighbor nodes according to the correlation matrix of
of image recognition has also been promoted compared with the the graph. By learning the structural similarities between training
state-of-the-art methods. data points, GCN can integrate the relationships into data features.
The rest of this paper is organized as follows. Section 2 talks Formally, GCN takes the correlation matrix A as well as feature
about several related works. We formulate and introduce our F-GCN matrix X as input, and produces the node-level output. The forward
in detail in section 3. Section 4 presents the comparison experiments, propagation process in GCN is described as:
ablation study as well as visual retrieval results of F-GCN. At last,
we conclude this paper in section 5. H l +1 = al (ÂH l W l ), (1)
1576
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland
...
Figure 1: Related objects are more likely to appear in the same image. For example, we can always see the superstar Kobe
Bryant and his teammates playing basketball on TV (in the first three images). These images are common in the real world,
which illustrates the combination of "person", "basketball" and "basketball hoop" tends to co-occur at the same time. However,
we hardly see a "basketball" in the "mountain", and they are rarely tied in one image, because there is no direct relationship
between these two objects.
2.3 Cross-modal Fusion Figure 2, and then respectively introduce the workflow of these
Cross-modal feature fusion methods have been proposed to solve three modules in detail.
the visual question answering (VQA) problem [25], which usually
use concatenation or element-wise summations to fuse the image 3.1 Overall Framework
and the question representations. Fukui et al. [7] first propose the For convenience, we list the preliminary notations in Table 1. As
Multi-modal Compact Bilinear pooling (MCB) which introduces shown in Figure 2, there are three key modules in F-GCN: a CNN
the bilinear model to fuse multi-modal features by using the outer module for image feature extraction, a GCN module for co-occurrence
product of two vectors in different modalities to produce a very label embedding generation and an MFB fusion module for cross-
high-dimensional feature for quadratic expansion. To reduce the modal vectors fusion.
high-dimension computation, Kim et al. [16] propose the Multi- Given a dataset consisting of N images, the i-th image x i is
modal Low-rank Bilinear pooling (MLB) approach based on the input to a CNN module (i.e., ResNet-101 [11]) to extract its image
Hadamard product of two feature vectors, which can achieve com- representation fi (a D-dimension feature vector) from the "conv5_x"
parable performance to MCB but may lead to a low convergence
rate. Furthermore, Zhou et al. [39] introduce the Multi-modal Fac-
torized Bilinear pooling (MFB) model to efficiently fuse image and Table 1: Preliminary notations used in this paper.
text embeddings, which produces a remarkable result in VQA as
well as speeds up the model convergence.
Notation Explanation
Motivated by the above studies, our work adopts the MFB com-
N the number of input images
ponent to fuse the image representations and label co-occurrence
xi the i-th image
embeddings respectively generated from the CNN and GCN mod-
fi the i-th image’s representation
ules. With the proposed F-GCN, the convergence efficiency of the
li the i-th image’s ground truth label
model has been greatly promoted. We also demonstrate our F-GCN
yi the i-th image’s predicted label
works as an effective model to learn the label dependencies and can
C the number of object categories
be trained in an end-to-end manner with remarkable performance
oi the i-th object in the label set
compared with the state-of-the-art methods.
Ti the occurrence times of the i-th object
Ti j the co-occurrence times of the i-th and j-th objects
3 PROPOSED METHODOLOGY d the dimension of each object’s word embedding vector
Based on previous studies, in this section, we propose F-GCN, a fast D the dimension of each image’s representation
GCN based multi-label image recognition framework that adopts Z a C × d object word embeddings matrix
the cross-modal component (i.e., MFB) to fuse image representations A a C × C label correlation matrix
and label co-occurrence embeddings. Our F-GCN mainly consists W a C × D label co-occurrence embeddings matrix
of three modules: image representation learning module, label co- Wj the j-th row vector of W
occurrence embedding module and MFB fusion module. In the U the weights in GCN propagation
following, we first present the overall framework of F-GCN in M the dimension of M 1 (or M 2 ) in MFB fusion
д the number of units in each дroup sum-poolinд
1577
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland
...
...
basketball
yi
...
D
Multi-Label Loss
hoop
global
conv5_x max-pooling
M
fc
...
...
...
...
object word embedding vectors label M2 C
co-occurrence M fc
g
Z ...
embeddings
Wj
M
}
...
group
C×d # # #
M sum-pooling
label correlation matrix GCN GCN # # #
... fc
... # # #
... C×D
A W
...
...
...
C×C
Figure 2: The overall framework of our proposed F-GCN is comprised of three key modules: image representation learning
module, label co-occurrence embedding module and MFB fusion module. The image representation module uses a CNN back-
bone (i.e., ResNet-101) to train and extract the representation (feature) of each image. The label co-occurrence embedding
module designs a two-layer GCN to learn the co-occurrence embeddings that reflect the label dependencies according to the
label statistical information. The MFB module efficiently fuses these two cross-modal vectors by means of дroup sum-poolinд.
The overall network is trained in an end-to-end manner using the Multi-Label Loss function.
layer of this network. At the same time, the GCN module takes output of our F-GCN as a classifier corresponding to each label.
both the object word embeddings matrix Z and label correlation Specifically, we aim to map the object dependencies of a dataset
matrix A as input, then adopts a two-layer GCN to learn and capture to label co-occurrence embeddings in our task. The input of GCN
the label co-occurrence embeddings matrix W . After obtaining the calls for the feature vector of each node and the correlation matrix
image representations and label co-occurrence embeddings, we between these nodes. As shown in the blue frame of Figure 2, we
design the MFB fusion module to efficiently fuse these two cross- adopt the GloVe [27] model to transform each object (totally C ob-
modal vectors (fi and W ) via дroup sum-poolinд, and use the fully jects in a dataset) into a d-dimensional (i.e., 300-dimensional) word
connected (f c) layer to generate the predicted label yi . Finally, the vector. Therefore, we can obtain a C × d object word embeddings
multi-label loss function is adopted to update the loss between yi matrix Z . For example, there are 20 object categories in VOC2007,
and the ground truth label li ∈ {0, 1}C to train the whole network so the input feature vectors matrix Z for the first GCN layer will
in an end-to-end manner. be a 20 × 300 matrix.
In addition to obtaining the feature vector of each node (object),
3.2 Image Representation Learning Module another essential issue in F-GCN is to construct the label correlation
Following ML-GCN [3], we adopt one of the state-of-the-art CNN matrix A between these nodes. In the implementation, we capture
based networks (i.e., ResNet-101 [11]) to complete the feature ex- the label dependencies and construct matrix A according to the
traction of an image in this module. As shown in the orange frame label statistical information over the whole dataset. Specifically,
of Figure 2, we first remove the last f c layer and so f tmax layer for ∀i ∈ [1, C], we collect the occurrence times (i.e., Ti ) of the i-th
of ResNet-101, and then use this sub-network to generate the rep- object (i.e., oi ) as well as the co-occurrence times (i.e., Ti j , which
resentation of each image x i with D dimension. As we know, for equals T ji ) of oi and o j . Furthermore, the label dependencies can
ResNet-101, the output dimension D is 2048. Therefore, for any be formulated by the conditional probability as follows:
an input image x i with the 448 × 448 resolution, we can obtain
2048 × 14 × 14 feature maps from the "conv5_x" layer. At last, we Ti j
Pi j = P(oi |o j ) = , (2)
adopt дlobal max-poolinд to generate the image representation fi Tj
which is a 2048-dimension feature vector.
where Pi j denotes the probability that oi occurs in the conditional
3.3 Label Co-occurrence Embedding Module of o j appearing. Note that Pi j is not equal to P ji owing that the
conditional probability between two objects is asymmetric. Based
In this part, we use GCN to learn the label co-occurrence embed-
on this, we can construct the correlation matrix A below:
dings according to the relationship between different objects.
Different from the original GCN which was proposed to solve
the node classification problem, we treat and design the node-level Ai j = P i j , (3)
1578
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland
where Ai j denotes the i-th row and j-th column element of matrix sum-poolinд to convert M 1 ◦ M 2 to a Mд -dimensional vector, where
A. However, similar to ML-GCN, if we directly use this correla- each group containing д units is sequentially mapped into one unit.
tion matrix to train the model, the rare co-occurrence objects will Finally, we adopt a f c layer to generate the j-th element of yi . We
become some noise that affect the data distribution as well as the will obtain a complete predicted label vector yi corresponding to fi
model convergence. To filter the noise, we choose to use a threshold after C times fusion with fi . Note that all the f c layers are shared
ϵ to binarize the above matrix A: by each fusion.
( Finally, we adopt the MultiLabelSoftMarginLoss1 (termed as
0, if P ji ≤ ϵ Multi-label loss) function to update the whole network in an end-
Ai j = (4)
1, otherwise to-end manner. The training loss function is described as:
where ϵ ∈ [0, 1]. Besides, when using GCN to update the node’s
feature in the propagation process, the binary correlation matrix ÍC exp(−yi j )
L(yi , li ) = − C1 j=1 li j loд((1 + exp(−yi j ))
−1 ) + (1 − l
i j )loд( (1+exp(−yi j )) ), (7)
may lead to the over-smoothing problem which makes the gen-
erated nodes’ features indistinguishable. Therefore, we adopt the where yi j and li j respectively denote the j-th element of yi and li .
weighted scheme to calculate the final correlation matrix as:
4 EXPERIMENTS
ÍC δ Ai j , if i , j
In this section, we evaluate the performance of F-GCN and compare
Ai j =
j=1∩i ,j (5) it with the state-of-the-art image recognition methods. We first
1 − δ,
otherwise
describe the datasets, then introduce the implementation details
where δ ∈ [0, 1]. In this way, we can use this weighted correlation and evaluation metrics, and finally present the experimental results
matrix A to update the node’s feature by choosing a suitable δ . of F-GCN.
After obtaining both the object word embeddings vectors Z and
label correlation matrix A, we design a two-layer GCN to propagate 4.1 Datasets
this relationship and each GCN layer can be described as:
MS-COCO [23] is a popular multi-label dataset for image recogni-
tion, segmentation and captioning, which contains 118,287 training
Z l +1 = f l (ÂZ l U l ), l ∈ [0, 2] (6)
images, 40,504 validation images and 40,775 test images, where
where  (see [17] for details) denotes the normalized version of each image is averagely labeled with about 2.9 object labels from
correlation matrix A, Z l denotes the latent features of C nodes in the 80 semantic class categories except that the labels of test set are
the l-layer, U l denotes the weights of the l-layer, and f l (·) denotes not available. Owing that the ground truth labels of the test set are
the non-linear operation which is a ReLU function. Note that Z is not available, we train our model on the train set and evaluate the
the input of this sub-network, and the output is a C × D label co- performance on the validation set.
occurrence embeddings matrix W of which each row will be fused
with the image representation fi in the next MFB fusion module. VOC2007 [6] consists of 9,963 multi-label images and 20 object
classes, which is divided into train, validation and test sets. On
3.4 MFB Fusion Module average, each image is annotated with 1.5 labels. We use both the
train and validation sets to train our model, and then evaluate the
MFB [39] has been proposed to work as an effective component to
performance on the test set.
fuse cross-modal features, which is usually implemented by com-
bining some commonly-used layers such as f c, element-wise mul-
4.2 Implementation Details and Evaluation
tiplication and pooling layers. Different from the previous works
that use DP to simply fuse the cross-modal vectors, in this part, we Metrics
adopt MFB to efficiently fuse image representations and label co- Implementation details. All experiments are implemented with
occurrence embeddings, which helps achieve higher performance PyTorch. In the image representation module, each image is resized
of F-GCN. into 448 × 448 using random horizontal flips and the output di-
As shown in the red frame of Figure 2, on the one hand, the mension is D = 2048 from ResNet-101. In the label co-occurrence
Hadmard product increases the interaction between different modal embedding module, our F-GCN consists of a two-layer GCN with
vectors, which promotes the precision of F-GCN. On the other hand, output dimension of 1024 and 2048, where the initial label word
дroup sum-poolinд reduces over-fitting and parameters explosion, embedding is a 300-dimensional vector generated by the GloVe
which speeds up the convergence of F-GCN. The input of MFB model pre-trained on the Wikipedia dataset. Note that we use the
consists of two parts: fi and W . Note that in each fusion, we use fi average embeddings of all words as the label word vector if this
and one row vector of W to generate one of the element in yi . label is expressed by multiple words. To construct the correlation
Formally, given the i-th image representation fi , for ∀j ∈ [1, C], matrix, we respectively set ϵ = 0.4 in Equation 4 and δ = 0.2 in
we fuse fi and Wj to generate the j-th element of yi where W j is Equation 5. In the MFB fusion module, we set M = 358 to fuse
stated to be the j-th row vector ofW in Table 1. First, we respectively cross-modal embeddings and д = 2 to complete дroup sum-poolinд.
use two f c layers to transform fi to a M-dimensional vector M 1 and The whole network is updated by stochastic gradient descent (SGD)
Wj to a M-dimensional vector M 2 . Then these two modal vectors with a momentum of 0.9, a weight decay of 10−4 , an initial learning
are fused to generate a M-dimensional vector M 1 ◦ M 2 , where ◦ is
the Hadamard product. To speed up the convergence, we use дroup 1 https://pytorch.org/docs/master/nn.html?highlight=multilabelsoft#torch.nn.MultiLabelSoftMarginLoss
1579
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland
All Top-3
Method
mAP CP CR CF1 OP OR OF1 CP CR CF1 OP OR OF1
CNN-RNN [32] 61.2 - - - - - - 66.0 55.6 60.4 69.2 66.4 67.8
RNN-Attention [33] - - - - - - - 79.1 58.7 67.4 84.0 63.0 72.0
Order-Free RNN [1] - - - - - - - 71.6 54.8 62.1 74.2 62.2 67.7
ML-ZSL [19] - - - - - - - 74.1 64.5 69.0 - - -
SRN [41] 77.1 81.6 65.4 71.2 82.7 69.9 75.8 85.2 58.8 67.4 87.4 62.5 72.9
Multi-Evidence [8] - 80.4 70.2 74.9 85.2 72.5 78.4 84.5 62.2 70.6 89.1 64.3 74.7
ResNet-101 [11] 77.3 80.2 66.7 72.8 83.9 70.8 76.8 84.1 59.4 69.7 89.1 62.8 73.6
ML-GCN [3] (DP) 83.0 85.1 72.0 78.0 85.8 75.4 80.3 89.2 64.1 74.6 90.5 66.5 76.7
A-GCN [21] (DP) 83.1 84.7 72.3 78.0 85.6 75.5 80.3 89.0 64.2 74.6 90.5 66.3 76.6
F-GCN (MFB) 83.2 85.4 72.4 78.3 86.0 75.7 80.5 89.3 64.3 74.7 90.5 66.6 76.7
Table 3: AP and mAP comparisons of F-GCN with the state-of-the-art methods on VOC2007.
AP
Method mAP
areo bike bird boat bottle bus car cat chair cow table dog horse motor person plant sheep sofa train tv
CNN-RNN [32] 96.7 83.1 94.2 92.8 61.2 82.1 89.1 94.2 64.2 83.6 70.0 92.4 91.7 84.2 93.7 59.8 93.2 75.3 99.7 78.6 84.0
RLSD [24] 96.4 92.7 93.8 94.1 71.2 92.5 94.2 95.7 74.3 90.0 74.2 95.4 96.2 92.1 97.9 66.9 93.5 73.7 97.5 87.6 88.5
VeryDeep [28] 98.9 95.0 96.8 95.4 69.7 90.4 93.5 96.0 74.2 86.6 87.8 96.0 96.3 93.1 97.2 70.0 92.1 80.3 98.1 87.0 89.7
ResNet-101 [11] 99.5 97.7 97.8 96.4 65.7 91.8 96.1 97.6 74.2 80.9 85.0 98.4 96.5 95.9 98.4 70.1 88.3 80.2 98.9 89.2 89.9
FeV+LV [36] 97.9 97.0 96.6 94.6 73.6 93.9 96.5 95.5 73.7 90.3 82.0 95.4 97.7 95.9 98.6 77.6 88.7 78.0 98.3 89.0 90.6
HCP [34] 98.6 97.1 98.0 95.6 75.3 94.7 95.8 97.3 73.1 90.2 80.0 97.4 96.1 94.9 96.3 78.3 94.7 76.2 97.9 91.5 90.9
RNN-Attention [33] 98.6 97.4 96.3 96.2 75.2 92.4 96.5 97.1 76.5 92.0 87.7 96.8 97.5 93.8 98.5 81.6 93.7 82.8 98.6 89.3 91.9
Atten-Reinforce [2] 98.6 97.1 97.1 95.5 75.6 92.8 96.8 97.3 78.3 92.2 87.6 96.9 96.5 93.6 98.5 81.6 93.1 83.2 98.5 89.3 92.0
ML-GCN [3] (DP) 99.5 98.5 98.6 98.1 80.8 94.6 97.2 98.2 82.3 95.7 86.4 98.2 98.4 96.7 99.0 84.7 96.7 84.3 98.9 93.7 94.0
A-GCN [21] (DP) 99.4 98.5 98.6 98.0 80.8 94.7 97.2 98.2 82.4 95.5 86.4 98.2 98.4 96.7 98.9 84.8 96.6 84.4 98.9 93.7 94.0
F-GCN (MFB) 99.5 98.5 98.7 98.2 80.9 94.8 97.3 98.3 82.5 95.7 86.6 98.2 98.4 96.7 99.0 84.8 96.7 84.4 99.0 93.7 94.1
rate of 0.1 which decays by a factor of 10 every 40 epochs, and a 38.89% and 72.91% lower than our F-GCN. In addition, ML-GCN will
batchsize of 32. take about 200 epochs (more than 11 times of F-GCN) to complete
its training process. This result verifies that our MFB component
Evaluation metrics. We use the conventional image recognition efficiently fuses the cross-modal embeddings and greatly promotes
evaluation metrics including the mean of class-average precision the convergence efficiency of F-GCN.
(mAP), overall precision (OP), recall (OR), F1 (OF1), and average
per-class precision (CP), recall (CR), F1 (CF1). For each image, the
labels are predicted as positive if the confidences of them are greater
than 0.5. In addition, we also present these evaluation results of
top-3 labels for fair comparisons.
1580
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland
1581
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland
Dataset
# layers MS-COCO VOC2007
mAP CF1 OF1 CF1-3 OF1-3 mAP
2 layers 83.2 78.3 80.5 74.7 76.7 94.1
3 layers 82.4 76.9 79.8 73.7 76.3 93.7
4 layers 82.3 76.7 79.6 73.1 75.9 93.2
(a) MS-COCO (b) VOC2007
The dimension M in cross-modal fusion. In this part, we eval- Figure 7: The change of mAP using different units of д.
uate the performance of F-GCN by changing the dimension of M
when fusing the image representations and label co-occurrence The number of units д in дroup sum-poolinд. In this part, we
embeddings. The input of MFB fusion module consists of 2048- evaluate the performance of F-GCN by using different number of
dimensional vectors pairs, which will be reduced into M-dimension units д. By дroup sum-poolinд, each M-dimensional vector will be
via f c layers. We vary M from 64 to 1024 with the step size of 64. As transformed into a M д -dimensional vector. We vary the value of д
shown in Figure 6, the performance of F-GCN will be improved with from 1 to 64 to generate a light-weight fusion vector. As shown
the increase of M until M exceeds 384 on MS-COCO and VOC2007. in Figure 7(a), F-GCN obtains a better performance on MS-COCO
Maybe M ∈ [320, 384] not only plays a better role in dimensionality when choosing д = 2, while the change of mAP is very slight on
reduction, but also efficiently fuses the cross-modal vectors. In fact, VOC2007 in Figure 7(b). We believe д = 2 can better express the
in the experiment, we find M = 358 can bring a good result for both original semantic information by pooling. Otherwise, other values
the efficiency and precision. We believe a more detailed perspective of д also bring a comparable result, which will not indeed affect the
about the effect of M will be given if the interval is divided more model too much. It is the structure of MFB that plays a vital role in
finely. promoting the performance of F-GCN.
1582
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland
Query Results
person + dog
bus + car
Figure 8: Two examples of the returned results with the query image on VOC2007.
4.3.4 Visual retrieval results. In this section, we evaluate F- 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18),
GCN by giving two retrieval examples on VOC2007. We return New Orleans, Louisiana, USA, February 2-7, 2018, Sheila A. McIlraith and Kilian Q.
Weinberger (Eds.). AAAI Press, 6714–6721. https://www.aaai.org/ocs/index.php/
the top-5 images by the k-NN algorithm for each given query im- AAAI/AAAI18/paper/view/16114
age. Figure 8 lists the retrieval results. For example, the first input [2] Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. 2018. Recurrent
Attentional Reinforcement Learning for Multi-Label Image Recognition. In Pro-
image contains two objects: "person" and "dog", and each returned ceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18),
image also contains these two objects. Besides, we obtain a similar the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th
effect when inputting the second image that contains "bus" and AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18),
New Orleans, Louisiana, USA, February 2-7, 2018, Sheila A. McIlraith and Kilian Q.
"car". The visual retrieval results verify that F-GCN owns a good Weinberger (Eds.). AAAI Press, 6730–6737. https://www.aaai.org/ocs/index.php/
classification ability to recognize multi-label images. AAAI/AAAI18/paper/view/16654
[3] Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-Label
Image Recognition With Graph Convolutional Networks. In IEEE Conference
5 CONCLUSION AND FUTURE WORK on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA,
In order to model the label dependencies and efficiently fuse cross- June 16-20, 2019. Computer Vision Foundation / IEEE, 5177–5186. DOI:http:
//dx.doi.org/10.1109/CVPR.2019.00532
modal vectors (i.e., image representations and label co-occurrence [4] Amanda Clare and Ross D. King. 2001. Knowledge Discovery in Multi-label Phe-
embeddings), in this paper, we introduce a cross-modal fusion com- notype Data. In Principles of Data Mining and Knowledge Discovery, 5th European
Conference, PKDD 2001, Freiburg, Germany, September 3-5, 2001, Proceedings (Lec-
ponent (i.e., MFB) and propose F-GCN, a fast GCN based multi- ture Notes in Computer Science), Luc De Raedt and Arno Siebes (Eds.), Vol. 2168.
label image recognition model. F-GCN first respectively adopts Springer, 42–53. DOI:http://dx.doi.org/10.1007/3-540-44794-6_4
a CNN to extract the image features and a GCN to capture the [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009.
ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer
label co-occurrence embeddings according to the relationship be- Society Conference on Computer Vision and Pattern Recognition (CVPR 2009),
tween different objects, then utilizes MFB to efficiently fuse these 20-25 June 2009, Miami, Florida, USA. IEEE Computer Society, 248–255. DOI:
cross-modal embeddings and trains an end-to-end model with a http://dx.doi.org/10.1109/CVPR.2009.5206848
[6] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John M. Winn, and
multi-label loss function. Extensive experimental results on MS- Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge.
COCO and VOC2007 demonstrate the MFB component efficiently International Journal of Computer Vision 88, 2 (2010), 303–338. DOI:http://dx.doi.
org/10.1007/s11263-009-0275-4
fuses image representations and label co-occurrence embeddings [7] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell,
and thus greatly improves the convergence efficiency of the model. and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual
In addition, the performance of image recognition has also been Question Answering and Visual Grounding. In Proceedings of the 2016 Conference
on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas,
promoted compared with the state-of-the-art methods. In the fu- USA, November 1-4, 2016, Jian Su, Xavier Carreras, and Kevin Duh (Eds.). The
ture, we will integrate the attention mechanism into our model to Association for Computational Linguistics, 457–468. DOI:http://dx.doi.org/10.
extract more accurate image features to help further promote the 18653/v1/d16-1044
[8] Weifeng Ge, Sibei Yang, and Yizhou Yu. 2018. Multi-Evidence Filtering and Fusion
image recognition performance. for Multi-Label Classification, Object Detection and Semantic Segmentation
Based on Weakly Supervised Learning. In 2018 IEEE Conference on Computer
ACKNOWLEDGEMENTS Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22,
2018. IEEE Computer Society, 1277–1286. DOI:http://dx.doi.org/10.1109/CVPR.
This work is supported by the National Natural Science Foundation 2018.00139
[9] Zongyuan Ge, Dwarikanath Mahapatra, Suman Sedai, Rahil Garnavi, and Rajib
of China No.61902135 and the Innovation Group Project of the Chakravorty. 2018. Chest X-rays Classification: A Multi-Label and Fine-Grained
National Natural Science Foundation of China No.61821003. Thanks Problem. CoRR abs/1807.07247 (2018). arXiv:1807.07247 http://arxiv.org/abs/1807.
for Jay Chou, a celebrated Chinese singer whose songs have been 07247
[10] Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and Sergey
accompanying the author. Ioffe. 2014. Deep Convolutional Ranking for Multilabel Image Annotation. In
2nd International Conference on Learning Representations, ICLR 2014, Banff, AB,
REFERENCES Canada, April 14-16, 2014, Conference Track Proceedings, Yoshua Bengio and Yann
LeCun (Eds.). http://arxiv.org/abs/1312.4894
[1] Shang-Fu Chen, Yi-Chen Chen, Chih-Kuan Yeh, and Yu-Chiang Frank Wang. [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual
2018. Order-Free RNN With Visual Attention for Multi-Label Classification. In Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI- and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE
18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the Computer Society, 770–778. DOI:http://dx.doi.org/10.1109/CVPR.2016.90
1583
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland
[12] Tao He and Xiaoming Jin. 2019. Image Emotion Distribution Learning with 2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
Graph Convolutional Networks. In Proceedings of the 2019 on International Con- http://arxiv.org/abs/1301.3781
ference on Multimedia Retrieval, ICMR 2019, Ottawa, ON, Canada, June 10-13, [27] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove:
2019, Abdulmotaleb El-Saddik, Alberto Del Bimbo, Zhongfei Zhang, Alexander G. Global Vectors for Word Representation. In Proceedings of the 2014 Conference on
Hauptmann, K. Selçuk Candan, Marco Bertini, Lexing Xie, and Xiao-Yong Wei Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29,
(Eds.). ACM, 382–390. DOI:http://dx.doi.org/10.1145/3323873.3326593 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL,
[13] Fenyu Hu, Yanqiao Zhu, Shu Wu, Liang Wang, and Tieniu Tan. 2019. Hierarchi- Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1532–1543.
cal Graph Convolutional Networks for Semi-supervised Node Classification. In DOI:http://dx.doi.org/10.3115/v1/d14-1162
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intel- [28] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net-
ligence, IJCAI 2019, Macao, China, August 10-16, 2019, Sarit Kraus (Ed.). ijcai.org, works for Large-Scale Image Recognition. In 3rd International Conference on Learn-
4532–4539. DOI:http://dx.doi.org/10.24963/ijcai.2019/630 ing Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
[14] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.1556
2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on [29] Jiaxiang Tang, Wei Hu, Xiang Gao, and Zongming Guo. 2019. Joint Learning
Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July of Graph Representation and Node Features in Graph Convolutional Neural
21-26, 2017. IEEE Computer Society, 2261–2269. DOI:http://dx.doi.org/10.1109/ Networks. CoRR abs/1909.04931 (2019). arXiv:1909.04931 http://arxiv.org/abs/
CVPR.2017.243 1909.04931
[15] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jégou, [30] Konstantinos Trohidis, Grigorios Tsoumakas, George Kalliris, and Ioannis P.
and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. Vlahavas. 2008. Multi-Label Classification of Music into Emotions. In ISMIR 2008,
CoRR abs/1612.03651 (2016). arXiv:1612.03651 http://arxiv.org/abs/1612.03651 9th International Conference on Music Information Retrieval, Drexel University,
[16] Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, Philadelphia, PA, USA, September 14-18, 2008, Juan Pablo Bello, Elaine Chew, and
and Byoung-Tak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling. Douglas Turnbull (Eds.). 325–330. http://ismir2008.ismir.net/papers/ISMIR2008_
In 5th International Conference on Learning Representations, ICLR 2017, Toulon, 275.pdf
France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https: [31] Grigorios Tsoumakas and Ioannis Katakis. 2007. Multi-Label Classification:
//openreview.net/forum?id=r1rhWnZkg An Overview. IJDWM 3, 3 (2007), 1–13. DOI:http://dx.doi.org/10.4018/jdwm.
[17] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with 2007070101
Graph Convolutional Networks. In 5th International Conference on Learning [32] Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu.
Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track 2016. CNN-RNN: A Unified Framework for Multi-label Image Classification. In
Proceedings. OpenReview.net. https://openreview.net/forum?id=SJU4ayYgl 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016,
[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2285–2294. DOI:
Classification with Deep Convolutional Neural Networks. In Advances in Neu- http://dx.doi.org/10.1109/CVPR.2016.251
ral Information Processing Systems 25: 26th Annual Conference on Neural In- [33] Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. 2017. Multi-
formation Processing Systems 2012. Proceedings of a meeting held December 3- label Image Recognition by Recurrently Discovering Attentional Regions. In IEEE
6, 2012, Lake Tahoe, Nevada, United States, Peter L. Bartlett, Fernando C. N. International Conference on Computer Vision, ICCV 2017, Venice, Italy, October
Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger (Eds.). 22-29, 2017. IEEE Computer Society, 464–472. DOI:http://dx.doi.org/10.1109/
1106–1114. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep- ICCV.2017.58
convolutional-neural-networks [34] Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao,
[19] Chung-Wei Lee, Wei Fang, Chih-Kuan Yeh, and Yu-Chiang Frank Wang. 2018. and Shuicheng Yan. 2016. HCP: A Flexible CNN Framework for Multi-Label Image
Multi-Label Zero-Shot Learning With Structured Knowledge Graphs. In 2018 Classification. IEEE Trans. Pattern Anal. Mach. Intell. 38, 9 (2016), 1901–1907. DOI:
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt http://dx.doi.org/10.1109/TPAMI.2015.2491929
Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 1576–1585. DOI: [35] Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph
http://dx.doi.org/10.1109/CVPR.2018.00170 Convolutional Network Hashing for Cross-Modal Retrieval. In Proceedings of
[20] Ron Levie, Federico Monti, Xavier Bresson, and Michael M. Bronstein. 2019. the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI
CayleyNets: Graph Convolutional Neural Networks With Complex Rational 2019, Macao, China, August 10-16, 2019, Sarit Kraus (Ed.). ijcai.org, 982–988. DOI:
Spectral Filters. IEEE Trans. Signal Processing 67, 1 (2019), 97–109. DOI:http: http://dx.doi.org/10.24963/ijcai.2019/138
//dx.doi.org/10.1109/TSP.2018.2879624 [36] Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai.
[21] Qing Li, Xiaojiang Peng, Yu Qiao, and Qiang Peng. 2019. Learning Category 2016. Exploit Bounding Box Annotations for Multi-Label Object Recognition.
Correlations for Multi-label Image Recognition with Graph Networks. CoRR In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016,
abs/1909.13005 (2019). arXiv:1909.13005 http://arxiv.org/abs/1909.13005 Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 280–288. DOI:
[22] Yining Li, Chen Huang, Chen Change Loy, and Xiaoou Tang. 2016. Human http://dx.doi.org/10.1109/CVPR.2016.37
Attribute Recognition by Deep Hierarchical Contexts. In Computer Vision - ECCV [37] Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph Convolutional Net-
2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, works for Text Classification. In The Thirty-Third AAAI Conference on Artificial
2016, Proceedings, Part VI (Lecture Notes in Computer Science), Bastian Leibe, Jiri Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelli-
Matas, Nicu Sebe, and Max Welling (Eds.), Vol. 9910. Springer, 684–700. DOI: gence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances
http://dx.doi.org/10.1007/978-3-319-46466-4_41 in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - Febru-
[23] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva ary 1, 2019. AAAI Press, 7370–7377. DOI:http://dx.doi.org/10.1609/aaai.v33i01.
Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common 33017370
Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference, [38] Jing Yu, Yuhang Lu, Zengchang Qin, Weifeng Zhang, Yanbing Liu, Jianlong Tan,
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V (Lecture Notes in and Li Guo. 2018. Modeling Text with Graph Convolutional Network for Cross-
Computer Science), David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuyte- Modal Information Retrieval. In Advances in Multimedia Information Processing -
laars (Eds.), Vol. 8693. Springer, 740–755. DOI:http://dx.doi.org/10.1007/978-3- PCM 2018 - 19th Pacific-Rim Conference on Multimedia, Hefei, China, September
319-10602-1_48 21-22, 2018, Proceedings, Part I (Lecture Notes in Computer Science), Richang Hong,
[24] Lingqiao Liu, Peng Wang, Chunhua Shen, Lei Wang, Anton van den Hengel, Wen-Huang Cheng, Toshihiko Yamasaki, Meng Wang, and Chong-Wah Ngo
Chao Wang, and Heng Tao Shen. 2017. Compositional Model Based Fisher Vector (Eds.), Vol. 11164. Springer, 223–234. DOI:http://dx.doi.org/10.1007/978-3-030-
Coding for Image Classification. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 00776-8_21
(2017), 2335–2348. DOI:http://dx.doi.org/10.1109/TPAMI.2017.2651061 [39] Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018. Beyond
[25] Mateusz Malinowski and Mario Fritz. 2014. A Multi-World Approach Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Ques-
to Question Answering about Real-World Scenes based on Uncertain In- tion Answering. IEEE Trans. Neural Netw. Learning Syst. 29, 12 (2018), 5947–5959.
put. In Advances in Neural Information Processing Systems 27: Annual DOI:http://dx.doi.org/10.1109/TNNLS.2018.2817340
Conference on Neural Information Processing Systems 2014, December 8- [40] Min-Ling Zhang and Zhi-Hua Zhou. 2014. A Review on Multi-Label Learning
13 2014, Montreal, Quebec, Canada, Zoubin Ghahramani, Max Welling, Algorithms. IEEE Trans. Knowl. Data Eng. 26, 8 (2014), 1819–1837. DOI:http:
Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (Eds.). 1682– //dx.doi.org/10.1109/TKDE.2013.39
1690. http://papers.nips.cc/paper/5411-a-multi-world-approach-to-question- [41] Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, and Xiaogang Wang. 2017.
answering-about-real-world-scenes-based-on-uncertain-input Learning Spatial Regularization with Image-Level Supervisions for Multi-label
[26] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Image Classification. In 2017 IEEE Conference on Computer Vision and Pattern
Estimation of Word Representations in Vector Space. In 1st International Con- Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer
ference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May Society, 2027–2036. DOI:http://dx.doi.org/10.1109/CVPR.2017.219
1584