0% found this document useful (0 votes)
24 views10 pages

FGCN

This paper presents F-GCN, a fast graph convolution network model for multi-label image recognition that utilizes Multi-modal Factorized Bilinear pooling (MFB) to effectively fuse image features and label embeddings. The model consists of three main components: an image representation learning module using CNN, a label co-occurrence embedding module using GCN, and the MFB fusion module. Experimental results on datasets like MS-COCO and VOC2007 demonstrate that F-GCN significantly improves convergence efficiency and recognition performance compared to state-of-the-art methods.

Uploaded by

jamesrosario
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views10 pages

FGCN

This paper presents F-GCN, a fast graph convolution network model for multi-label image recognition that utilizes Multi-modal Factorized Bilinear pooling (MFB) to effectively fuse image features and label embeddings. The model consists of three main components: an image representation learning module using CNN, a label co-occurrence embedding module using GCN, and the MFB fusion module. Experimental results on datasets like MS-COCO and VOC2007 demonstrate that F-GCN significantly improves convergence efficiency and recognition performance compared to state-of-the-art methods.

Uploaded by

jamesrosario
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Fast Graph Convolution Network Based Multi-label Image


Recognition via Cross-modal Fusion
Yangtao Wang, Yanzhao Xie, Yu Liu*, Ke Zhou, Xiaocui Li
Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, China
{ytwbruce, yzxie, liu_yu, k.zhou, LXC}@hust.edu.cn
*Corresponding author: Yu Liu (liu_yu@hust.edu.cn)

ABSTRACT Cross-modal Fusion. In Proceedings of the 29th ACM International Con-


In multi-label image recognition, it has become a popular method ference on Information and Knowledge Management (CIKM ’20), October
19–23, 2020, Virtual Event, Ireland. ACM, New York, NY, USA, 10 pages.
to predict those labels that co-occur in an image via modeling the
https://doi.org/10.1145/3340531.3411880
label dependencies. Previous works focus on capturing the cor-
relation between labels, but neglect to effectively fuse the image
features and label embeddings, which severely affects the conver- 1 INTRODUCTION
gence efficiency of the model and inhibits the further precision In recent years, multi-label image recognition has been one of the
improvement of multi-label image recognition. To overcome this research hotspots in computer vision community owing to its wide
shortcoming, in this paper, we introduce Multi-modal Factorized applications like human attribution recognition [22], music emotion
Bilinear pooling (MFB) which works as an efficient component to categorization [30], medical diagnosis recognition [9], etc. Different
fuse cross-modal embeddings and propose F-GCN, a fast graph from the conventional image classification task which only learns
convolution network (GCN) based multi-label image recognition and predicts one label for each image, the task of multi-label image
model. F-GCN consists of three key modules: (1) an image rep- recognition brings greater challenges that call for more effective
resentation learning module which adopts a convolution neural methods to recognize those objects that co-occur in an image.
network (CNN) to learn and generate image representations, (2) Early multi-label classification algorithms [4, 31, 40] recognize
a label co-occurrence embedding module which first obtains the each object in isolation and naively transform this problem into
label vectors via the word embeddings technique and then adopts multiple binary classification tasks. Entering the stage of deep
GCN to capture label co-occurrence embeddings and (3) an MFB convolution neural network (CNN) [18], image classification has
fusion module which efficiently fuses these cross-modal vectors to made great progress and the precision of existing multi-label image
enable an end-to-end model with a multi-label loss function. We recognition methods has been promoted based on CNN and its
conduct extensive experiments on two multi-label datasets includ- variants [11, 14, 28]. However, the performance of these methods is
ing MS-COCO and VOC2007. Experimental results demonstrate essentially limited by ignoring the complex topology between ob-
the MFB component efficiently fuses image representations and jects in an image, which inhibits the further precision improvement
label co-occurrence embeddings and thus greatly improves the of multi-image recognition.
convergence efficiency of the model. In addition, the performance An effective method to solve this problem is to model the label
of image recognition has also been promoted compared with the dependencies to learn the objective law in the real world that re-
state-of-the-art methods. lated objects will be more likely to co-occur in an image. As shown
in Figure 1, "person", "basketball" and "basketball hoop" will appear
CCS CONCEPTS in an image at the same time with a high possibility, while "basket-
• Computing methodologies → Image representations. ball" and "mountain" will rarely co-occur in the same image. Wang
et al. [32] utilize the recurrent neural network (RNN) to model
KEYWORDS the label dependencies in a sequential fashion, but fail to compre-
Multi-label Image Recognition; Graph Convolution Network; Cross- hensively take the correlations between image labels and image
modal Fusion regions into consideration. To compensate for this shortcoming,
some other works [33, 41] explore the label dependencies via atten-
ACM Reference Format: tion mechanism, which consider limited local correlations between
Yangtao Wang, Yanzhao Xie, Yu Liu*, Ke Zhou, Xiaocui Li. 2020. Fast attended regions of a single image, but ignore the global correla-
Graph Convolution Network Based Multi-label Image Recognition via
tions of labels distribution on all images. It is worth mentioning
that Chen et al. [3] propose ML-GCN that adopts graph convolution
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed network (GCN) [17, 20] to capture and learn the label dependencies
for profit or commercial advantage and that copies bear this notice and the full citation according to the label statistical information, which achieves good
on the first page. Copyrights for components of this work owned by others than ACM performance. Similar to ML-GCN, Li et al. [21] design A-GCN to
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a capture label dependencies by constructing an adaptive label graph.
fee. Request permissions from permissions@acm.org. However, both of them use the dot product (DP) to simply fuse
CIKM ’20, October 19–23, 2020, Virtual Event, Ireland the two-modal vectors, i.e., the features extracted from the CNN
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-6859-9/20/10. . . $15.00 module and the label co-occurrence embeddings generated from
https://doi.org/10.1145/3340531.3411880 the GCN module, which severely limits the convergence efficiency

1575
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

of the model as well as prevents the further precision improvement "mountain" will rarely co-occur in the same image. Therefore, a
of multi-label image recognition. large number of combinations of labels will hardly occur in the real
In this paper, we introduce Multi-modal Factorized Bilinear pool- world.
ing (MFB) [39] which works as an efficient component to fuse Various approaches have been considered to explore the label
cross-modal vectors and propose a fast GCN based model (termed dependencies to reduce and optimize the label prediction space.
as F-GCN) to fuse image representations and label embeddings for Gong et al. [10] adopt a deep convolution architecture to learn the
multi-label image recognition. Our F-GCN mainly contains image approximate top-k ranking objective function for multi-label image
representation learning module, label co-occurrence embedding recognition. Wang et al. [32] combine CNN with RNN to model the
module and MFB fusion module. In the image representation learn- label dependencies in a sequential fashion by embedding semantic
ing module, following ML-GCN, we use a CNN (i.e., ResNet-101 [11]) labels into vectors. Besides, others propose to utilize the attention
based model to obtain the latent representation of each image. At mechanism to capture the correlation between labels. Zhu et al. [41]
the same time, in the label co-occurrence embedding module, we propose a spatial regularization network to capture both semantic
take both the word vectors and co-occurrence correlations of labels and spatial relations of these multiple labels based on weighted
as input, then use GCN to train and learn the label embeddings that attention maps. Wang et al. [33] introduce a spatial transformer
reflect the co-occurrence relationships between different objects. layer and long short-term memory (LSTM) units to capture the
Different from previous studies, in the following, we fuse these two- label correlation. In addition, a graph based framework [19] has
modal vectors (i.e., image representations and label co-occurrence been proposed to describe the relationships between labels via
embeddings) via MFB to train an end-to-end multi-label image knowledge graphs which aims to generate more accurate image
recognition model with a multi-label loss function. We conduct representations.
extensive experiments on two multi-label datasets including MS-
COCO [23] and VOC2007 [6]. Experimental results demonstrate 2.2 Graph Convolution Network
the MFB component efficiently fuses image representations and The basic idea of graph convolution network (GCN) [17] is to update
label co-occurrence embeddings and thus greatly improves the one node’s feature based on those features of the node itself and
convergence efficiency of the model. In addition, the performance other related neighbor nodes according to the correlation matrix of
of image recognition has also been promoted compared with the the graph. By learning the structural similarities between training
state-of-the-art methods. data points, GCN can integrate the relationships into data features.
The rest of this paper is organized as follows. Section 2 talks Formally, GCN takes the correlation matrix A as well as feature
about several related works. We formulate and introduce our F-GCN matrix X as input, and produces the node-level output. The forward
in detail in section 3. Section 4 presents the comparison experiments, propagation process in GCN is described as:
ablation study as well as visual retrieval results of F-GCN. At last,
we conclude this paper in section 5. H l +1 = al (ÂH l W l ), (1)

2 RELATED WORKS where  denotes the normalized version of correlation matrix A, H l ,


W l and al respectively denote the input, weight and non-linear acti-
In this section, we first review existing multi-label image recogni-
vation function (like Sigmoid or ReLU) of the l-th graph convolution
tion methods, then discuss recent GCN based studies, and finally
layer.
introduce several representative cross-modal fusion works.
As a deep learning technique that effectively learns and extracts
relationships, GCN has been widely applied to relational feature
2.1 Multi-label Image Recognition extraction, node classification prediction and information retrieval
With the development of deep neural network, image recognition tasks [13, 29, 37]. Cross-model works regrade each feature of text
has achieved great success in the past few years on large scale hand- or image as a node representation, and complete the learning pro-
crafted datasets like ImageNet [5], MS-COCO [23], etc. Powerful cess according to the mutual relationships. Jing et al. [38] propose
CNN based models [11, 14, 28] can extract the visual feature of to utilize GCN for text modeling and another neural network for
each image and obtain remarkable performance for single-label image modeling, which achieves significant improvement with a
classification tasks. Furthermore, researchers have made their great pairwise loss function. GCH [35], another cross-modal research,
efforts to explore deep networks to promote the performance of learns modality-unified binary codes via an affinity graph, then
multi-label image recognition. adopts GCN to map hash codes by the relationships between nodes.
Early multi-label image recognition methods naively divide this It’s worth mentioning that ML-GCN [3] can achieve remarkable
task into multiple independent binary classification tasks, which performance for multi-label classification tasks. It treats each ob-
train a set of classifiers for each label. On the one hand, these ject in the image as a node, and constructs a graph among these
methods suffer from the increase of labels space. Take VOC2007 as nodes, finally uses GCN to learn the probability of different objects
an example, there are 20 object classes in this dataset, which calls appearing in an image, which explores the label correlation de-
for 220 classifiers if using the single-label classification method. On pendency and promotes the precision of image retrieval. Similar to
the other hand, they treat each object in isolation and thus neglect ML-GCN, A-GCN [21] constructs an adaptive label graph to capture
the topology between objects in a multi-label image. For instance, label dependencies for image recognition and EmotionGCN [12]
"person", "basketball" and "basketball hoop" will appear in an image models the correlations between emotions via GCN for emotion
at the same time with a high possibility, while "basketball" and distribution learning.

1576
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Basketball Hoop Basketball Hoop Basketball Hoop


Mountain Sky Tree
Person Basketball Person Basketball Person Basketball

...

Basketball Hoop Basketball Person Mountain

Figure 1: Related objects are more likely to appear in the same image. For example, we can always see the superstar Kobe
Bryant and his teammates playing basketball on TV (in the first three images). These images are common in the real world,
which illustrates the combination of "person", "basketball" and "basketball hoop" tends to co-occur at the same time. However,
we hardly see a "basketball" in the "mountain", and they are rarely tied in one image, because there is no direct relationship
between these two objects.

2.3 Cross-modal Fusion Figure 2, and then respectively introduce the workflow of these
Cross-modal feature fusion methods have been proposed to solve three modules in detail.
the visual question answering (VQA) problem [25], which usually
use concatenation or element-wise summations to fuse the image 3.1 Overall Framework
and the question representations. Fukui et al. [7] first propose the For convenience, we list the preliminary notations in Table 1. As
Multi-modal Compact Bilinear pooling (MCB) which introduces shown in Figure 2, there are three key modules in F-GCN: a CNN
the bilinear model to fuse multi-modal features by using the outer module for image feature extraction, a GCN module for co-occurrence
product of two vectors in different modalities to produce a very label embedding generation and an MFB fusion module for cross-
high-dimensional feature for quadratic expansion. To reduce the modal vectors fusion.
high-dimension computation, Kim et al. [16] propose the Multi- Given a dataset consisting of N images, the i-th image x i is
modal Low-rank Bilinear pooling (MLB) approach based on the input to a CNN module (i.e., ResNet-101 [11]) to extract its image
Hadamard product of two feature vectors, which can achieve com- representation fi (a D-dimension feature vector) from the "conv5_x"
parable performance to MCB but may lead to a low convergence
rate. Furthermore, Zhou et al. [39] introduce the Multi-modal Fac-
torized Bilinear pooling (MFB) model to efficiently fuse image and Table 1: Preliminary notations used in this paper.
text embeddings, which produces a remarkable result in VQA as
well as speeds up the model convergence.
Notation Explanation
Motivated by the above studies, our work adopts the MFB com-
N the number of input images
ponent to fuse the image representations and label co-occurrence
xi the i-th image
embeddings respectively generated from the CNN and GCN mod-
fi the i-th image’s representation
ules. With the proposed F-GCN, the convergence efficiency of the
li the i-th image’s ground truth label
model has been greatly promoted. We also demonstrate our F-GCN
yi the i-th image’s predicted label
works as an effective model to learn the label dependencies and can
C the number of object categories
be trained in an end-to-end manner with remarkable performance
oi the i-th object in the label set
compared with the state-of-the-art methods.
Ti the occurrence times of the i-th object
Ti j the co-occurrence times of the i-th and j-th objects
3 PROPOSED METHODOLOGY d the dimension of each object’s word embedding vector
Based on previous studies, in this section, we propose F-GCN, a fast D the dimension of each image’s representation
GCN based multi-label image recognition framework that adopts Z a C × d object word embeddings matrix
the cross-modal component (i.e., MFB) to fuse image representations A a C × C label correlation matrix
and label co-occurrence embeddings. Our F-GCN mainly consists W a C × D label co-occurrence embeddings matrix
of three modules: image representation learning module, label co- Wj the j-th row vector of W
occurrence embedding module and MFB fusion module. In the U the weights in GCN propagation
following, we first present the overall framework of F-GCN in M the dimension of M 1 (or M 2 ) in MFB fusion
д the number of units in each дroup sum-poolinд

1577
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Image representation learning module


xi ResNet-101 fi MFB fusion module
person
basketball
M1
M1 M2
li
CNN

...
...
basketball
yi

...
D

Multi-Label Loss
hoop
global
conv5_x max-pooling
M
fc

Label co-occurrence embedding module


...

...
...
...

...
object word embedding vectors label M2 C
co-occurrence M fc
g
Z ...
embeddings
Wj
M

}
...
group
C×d # # #
M sum-pooling
label correlation matrix GCN GCN # # #
... fc
... # # #
... C×D
A W
...

...

...
C×C

Figure 2: The overall framework of our proposed F-GCN is comprised of three key modules: image representation learning
module, label co-occurrence embedding module and MFB fusion module. The image representation module uses a CNN back-
bone (i.e., ResNet-101) to train and extract the representation (feature) of each image. The label co-occurrence embedding
module designs a two-layer GCN to learn the co-occurrence embeddings that reflect the label dependencies according to the
label statistical information. The MFB module efficiently fuses these two cross-modal vectors by means of дroup sum-poolinд.
The overall network is trained in an end-to-end manner using the Multi-Label Loss function.

layer of this network. At the same time, the GCN module takes output of our F-GCN as a classifier corresponding to each label.
both the object word embeddings matrix Z and label correlation Specifically, we aim to map the object dependencies of a dataset
matrix A as input, then adopts a two-layer GCN to learn and capture to label co-occurrence embeddings in our task. The input of GCN
the label co-occurrence embeddings matrix W . After obtaining the calls for the feature vector of each node and the correlation matrix
image representations and label co-occurrence embeddings, we between these nodes. As shown in the blue frame of Figure 2, we
design the MFB fusion module to efficiently fuse these two cross- adopt the GloVe [27] model to transform each object (totally C ob-
modal vectors (fi and W ) via дroup sum-poolinд, and use the fully jects in a dataset) into a d-dimensional (i.e., 300-dimensional) word
connected (f c) layer to generate the predicted label yi . Finally, the vector. Therefore, we can obtain a C × d object word embeddings
multi-label loss function is adopted to update the loss between yi matrix Z . For example, there are 20 object categories in VOC2007,
and the ground truth label li ∈ {0, 1}C to train the whole network so the input feature vectors matrix Z for the first GCN layer will
in an end-to-end manner. be a 20 × 300 matrix.
In addition to obtaining the feature vector of each node (object),
3.2 Image Representation Learning Module another essential issue in F-GCN is to construct the label correlation
Following ML-GCN [3], we adopt one of the state-of-the-art CNN matrix A between these nodes. In the implementation, we capture
based networks (i.e., ResNet-101 [11]) to complete the feature ex- the label dependencies and construct matrix A according to the
traction of an image in this module. As shown in the orange frame label statistical information over the whole dataset. Specifically,
of Figure 2, we first remove the last f c layer and so f tmax layer for ∀i ∈ [1, C], we collect the occurrence times (i.e., Ti ) of the i-th
of ResNet-101, and then use this sub-network to generate the rep- object (i.e., oi ) as well as the co-occurrence times (i.e., Ti j , which
resentation of each image x i with D dimension. As we know, for equals T ji ) of oi and o j . Furthermore, the label dependencies can
ResNet-101, the output dimension D is 2048. Therefore, for any be formulated by the conditional probability as follows:
an input image x i with the 448 × 448 resolution, we can obtain
2048 × 14 × 14 feature maps from the "conv5_x" layer. At last, we Ti j
Pi j = P(oi |o j ) = , (2)
adopt дlobal max-poolinд to generate the image representation fi Tj
which is a 2048-dimension feature vector.
where Pi j denotes the probability that oi occurs in the conditional
3.3 Label Co-occurrence Embedding Module of o j appearing. Note that Pi j is not equal to P ji owing that the
conditional probability between two objects is asymmetric. Based
In this part, we use GCN to learn the label co-occurrence embed-
on this, we can construct the correlation matrix A below:
dings according to the relationship between different objects.
Different from the original GCN which was proposed to solve
the node classification problem, we treat and design the node-level Ai j = P i j , (3)

1578
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

where Ai j denotes the i-th row and j-th column element of matrix sum-poolinд to convert M 1 ◦ M 2 to a Mд -dimensional vector, where
A. However, similar to ML-GCN, if we directly use this correla- each group containing д units is sequentially mapped into one unit.
tion matrix to train the model, the rare co-occurrence objects will Finally, we adopt a f c layer to generate the j-th element of yi . We
become some noise that affect the data distribution as well as the will obtain a complete predicted label vector yi corresponding to fi
model convergence. To filter the noise, we choose to use a threshold after C times fusion with fi . Note that all the f c layers are shared
ϵ to binarize the above matrix A: by each fusion.
( Finally, we adopt the MultiLabelSoftMarginLoss1 (termed as
0, if P ji ≤ ϵ Multi-label loss) function to update the whole network in an end-
Ai j = (4)
1, otherwise to-end manner. The training loss function is described as:
where ϵ ∈ [0, 1]. Besides, when using GCN to update the node’s
feature in the propagation process, the binary correlation matrix ÍC exp(−yi j )
L(yi , li ) = − C1 j=1 li j loд((1 + exp(−yi j ))
−1 ) + (1 − l
i j )loд( (1+exp(−yi j )) ), (7)
may lead to the over-smoothing problem which makes the gen-
erated nodes’ features indistinguishable. Therefore, we adopt the where yi j and li j respectively denote the j-th element of yi and li .
weighted scheme to calculate the final correlation matrix as:
4 EXPERIMENTS
 ÍC δ Ai j , if i , j

 In this section, we evaluate the performance of F-GCN and compare
Ai j =

j=1∩i ,j (5) it with the state-of-the-art image recognition methods. We first
 1 − δ,
 otherwise
 describe the datasets, then introduce the implementation details
where δ ∈ [0, 1]. In this way, we can use this weighted correlation and evaluation metrics, and finally present the experimental results
matrix A to update the node’s feature by choosing a suitable δ . of F-GCN.
After obtaining both the object word embeddings vectors Z and
label correlation matrix A, we design a two-layer GCN to propagate 4.1 Datasets
this relationship and each GCN layer can be described as:
MS-COCO [23] is a popular multi-label dataset for image recogni-
tion, segmentation and captioning, which contains 118,287 training
Z l +1 = f l (ÂZ l U l ), l ∈ [0, 2] (6)
images, 40,504 validation images and 40,775 test images, where
where  (see [17] for details) denotes the normalized version of each image is averagely labeled with about 2.9 object labels from
correlation matrix A, Z l denotes the latent features of C nodes in the 80 semantic class categories except that the labels of test set are
the l-layer, U l denotes the weights of the l-layer, and f l (·) denotes not available. Owing that the ground truth labels of the test set are
the non-linear operation which is a ReLU function. Note that Z is not available, we train our model on the train set and evaluate the
the input of this sub-network, and the output is a C × D label co- performance on the validation set.
occurrence embeddings matrix W of which each row will be fused
with the image representation fi in the next MFB fusion module. VOC2007 [6] consists of 9,963 multi-label images and 20 object
classes, which is divided into train, validation and test sets. On
3.4 MFB Fusion Module average, each image is annotated with 1.5 labels. We use both the
train and validation sets to train our model, and then evaluate the
MFB [39] has been proposed to work as an effective component to
performance on the test set.
fuse cross-modal features, which is usually implemented by com-
bining some commonly-used layers such as f c, element-wise mul-
4.2 Implementation Details and Evaluation
tiplication and pooling layers. Different from the previous works
that use DP to simply fuse the cross-modal vectors, in this part, we Metrics
adopt MFB to efficiently fuse image representations and label co- Implementation details. All experiments are implemented with
occurrence embeddings, which helps achieve higher performance PyTorch. In the image representation module, each image is resized
of F-GCN. into 448 × 448 using random horizontal flips and the output di-
As shown in the red frame of Figure 2, on the one hand, the mension is D = 2048 from ResNet-101. In the label co-occurrence
Hadmard product increases the interaction between different modal embedding module, our F-GCN consists of a two-layer GCN with
vectors, which promotes the precision of F-GCN. On the other hand, output dimension of 1024 and 2048, where the initial label word
дroup sum-poolinд reduces over-fitting and parameters explosion, embedding is a 300-dimensional vector generated by the GloVe
which speeds up the convergence of F-GCN. The input of MFB model pre-trained on the Wikipedia dataset. Note that we use the
consists of two parts: fi and W . Note that in each fusion, we use fi average embeddings of all words as the label word vector if this
and one row vector of W to generate one of the element in yi . label is expressed by multiple words. To construct the correlation
Formally, given the i-th image representation fi , for ∀j ∈ [1, C], matrix, we respectively set ϵ = 0.4 in Equation 4 and δ = 0.2 in
we fuse fi and Wj to generate the j-th element of yi where W j is Equation 5. In the MFB fusion module, we set M = 358 to fuse
stated to be the j-th row vector ofW in Table 1. First, we respectively cross-modal embeddings and д = 2 to complete дroup sum-poolinд.
use two f c layers to transform fi to a M-dimensional vector M 1 and The whole network is updated by stochastic gradient descent (SGD)
Wj to a M-dimensional vector M 2 . Then these two modal vectors with a momentum of 0.9, a weight decay of 10−4 , an initial learning
are fused to generate a M-dimensional vector M 1 ◦ M 2 , where ◦ is
the Hadamard product. To speed up the convergence, we use дroup 1 https://pytorch.org/docs/master/nn.html?highlight=multilabelsoft#torch.nn.MultiLabelSoftMarginLoss

1579
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Table 2: Performance comparisons of F-GCN with the state-of-the-art methods on MS-COCO.

All Top-3
Method
mAP CP CR CF1 OP OR OF1 CP CR CF1 OP OR OF1
CNN-RNN [32] 61.2 - - - - - - 66.0 55.6 60.4 69.2 66.4 67.8
RNN-Attention [33] - - - - - - - 79.1 58.7 67.4 84.0 63.0 72.0
Order-Free RNN [1] - - - - - - - 71.6 54.8 62.1 74.2 62.2 67.7
ML-ZSL [19] - - - - - - - 74.1 64.5 69.0 - - -
SRN [41] 77.1 81.6 65.4 71.2 82.7 69.9 75.8 85.2 58.8 67.4 87.4 62.5 72.9
Multi-Evidence [8] - 80.4 70.2 74.9 85.2 72.5 78.4 84.5 62.2 70.6 89.1 64.3 74.7
ResNet-101 [11] 77.3 80.2 66.7 72.8 83.9 70.8 76.8 84.1 59.4 69.7 89.1 62.8 73.6
ML-GCN [3] (DP) 83.0 85.1 72.0 78.0 85.8 75.4 80.3 89.2 64.1 74.6 90.5 66.5 76.7
A-GCN [21] (DP) 83.1 84.7 72.3 78.0 85.6 75.5 80.3 89.0 64.2 74.6 90.5 66.3 76.6
F-GCN (MFB) 83.2 85.4 72.4 78.3 86.0 75.7 80.5 89.3 64.3 74.7 90.5 66.6 76.7

Table 3: AP and mAP comparisons of F-GCN with the state-of-the-art methods on VOC2007.

AP
Method mAP
areo bike bird boat bottle bus car cat chair cow table dog horse motor person plant sheep sofa train tv
CNN-RNN [32] 96.7 83.1 94.2 92.8 61.2 82.1 89.1 94.2 64.2 83.6 70.0 92.4 91.7 84.2 93.7 59.8 93.2 75.3 99.7 78.6 84.0
RLSD [24] 96.4 92.7 93.8 94.1 71.2 92.5 94.2 95.7 74.3 90.0 74.2 95.4 96.2 92.1 97.9 66.9 93.5 73.7 97.5 87.6 88.5
VeryDeep [28] 98.9 95.0 96.8 95.4 69.7 90.4 93.5 96.0 74.2 86.6 87.8 96.0 96.3 93.1 97.2 70.0 92.1 80.3 98.1 87.0 89.7
ResNet-101 [11] 99.5 97.7 97.8 96.4 65.7 91.8 96.1 97.6 74.2 80.9 85.0 98.4 96.5 95.9 98.4 70.1 88.3 80.2 98.9 89.2 89.9
FeV+LV [36] 97.9 97.0 96.6 94.6 73.6 93.9 96.5 95.5 73.7 90.3 82.0 95.4 97.7 95.9 98.6 77.6 88.7 78.0 98.3 89.0 90.6
HCP [34] 98.6 97.1 98.0 95.6 75.3 94.7 95.8 97.3 73.1 90.2 80.0 97.4 96.1 94.9 96.3 78.3 94.7 76.2 97.9 91.5 90.9
RNN-Attention [33] 98.6 97.4 96.3 96.2 75.2 92.4 96.5 97.1 76.5 92.0 87.7 96.8 97.5 93.8 98.5 81.6 93.7 82.8 98.6 89.3 91.9
Atten-Reinforce [2] 98.6 97.1 97.1 95.5 75.6 92.8 96.8 97.3 78.3 92.2 87.6 96.9 96.5 93.6 98.5 81.6 93.1 83.2 98.5 89.3 92.0
ML-GCN [3] (DP) 99.5 98.5 98.6 98.1 80.8 94.6 97.2 98.2 82.3 95.7 86.4 98.2 98.4 96.7 99.0 84.7 96.7 84.3 98.9 93.7 94.0
A-GCN [21] (DP) 99.4 98.5 98.6 98.0 80.8 94.7 97.2 98.2 82.4 95.5 86.4 98.2 98.4 96.7 98.9 84.8 96.6 84.4 98.9 93.7 94.0
F-GCN (MFB) 99.5 98.5 98.7 98.2 80.9 94.8 97.3 98.3 82.5 95.7 86.6 98.2 98.4 96.7 99.0 84.8 96.7 84.4 99.0 93.7 94.1

rate of 0.1 which decays by a factor of 10 every 40 epochs, and a 38.89% and 72.91% lower than our F-GCN. In addition, ML-GCN will
batchsize of 32. take about 200 epochs (more than 11 times of F-GCN) to complete
its training process. This result verifies that our MFB component
Evaluation metrics. We use the conventional image recognition efficiently fuses the cross-modal embeddings and greatly promotes
evaluation metrics including the mean of class-average precision the convergence efficiency of F-GCN.
(mAP), overall precision (OP), recall (OR), F1 (OF1), and average
per-class precision (CP), recall (CR), F1 (CF1). For each image, the
labels are predicted as positive if the confidences of them are greater
than 0.5. In addition, we also present these evaluation results of
top-3 labels for fair comparisons.

4.3 Experimental Results


In this part, we first show the convergence efficiency of F-CGN, then
compare it with the state-of-the-art image recognition methods as
well as conduct ablation study to explore the influence of different
parameters and components on the model, and finally give the (a) MS-COCO (b) VOC2007

visual retrieval results.


Figure 3: mAP on test set with the increase of epoch on train-
4.3.1 Convergence efficiency. We compare the convergence ef-
ing set.
ficiency of F-GCN with that of ML-GCN. For fair comparisons, we
employ the same training parameters (SGD, learning rate, batch-
size, etc), loss function (Multi-label loss function) and datasets (MS-
COCO and VOC2007) as ML-GCN. As shown in Figure 3, we show 4.3.2 Comparisons with the state-of-the-art methods. In this
the convergence trend of mAP on the test set when increasing the part, we respectively conduct experiments on MS-COCO and VOC2007
epochs on the training set. As we see, F-GCN has converged at the to compare the performance of F-GCN with the the state-of-the-art
17-th and 15-th epoch on MS-COCO and VOC2007 respectively, and methods. Note that, for fair comparisons, we implement our F-GCN
produces the higher mAP of 83.2% and 94.1%. However, at the same using the same feature extraction network (i.e., ResNet-101) and
time, ML-GCN never converges and its mAP values are respectively weighted correlation matrix as ML-GCN.

1580
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Results on MS-COCO. We compare F-GCN with the state-of-


Table 4: Results of different image feature extraction mod-
the-art methods including CNN-RNN [32], RNN-Attention [33],
els.
Order-Free RNN [1], ML-ZSL [19], SRN [41], Multi-Evidence [8],
ResNet-101 [11], ML-GCN [3] and A-GCN [21]. We list the com-
Dataset
parison results on MS-COCO in Table 2 including the evaluation
Model MS-COCO VOC2007
metrics over the whole dataset and the top-3 labels. Obviously,
mAP CF1 OF1 CF1-3 OF1-3 mAP
F-GCN outperforms all candidate methods on almost all metrics.
Specifically, F-GCN greatly promotes the performance compared VGG 82.1 77.1 78 73.5 75.7 92.9
with ResNet-101 baseline, which shows GCN plays a crucial role ResNet-101 83.2 78.3 80.5 74.7 76.7 94.1
in integrating the label dependencies into image representations.
In addition, compared with ML-GCN and A-GCN that use DP to
Table 5: Results of different word embedding methods
fuse images representations and label co-occurrence embeddings,
(WEM) for label vectors.
our F-GCN further improve mAP and other metrics results, which
verifies that MFB in our framework can effectively fuse cross-modal
embeddings. Dataset
WEM MS-COCO VOC2007
Results on VOC2007. We compare F-GCN with the state-of-the- mAP CF1 OF1 CF1-3 OF1-3 mAP
art methods including CNN-RNN [32], RLSD [24], VeryDeep [28], GloVe 83.2 78.3 80.5 74.7 76.7 94.1
ResNet-101 [11], FeV+LV [36], HCP [34], RNN-Attention [33], Atten- GoogleNews 83.1 78.2 80.4 74.5 76.7 94.0
Reinforce [2], ML-GCN [3] and A-GCN [21]. We list the AP and FastText 83.2 78.2 80.3 74.5 76.6 94.0
mAP results on VOC2007 in Table 3. Similarly, our F-GCN greatly OneHot 83.1 78.2 80.3 75.5 76.6 94.0
outperforms the baseline (ResNet-101) and other well-known meth-
ods in all metrics except for a lower result on "table", "dog" and
"train" objects. Note that we generate a higher mAP than both ML-
GCN and A-GCN, which demonstrates F-GCN has an good effect
on multi-label image recognition.
In general, according to the comparison results, F-GCN not only
greatly speeds up the convergence efficiency but also promotes
the performance of multi-label image recognition via cross-modal
fusion.
4.3.3 Ablation study. In this section, we conduct ablation study (a) MS-COCO (b) VOC2007
to explore the influence of different parameters and components on
our model including image representation extraction model, word Figure 4: The change of mAP using different values of ϵ.
embedding methods for label vectors, parameters ϵ, δ in correlation
matrix, different number of layers in GCN, the dimension M in
cross-modal fusion and the number of units д in дroup sum-poolinд.
Note that we use mAP, CF1, OF1, CF1-3 and OF1-3 as the evaluation
metrics of our F-GCN.

Image representation extraction model. In this part, we eval-


uate the performance of F-GCN by comparing two commonly-
used feature extraction CNN based models: VGG [28] and ResNet-
101 [11]. We list the results on MS-COCO (all labels and top-3 labels)
(a) MS-COCO (b) VOC2007
and VOC2007 in Table 4. As we see, ResNet-101 produces a higher
performance than VGG. This may lies in that the later ResNet-101
has a strong ability to extract features which will be fused with the Figure 5: The change of mAP using different values of δ .
label co-occurrence embeddings to generate a better model.
GCN that plays a crucial role in propagating and capturing the label
Word embedding methods for label vectors. In this part, we dependencies. Of course, we believe more powerful word embed-
evaluate the performance of F-GCN by trying several popular word dings can maintain the semantic topology between objects and our
embedding methods including GloVe [27], GoogleNews [26], Fast- F-GCN may benefit from a better scheme. Given this, we choose
Text [15] and the simple one-hot encoding technique to generate GloVe to obtain the label vectors by default.
label vectors. We list the experimental results on MS-COCO and
VOC2007 in Table 5. As we see, different word embedding methods Parameters ϵ and δ in correlation matrix. In this part, we eval-
have a slight impact on the results of F-GCN, except that GloVe uate the performance of F-GCN by using different values of ϵ (Equa-
gains a small preponderance than others. This illustrates the input tion 4) and δ (Equation 5) to construct the weighted correlation
correlations of GCN with not severely affect the result, but it is matrix.

1581
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Parameter ϵ is used to filter the noise data (rare co-occurrence


probability) to enable a better model. As shown in Figure 4, we vary
ϵ from 0.1 to 1 to observe the effect and find that F-GCN achieves
the highest mAP on both MS-COCO and VOC2007 when ϵ = 0.4.
This may result from that ϵ = 0.4 is a better balance parameter
which not only reduces the small-probability data points but also
reserves the correlation between objects.
In addition, we also explore the influence of parameter δ . As we
mentioned, δ is used to avoid the over-smoothing problem which
may make the generated nodes’ features indistinguishable. We vary
δ from 0 to 1 to observe the effect and find F-GCN achieves the
highest mAP on both MS-COCO and VOC2007 when δ = 0.2. Note
that F-GCN will not converge when setting δ = 1. If we use a larger
δ , the feature of the node itself may be ignored in the propagation
(a) MS-COCO
process. Otherwise, a too smaller δ will make F-GCN ignore the
correlation between a node with its neighbor nodes. The result
demonstrates δ = 0.2 can well balance this correlation.

Different number of layers in GCN. In this part, we evaluate the


performance of F-GCN by changing the number of layers in GCN.
We respectively use 2-layer (with the output dimension of 1024
and 2048), 3-layer (with the output dimension of 1024, 1024, and
2048) and 4-layer (with the output dimension of 1024, 1024, 1024
and 2048) GCN to train our model. We list the comparison results
in Table 6. As we see, with the increase of layers, the performance
begins to decrease on both MS-COCO and VOC2007. This may
result from that in the propagation process, the features of nodes
will be frequently accumulated with more GCN layers so that the
output features become indistinguishable, thereby reducing the
performance of image recognition. (b) VOC2007

Figure 6: The change of mAP using different dimension of


Table 6: Different number of layers in GCN. M in cross-modal fusion.

Dataset
# layers MS-COCO VOC2007
mAP CF1 OF1 CF1-3 OF1-3 mAP
2 layers 83.2 78.3 80.5 74.7 76.7 94.1
3 layers 82.4 76.9 79.8 73.7 76.3 93.7
4 layers 82.3 76.7 79.6 73.1 75.9 93.2
(a) MS-COCO (b) VOC2007

The dimension M in cross-modal fusion. In this part, we eval- Figure 7: The change of mAP using different units of д.
uate the performance of F-GCN by changing the dimension of M
when fusing the image representations and label co-occurrence The number of units д in дroup sum-poolinд. In this part, we
embeddings. The input of MFB fusion module consists of 2048- evaluate the performance of F-GCN by using different number of
dimensional vectors pairs, which will be reduced into M-dimension units д. By дroup sum-poolinд, each M-dimensional vector will be
via f c layers. We vary M from 64 to 1024 with the step size of 64. As transformed into a M д -dimensional vector. We vary the value of д
shown in Figure 6, the performance of F-GCN will be improved with from 1 to 64 to generate a light-weight fusion vector. As shown
the increase of M until M exceeds 384 on MS-COCO and VOC2007. in Figure 7(a), F-GCN obtains a better performance on MS-COCO
Maybe M ∈ [320, 384] not only plays a better role in dimensionality when choosing д = 2, while the change of mAP is very slight on
reduction, but also efficiently fuses the cross-modal vectors. In fact, VOC2007 in Figure 7(b). We believe д = 2 can better express the
in the experiment, we find M = 358 can bring a good result for both original semantic information by pooling. Otherwise, other values
the efficiency and precision. We believe a more detailed perspective of д also bring a comparable result, which will not indeed affect the
about the effect of M will be given if the interval is divided more model too much. It is the structure of MFB that plays a vital role in
finely. promoting the performance of F-GCN.

1582
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Query Results

person + dog

bus + car
Figure 8: Two examples of the returned results with the query image on VOC2007.
4.3.4 Visual retrieval results. In this section, we evaluate F- 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18),
GCN by giving two retrieval examples on VOC2007. We return New Orleans, Louisiana, USA, February 2-7, 2018, Sheila A. McIlraith and Kilian Q.
Weinberger (Eds.). AAAI Press, 6714–6721. https://www.aaai.org/ocs/index.php/
the top-5 images by the k-NN algorithm for each given query im- AAAI/AAAI18/paper/view/16114
age. Figure 8 lists the retrieval results. For example, the first input [2] Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. 2018. Recurrent
Attentional Reinforcement Learning for Multi-Label Image Recognition. In Pro-
image contains two objects: "person" and "dog", and each returned ceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18),
image also contains these two objects. Besides, we obtain a similar the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th
effect when inputting the second image that contains "bus" and AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18),
New Orleans, Louisiana, USA, February 2-7, 2018, Sheila A. McIlraith and Kilian Q.
"car". The visual retrieval results verify that F-GCN owns a good Weinberger (Eds.). AAAI Press, 6730–6737. https://www.aaai.org/ocs/index.php/
classification ability to recognize multi-label images. AAAI/AAAI18/paper/view/16654
[3] Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-Label
Image Recognition With Graph Convolutional Networks. In IEEE Conference
5 CONCLUSION AND FUTURE WORK on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA,
In order to model the label dependencies and efficiently fuse cross- June 16-20, 2019. Computer Vision Foundation / IEEE, 5177–5186. DOI:http:
//dx.doi.org/10.1109/CVPR.2019.00532
modal vectors (i.e., image representations and label co-occurrence [4] Amanda Clare and Ross D. King. 2001. Knowledge Discovery in Multi-label Phe-
embeddings), in this paper, we introduce a cross-modal fusion com- notype Data. In Principles of Data Mining and Knowledge Discovery, 5th European
Conference, PKDD 2001, Freiburg, Germany, September 3-5, 2001, Proceedings (Lec-
ponent (i.e., MFB) and propose F-GCN, a fast GCN based multi- ture Notes in Computer Science), Luc De Raedt and Arno Siebes (Eds.), Vol. 2168.
label image recognition model. F-GCN first respectively adopts Springer, 42–53. DOI:http://dx.doi.org/10.1007/3-540-44794-6_4
a CNN to extract the image features and a GCN to capture the [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009.
ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer
label co-occurrence embeddings according to the relationship be- Society Conference on Computer Vision and Pattern Recognition (CVPR 2009),
tween different objects, then utilizes MFB to efficiently fuse these 20-25 June 2009, Miami, Florida, USA. IEEE Computer Society, 248–255. DOI:
cross-modal embeddings and trains an end-to-end model with a http://dx.doi.org/10.1109/CVPR.2009.5206848
[6] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John M. Winn, and
multi-label loss function. Extensive experimental results on MS- Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge.
COCO and VOC2007 demonstrate the MFB component efficiently International Journal of Computer Vision 88, 2 (2010), 303–338. DOI:http://dx.doi.
org/10.1007/s11263-009-0275-4
fuses image representations and label co-occurrence embeddings [7] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell,
and thus greatly improves the convergence efficiency of the model. and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual
In addition, the performance of image recognition has also been Question Answering and Visual Grounding. In Proceedings of the 2016 Conference
on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas,
promoted compared with the state-of-the-art methods. In the fu- USA, November 1-4, 2016, Jian Su, Xavier Carreras, and Kevin Duh (Eds.). The
ture, we will integrate the attention mechanism into our model to Association for Computational Linguistics, 457–468. DOI:http://dx.doi.org/10.
extract more accurate image features to help further promote the 18653/v1/d16-1044
[8] Weifeng Ge, Sibei Yang, and Yizhou Yu. 2018. Multi-Evidence Filtering and Fusion
image recognition performance. for Multi-Label Classification, Object Detection and Semantic Segmentation
Based on Weakly Supervised Learning. In 2018 IEEE Conference on Computer
ACKNOWLEDGEMENTS Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22,
2018. IEEE Computer Society, 1277–1286. DOI:http://dx.doi.org/10.1109/CVPR.
This work is supported by the National Natural Science Foundation 2018.00139
[9] Zongyuan Ge, Dwarikanath Mahapatra, Suman Sedai, Rahil Garnavi, and Rajib
of China No.61902135 and the Innovation Group Project of the Chakravorty. 2018. Chest X-rays Classification: A Multi-Label and Fine-Grained
National Natural Science Foundation of China No.61821003. Thanks Problem. CoRR abs/1807.07247 (2018). arXiv:1807.07247 http://arxiv.org/abs/1807.
for Jay Chou, a celebrated Chinese singer whose songs have been 07247
[10] Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and Sergey
accompanying the author. Ioffe. 2014. Deep Convolutional Ranking for Multilabel Image Annotation. In
2nd International Conference on Learning Representations, ICLR 2014, Banff, AB,
REFERENCES Canada, April 14-16, 2014, Conference Track Proceedings, Yoshua Bengio and Yann
LeCun (Eds.). http://arxiv.org/abs/1312.4894
[1] Shang-Fu Chen, Yi-Chen Chen, Chih-Kuan Yeh, and Yu-Chiang Frank Wang. [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual
2018. Order-Free RNN With Visual Attention for Multi-Label Classification. In Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI- and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE
18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the Computer Society, 770–778. DOI:http://dx.doi.org/10.1109/CVPR.2016.90

1583
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

[12] Tao He and Xiaoming Jin. 2019. Image Emotion Distribution Learning with 2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
Graph Convolutional Networks. In Proceedings of the 2019 on International Con- http://arxiv.org/abs/1301.3781
ference on Multimedia Retrieval, ICMR 2019, Ottawa, ON, Canada, June 10-13, [27] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove:
2019, Abdulmotaleb El-Saddik, Alberto Del Bimbo, Zhongfei Zhang, Alexander G. Global Vectors for Word Representation. In Proceedings of the 2014 Conference on
Hauptmann, K. Selçuk Candan, Marco Bertini, Lexing Xie, and Xiao-Yong Wei Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29,
(Eds.). ACM, 382–390. DOI:http://dx.doi.org/10.1145/3323873.3326593 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL,
[13] Fenyu Hu, Yanqiao Zhu, Shu Wu, Liang Wang, and Tieniu Tan. 2019. Hierarchi- Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1532–1543.
cal Graph Convolutional Networks for Semi-supervised Node Classification. In DOI:http://dx.doi.org/10.3115/v1/d14-1162
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intel- [28] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net-
ligence, IJCAI 2019, Macao, China, August 10-16, 2019, Sarit Kraus (Ed.). ijcai.org, works for Large-Scale Image Recognition. In 3rd International Conference on Learn-
4532–4539. DOI:http://dx.doi.org/10.24963/ijcai.2019/630 ing Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
[14] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.1556
2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on [29] Jiaxiang Tang, Wei Hu, Xiang Gao, and Zongming Guo. 2019. Joint Learning
Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July of Graph Representation and Node Features in Graph Convolutional Neural
21-26, 2017. IEEE Computer Society, 2261–2269. DOI:http://dx.doi.org/10.1109/ Networks. CoRR abs/1909.04931 (2019). arXiv:1909.04931 http://arxiv.org/abs/
CVPR.2017.243 1909.04931
[15] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jégou, [30] Konstantinos Trohidis, Grigorios Tsoumakas, George Kalliris, and Ioannis P.
and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. Vlahavas. 2008. Multi-Label Classification of Music into Emotions. In ISMIR 2008,
CoRR abs/1612.03651 (2016). arXiv:1612.03651 http://arxiv.org/abs/1612.03651 9th International Conference on Music Information Retrieval, Drexel University,
[16] Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, Philadelphia, PA, USA, September 14-18, 2008, Juan Pablo Bello, Elaine Chew, and
and Byoung-Tak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling. Douglas Turnbull (Eds.). 325–330. http://ismir2008.ismir.net/papers/ISMIR2008_
In 5th International Conference on Learning Representations, ICLR 2017, Toulon, 275.pdf
France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https: [31] Grigorios Tsoumakas and Ioannis Katakis. 2007. Multi-Label Classification:
//openreview.net/forum?id=r1rhWnZkg An Overview. IJDWM 3, 3 (2007), 1–13. DOI:http://dx.doi.org/10.4018/jdwm.
[17] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with 2007070101
Graph Convolutional Networks. In 5th International Conference on Learning [32] Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu.
Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track 2016. CNN-RNN: A Unified Framework for Multi-label Image Classification. In
Proceedings. OpenReview.net. https://openreview.net/forum?id=SJU4ayYgl 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016,
[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2285–2294. DOI:
Classification with Deep Convolutional Neural Networks. In Advances in Neu- http://dx.doi.org/10.1109/CVPR.2016.251
ral Information Processing Systems 25: 26th Annual Conference on Neural In- [33] Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. 2017. Multi-
formation Processing Systems 2012. Proceedings of a meeting held December 3- label Image Recognition by Recurrently Discovering Attentional Regions. In IEEE
6, 2012, Lake Tahoe, Nevada, United States, Peter L. Bartlett, Fernando C. N. International Conference on Computer Vision, ICCV 2017, Venice, Italy, October
Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger (Eds.). 22-29, 2017. IEEE Computer Society, 464–472. DOI:http://dx.doi.org/10.1109/
1106–1114. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep- ICCV.2017.58
convolutional-neural-networks [34] Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao,
[19] Chung-Wei Lee, Wei Fang, Chih-Kuan Yeh, and Yu-Chiang Frank Wang. 2018. and Shuicheng Yan. 2016. HCP: A Flexible CNN Framework for Multi-Label Image
Multi-Label Zero-Shot Learning With Structured Knowledge Graphs. In 2018 Classification. IEEE Trans. Pattern Anal. Mach. Intell. 38, 9 (2016), 1901–1907. DOI:
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt http://dx.doi.org/10.1109/TPAMI.2015.2491929
Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 1576–1585. DOI: [35] Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph
http://dx.doi.org/10.1109/CVPR.2018.00170 Convolutional Network Hashing for Cross-Modal Retrieval. In Proceedings of
[20] Ron Levie, Federico Monti, Xavier Bresson, and Michael M. Bronstein. 2019. the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI
CayleyNets: Graph Convolutional Neural Networks With Complex Rational 2019, Macao, China, August 10-16, 2019, Sarit Kraus (Ed.). ijcai.org, 982–988. DOI:
Spectral Filters. IEEE Trans. Signal Processing 67, 1 (2019), 97–109. DOI:http: http://dx.doi.org/10.24963/ijcai.2019/138
//dx.doi.org/10.1109/TSP.2018.2879624 [36] Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai.
[21] Qing Li, Xiaojiang Peng, Yu Qiao, and Qiang Peng. 2019. Learning Category 2016. Exploit Bounding Box Annotations for Multi-Label Object Recognition.
Correlations for Multi-label Image Recognition with Graph Networks. CoRR In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016,
abs/1909.13005 (2019). arXiv:1909.13005 http://arxiv.org/abs/1909.13005 Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 280–288. DOI:
[22] Yining Li, Chen Huang, Chen Change Loy, and Xiaoou Tang. 2016. Human http://dx.doi.org/10.1109/CVPR.2016.37
Attribute Recognition by Deep Hierarchical Contexts. In Computer Vision - ECCV [37] Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph Convolutional Net-
2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, works for Text Classification. In The Thirty-Third AAAI Conference on Artificial
2016, Proceedings, Part VI (Lecture Notes in Computer Science), Bastian Leibe, Jiri Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelli-
Matas, Nicu Sebe, and Max Welling (Eds.), Vol. 9910. Springer, 684–700. DOI: gence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances
http://dx.doi.org/10.1007/978-3-319-46466-4_41 in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - Febru-
[23] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva ary 1, 2019. AAAI Press, 7370–7377. DOI:http://dx.doi.org/10.1609/aaai.v33i01.
Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common 33017370
Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference, [38] Jing Yu, Yuhang Lu, Zengchang Qin, Weifeng Zhang, Yanbing Liu, Jianlong Tan,
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V (Lecture Notes in and Li Guo. 2018. Modeling Text with Graph Convolutional Network for Cross-
Computer Science), David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuyte- Modal Information Retrieval. In Advances in Multimedia Information Processing -
laars (Eds.), Vol. 8693. Springer, 740–755. DOI:http://dx.doi.org/10.1007/978-3- PCM 2018 - 19th Pacific-Rim Conference on Multimedia, Hefei, China, September
319-10602-1_48 21-22, 2018, Proceedings, Part I (Lecture Notes in Computer Science), Richang Hong,
[24] Lingqiao Liu, Peng Wang, Chunhua Shen, Lei Wang, Anton van den Hengel, Wen-Huang Cheng, Toshihiko Yamasaki, Meng Wang, and Chong-Wah Ngo
Chao Wang, and Heng Tao Shen. 2017. Compositional Model Based Fisher Vector (Eds.), Vol. 11164. Springer, 223–234. DOI:http://dx.doi.org/10.1007/978-3-030-
Coding for Image Classification. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 00776-8_21
(2017), 2335–2348. DOI:http://dx.doi.org/10.1109/TPAMI.2017.2651061 [39] Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018. Beyond
[25] Mateusz Malinowski and Mario Fritz. 2014. A Multi-World Approach Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Ques-
to Question Answering about Real-World Scenes based on Uncertain In- tion Answering. IEEE Trans. Neural Netw. Learning Syst. 29, 12 (2018), 5947–5959.
put. In Advances in Neural Information Processing Systems 27: Annual DOI:http://dx.doi.org/10.1109/TNNLS.2018.2817340
Conference on Neural Information Processing Systems 2014, December 8- [40] Min-Ling Zhang and Zhi-Hua Zhou. 2014. A Review on Multi-Label Learning
13 2014, Montreal, Quebec, Canada, Zoubin Ghahramani, Max Welling, Algorithms. IEEE Trans. Knowl. Data Eng. 26, 8 (2014), 1819–1837. DOI:http:
Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (Eds.). 1682– //dx.doi.org/10.1109/TKDE.2013.39
1690. http://papers.nips.cc/paper/5411-a-multi-world-approach-to-question- [41] Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, and Xiaogang Wang. 2017.
answering-about-real-world-scenes-based-on-uncertain-input Learning Spatial Regularization with Image-Level Supervisions for Multi-label
[26] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Image Classification. In 2017 IEEE Conference on Computer Vision and Pattern
Estimation of Word Representations in Vector Space. In 1st International Con- Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer
ference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May Society, 2027–2036. DOI:http://dx.doi.org/10.1109/CVPR.2017.219

1584

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy