PatchBased Convolutional Neural
PatchBased Convolutional Neural
Classification
Le Hou1 , Dimitris Samaras1 , Tahsin M. Kurc2,4 , Yi Gao2,1,3 , James E. Davis5 , and Joel H. Saltz2,1,5,6
1
Dept. of Computer Science, Stony Brook University
2
Dept. of Biomedical Informatics, Stony Brook University
3
Dept. of Applied Mathematics and Statistics, Stony Brook University
4
Oak Ridge National Laboratory
5
Dept. of Pathology, Stony Brook Hospital
6
Cancer Center, Stony Brook Hospital
{lehhou,samaras}@cs.stonybrook.edu {tahsin.kurc,joel.saltz}@stonybrook.edu
{yi.gao,james.davis}@stonybrookmedicine.edu
12424
known, as only the image-level ground truth label is given. all patches to be discriminative. We train a CNN model
This complicates the classification problem. Because tu- that outputs the cancer type probability of each input patch.
mors may have a mixture of structures and texture proper- We apply spatial smoothing to the resulting probability map
ties, patch-level labels are not necessarily consistent with and select only patches with higher probability values as
the image-level label. More importantly, when aggregat- discriminative patches. We iterate this process using the
ing patch-level labels to an image-level label, simple deci- new set of discriminative patches in an EM fashion. In the
sion fusion methods such as voting and max-pooling are not second-level (image-level), histograms of patch-level pre-
robust and do not match the decision process followed by dictions are input into an image-level multiclass logistic re-
pathologists. For example, a mixed subtype of cancer such gression or Support Vector Machine (SVM) [10] model that
as oligoastrocytoma, might have distinct regions of other predicts the image-level labels.
cancer subtypes. Therefore, neither voting nor max-pooling Pathology image classification and segmentation is an
could predict the correct WSI-level label since the patch- active research field. Most WSI classification methods fo-
level predictions do not match the WSI-level label. cus on classifying or extracting features on patches [17, 35,
50, 56, 11, 4, 48, 14, 50]. In [50] a pretrained CNN model
extracts features on patches which are then aggregated for
WSI classification. As we show here, the heterogeneity of
some cancer subtypes cannot be captured by those generic
CNN features. Patch-level supervised classifiers can learn
the heterogeneity of cancer subtypes, if a lot of patch la-
bels are provided [17, 35]. However, acquiring such labels
in large scale is prohibitive, due to the need for special-
ized annotators. As digitization of tissue samples becomes
commonplace, one can envision large scale datasets, that
could not be annotated at patch scale. Utilizing unlabeled
patches has led to Multiple Instance Learning (MIL) based
WSI classification [16, 51, 52].
In the MIL paradigm [18, 33, 5], unlabeled instances be-
long to labeled bags of instances. The goal is to predict the
label of a new bag and/or the label of each instance. The
Standard Multi-Instance (SMI) assumption [18] states that
for a binary classification problem, a bag is positive iff there
exists at least one positive instance in the bag. The probabil-
ity of a bag being positive equals to the maximum positive
prediction over all of its instances [6, 54, 27]. Combining
MIL with Neural Networks (NN) [43, 57, 31, 13], the SMI
assumption is modeled by max-pooling. Following this for-
mulation, the Back Propagation for Multi-Instance Prob-
lems (BP-MIP) [43, 57] performs back propagation along
Figure 2: An overview of our workflow. Top: A CNN is the instance with the maximum response if the bag is posi-
trained on patches. An EM-based method iteratively elimi- tive. This is inefficient because only one instance per bag is
nates non-discriminative patches. Bottom: An image-level trained in one training iteration on the whole bag.
decision fusion model is trained on histograms of patch- MIL-based CNNs have been applied to object recogni-
level predictions, to predict the image-level label. tion [38] and semantic segmentation [40] in image analy-
sis – the image is the bag and image-windows are the in-
We propose using a patch-level CNN and training a de- stances [36]. These methods also follow the SMI assump-
cision fusion model as a two-level model, shown in Fig. 2. tion. The training error is only propagated through the
The first-level (patch-level) model is an Expectation Maxi- object-containing window which is also assumed to be the
mization (EM) based method combined with CNN that out- window that has the maximum prediction confidence. This
puts patch-level predictions. In particular, we assume that is not robust because one significantly misclassified window
there is a hidden variable associated with each patch ex- might be considered as the object-containing window. Ad-
tracted from an image that indicates whether the patch is ditionally, in WSIs, there might be multiple windows that
discriminative (i.e. the true hidden label of the patch is the contain discriminative information. Hence, recent seman-
same as the true label of the image). Initially, we consider tic image segmentation approaches [12, 41, 39] smooth the
2425
output probability (feature) maps of the CNNs. of bag Xi . We further assume that all Xi,j depends on Hi,j
To predict the image-level label, max-pooling (SMI) and only and are independent with each other given Hi,j . Thus
voting (average-pooling) were applied in [36, 30, 17]. How-
Ni
N Y
ever, it has been shown that in many applications, learning Y
decision fusion models can significantly improve perfor- P (X, H) = P (Xi,j | Hi,j )P (Hi ) . (2)
i=1 j=1
mance compared to voting [42, 45, 24, 47, 26, 46]. Further-
more, such a learned decision fusion model is based on the We maximize the data likelihood P (X) using EM.
Count-based Multiple Instance (CMI) assumption which is
the most general MIL assumption [49]. 1. At the initial E step, we set Hi,j = 1 for all i, j. This
Our main contributions in this paper are: (1) To the best means that all instances are considered discriminative.
of our knowledge, we are the first to combine patch-level 2. M step: We update the model parameter θ to maximize
CNNs with supervised decision fusion. Aggregating patch- the data likelihood
level CNN predictions for WSI classification significantly
outperforms patch-level CNNs with max-pooling or vot- θ ← arg max P (X | H; θ)
θ
ing. (2) We propose a new EM-based model that identi- Y
fies discriminative patches in high resolution images auto- = arg max P (xi,j , yi | θ)
θ xi,j ∈D (3)
matically for patch-level CNN training, utilizing the spatial Y
relationship between patches. (3) Our model achieves mul- × P (xp,q , yq | θ),
tiple state-of-the-art results classifying WSIs to cancer sub- xp,q 6∈D
types on the TCGA dataset. Our results are similar or close
to inter-observer agreement between pathologists. Larger where D is the discriminative patches set. Assuming
classification improvements are observed in the harder-to- a uniform generative model for all non-discriminative
classify cases. (4) We provide experimental evidence that instances, the optimization in Eq. 3 simplifies to:
combining multiple patch-level classifiers might actually be Y
advantageous compared to whole image classification. arg max P (xi,j , yi | θ)
θ xi,j ∈D
The rest of this paper is organized as follows. Sec. 2 (4)
Y
describes the framework of the EM-based MIL algorithm. = arg max P (yi | xi,j ; θ)P (xi,j | θ).
Sec. 3 discusses the identification of discriminative patches. θ xi,j ∈D
Sec. 4 explains the image-level model that predicts the
image-level label by aggregating patch-level predictions. Additionally we assume an uniform distribution over
Sec. 5 shows experimental results. The paper concludes in xi,j . Thus Eq. 4 describes a discriminative model (in
Sec. 6. App. A lists the cancer subtypes in our experiments. this paper we use a CNN).
2. EM-based method with CNN 3. E step: We estimate the hidden variables H. In par-
ticular, Hi,j = 1 if and only if P (Hi,j | X) is above
An overview of our EM-based method can be found in
a certain threshold. In the case of image classifica-
Fig. 2. We model a high resolution image as a bag and
tion, given the i-th image, P (Hi,j | X) is obtained by
patches extracted from it as instances. We have a ground
applying Gaussian smoothing on P (yi | xi,j ; θ) (De-
truth label for the whole image but not for the individual
tailed in Sec 3). This smoothing step utilizes the spatial
patches. We model whether an instance is discriminative or
relationship of P (yi | xi,j ; θ) in the image. We then
not as a hidden binary variable.
iterate back to the M step till convergence.
We denote X = {X1 , X2 , . . . , XN } as the dataset con-
taining N bags. Each bag Xi = {Xi,1 , Xi,2 , . . . , Xi,Ni } Many MIL algorithms can be interpreted through this
consists of Ni instances, where Xi,j = hxi,j , yi i is the j-th formulation. Based on the SMI assumption, the instance
instance and its associated label in the i-th bag. Assuming with the maximum P (Hi,j | X) is the discriminative in-
the bags are independent and identically distributed (i.i.d.), stance for the positive bag, as in the EM Diverse Density
the X and the hidden variables H are generated by the fol- (EM-DD) [55] and the BP-MIP [43, 57] algorithms.
lowing generative model:
3. Discriminative patch selection
N
Y Patches xi,j that have P (Hi,j | X) larger than a thresh-
P (X, H) = P (Xi,1 , . . . , Xi,Ni | Hi )P (Hi ) , (1)
old Ti,j are considered discriminative and are selected to
i=1
continue training the CNN. We present in this section the
where the hidden variable H = {H1 , H2 , . . . , HN }, Hi = estimation of P (H | X) and the choice of the threshold.
{Hi,1 , Hi,2 , . . . , Hi,Ni } and Hi,j is the hidden variable that It is reasonable to assume that P (Hi,j | X) is correlated
indicates whether instance xi,j is discriminative for label yi with P (yi | xi,j ; θ), i.e. patches with lower P (yi | xi,j ; θ)
2426
tend to have lower probability xi,j to be discriminative. scale [34], we fuse patch-level predictions without the spa-
However, a hard-to-classify patch, or a patch close to the tial relationship between patches. In particular, the class
decision boundary may have low P (yi | xi,j ; θ) as well. histogram of the patch-level predictions is the input to a
These patches are informative and should not be rejected. linear multi-class logistic regression model [8] or an SVM
Therefore, to obtain a more robust P (Hi,j | X), we apply with Radial Basis Function (RBF) kernel [10]. Because a
the following two steps: First, we train two CNNs on two WSI contains at least hundreds of patches, the class his-
different scales in parallel. P (yi | xi,j ; θ) is the averaged togram is very robust to miss-classified patches. To gener-
prediction of the two CNNs. Second, we simply denoise the ate the histogram, we sum up all of the class probabilities
probability map P (yi | xi,j ; θ) of each image with a Gaus- given by the patch-level CNN. Moreover, we concatenate
sian kernel to compute P (Hi,j | X). This use of spatial histograms from four CNNs models: CNNs trained at two
relationships yields more robust discriminative patch iden- patch scales for two different numbers of iterations. We
tification as shown in the experiments in Sec. 5. found in practice that using multiple histograms is robust.
Choosing a thresholding scheme carefully yields sig-
nificantly better performance than a simpler thresholding 5. Experiments
scheme [39]. We obtain the threshold Ti,j for P (Hi,j | X)
as follows: We note Si as the set of P (Hi,j | X) values for We evaluate our method on two Whole Slide Tissue Im-
all xi,j of the i-th image and Ec as the set of P (Hi,j | X) ages (WSI) classification problems: classification of glioma
values for all xi,j of the c-th class. We introduce the image- and Non-Small-Cell Lung Carcinoma (NSCLC) cases into
level threshold Hi as the P1 -th percentile of Si and the glioma and NSCLC subtypes. Glioma is a type of brain
class-level threshold Ri as the P2 -th percentile of Ec , where cancer that rises from glial cells. It is the most common ma-
P1 and P2 are predefined. The threshold Ti,j is defined lignant brain tumor and the leading cause of cancer-related
as the minimum value between Hi and Ri . There are two deaths in people under age 20 [1]. NSCLC is the most
advantages of our method. First, by using the image-level common lung cancer, which is the leading cause of cancer-
threshold, there are at least 1 − P1 percent of patches that related deaths overall [3]. Classifying glioma and NSCLC
are considered discriminative for each image. Second, by into their respective subtypes and grades is crucial to the
using the class-level threshold, the thresholds can be easily study of disease onset and progression in order to provide
adapted to classes with different prior probabilities. targeted therapies. The dataset of WSIs used in the exper-
iments part of the public Cancer Genome Atlas (TCGA)
4. Image-level decision fusion model dataset [2]. It contains detailed clinical information and the
Hematoxylin and Eosin (H&E) stained images of various
We combine the patch-level classifiers of Sec. 3 to pre- cancers. The typical resolution of a WSI in this dataset is
dict the image-level label. We input all patch-level pre- 100K by 50K pixels. In the rest of this section, we first
dictions into a multi-class logistic regression or SVM that describe the algorithm we tested then show the evaluation
outputs the image-level label. This decision level fusion results on the glioma and NSCLC classification tasks.
method [28] is more robust than max-pooling [45]. More-
over, this method can be thought of as a Count-based Mul-
5.1. Patch extraction and segmentation
tiple Instance (CMI) learning method with two-level learn-
ing [49] which is a more general MIL assumption [20] than To train the CNN model, we extract patches of size
the Standard Multiple Instance (SMI) assumption. 500×500 from WSIs (examples in Fig. 3). To capture struc-
There are three reasons for combining multiple in- tures at multiple scales, we extract patches from 20X (0.5
stances: First, on difficult datasets, we do not want to assign microns per pixel) and 5X (2.0 microns per pixel) objec-
an image-level prediction simply based on a single patch- tive magnifications. We discard patches with less than 30%
level prediction (as is the case of the SMI assumption [18]). tissue sections or have too much blood. We extract around
Second, even though certain patches are not discriminative 1000 valid patches per image per scale. In most cases the
individually, their joint appearance might be discriminative. patches are non-overlapping given WSI resolution.
For example, a WSI of the “mixed” glioma, Oligoastrocy- To prevent the CNN from overfitting, we perform three
toma (see App. A) should be recognized when two single kinds of data augmentation in every iteration. We select a
glioma subtypes (Oligodendroglioma and Astrocytoma) are random 400×400 sub-patch from each 500×500 patch. We
jointly present on the slide possibly on non-overlapping re- randomly rotate and mirror the sub-patch. We randomly
gions. Third, because the patch-level model is never perfect adjust the amount of Hematoxylin and eosin stained on the
and probably biased, an image-level decision fusion model tissue. This is done by decomposing the RGB color of the
may learn to correct the bias of patch-level decisions. tissue into the H&E color space [44], followed by multiply-
Because it is unclear at this time whether strongly dis- ing the magnitude of H and E of every pixel by two i.i.d.
criminative features for cancer subtypes exist at whole slide Gaussian random variables with expectation equal to one.
2427
(a) GBM (b) OD (c) OA (d) DA (e) SCC (f) ADC
Figure 3: Some 20X sample patches of gliomas and Non-Small-Cell Lung Carcinoma (NSCLC) from the TCGA dataset. Two
patches in each column belong to the same subtype of cancer. Notice the large intra-class heterogeneity.
2428
8. EM-CNN-LR/SVM: EM-based method with CNN-LR Methods Acc mAP
and CNN-SVM respectively. CNN-Vote 0.710 0.812
9. EM-CNN-LR w/o spatial smoothing: We do not apply CNN-SMI 0.710 0.822
Gaussian smoothing to estimate P (H | X). Otherwise CNN-Fea-SVM 0.688 0.790
similar to EM-CNN-LR. EM-CNN-Vote 0.733 0.837
EM-CNN-SMI 0.719 0.823
10. EM-Finetune-CNN-LR/SVM: Similar to EM-CNN- EM-CNN-Fea-SVM 0.686 0.790
LR/SVM except that instead of training a CNN from EM-Finetune-CNN-Vote 0.719 0.817
scratch, we fine-tune a pretrained 16-layer CNN EM-Finetune-CNN-SMI 0.638 0.758
model [46] by training it on discriminative patches. CNN-LR 0.752 0.847
11. SMI-CNN-SMI: CNN with max-pooling at both dis- CNN-SVM 0.697 0.791
criminative patch identification and image-level pre- EM-CNN-LR 0.771 0.845
diction steps. For the patch-level CNN training, in EM-CNN-LR w/o spatial smoothing 0.745 0.832
each WSI only one patch with the highest confidence EM-CNN-SVM 0.730 0.818
is considered discriminative. EM-Finetune-CNN-LR 0.721 0.822
EM-Finetune-CNN-SVM 0.738 0.828
12. NM-LBP: We extract Nuclear Morphological fea-
SMI-CNN-SMI 0.683 0.765
tures [15] and rotation invariant Local Binary Pat-
NM-LBP 0.629 0.734
terns [37] from all patches. We build a Bag-of-Words
Pretrained CNN-Fea-SVM 0.733 0.837
(BoW) [19, 53] feature using k-means followed by
Pretrained-CNN-Bow-SVM 0.667 0.756
SVM with RBF kernel [10], as a non-CNN baseline.
Chance 0.513 0.689
13. Pretrained-CNN-Fea-SVM: Similar to CNN-Fea- Table 3: Glioma classification results. The proposed EM-
SVM. But instead of training a CNN, we use a CNN-LR method achieved the best result, close to inter-
pretrained 16-layer CNN model [46] to extract observer agreement between pathologists. (Sec. 5.4 ).
features from patches. Then we select the top 500
features according to accuracy on the training set [50]. Predictions
Ground Truth GBM OD OA DA AA AO
14. Pretrained-CNN-Bow-SVM: We build a BoW model
using k-means on features extracted by the pretrained GBM 214 0 2 0 1 0
CNN, followed by SVM [50]. OD 1 47 22 2 0 1
OA 1 18 40 8 3 1
DA 3 9 6 20 0 1
5.4. WSI of glioma classification
AA 3 2 3 3 4 0
There are WSIs of six subtypes of glioma in the TCGA AO 2 2 3 0 0 1
dataset [2]. The numbers of WSIs and patients in each class
Table 4: Confusion matrix of glioma classification. The na-
are shown in Tab. 2. All classes are described in App. A.
ture of Oligoastrocytoma causes the most confusions. See
Sec. 5.4 for details.
Gliomas GBM OD OA DA AA AO
# patients 209 100 106 82 29 13
# WSIs 510 206 183 114 36 15 method is the first to classify five LGG subtypes automat-
ically, a much more challenging classification task than
Table 2: The numbers of WSIs and patients in each class
the benchmark GBM vs. LGG classification. We achieve
from the TCGA dataset. Class descriptions are in App. A.
57.1% LGG-subtype classification accuracy with chance at
The results of our experiments are shown in Tab. 3. 36.7%. Most of the confusions are related to oligoastrocy-
The confusion matrix is given in Tab. 4. An experiment toma (OA) since it is a mixed glioma that is challenging
showed that the inter-observer agreement of two experi- for pathologists to agree on, according to a neuropathology
enced pathologists on a similar dataset was approximately study: “Oligoastrocytomas contain distinct regions of oligo-
70% and that even after reviewing the cases together, they dendroglial and astrocytic differentiation... The minimal
agreed only around 80% of the time [22]. Therefore, our percentage of each component required for the diagnosis
accuracy of 77% is similar to inter-observer agreement. of a mixed glioma has been debated, resulting in poor inter-
In the confusion matrix, we note that the classification observer reproducibility for this group of neoplasms.” [9].
accuracy between GBM and Low-Grade Glioma (LGG) We compare recognition rates for the OA subtype. The
is 97% (chance was 51.3%). A fully supervised method F-score of OA recognition is 0.426, 0.482, and 0.544 using
achieved 85% accuracy using a domain specific algorithm PreCNN-Fea-SVM, CNN-LR, and EM-CNN-LR respec-
trained on ten manually labeled patches per class [35]. Our tively. We thus see that the improvement over other methods
2429
Methods Acc mAP
GBM
2430
Methods Acc mAP
CNN-Vote 0.695 0.823
CNN-SMI 0.700 0.801
CNN-Fea-SVM 0.822 0.903
EM-CNN-Vote 0.683 0.817
EM-CNN-SMI 0.684 0.799
EM-CNN-Fea-SVM 0.830 0.908
CNN-LR 0.764 0.867
CNN-SVM 0.803 0.886
EM-CNN-LR 0.772 0.871
(a) Grade 0 (b) Grade 2 (c) Grade 4 (d) Grade 7 EM-CNN-SVM 0.813 0.895
Figure 5: Sample images of rail surfaces. The grade indi- SMI-CNN-SMI 0.258 0.461
cates defect severity. Notice that the defects are in image Pretrained CNN-Fea-SVM 0.808 0.894
patch scale and dispersed throughout the image. CNN-Image 0.770 0.876
Pretrained CNN-ImageFea-SVM 0.778 0.878
Chance 0.228 0.438
is 1200×500, as in Fig. 5.
To support our claim, we tested two additional methods: Table 8: Rail surface defect severity grade classification re-
sults. Our patch-based method EM-CNN-SVM and EM-
1. CNN-Image: We apply the CNN on image scale di- CNN-Fea-SVM outperform image-based methods CNN-
rectly. In particular, we train the CNN on 400×400 Image and Pretrained CNN-ImageFea-SVM significantly.
regions randomly extracted from images in each itera-
tion. At test time, we apply the CNN on five regions as part of the data likelihood in the EM formulation. We
(top left, top right, bottom left, bottom right, center) will optimize CNN-training so that it scales up to larger
and average the predictions. scale pathology datasets.
2. Pretrained CNN-ImageFea-SVM: We apply a pre- Acknowledgment
trained 16-layer network [46] to rail surface images to
This work was supported in part by 1U24CA180924-
extract features, and train an SVM on these features.
01A1 from the National Cancer Institute, R01LM011119-
The CNN used in this experiment has a similar achitec- 01 and R01LM009239, and partially supported by NSF IIS-
ture to the one described in Tab. 1 with smaller and fewer 1161876, IIS-1111047, FRA DTFR5315C00011, the Sub-
filters. The size of patches in our patch-based methods is 64 sample project from DIGITEO Institute, France, and a gift
by 64. We apply 4-fold cross-validation and show the aver- from Adobe Corp. We thank Ke Ma for providing the rail
aged results in Tab. 8. Our patch-based methods EM-CNN- surface dataset.
SVM and EM-CNN-Fea-SVM outperform the conventional Appendix A. Description of cancer subtypes
image-based method CNN-Image. Moreover, results using
GBM Glioblastoma, ICD-O 9440/3, WHO grade IV. A
CNN features extracted on patches (Pretrained CNN-Fea-
Whole Slide Image (WSI) is classified as GBM iff one
SVM) are better than results with CNN features extracted
patch can be classified as GBM with high confidence.
on images (Pretrained-CNN-ImageFea-SVM).
OD Oligodendroglioma, ICD-O 9450/3, WHO grade II.
6. Conclusions OA Oligoastrocytoma, ICD-O 9382/3, WHO grade II;
We presented a patch-based Convolutional Neural Net- Anaplastic oligoastrocytoma, ICD-O 9382/3, WHO
work (CNN) model with a supervised decision fusion model grade III. This mixed glioma subtype is hard to clas-
that is successful in Whole Slide Tissue Image (WSI) sify even by pathologists [22].
classification. We proposed an Expectation-Maximization DA Diffuse astrocytoma, ICD-O 9400/3, WHO grade II.
(EM) based method that identifies discriminative patches AA Anaplastic astrocytoma, ICD-O 9401/3, WHO grade
automatically for CNN training. With our algorithm, we III.
can classify subtypes of cancers given WSIs of patients AO Anaplastic oligodendroglioma, ICD-O 9451/3, WHO
with accuracy similar or close to inter-observer agree- grade III.
ments between pathologists. Furthermore, we experimen-
LGG Low-Grade-Glioma. Include OD, OA, DA, AA, AO.
tally demonstrate using a comparable non-cancer dataset
of smaller images, that the performance of our patch-based SCC Squamous cell carcinoma, ICD-O 8070/3.
CNN compare favorably to that of an image-based CNN. In ADC Adenocarcinoma, ICD-O 8140/3.
the future we will leverage the non-discriminative patches ADC-mix ADC with mixed subtypes, ICD-O 8255/3.
2431
References [21] J. E. Grilley-Olson, D. T. Moore, K. O. Leslie, B. F. Qaqish,
X. Yin, M. A. Socinski, T. E. Stinchcombe, L. B. Thorne,
[1] Brain tumor statistics. http://www.abta.org/ T. C. Allen, P. M. Banks, et al. Validation of interobserver
about-us/news/brain-tumor-statistics/. 4 agreement in lung cancer assessment: hematoxylin-eosin di-
[2] The cancer genome atlas. https://tcga-data.nci. agnostic reproducibility for non-small cell lung cancer: the
nih.gov/tcga/. 4, 6 2004 world health organization classification and therapeu-
[3] Non-small-cell lung carcinoma. http://www.cancer. tically relevant subsets. Archives of pathology & laboratory
org/cancer/lungcancer-non-smallcell/. 4 medicine, 2013. 7
[4] D. Altunbay, C. Cigir, C. Sokmensuer, and C. Gunduz- [22] M. Gupta, A. Djalilvand, and D. J. Brat. Clarifying the
Demir. Color graphs for automated cancer diagnosis and diffuse gliomas an update on the morphologic features and
grading. J Biomed Eng, 2010. 2 markers that discriminate oligodendroglioma from astrocy-
[5] J. Amores. Multiple instance classification: Review, taxon- toma. AJCP, 2005. 6, 8
omy and comparative study. AIJ, 2013. 2
[23] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into
[6] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vec-
rectifiers: Surpassing human-level performance on imagenet
tor machines for multiple-instance learning. In NIPS, 2002.
classification. In ICCV, 2015. 1
2
[24] M. Hoai and A. Zisserman. Improving human action recog-
[7] Y. Bengio, A. Courville, and P. Vincent. Representation
nition using score distribution and ranking. In ACCV. 2014.
learning: A review and new perspectives. PAMI, 2013. 1
3
[8] C. M. Bishop et al. Pattern recognition and machine learn-
ing. 2006. 4 [25] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
[9] D. J. Brat, R. A. Prayson, T. C. Ryken, and J. J. Olson. Diag- shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional
nosis of malignant glioma: role of neuropathology. Journal architecture for fast feature embedding. arXiv, 2014. 5
of neuro-oncology, 2008. 6 [26] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,
[10] C.-C. Chang and C.-J. Lin. Libsvm: a library for support and L. Fei-Fei. Large-scale video classification with convo-
vector machines. TIST, 2011. 2, 4, 6 lutional neural networks. In CVPR, 2014. 3
[11] H. Chang, Y. Zhou, A. Borowsky, K. Barner, P. Spellman, [27] M. Kim and F. Torre. Gaussian processes multiple instance
and B. Parvin. Stacked predictive sparse decomposition for learning. In ICML, 2010. 2
classification of histology sections. IJCV, 2014. 2 [28] M. M. Kokar, J. A. Tomasik, and J. Weyman. Data vs. deci-
[12] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and sion fusion in the category theory framework. 2001. 4
A. L. Yuille. Semantic image segmentation with deep con- [29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
volutional nets and fully connected crfs. arXiv, 2014. 2 classification with deep convolutional neural networks. In
[13] Z. Chen, Z. Chi, H. Fu, and D. Feng. Multi-instance multi- NIPS, 2012. 1
label image classification: A neural approach. Neurocom- [30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
puting, 2013. 2 based learning applied to document recognition. Proceed-
[14] D. C. Cireşan, A. Giusti, L. M. Gambardella, and J. Schmid- ings of the IEEE, 1998. 1, 3
huber. Mitosis detection in breast cancer histology images [31] C. H. Li, I. Gondra, and L. Liu. An efficient parallel neural
with deep neural networks. In MICCAI. 2013. 2 network-based multi-instance learning algorithm. J Super-
[15] L. A. Cooper, J. Kong, D. A. Gutman, F. Wang, J. Gao, comput, 2012. 2
C. Appin, S. Cholleti, T. Pan, A. Sharma, L. Scarpace, et al. [32] K. Ma, T. F. Y. Vicente, D. Samaras, M. Petrucci, and D. L.
Integrated morphologic analysis for the identification and Magnus. Texture classification for rail surface condition
characterization of disease subtypes. JAMIA, 2012. 6 evaluation. In WACV, 2016. 7
[16] E. Cosatto, P.-F. Laquerre, C. Malon, H.-P. Graf, A. Saito,
[33] O. Maron and T. Lozano-Pérez. A framework for multiple-
T. Kiyuna, A. Marugame, and K. Kamijo. Automated gastric
instance learning. NIPS, 1998. 2
cancer diagnosis on h&e-stained sections; ltraining a classi-
fier on a large scale with multiple instance machine learning. [34] S. E. Mills. Histology for pathologists. Lippincott Williams
In Medical Imaging, 2013. 2 & Wilkins, 2012. 4
[17] A. Cruz-Roa, A. Basavanhally, F. González, H. Gilmore, [35] H. S. Mousavi, V. Monga, G. Rao, and A. U. Rao. Automated
M. Feldman, S. Ganesan, N. Shih, J. Tomaszewski, and discrimination of lower and higher grade gliomas based on
A. Madabhushi. Automatic detection of invasive ductal car- histopathological image analysis. JPI, 2015. 2, 6
cinoma in whole slide images with convolutional neural net- [36] M. H. Nguyen, L. Torresani, F. De La Torre, and C. Rother.
works. In Medical Imaging, 2014. 2, 3 Weakly supervised discriminative localization and classifica-
[18] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. Solv- tion: a joint learning process. In ICCV, 2009. 2, 3
ing the multiple instance problem with axis-parallel rectan- [37] T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution
gles. AIJ, 1997. 2, 4 gray-scale and rotation invariant texture classification with
[19] L. Fei-Fei and P. Perona. A bayesian hierarchical model for local binary patterns. PAMI, 2002. 6
learning natural scene categories. In CVPR, 2005. 6 [38] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Weakly super-
[20] J. Foulds and E. Frank. A review of multi-instance learning vised object recognition with convolutional neural networks.
assumptions. Knowl Eng Rev, 2010. 4 In NIPS. 2
2432
[39] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille. [57] Z.-H. Zhou and M.-L. Zhang. Neural networks for multi-
Weakly-and semi-supervised learning of a dcnn for semantic instance learning. In ICIIT, 2002. 2, 3
image segmentation. arXiv, 2015. 2, 4
[40] D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully
convolutional multi-class multiple instance learning. arXiv,
2014. 2
[41] P. O. Pinheiro and R. Collobert. Weakly supervised semantic
segmentation with convolutional networks. arXiv, 2014. 2
[42] S. Poria, E. Cambria, and A. Gelbukh. Deep convolutional
neural network textual features and multiple kernel learning
for utterance-level multimodal sentiment analysis. 3
[43] J. Ramon and L. De Raedt. Multi instance neural networks.
2000. 2, 3
[44] A. C. Ruifrok and D. A. Johnston. Quantification of histo-
chemical staining by color deconvolution. Anal Quant Cytol
Histol, 2001. 4
[45] A. Seff, L. Lu, K. M. Cherry, H. R. Roth, J. Liu, S. Wang,
J. Hoffman, E. B. Turkbey, and R. M. Summers. 2d view ag-
gregation for lymph node detection using a shallow hierarchy
of linear classifiers. In MICCAI. 2014. 3, 4
[46] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. CoRR, 2014. 3,
5, 6, 8
[47] F. Tabib Mahmoudi, F. Samadzadegan, and P. Reinartz. a
decision level fusion method for object recognition using
multi-angular imagery. International Archives of the Pho-
togrammetry, Remote Sensing and Spatial Information Sci-
ences, 2013. 3
[48] T. H. Vu, H. S. Mousavi, V. Monga, U. Rao, and G. Rao.
Dfdl: Discriminative feature-oriented dictionary learning for
histopathological image classification. arXiv, 2015. 2
[49] N. Weidmann, E. Frank, and B. Pfahringer. A two-level
learning method for generalized multi-instance problems. In
ECML. 2003. 3, 4
[50] Y. Xu, Z. Jia, Y. Ai, F. Zhang, M. Lai, E. I. Chang, et al.
Deep convolutional activation features for large scale brain
tumor histopathology image classification and segmentation.
In ICASSP, 2015. 2, 5, 6
[51] Y. Xu, T. Mo, Q. Feng, P. Zhong, M. Lai, E. I. Chang,
et al. Deep learning of feature representation with multiple
instance learning for medical image analysis. In ICASSP,
2014. 2
[52] Y. Xu, J.-Y. Zhu, I. Eric, C. Chang, M. Lai, and Z. Tu.
Weakly supervised histopathology cancer image segmenta-
tion and classification. Medical image analysis, 2014. 2
[53] J. Yang, Y.-G. Jiang, A. G. Hauptmann, and C.-W. Ngo.
Evaluating bag-of-visual-words representations in scene
classification. In Workshop on multimedia information re-
trieval, 2007. 6
[54] C. Zhang, J. C. Platt, and P. A. Viola. Multiple instance
boosting for object detection. In NIPS, 2005. 2
[55] Q. Zhang and S. A. Goldman. Em-dd: An improved
multiple-instance learning technique. In NIPS, 2001. 3
[56] Y. Zhou, H. Chang, K. Barner, P. Spellman, and B. Parvin.
Classification of histology sections via multispectral convo-
lutional sparse coding. In CVPR, 2014. 2
2433