0% found this document useful (0 votes)
86 views10 pages

PatchBased Convolutional Neural

This document proposes a patch-based convolutional neural network approach for classifying whole slide tissue images. The key challenges are that CNNs cannot be directly applied to gigapixel whole slide images, and not all image patches are discriminative for classification. The proposed approach trains a CNN on high-resolution image patches to classify each patch, then uses a decision fusion model and an Expectation-Maximization algorithm to aggregate the patch-level predictions and locate the most discriminative patches to determine the whole image classification. Experimental results on glioma and lung cancer subtype classification show performance comparable to pathologist agreement.

Uploaded by

aaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views10 pages

PatchBased Convolutional Neural

This document proposes a patch-based convolutional neural network approach for classifying whole slide tissue images. The key challenges are that CNNs cannot be directly applied to gigapixel whole slide images, and not all image patches are discriminative for classification. The proposed approach trains a CNN on high-resolution image patches to classify each patch, then uses a decision fusion model and an Expectation-Maximization algorithm to aggregate the patch-level predictions and locate the most discriminative patches to determine the whole image classification. Experimental results on glioma and lung cancer subtype classification show performance comparable to pathologist agreement.

Uploaded by

aaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Patch-based Convolutional Neural Network for Whole Slide Tissue Image

Classification

Le Hou1 , Dimitris Samaras1 , Tahsin M. Kurc2,4 , Yi Gao2,1,3 , James E. Davis5 , and Joel H. Saltz2,1,5,6
1
Dept. of Computer Science, Stony Brook University
2
Dept. of Biomedical Informatics, Stony Brook University
3
Dept. of Applied Mathematics and Statistics, Stony Brook University
4
Oak Ridge National Laboratory
5
Dept. of Pathology, Stony Brook Hospital
6
Cancer Center, Stony Brook Hospital
{lehhou,samaras}@cs.stonybrook.edu {tahsin.kurc,joel.saltz}@stonybrook.edu
{yi.gao,james.davis}@stonybrookmedicine.edu

Abstract Whole Slide Tissue Images (WSI). Classification of cancer


WSIs into grades and subtypes is critical to the study of dis-
Convolutional Neural Networks (CNN) are state-of-the-
ease onset and progression and the development of targeted
art models for many image classification tasks. However, to
therapies, because the effects of cancer can be observed in
recognize cancer subtypes automatically, training a CNN
WSIs at the cellular and sub-cellular levels (Fig. 1). Apply-
on gigapixel resolution Whole Slide Tissue Images (WSI)
ing CNN directly for WSI classification has several draw-
is currently computationally impossible. The differentia-
backs. First, extensive image downsampling is required by
tion of cancer subtypes is based on cellular-level visual
which most of the discriminative details could be lost. Sec-
features observed on image patch scale. Therefore, we ar-
ond, it is possible that a CNN might only learn from one of
gue that in this situation, training a patch-level classifier
the multiple discriminative patterns in an image, resulting
on image patches will perform better than or similar to
in data inefficiency. Discriminative information is encoded
an image-level classifier. The challenge becomes how to
in high resolution image patches. Therefore, one solution is
intelligently combine patch-level classification results and
to train a CNN on high resolution image patches and predict
model the fact that not all patches will be discriminative.
the label of a WSI based on patch-level predictions.
We propose to train a decision fusion model to aggregate
patch-level predictions given by patch-level CNNs, which to
the best of our knowledge has not been shown before. Fur-
thermore, we formulate a novel Expectation-Maximization
(EM) based method that automatically locates discrimina-
tive patches robustly by utilizing the spatial relationships
of patches. We apply our method to the classification of
glioma and non-small-cell lung carcinoma cases into sub-
types. The classification accuracy of our method is simi-
lar to the inter-observer agreement between pathologists.
Although it is impossible to train CNNs on WSIs, we ex- Figure 1: A gigapixel Whole Slide Tissue Image of a grade
perimentally demonstrate using a comparable non-cancer IV tumor. Visual features that determine the subtype and
dataset of smaller images that a patch-based CNN can out- grade of a WSI are visible in high resolution. In this case,
perform an image-based CNN. patches framed in red are discriminative since they show
1. Introduction typical visual features of grade IV tumor. Patches framed in
blue are non-discriminative since they only contain visual
Convolutional Neural Networks (CNNs) are currently
features from lower grade tumors. Discriminative patches
the state-of-the-art image classifiers [30, 29, 7, 23]. How-
are dispersed throughout the image at multiple locations.
ever, due to high computational cost, CNNs cannot be ap-
plied to very high resolution images, such as gigapixel The ground truth labels of individual patches are un-

12424
known, as only the image-level ground truth label is given. all patches to be discriminative. We train a CNN model
This complicates the classification problem. Because tu- that outputs the cancer type probability of each input patch.
mors may have a mixture of structures and texture proper- We apply spatial smoothing to the resulting probability map
ties, patch-level labels are not necessarily consistent with and select only patches with higher probability values as
the image-level label. More importantly, when aggregat- discriminative patches. We iterate this process using the
ing patch-level labels to an image-level label, simple deci- new set of discriminative patches in an EM fashion. In the
sion fusion methods such as voting and max-pooling are not second-level (image-level), histograms of patch-level pre-
robust and do not match the decision process followed by dictions are input into an image-level multiclass logistic re-
pathologists. For example, a mixed subtype of cancer such gression or Support Vector Machine (SVM) [10] model that
as oligoastrocytoma, might have distinct regions of other predicts the image-level labels.
cancer subtypes. Therefore, neither voting nor max-pooling Pathology image classification and segmentation is an
could predict the correct WSI-level label since the patch- active research field. Most WSI classification methods fo-
level predictions do not match the WSI-level label. cus on classifying or extracting features on patches [17, 35,
50, 56, 11, 4, 48, 14, 50]. In [50] a pretrained CNN model
extracts features on patches which are then aggregated for
WSI classification. As we show here, the heterogeneity of
some cancer subtypes cannot be captured by those generic
CNN features. Patch-level supervised classifiers can learn
the heterogeneity of cancer subtypes, if a lot of patch la-
bels are provided [17, 35]. However, acquiring such labels
in large scale is prohibitive, due to the need for special-
ized annotators. As digitization of tissue samples becomes
commonplace, one can envision large scale datasets, that
could not be annotated at patch scale. Utilizing unlabeled
patches has led to Multiple Instance Learning (MIL) based
WSI classification [16, 51, 52].
In the MIL paradigm [18, 33, 5], unlabeled instances be-
long to labeled bags of instances. The goal is to predict the
label of a new bag and/or the label of each instance. The
Standard Multi-Instance (SMI) assumption [18] states that
for a binary classification problem, a bag is positive iff there
exists at least one positive instance in the bag. The probabil-
ity of a bag being positive equals to the maximum positive
prediction over all of its instances [6, 54, 27]. Combining
MIL with Neural Networks (NN) [43, 57, 31, 13], the SMI
assumption is modeled by max-pooling. Following this for-
mulation, the Back Propagation for Multi-Instance Prob-
lems (BP-MIP) [43, 57] performs back propagation along
Figure 2: An overview of our workflow. Top: A CNN is the instance with the maximum response if the bag is posi-
trained on patches. An EM-based method iteratively elimi- tive. This is inefficient because only one instance per bag is
nates non-discriminative patches. Bottom: An image-level trained in one training iteration on the whole bag.
decision fusion model is trained on histograms of patch- MIL-based CNNs have been applied to object recogni-
level predictions, to predict the image-level label. tion [38] and semantic segmentation [40] in image analy-
sis – the image is the bag and image-windows are the in-
We propose using a patch-level CNN and training a de- stances [36]. These methods also follow the SMI assump-
cision fusion model as a two-level model, shown in Fig. 2. tion. The training error is only propagated through the
The first-level (patch-level) model is an Expectation Maxi- object-containing window which is also assumed to be the
mization (EM) based method combined with CNN that out- window that has the maximum prediction confidence. This
puts patch-level predictions. In particular, we assume that is not robust because one significantly misclassified window
there is a hidden variable associated with each patch ex- might be considered as the object-containing window. Ad-
tracted from an image that indicates whether the patch is ditionally, in WSIs, there might be multiple windows that
discriminative (i.e. the true hidden label of the patch is the contain discriminative information. Hence, recent seman-
same as the true label of the image). Initially, we consider tic image segmentation approaches [12, 41, 39] smooth the

2425
output probability (feature) maps of the CNNs. of bag Xi . We further assume that all Xi,j depends on Hi,j
To predict the image-level label, max-pooling (SMI) and only and are independent with each other given Hi,j . Thus
voting (average-pooling) were applied in [36, 30, 17]. How-
Ni 
N Y
ever, it has been shown that in many applications, learning Y 
decision fusion models can significantly improve perfor- P (X, H) = P (Xi,j | Hi,j )P (Hi ) . (2)
i=1 j=1
mance compared to voting [42, 45, 24, 47, 26, 46]. Further-
more, such a learned decision fusion model is based on the We maximize the data likelihood P (X) using EM.
Count-based Multiple Instance (CMI) assumption which is
the most general MIL assumption [49]. 1. At the initial E step, we set Hi,j = 1 for all i, j. This
Our main contributions in this paper are: (1) To the best means that all instances are considered discriminative.
of our knowledge, we are the first to combine patch-level 2. M step: We update the model parameter θ to maximize
CNNs with supervised decision fusion. Aggregating patch- the data likelihood
level CNN predictions for WSI classification significantly
outperforms patch-level CNNs with max-pooling or vot- θ ← arg max P (X | H; θ)
θ
ing. (2) We propose a new EM-based model that identi- Y
fies discriminative patches in high resolution images auto- = arg max P (xi,j , yi | θ)
θ xi,j ∈D (3)
matically for patch-level CNN training, utilizing the spatial Y
relationship between patches. (3) Our model achieves mul- × P (xp,q , yq | θ),
tiple state-of-the-art results classifying WSIs to cancer sub- xp,q 6∈D
types on the TCGA dataset. Our results are similar or close
to inter-observer agreement between pathologists. Larger where D is the discriminative patches set. Assuming
classification improvements are observed in the harder-to- a uniform generative model for all non-discriminative
classify cases. (4) We provide experimental evidence that instances, the optimization in Eq. 3 simplifies to:
combining multiple patch-level classifiers might actually be Y
advantageous compared to whole image classification. arg max P (xi,j , yi | θ)
θ xi,j ∈D
The rest of this paper is organized as follows. Sec. 2 (4)
Y
describes the framework of the EM-based MIL algorithm. = arg max P (yi | xi,j ; θ)P (xi,j | θ).
Sec. 3 discusses the identification of discriminative patches. θ xi,j ∈D
Sec. 4 explains the image-level model that predicts the
image-level label by aggregating patch-level predictions. Additionally we assume an uniform distribution over
Sec. 5 shows experimental results. The paper concludes in xi,j . Thus Eq. 4 describes a discriminative model (in
Sec. 6. App. A lists the cancer subtypes in our experiments. this paper we use a CNN).
2. EM-based method with CNN 3. E step: We estimate the hidden variables H. In par-
ticular, Hi,j = 1 if and only if P (Hi,j | X) is above
An overview of our EM-based method can be found in
a certain threshold. In the case of image classifica-
Fig. 2. We model a high resolution image as a bag and
tion, given the i-th image, P (Hi,j | X) is obtained by
patches extracted from it as instances. We have a ground
applying Gaussian smoothing on P (yi | xi,j ; θ) (De-
truth label for the whole image but not for the individual
tailed in Sec 3). This smoothing step utilizes the spatial
patches. We model whether an instance is discriminative or
relationship of P (yi | xi,j ; θ) in the image. We then
not as a hidden binary variable.
iterate back to the M step till convergence.
We denote X = {X1 , X2 , . . . , XN } as the dataset con-
taining N bags. Each bag Xi = {Xi,1 , Xi,2 , . . . , Xi,Ni } Many MIL algorithms can be interpreted through this
consists of Ni instances, where Xi,j = hxi,j , yi i is the j-th formulation. Based on the SMI assumption, the instance
instance and its associated label in the i-th bag. Assuming with the maximum P (Hi,j | X) is the discriminative in-
the bags are independent and identically distributed (i.i.d.), stance for the positive bag, as in the EM Diverse Density
the X and the hidden variables H are generated by the fol- (EM-DD) [55] and the BP-MIP [43, 57] algorithms.
lowing generative model:
3. Discriminative patch selection
N 
Y  Patches xi,j that have P (Hi,j | X) larger than a thresh-
P (X, H) = P (Xi,1 , . . . , Xi,Ni | Hi )P (Hi ) , (1)
old Ti,j are considered discriminative and are selected to
i=1
continue training the CNN. We present in this section the
where the hidden variable H = {H1 , H2 , . . . , HN }, Hi = estimation of P (H | X) and the choice of the threshold.
{Hi,1 , Hi,2 , . . . , Hi,Ni } and Hi,j is the hidden variable that It is reasonable to assume that P (Hi,j | X) is correlated
indicates whether instance xi,j is discriminative for label yi with P (yi | xi,j ; θ), i.e. patches with lower P (yi | xi,j ; θ)

2426
tend to have lower probability xi,j to be discriminative. scale [34], we fuse patch-level predictions without the spa-
However, a hard-to-classify patch, or a patch close to the tial relationship between patches. In particular, the class
decision boundary may have low P (yi | xi,j ; θ) as well. histogram of the patch-level predictions is the input to a
These patches are informative and should not be rejected. linear multi-class logistic regression model [8] or an SVM
Therefore, to obtain a more robust P (Hi,j | X), we apply with Radial Basis Function (RBF) kernel [10]. Because a
the following two steps: First, we train two CNNs on two WSI contains at least hundreds of patches, the class his-
different scales in parallel. P (yi | xi,j ; θ) is the averaged togram is very robust to miss-classified patches. To gener-
prediction of the two CNNs. Second, we simply denoise the ate the histogram, we sum up all of the class probabilities
probability map P (yi | xi,j ; θ) of each image with a Gaus- given by the patch-level CNN. Moreover, we concatenate
sian kernel to compute P (Hi,j | X). This use of spatial histograms from four CNNs models: CNNs trained at two
relationships yields more robust discriminative patch iden- patch scales for two different numbers of iterations. We
tification as shown in the experiments in Sec. 5. found in practice that using multiple histograms is robust.
Choosing a thresholding scheme carefully yields sig-
nificantly better performance than a simpler thresholding 5. Experiments
scheme [39]. We obtain the threshold Ti,j for P (Hi,j | X)
as follows: We note Si as the set of P (Hi,j | X) values for We evaluate our method on two Whole Slide Tissue Im-
all xi,j of the i-th image and Ec as the set of P (Hi,j | X) ages (WSI) classification problems: classification of glioma
values for all xi,j of the c-th class. We introduce the image- and Non-Small-Cell Lung Carcinoma (NSCLC) cases into
level threshold Hi as the P1 -th percentile of Si and the glioma and NSCLC subtypes. Glioma is a type of brain
class-level threshold Ri as the P2 -th percentile of Ec , where cancer that rises from glial cells. It is the most common ma-
P1 and P2 are predefined. The threshold Ti,j is defined lignant brain tumor and the leading cause of cancer-related
as the minimum value between Hi and Ri . There are two deaths in people under age 20 [1]. NSCLC is the most
advantages of our method. First, by using the image-level common lung cancer, which is the leading cause of cancer-
threshold, there are at least 1 − P1 percent of patches that related deaths overall [3]. Classifying glioma and NSCLC
are considered discriminative for each image. Second, by into their respective subtypes and grades is crucial to the
using the class-level threshold, the thresholds can be easily study of disease onset and progression in order to provide
adapted to classes with different prior probabilities. targeted therapies. The dataset of WSIs used in the exper-
iments part of the public Cancer Genome Atlas (TCGA)
4. Image-level decision fusion model dataset [2]. It contains detailed clinical information and the
Hematoxylin and Eosin (H&E) stained images of various
We combine the patch-level classifiers of Sec. 3 to pre- cancers. The typical resolution of a WSI in this dataset is
dict the image-level label. We input all patch-level pre- 100K by 50K pixels. In the rest of this section, we first
dictions into a multi-class logistic regression or SVM that describe the algorithm we tested then show the evaluation
outputs the image-level label. This decision level fusion results on the glioma and NSCLC classification tasks.
method [28] is more robust than max-pooling [45]. More-
over, this method can be thought of as a Count-based Mul-
5.1. Patch extraction and segmentation
tiple Instance (CMI) learning method with two-level learn-
ing [49] which is a more general MIL assumption [20] than To train the CNN model, we extract patches of size
the Standard Multiple Instance (SMI) assumption. 500×500 from WSIs (examples in Fig. 3). To capture struc-
There are three reasons for combining multiple in- tures at multiple scales, we extract patches from 20X (0.5
stances: First, on difficult datasets, we do not want to assign microns per pixel) and 5X (2.0 microns per pixel) objec-
an image-level prediction simply based on a single patch- tive magnifications. We discard patches with less than 30%
level prediction (as is the case of the SMI assumption [18]). tissue sections or have too much blood. We extract around
Second, even though certain patches are not discriminative 1000 valid patches per image per scale. In most cases the
individually, their joint appearance might be discriminative. patches are non-overlapping given WSI resolution.
For example, a WSI of the “mixed” glioma, Oligoastrocy- To prevent the CNN from overfitting, we perform three
toma (see App. A) should be recognized when two single kinds of data augmentation in every iteration. We select a
glioma subtypes (Oligodendroglioma and Astrocytoma) are random 400×400 sub-patch from each 500×500 patch. We
jointly present on the slide possibly on non-overlapping re- randomly rotate and mirror the sub-patch. We randomly
gions. Third, because the patch-level model is never perfect adjust the amount of Hematoxylin and eosin stained on the
and probably biased, an image-level decision fusion model tissue. This is done by decomposing the RGB color of the
may learn to correct the bias of patch-level decisions. tissue into the H&E color space [44], followed by multiply-
Because it is unclear at this time whether strongly dis- ing the magnitude of H and E of every pixel by two i.i.d.
criminative features for cancer subtypes exist at whole slide Gaussian random variables with expectation equal to one.

2427
(a) GBM (b) OD (c) OA (d) DA (e) SCC (f) ADC
Figure 3: Some 20X sample patches of gliomas and Non-Small-Cell Lung Carcinoma (NSCLC) from the TCGA dataset. Two
patches in each column belong to the same subtype of cancer. Notice the large intra-class heterogeneity.

5.2. CNN architecture 1. CNN-Vote: CNN followed by voting (average-


The architecture of our CNN is shown in Tab. 1. We used pooling). We use all patches extracted from a WSI
the CAFFE tool box [25] for the CNN implementation. The to train the patch-level CNN. There is no second-level
network was trained on a single NVidia Tesla K40 GPU. model. Instead, the predictions of all patches vote for
the final predicted label of a WSI.
Layer Filter size, stride Output W×H×N 2. CNN-SMI: CNN followed by max-pooling. Same as
Input - 400 × 400 × 3 CNN-Vote except the final predicted label of a WSI
Conv 10 × 10, 2 196 × 196 × 80 equals to the predicted label of the patch with maxi-
ReLU+LRN - 196 × 196 × 80 mum probability over all other patches and classes.
Max-pool 6 × 6, 4 49 × 49 × 80 3. CNN-Fea-SVM: We apply feature fusion instead of de-
Conv 5 × 5, 1 45 × 45 × 120 cision level fusion. In particular, we aggregate the out-
ReLU+LRN - 45 × 45 × 120 puts of the second fully connected layer of the CNN
Max-pool 3 × 3, 2 22 × 22 × 120 on all patches by 3-norm pooling [50]. Then an SVM
Conv 3 × 3, 1 20 × 20 × 160 with RBF kernel predicts the image-level label.
ReLU - 20 × 20 × 160
4. EM-CNN-Vote/SMI, EM-CNN-Fea-SVM: EM-based
Conv 3 × 3, 1 18 × 18 × 200
method with CNN-Vote, CNN-SMI, CNN-Fea-SVM
ReLU - 18 × 18 × 200
respectively. We train the patch-level EM-CNN on dis-
Max-pool 3 × 3, 2 9 × 9 × 200
criminative patches identified by the E-step. Depend-
FC - 320
ing on the dataset, the discriminative threshold P1 for
ReLu+Drop - 320
each image ranges from 0.18 to 0.25; the discrimina-
FC - 320 tive threshold P2 for each class ranges from 0.05 to
ReLu+Drop - 320 0.28 (details in Sec. 3). In each M-step, we train the
FC - Dataset dependent CNN on all the discriminative patches for 2 epochs.
Softmax - Dataset dependent
5. EM-Finetune-CNN-Vote/SMI: Similar to EM-CNN-
Table 1: The architecture of our CNN used in glioma and
Vote/SMI except that instead of training a CNN
NSCLC classification. ReLU+LRN is a sequence of Recti-
from scratch, we fine-tune a pretrained 16-layer CNN
fied Linear Units (ReLU) followed by Local Response Nor-
model [46] by training it on discriminative patches.
malization (LRN). Similarily, ReLU+Drop is a sequence of
ReLU followed by dropout. The dropout probability is 0.5. 6. CNN-LR: CNN followed by logistic regression. Same
as CNN-Vote except that we train a second-level multi-
5.3. Experiment setup class logistic regression to predict the image-level la-
The WSIs of 80% of the patients are randomly selected bel. One tenth of the patches in each image is held
to train the model and the remaining 20% to test. Depending out from the CNN to train the second-level multi-class
on method, training patches are further divided into i) CNN logistic regression.
and ii) decision fusion model training sets. We separate the 7. CNN-SVM: CNN followed by SVM with RBF kernel
data twice and average the results. Tested algorithms are: instead of logistic regression.

2428
8. EM-CNN-LR/SVM: EM-based method with CNN-LR Methods Acc mAP
and CNN-SVM respectively. CNN-Vote 0.710 0.812
9. EM-CNN-LR w/o spatial smoothing: We do not apply CNN-SMI 0.710 0.822
Gaussian smoothing to estimate P (H | X). Otherwise CNN-Fea-SVM 0.688 0.790
similar to EM-CNN-LR. EM-CNN-Vote 0.733 0.837
EM-CNN-SMI 0.719 0.823
10. EM-Finetune-CNN-LR/SVM: Similar to EM-CNN- EM-CNN-Fea-SVM 0.686 0.790
LR/SVM except that instead of training a CNN from EM-Finetune-CNN-Vote 0.719 0.817
scratch, we fine-tune a pretrained 16-layer CNN EM-Finetune-CNN-SMI 0.638 0.758
model [46] by training it on discriminative patches. CNN-LR 0.752 0.847
11. SMI-CNN-SMI: CNN with max-pooling at both dis- CNN-SVM 0.697 0.791
criminative patch identification and image-level pre- EM-CNN-LR 0.771 0.845
diction steps. For the patch-level CNN training, in EM-CNN-LR w/o spatial smoothing 0.745 0.832
each WSI only one patch with the highest confidence EM-CNN-SVM 0.730 0.818
is considered discriminative. EM-Finetune-CNN-LR 0.721 0.822
EM-Finetune-CNN-SVM 0.738 0.828
12. NM-LBP: We extract Nuclear Morphological fea-
SMI-CNN-SMI 0.683 0.765
tures [15] and rotation invariant Local Binary Pat-
NM-LBP 0.629 0.734
terns [37] from all patches. We build a Bag-of-Words
Pretrained CNN-Fea-SVM 0.733 0.837
(BoW) [19, 53] feature using k-means followed by
Pretrained-CNN-Bow-SVM 0.667 0.756
SVM with RBF kernel [10], as a non-CNN baseline.
Chance 0.513 0.689
13. Pretrained-CNN-Fea-SVM: Similar to CNN-Fea- Table 3: Glioma classification results. The proposed EM-
SVM. But instead of training a CNN, we use a CNN-LR method achieved the best result, close to inter-
pretrained 16-layer CNN model [46] to extract observer agreement between pathologists. (Sec. 5.4 ).
features from patches. Then we select the top 500
features according to accuracy on the training set [50]. Predictions
Ground Truth GBM OD OA DA AA AO
14. Pretrained-CNN-Bow-SVM: We build a BoW model
using k-means on features extracted by the pretrained GBM 214 0 2 0 1 0
CNN, followed by SVM [50]. OD 1 47 22 2 0 1
OA 1 18 40 8 3 1
DA 3 9 6 20 0 1
5.4. WSI of glioma classification
AA 3 2 3 3 4 0
There are WSIs of six subtypes of glioma in the TCGA AO 2 2 3 0 0 1
dataset [2]. The numbers of WSIs and patients in each class
Table 4: Confusion matrix of glioma classification. The na-
are shown in Tab. 2. All classes are described in App. A.
ture of Oligoastrocytoma causes the most confusions. See
Sec. 5.4 for details.
Gliomas GBM OD OA DA AA AO
# patients 209 100 106 82 29 13
# WSIs 510 206 183 114 36 15 method is the first to classify five LGG subtypes automat-
ically, a much more challenging classification task than
Table 2: The numbers of WSIs and patients in each class
the benchmark GBM vs. LGG classification. We achieve
from the TCGA dataset. Class descriptions are in App. A.
57.1% LGG-subtype classification accuracy with chance at
The results of our experiments are shown in Tab. 3. 36.7%. Most of the confusions are related to oligoastrocy-
The confusion matrix is given in Tab. 4. An experiment toma (OA) since it is a mixed glioma that is challenging
showed that the inter-observer agreement of two experi- for pathologists to agree on, according to a neuropathology
enced pathologists on a similar dataset was approximately study: “Oligoastrocytomas contain distinct regions of oligo-
70% and that even after reviewing the cases together, they dendroglial and astrocytic differentiation... The minimal
agreed only around 80% of the time [22]. Therefore, our percentage of each component required for the diagnosis
accuracy of 77% is similar to inter-observer agreement. of a mixed glioma has been debated, resulting in poor inter-
In the confusion matrix, we note that the classification observer reproducibility for this group of neoplasms.” [9].
accuracy between GBM and Low-Grade Glioma (LGG) We compare recognition rates for the OA subtype. The
is 97% (chance was 51.3%). A fully supervised method F-score of OA recognition is 0.426, 0.482, and 0.544 using
achieved 85% accuracy using a domain specific algorithm PreCNN-Fea-SVM, CNN-LR, and EM-CNN-LR respec-
trained on ten manually labeled patches per class [35]. Our tively. We thus see that the improvement over other methods

2429
Methods Acc mAP
GBM

CNN-Vote 0.702 0.838


CNN-SMI 0.731 0.852
CNN-Fea-SVM 0.637 0.793
EM-CNN-Vote 0.714 0.842
GBM

EM-CNN-SMI 0.731 0.850


EM-CNN-Fea-SVM 0.637 0.791
EM-Finetune-CNN-Vote 0.773 0.877
EM-Finetune-CNN-SMI 0.729 0.853
CNN-LR 0.727 0.845
OD

CNN-SVM 0.738 0.856


EM-CNN-LR 0.743 0.856
EM-CNN-SVM 0.759 0.869
EM-Finetune-CNN-LR 0.784 0.883
OA

EM-Finetune-CNN-SVM 0.798 0.889


SMI-CNN-SMI 0.531 0.749
WSIs Pathologist Max-pooling EM Pretrained CNN-Fea-SVM 0.778 0.879
Pretrained-CNN-Bow-SVM 0.759 0.871
Figure 4: Examples of discriminative patch (region) seg- Chance 0.484 0.715
mentation (best viewed in color). Discriminative regions
are indicated in red. Diagnostic or highly discriminative re- Table 6: NSCLC classification results. The proposed EM-
gions are yellow. Non-discriminative regions are in black. CNN-SVM and EM-Finetune-CNN-SVM achieved best re-
Pathologist: ground truth by a pathologist. Max-pooling: sults, close to the inter-observer agreement between pathol-
results by CNN with the SMI assumption (SMI-CNN-SMI). ogists. See Sec. 5.5 for details.
The discriminative patches are indicated by red arrows. Predictions
EM: results by our EM-based patch-level CNN (EM-CNN- Ground Truth SCC ADC ADC-mix
Vote/SMI/LR). Notice that max-pooling does not segment SCC 199 26 0
enough discriminative regions. ADC 30 155 11
ADC-mix 2 25 17
becomes increasingly more significant using our proposed Table 7: The confusion matrix of NSCLC classification.
method on the harder-to-classify classes.
The discriminative patch (region) segmentation results in
Fig. 4 demonstrate the quality of our EM-based method. results appear close to inter-observer agreement.
The ADC-mix subtype is hard to classify because it con-
5.5. WSI of NSCLC classification tains visual features of multiple NSCLC subtypes. The
We use three major subtypes of Non-Small-Cell Lung Pretrained CNN-Fea-SVM method achieves an F-score of
Carcinoma (NSCLC). Numbers of WSIs and patients in 0.412 recognizing ADC-mix cases, whereas our proposed
each class are in Tab. 5. All classes are listed in App. A. method EM-Finetune-CNN-SVM achieves 0.472. Consis-
tent with the glioma results, our method’s performance ad-
NSCLCs SCC ADC ADC-mix vantages are more pronounced in the hardest cases.
# patients 316 250 75 5.6. Rail surface defect severity grade classification
# WSIs 347 291 80
Table 5: The numbers of WSIs and patients in each class We evaluate our approach beyond classification of
from the TCGA dataset. Class descriptions are in App. A. pathology images. A CNN cannot be applied to gigapixel
images directly because of computational limitations. Even
Experimental results are shown in Tab. 6; the confusion when the images are small enough for CNNs, our patch-
matrix is in Tab. 7. When classifying SCC vs. non-SCC, based method compares favorably to an image-based CNN
inter-observer agreement between pulmonary pathology ex- if discriminative information is encoded in image patch
perts and between community pathologists measured by scale and dispersed throughout the images.
Cohen’s kappa is κ = 0.64 and κ = 0.41 respectively [21]. We classify the severity grade of rail surface defects. Au-
We achieved κ = 0.75. When classifying ADC vs. non- tomatic defect grading can obviate the need for laborious
ADC, the inter-observer agreement between experts and be- examination and grading of rail surface defects on a regular
tween community pathologists are κ = 0.69 and κ = 0.46 basis. We used a dataset [32] of 939 rail surface images with
respectively [21]. We achieved κ = 0.60. Therefore, our defect severity grades from 0 to 7. Typical image resolution

2430
Methods Acc mAP
CNN-Vote 0.695 0.823
CNN-SMI 0.700 0.801
CNN-Fea-SVM 0.822 0.903
EM-CNN-Vote 0.683 0.817
EM-CNN-SMI 0.684 0.799
EM-CNN-Fea-SVM 0.830 0.908
CNN-LR 0.764 0.867
CNN-SVM 0.803 0.886
EM-CNN-LR 0.772 0.871
(a) Grade 0 (b) Grade 2 (c) Grade 4 (d) Grade 7 EM-CNN-SVM 0.813 0.895
Figure 5: Sample images of rail surfaces. The grade indi- SMI-CNN-SMI 0.258 0.461
cates defect severity. Notice that the defects are in image Pretrained CNN-Fea-SVM 0.808 0.894
patch scale and dispersed throughout the image. CNN-Image 0.770 0.876
Pretrained CNN-ImageFea-SVM 0.778 0.878
Chance 0.228 0.438
is 1200×500, as in Fig. 5.
To support our claim, we tested two additional methods: Table 8: Rail surface defect severity grade classification re-
sults. Our patch-based method EM-CNN-SVM and EM-
1. CNN-Image: We apply the CNN on image scale di- CNN-Fea-SVM outperform image-based methods CNN-
rectly. In particular, we train the CNN on 400×400 Image and Pretrained CNN-ImageFea-SVM significantly.
regions randomly extracted from images in each itera-
tion. At test time, we apply the CNN on five regions as part of the data likelihood in the EM formulation. We
(top left, top right, bottom left, bottom right, center) will optimize CNN-training so that it scales up to larger
and average the predictions. scale pathology datasets.
2. Pretrained CNN-ImageFea-SVM: We apply a pre- Acknowledgment
trained 16-layer network [46] to rail surface images to
This work was supported in part by 1U24CA180924-
extract features, and train an SVM on these features.
01A1 from the National Cancer Institute, R01LM011119-
The CNN used in this experiment has a similar achitec- 01 and R01LM009239, and partially supported by NSF IIS-
ture to the one described in Tab. 1 with smaller and fewer 1161876, IIS-1111047, FRA DTFR5315C00011, the Sub-
filters. The size of patches in our patch-based methods is 64 sample project from DIGITEO Institute, France, and a gift
by 64. We apply 4-fold cross-validation and show the aver- from Adobe Corp. We thank Ke Ma for providing the rail
aged results in Tab. 8. Our patch-based methods EM-CNN- surface dataset.
SVM and EM-CNN-Fea-SVM outperform the conventional Appendix A. Description of cancer subtypes
image-based method CNN-Image. Moreover, results using
GBM Glioblastoma, ICD-O 9440/3, WHO grade IV. A
CNN features extracted on patches (Pretrained CNN-Fea-
Whole Slide Image (WSI) is classified as GBM iff one
SVM) are better than results with CNN features extracted
patch can be classified as GBM with high confidence.
on images (Pretrained-CNN-ImageFea-SVM).
OD Oligodendroglioma, ICD-O 9450/3, WHO grade II.
6. Conclusions OA Oligoastrocytoma, ICD-O 9382/3, WHO grade II;
We presented a patch-based Convolutional Neural Net- Anaplastic oligoastrocytoma, ICD-O 9382/3, WHO
work (CNN) model with a supervised decision fusion model grade III. This mixed glioma subtype is hard to clas-
that is successful in Whole Slide Tissue Image (WSI) sify even by pathologists [22].
classification. We proposed an Expectation-Maximization DA Diffuse astrocytoma, ICD-O 9400/3, WHO grade II.
(EM) based method that identifies discriminative patches AA Anaplastic astrocytoma, ICD-O 9401/3, WHO grade
automatically for CNN training. With our algorithm, we III.
can classify subtypes of cancers given WSIs of patients AO Anaplastic oligodendroglioma, ICD-O 9451/3, WHO
with accuracy similar or close to inter-observer agree- grade III.
ments between pathologists. Furthermore, we experimen-
LGG Low-Grade-Glioma. Include OD, OA, DA, AA, AO.
tally demonstrate using a comparable non-cancer dataset
of smaller images, that the performance of our patch-based SCC Squamous cell carcinoma, ICD-O 8070/3.
CNN compare favorably to that of an image-based CNN. In ADC Adenocarcinoma, ICD-O 8140/3.
the future we will leverage the non-discriminative patches ADC-mix ADC with mixed subtypes, ICD-O 8255/3.

2431
References [21] J. E. Grilley-Olson, D. T. Moore, K. O. Leslie, B. F. Qaqish,
X. Yin, M. A. Socinski, T. E. Stinchcombe, L. B. Thorne,
[1] Brain tumor statistics. http://www.abta.org/ T. C. Allen, P. M. Banks, et al. Validation of interobserver
about-us/news/brain-tumor-statistics/. 4 agreement in lung cancer assessment: hematoxylin-eosin di-
[2] The cancer genome atlas. https://tcga-data.nci. agnostic reproducibility for non-small cell lung cancer: the
nih.gov/tcga/. 4, 6 2004 world health organization classification and therapeu-
[3] Non-small-cell lung carcinoma. http://www.cancer. tically relevant subsets. Archives of pathology & laboratory
org/cancer/lungcancer-non-smallcell/. 4 medicine, 2013. 7
[4] D. Altunbay, C. Cigir, C. Sokmensuer, and C. Gunduz- [22] M. Gupta, A. Djalilvand, and D. J. Brat. Clarifying the
Demir. Color graphs for automated cancer diagnosis and diffuse gliomas an update on the morphologic features and
grading. J Biomed Eng, 2010. 2 markers that discriminate oligodendroglioma from astrocy-
[5] J. Amores. Multiple instance classification: Review, taxon- toma. AJCP, 2005. 6, 8
omy and comparative study. AIJ, 2013. 2
[23] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into
[6] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vec-
rectifiers: Surpassing human-level performance on imagenet
tor machines for multiple-instance learning. In NIPS, 2002.
classification. In ICCV, 2015. 1
2
[24] M. Hoai and A. Zisserman. Improving human action recog-
[7] Y. Bengio, A. Courville, and P. Vincent. Representation
nition using score distribution and ranking. In ACCV. 2014.
learning: A review and new perspectives. PAMI, 2013. 1
3
[8] C. M. Bishop et al. Pattern recognition and machine learn-
ing. 2006. 4 [25] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
[9] D. J. Brat, R. A. Prayson, T. C. Ryken, and J. J. Olson. Diag- shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional
nosis of malignant glioma: role of neuropathology. Journal architecture for fast feature embedding. arXiv, 2014. 5
of neuro-oncology, 2008. 6 [26] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,
[10] C.-C. Chang and C.-J. Lin. Libsvm: a library for support and L. Fei-Fei. Large-scale video classification with convo-
vector machines. TIST, 2011. 2, 4, 6 lutional neural networks. In CVPR, 2014. 3
[11] H. Chang, Y. Zhou, A. Borowsky, K. Barner, P. Spellman, [27] M. Kim and F. Torre. Gaussian processes multiple instance
and B. Parvin. Stacked predictive sparse decomposition for learning. In ICML, 2010. 2
classification of histology sections. IJCV, 2014. 2 [28] M. M. Kokar, J. A. Tomasik, and J. Weyman. Data vs. deci-
[12] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and sion fusion in the category theory framework. 2001. 4
A. L. Yuille. Semantic image segmentation with deep con- [29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
volutional nets and fully connected crfs. arXiv, 2014. 2 classification with deep convolutional neural networks. In
[13] Z. Chen, Z. Chi, H. Fu, and D. Feng. Multi-instance multi- NIPS, 2012. 1
label image classification: A neural approach. Neurocom- [30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
puting, 2013. 2 based learning applied to document recognition. Proceed-
[14] D. C. Cireşan, A. Giusti, L. M. Gambardella, and J. Schmid- ings of the IEEE, 1998. 1, 3
huber. Mitosis detection in breast cancer histology images [31] C. H. Li, I. Gondra, and L. Liu. An efficient parallel neural
with deep neural networks. In MICCAI. 2013. 2 network-based multi-instance learning algorithm. J Super-
[15] L. A. Cooper, J. Kong, D. A. Gutman, F. Wang, J. Gao, comput, 2012. 2
C. Appin, S. Cholleti, T. Pan, A. Sharma, L. Scarpace, et al. [32] K. Ma, T. F. Y. Vicente, D. Samaras, M. Petrucci, and D. L.
Integrated morphologic analysis for the identification and Magnus. Texture classification for rail surface condition
characterization of disease subtypes. JAMIA, 2012. 6 evaluation. In WACV, 2016. 7
[16] E. Cosatto, P.-F. Laquerre, C. Malon, H.-P. Graf, A. Saito,
[33] O. Maron and T. Lozano-Pérez. A framework for multiple-
T. Kiyuna, A. Marugame, and K. Kamijo. Automated gastric
instance learning. NIPS, 1998. 2
cancer diagnosis on h&e-stained sections; ltraining a classi-
fier on a large scale with multiple instance machine learning. [34] S. E. Mills. Histology for pathologists. Lippincott Williams
In Medical Imaging, 2013. 2 & Wilkins, 2012. 4
[17] A. Cruz-Roa, A. Basavanhally, F. González, H. Gilmore, [35] H. S. Mousavi, V. Monga, G. Rao, and A. U. Rao. Automated
M. Feldman, S. Ganesan, N. Shih, J. Tomaszewski, and discrimination of lower and higher grade gliomas based on
A. Madabhushi. Automatic detection of invasive ductal car- histopathological image analysis. JPI, 2015. 2, 6
cinoma in whole slide images with convolutional neural net- [36] M. H. Nguyen, L. Torresani, F. De La Torre, and C. Rother.
works. In Medical Imaging, 2014. 2, 3 Weakly supervised discriminative localization and classifica-
[18] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. Solv- tion: a joint learning process. In ICCV, 2009. 2, 3
ing the multiple instance problem with axis-parallel rectan- [37] T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution
gles. AIJ, 1997. 2, 4 gray-scale and rotation invariant texture classification with
[19] L. Fei-Fei and P. Perona. A bayesian hierarchical model for local binary patterns. PAMI, 2002. 6
learning natural scene categories. In CVPR, 2005. 6 [38] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Weakly super-
[20] J. Foulds and E. Frank. A review of multi-instance learning vised object recognition with convolutional neural networks.
assumptions. Knowl Eng Rev, 2010. 4 In NIPS. 2

2432
[39] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille. [57] Z.-H. Zhou and M.-L. Zhang. Neural networks for multi-
Weakly-and semi-supervised learning of a dcnn for semantic instance learning. In ICIIT, 2002. 2, 3
image segmentation. arXiv, 2015. 2, 4
[40] D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully
convolutional multi-class multiple instance learning. arXiv,
2014. 2
[41] P. O. Pinheiro and R. Collobert. Weakly supervised semantic
segmentation with convolutional networks. arXiv, 2014. 2
[42] S. Poria, E. Cambria, and A. Gelbukh. Deep convolutional
neural network textual features and multiple kernel learning
for utterance-level multimodal sentiment analysis. 3
[43] J. Ramon and L. De Raedt. Multi instance neural networks.
2000. 2, 3
[44] A. C. Ruifrok and D. A. Johnston. Quantification of histo-
chemical staining by color deconvolution. Anal Quant Cytol
Histol, 2001. 4
[45] A. Seff, L. Lu, K. M. Cherry, H. R. Roth, J. Liu, S. Wang,
J. Hoffman, E. B. Turkbey, and R. M. Summers. 2d view ag-
gregation for lymph node detection using a shallow hierarchy
of linear classifiers. In MICCAI. 2014. 3, 4
[46] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. CoRR, 2014. 3,
5, 6, 8
[47] F. Tabib Mahmoudi, F. Samadzadegan, and P. Reinartz. a
decision level fusion method for object recognition using
multi-angular imagery. International Archives of the Pho-
togrammetry, Remote Sensing and Spatial Information Sci-
ences, 2013. 3
[48] T. H. Vu, H. S. Mousavi, V. Monga, U. Rao, and G. Rao.
Dfdl: Discriminative feature-oriented dictionary learning for
histopathological image classification. arXiv, 2015. 2
[49] N. Weidmann, E. Frank, and B. Pfahringer. A two-level
learning method for generalized multi-instance problems. In
ECML. 2003. 3, 4
[50] Y. Xu, Z. Jia, Y. Ai, F. Zhang, M. Lai, E. I. Chang, et al.
Deep convolutional activation features for large scale brain
tumor histopathology image classification and segmentation.
In ICASSP, 2015. 2, 5, 6
[51] Y. Xu, T. Mo, Q. Feng, P. Zhong, M. Lai, E. I. Chang,
et al. Deep learning of feature representation with multiple
instance learning for medical image analysis. In ICASSP,
2014. 2
[52] Y. Xu, J.-Y. Zhu, I. Eric, C. Chang, M. Lai, and Z. Tu.
Weakly supervised histopathology cancer image segmenta-
tion and classification. Medical image analysis, 2014. 2
[53] J. Yang, Y.-G. Jiang, A. G. Hauptmann, and C.-W. Ngo.
Evaluating bag-of-visual-words representations in scene
classification. In Workshop on multimedia information re-
trieval, 2007. 6
[54] C. Zhang, J. C. Platt, and P. A. Viola. Multiple instance
boosting for object detection. In NIPS, 2005. 2
[55] Q. Zhang and S. A. Goldman. Em-dd: An improved
multiple-instance learning technique. In NIPS, 2001. 3
[56] Y. Zhou, H. Chang, K. Barner, P. Spellman, and B. Parvin.
Classification of histology sections via multispectral convo-
lutional sparse coding. In CVPR, 2014. 2

2433

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy