0% found this document useful (0 votes)
13 views10 pages

Mgdoc: Pre-Training With Multi-Granular Hierarchy For Document Image Understanding

Uploaded by

brian.ho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

Mgdoc: Pre-Training With Multi-Granular Hierarchy For Document Image Understanding

Uploaded by

brian.ho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

MGDoc: Pre-training with Multi-granular Hierarchy

for Document Image Understanding


Zilong Wang1∗ Jiuxiang Gu2 Chris Tensmeyer2 Nikolaos Barmpalios2
Ani Nenkova2 Tong Sun2 Jingbo Shang1 Vlad I. Morariu2
1 2
University of California, San Diego Adobe Research
1
{zlwang, jshang}@ucsd.edu
2
{jigu, tensmeye, barmpali, nenkova, tsun, morariu}@adobe.com

Abstract

Document images are a ubiquitous source of


data where the text is organized in a com-
arXiv:2211.14958v1 [cs.CV] 27 Nov 2022

plex hierarchical structure ranging from fine


granularity (e.g., words), medium granularity
(e.g., regions such as paragraphs or figures), to
coarse granularity (e.g., the whole page). The
spatial hierarchical relationships between con-
tent at different levels of granularity are cru- (a) Page-level (b) Region-level (c) Word-level

cial for document image understanding tasks.


Figure 1: The multi-granular structure in document im-
Existing methods learn features from either
ages. The image is from the RVL-CDIP dataset. Impor-
word-level or region-level but fail to consider
tant information is encoded at the page-level (e.g., the
both simultaneously. Word-level models are
document type), region-level (constructs such as para-
restricted by the fact that they originate from
graphs, tables, etc), and word-level (specific semantics).
pure-text language models, which only encode
Our proposed model reasons about all of these jointly
the word-level context. In contrast, region-
to leverage information across granularities.
level models attempt to encode regions cor-
responding to paragraphs or text blocks into
a single embedding, but they perform worse
with additional word-level features. To deal images express rich information via both textual
with these issues, we propose MGDoc, a content and heterogeneous layout patterns, which
new multi-modal multi-granular pre-training leads to barriers to the automatic processing of
framework that encodes page-level, region- these document images. Here, layout pattern refers
level, and word-level information at the same to how the text is spatially arranged on the docu-
time. MGDoc uses a unified text-visual en- ment page and involves information from multiple
coder to obtain multi-modal features across
levels of granularity. Specifically, a layout pattern
different granularities, which makes it possi-
ble to project the multi-granular features into
divides the entire page into individual regions and,
the same hyperspace. To model the region- within each region, the fine-grained textual content
word correlation, we design a cross-granular is distributed following a certain format, such as
attention mechanism and specific pre-training paragraphs, columns, lists, as shown in Figure 1.
tasks for our model to reinforce the model The layout of a document provides important
of learning the hierarchy between regions and cues for interpreting the document through spatial
words. Experiments demonstrate that our pro-
structures such as alignment, proximity, and hier-
posed model can learn better features that per-
form well across granularities and lead to im-
archy between content at different levels of gran-
provements in downstream tasks. ularity. For example, a numeric text field is more
likely to be the total price of a grocery receipt if
1 Introduction it is located at the bottom right of a table region;
a region is more likely to correspond to the title
Document images are ubiquitous and are often used area of a form if there are a lot of bold types inside
as a representation for forms, receipts, printed pa- of the region. In these two examples, it is impor-
pers, etc. Unlike plain text documents, document tant to understand page-level information (e.g., that

This work was completed while the author was working the document is a receipt or a form), region-level
as an intern at Adobe Research. information (e.g., that a region is a table or title),
Text Vison Layout
2020b,a; Gu et al., 2021), our multi-modal features
Page
Region-level
represent text, layout (represented by bounding
Region Model boxes), and image modalities. The input consists
Word-level
of information at different levels of granularity, and
Word
Model can be organized into a hierarchy within the page,
which means words are included in the correspond-
Multi-modal Multi-granular Model
ing regions and the page includes all of them. We
Figure 2: Comparison between MGDoc and exist- leverage attention to learn the correlation between
ing methods. While previous methods have explored inputs from different levels of granularity and add
multi-modal inputs, and explored word+page and re- special attention weights to encode the hierarchical
gion+page level features, MGDoc combines multi- structure and relative distances (Xu et al., 2020a;
modal reasoning, and joint word-, region-, and page- Garncarek et al., 2021; Powalski et al., 2021). We
level reasoning.
rely on pre-training to encourage the model to learn
the alignment between regions at different levels of
granularity. In addition, we use masked language
and word/token-level information (e.g., the font
modeling for the word-level inputs and extend this
style of a word, or that a token is a number), as
idea into the more coarse-grained inputs. We mask
well as how these relate to each other. Therefore,
a proportion of regions and ask the model to mini-
to facilitate the automatic processing of such doc-
mize the difference between the masked contextual
uments, it is essential to consider the features of
features and the input features corresponding to the
multiple granularities and let the model learn the
selected region.
hierarchy between different levels to encode the
We validate MGDoc on three public bench-
multi-granular structure in the document images.
marks, the FUNSD dataset (Jaume et al., 2019) for
However, existing methods in document image form understanding, the CORD dataset (Park et al.,
understanding formulate the document image un- 2019) for receipt extraction, and the RVL-CDIP
derstanding tasks either at the word-level or region- dataset (Lewis et al., 2006) for document image
level and thus do not use both cues. They mostly classification. Extensive experiments demonstrate
follow language modeling methods designed for the effectiveness of our proposed approach with
plain text settings, formulating document image un- great improvements on fine-grained tasks and good
derstanding tasks using word-level information and results on coarse-grained tasks. We summarize our
augmenting semantic features with spatial and vi- contribution as follows:
sual features to exploit the word-level context (Xu • We propose MGDoc, a multi-modal multi-
et al., 2020b,a; Hong et al., 2021; Garncarek et al., granular pre-training framework, which encodes
2021). Recent works go beyond fine-grained word- the hierarchy in document images and integrates
level inputs and focus on regions instead of words features from text, layout patterns, and images.
to acquire useful signals (Li et al., 2021; Gu et al., • A cross-granularity attention mechanism and a
2021). By encoding the regions corresponding to new pre-training task designed to enable the
paragraphs or text blocks, these region-level mod- model to learn the alignment between different
els manage to save training resources and achieve levels. This work extends the masked language
good performance with rich locality features. How- modeling to different granularity to encode the
ever, these models fail to leverage the cross-region contextual information.
word-level correlation, which is also necessary to • Extensive experiments demonstrate the effective-
tackle fine-grained tasks. ness of MGDoc on three representative bench-
Motivated by this observation, we propose marks.
MGDoc, a new multi-modal multi-granular pre-
training framework that encodes document infor- 2 Method
mation at different levels of granularity and rep-
resents them using multi-modal features as high- 2.1 Overview
lighted in Figure 2. Specifically, we use the MGDoc is a multi-modal multi-granular pre-
OCR engine to decompose a document page into training framework for document image under-
three granularities: page-level, region-level, and standing tasks. The framework encodes features
word-level. Following previous works (Xu et al., from different levels of granularity in a document
Page PP Spatial Hierarchy Bias
Region 1 w1 w2 w3 w4 w5 R1 w1
Task #1: Mask Text Modeling
text RR1 R1
masked 1 R2 w2, w3, w4 MAE
fRV→T fRT1
R2 1
Region 2 RR2
2 R3
R3
w5
Region 3 wR1
3 Task #2: Mask Vison Modeling
Word 1 L× MAE
vison ww2 fwT→V fwV1
masked 1 fPfTTT Multi-granular fPfTTT Multi-granular fPfTV→T
T 1


P Attention Pfλ Attention fλP
Word 2 ww3
2
Task #3: Multi-Granularity Modeling
Word 3 ww1
3 fR2 fw3 Inside
Word 4 ww1 fPfTTV Multi-granular fPfTTV Multi-granular fPfTT→V
T
4 ePλ Attention fλ
P Attention fPλ fR3 fw2 Outside
Word 5 ww1
5 Multi-granularity: λ ∈ {P} ∪ {R1, . . . , Rm} ∪ {w1, . . . , wn}

Recognition Multi-modal Text + Visual Text + Visual Text + Visual


Pre-training Tasks
Results Encoder Embeddings Features Multi-modal Features

Figure 3: The multi-modal multi-granular pre-training framework of MGDoc. Inputs consist of visual, textual,
and spatial representations of the page, regions on the page, and individual words. Multi-granular attention learns
relationships within and across granularities within single modalities, followed by multi-granular attention across
modalities. The final output consists of an embedding for each region at each granularity, on top of which three
self-supervision tasks are added to pre-train the model. Specifically, the multi-granularity task ensures that our
model makes use of multi-granular inputs and multi-granular attention to solve spatial tasks.

page and leverages the spatial hierarchy between 2.2 Multi-modal Encoding
them. There are three stages in our architecture. The multi-modal encoding is designed to encode
First, OCR engine, human annotations, or digital the text, visual, and spatial information of the multi-
parsing provide us with the text and the bounding granular inputs into the embedding space. We first
boxes of contents at different levels of granular- acquire the inputs at the word-level, encoding the
ity. We focus on pages, regions, and words in this text and bounding box of each word. At the region-
paper and leave the more fine-grained pixel-level level, words are grouped into regions where all
and coarse-grained multi-page modeling for future words within the region are combined and the en-
work. We input the document image, textual con- closing bounding box of all the words is used as
tent, and bounding boxes at these three levels to the region bounding box. At the page-level, the tex-
the multi-modal encoder to encode multi-granular tual input is the sequence of all words in the page
information into text and image embeddings. Fol- and the width and height of the document image
lowing previous work (Xu et al., 2020b), we use is used as the page-level bounding box. Now, the
spatial embeddings to encode spatial layout infor- inputs at different levels of granularity consist of
mation. Next, we design a multi-granular attention the textual content and the bounding box. We de-
mechanism to extract the correlation between the note the inputs as P , {R1, ..., Rm }, {w1 , ..., wn },
features from different levels. Distinct from the nor- where m, n are the number of regions and words.
mal self-attention mechanism in BERT (Lu et al., The multi-modal encoding takes the textual con-
2019), multi-granular attention computes the dot tent of each input unit, ranging from a single word,
product between the features and encodes the hier- to sentences, to the whole textual content of a
archical relationship between regions and words by page, and encodes it with a pre-trained language
adding an attention bias. Then, the cross-attention model, e.g., SBERT (Reimers and Gurevych, 2019).
mechanism is used to combine the features from We add the spatial embeddings to the text encoder
different modalities. Finally, the sum of the final outputs where the fully-connected layer is used to
text and visual features is used in the pre-training project the bounding boxes into a hyperspace. In
or fine-tuning tasks. this way, the text embeddings of our model are
augmented with spatial information. Then, MG-
Doc encodes the entire image with a visual feature
extractor, e.g., ResNet (He et al., 2016), and ex-
tracts region feature maps using bounding boxes as into a fixed-sized vector and each value in this
the Region of Interest (ROI) areas. The results of vector is added to an attention head of the multi-
the vision encoder have different resolutions due granular attention module. After the multi-granular
to the sizes of bounding boxes, but they lie in the attention module, self-attention is applied to the
same feature space. Similarly, we add the spatial input embeddings to learn contextual information
embeddings from the bounding boxes to the visual and hierarchical relationships between the multi-
embeddings. The multi-modal embeddings are rep- granular inputs. We denote the resulting textual and
resented as follows, visual features by fλT and fλV , respectively, where
λ ∈ {P } ∪ {R1 , ..., Rm } ∪ {w1 , ..., wn }.
eTλ = EncT (textλ ) + FC(boxλ ) + EmbT
2.4 Cross-modal Attention
eVλ = EncV (img)[boxλ ] + FC(boxλ ) + EmbV
As mentioned in Section 2.3, multi-granular atten-
where eTλ and eVλare the text and visual embed- tion is only designed for the interaction between
dings; λ ∈ {P } ∪ {R1 , ..., Rm } ∪ {w1 , ..., wn } different levels of granularity; however, it is also
denotes different levels of granularity; EncT and essential to fuse information from multiple modal-
EncV are the text and image encoders, respec- ities. Therefore, we design the cross-modal atten-
tively; text and box refer to the textual contents tion module to conduct the modality fusion. Fol-
and bounding boxes; img is the entire document lowing previous works in visual-language model-
image; FC(·) is the fully-connected layer; EmbT ing, we use a cross-attention mechanism to fuse
and EmbV are the type embeddings for text and the textual and visual features. Specifically, the
vision. cross-attention function is formulated as,
(W Q α)> (W K β)
 
2.3 Multi-granular Attention CrossAttn(α|β) = σ √ WV β
d
Given the multi-modal embeddings described
above, we design a multi-granular attention to en- where α and β are matrices of the same size; σ
code the hierarchical relation between regions and is the Softmax function to normalize the attention
words. Specifically, we add attention biases to the matrix; W Q , W K , W V are the trainable weights
original self-attention weights to strengthen the in each of the attention heads for query, key and,
region-word correlation. We apply multi-granular value. Then we list text and visual features from
attention to the text embeddings and visual em- different levels of granularity as F T = {fλT } and
beddings individually because the purpose of this F V = {fλV } and compute the multi-modal features
module is to learn the interaction between different as follows,
levels of granularity rather than to fuse modalities. fλT →V = CrossAttn(F T |F V )[λ]
Therefore, without loss of generality, we omit the
fλV →T = CrossAttn(F V |F T )[λ]
notation of modality in the expressions. The atten-
tion weight is computed as From these expressions, we can see that the cross
attention uses the dot product between multi-modal
1
Aα,β = √ (W Q eα )> (W K eβ ) features as attention weights. In this way, the given
d modality can learn from the other modality and the
+ HierBias(boxα ⊆ or 6⊆ boxβ ) module also bridges the gap between the modalities.
+ RelBias(boxα − boxβ ), We call the output of this module text or visual
multi-modal features to distinguish from the text
where α, β ∈ {P } ∪ {R1 , ..., Rm } ∪ {w1 , ..., wn }; and visual features in Section 2.2, and denote them
the first part is the same as the attention mecha- as fλT →V and fλV →T . The final representation is the
nism in the original BERT; RelBias(·) is the at- sum of the textual and visual multi-modal features,
tention weight bias to encode the relative distance fλ = fλT →V + fλV →T , λ ∈ {P } ∪ {R1 , ..., Rm } ∪
between the bounding boxes; HierBias(·) is the {w1 , ..., wn }, and is used in the pre-training and
attention weight bias to encode the inside or out- downstream tasks.
side relation which models the spatial hierarchy
within the page. Since the regions are created by 2.5 Pre-training Tasks
grouping the words, all the words correspond to Large-scale pre-training has shown strong results
a specific region. We embed the binary relation in document image understanding tasks. With a
large amount of unlabeled data, pre-trained models T →V is the contextual feature of the zero
where f[0]| λ̄
can learn the latent data distribution without man- vector given the unmasked inputs.
ually labelled supervision and can easily transfer
the learned knowledge to downstream tasks. The Multi-Granularity Modeling The multi-
design of pre-training tasks is crucial for successful granularity modeling task asks the model to
pre-training. We go beyond the classic mask mod- understand the spatial hierarchy between different
eling and apply the mask text modeling and the levels of granularity. Since the page-level input
mask vision modeling on all the inputs from differ- includes all regions and words, it is trivial for the
ent levels of granularity. Due to the unified multi- model to learn it. We only focus on the hierarchical
model encoder (see Section 2.2), it is possible for relation between the regions and words. Although
us to treat all levels of granularity equally and in- the relation is also encoded in the multi-granular
troduce a unified masking task for each modality. attention, it is necessary to reinforce the model
Because we believe that spatial hierarchical rela- to emphasize the region-word correspondence.
tionships are essential for encoding documents, we Otherwise, the spatial hierarchy biases are random
design a pre-training task that requires the model add-ons to the attention matrix.
to identify the spatial relationship between con- The model takes the region-level and word-level
tent at different levels of granularity. The final features and predicts which region the given the
training loss is the sum of the pre-training tasks, word is located in. We first compute the dot prod-
L = LM T M + LM V M + LM GM . Below we pro- uct of the region-level and word-level features as
vide details for each component. the score and use the Cross-entropy as the loss
function.
Mask Text Modeling The mask text modeling
task requires the model to understand the textual in- >
X e fw fr ∗
puts of the model. Specifically, we randomly select LM GM =
efw> fr∗ + >f
fw
P
r∈R−{r∗ } e
r
a proportion of regions or words, and their textual w∈W
contents are replaced with a special token [MASK].
We run the model to obtain the contextual features where W = {w1 , ..., wn } and R = {R1 , ..., Rm };
of these masked inputs and compare them with the r∗ is the region that includes the word w.
encoding result of original textual inputs. We use
the Mean Absolute Error as the loss function. 3 Experiments
X
V →T
LM T M = eTλ − f[MASK]| λ̄ 3.1 Pre-training Settings
λ∈Λ
We use the RVL-CDIP dataset (Harley et al.) as
where Λ = {P } ∪ {R1 , ..., Rm } ∪ {w1 , ..., wn }; our pre-training corpus. The RVL-CDIP dataset
eTλ denotes the encoding result of the original tex- is a scanned document image dataset containing
tual contents; λ̄ denotes the multi-granular context 400,000 grey-scale images and covering a variety
V →T
without λ; f[MASK]| is the contextual feature of the of layout patterns. We use OCR engines to rec-
λ̄
masked textual inputs. ognize the location of textual content in the docu-
ment images and also the location of the individ-
Mask Vision Modeling Similarly to the mask ual words. Following Gu et al. (2021), we use
text modeling task, we use mask vision model- EasyOCR 1 with two different output modes: non-
ing to learn visual contextual information. Instead paragraph and paragraph. The difference is that the
of replacing the [MASK] token as is done in mask non-paragraph mode extracts the individual words
text modeling, we set the visual embeddings of the in the pages, and the paragraph mode groups these
selected areas to zero vectors. The loss function results into regions. The OCR engine allows us to
computes the Mean Absolute Error between the design the architecture and the pre-training tasks
contextual feature of masked areas and the original focusing on the multi-granularity of document im-
visual embeddings. The mask vision modeling loss ages. Therefore, the paragraph results serve as the
is formulated as, region-level inputs, and the non-paragraph results
X serve as the word-level inputs.
T →V
LM V M = eVλ − f[0]| λ̄
λ∈Λ 1
https://github.com/JaidedAI/EasyOCR
Pre-training FUNSD CORD RVL-CDIP
Scale Model
Corpus #Data #Param. (F1) (F1) (Accuracy)
BERTBASE - - 110M 60.26 89.68 89.81
BERTLARGE - - 340M 65.63 90.25 89.92
LayoutLMBASE IIT-CDIP 11M 113M 78.66 94.72 94.42
LayoutLMLARGE IIT-CDIP 11M 343M 78.95 94.93 94.43
BROSBASE IIT-CDIP 11M 110M 83.05 96.50 -
BROSLARGE IIT-CDIP 11M 340M 84.52 97.28 -
Word LayoutLMv2BASE IIT-CDIP 11M 200M 82.76 94.95 95.25
LayoutLMv2LARGE IIT-CDIP 11M 426M 84.20 96.01 95.64
TILTBASE RVL-CDIP+ 1.1M 230M - 95.11 95.25
TILTLARGE RVL-CDIP+ 1.1M 780M - 96.33 95.52
DocFormerBASE IIT-CDIP- 5M 183M 83.34 96.33 96.17
DocFormerLARGE IIT-CDIP- 5M 536M 84.55 96.99 95.50
SelfDoc RVL-CDIP 320K - 83.36 - 92.81
SelfDoc+VGG-16 RVL-CDIP 320K - - - 93.81
Region UDoc IIT-CDIP- 1M 272M 87.96 96.64 93.96
UDoc‡ IIT-CDIP- 1M 272M 87.93 96.86 95.05
Region+Word MGDoc (Ours) RVL-CDIP 320K 203M∗ 89.44 97.11 93.64

Table 1: The experiment results and comparison. * indicates that non-trainable parameters are not included. The
total #param. of MGDoc is 312M. ‡ implies unfreezing the sentence encoder during the finetuning. RVL-CDIP+:
TILT uses extra training pages in pre-training; IIT-CDIP-: DocFormer uses a subset of IIT-CDIP in pre-training.

3.2 Fine-tuning Tasks is a subset of the IIT-CDIP dataset (Lewis et al.,


We select three representative tasks to evaluate the 2006). The RVL-CDIP dataset contains 400,000
performance of our model and use the publicly- pages, each annotated with 16 semantic categories.
available benchmarks for each tasks. The input features for this dataset are extracted by
the EasyOCR engine in our experiments. The RVL-
Form Understanding The goal of the form un- CDIP dataset is also used in the pre-training, but no
derstanding task is to predict the label of semantic labeling information is involved in the pre-training
entities in document images. We use the FUNSD tasks, so there is no concern about data leakage.
dataset (Jaume et al., 2019) for this task. The In the downstream task, the RVL-CDIP dataset is
FUNSD dataset consists of 199 fully-annotated, divided into training, validation, and test subsets
noisy-scanned forms with various appearances and with 8:1:1 ratio. We report classification accuracy
formats. There are 149 and 50 pages in the training over the 16 categories for our experiments.
set and the testing set, respectively. Each entity is
labeled into 3 categories: Header, Question, and 3.3 Implementation Details
Answer. We use the provided OCR results from the In the multi-modal encoder, we use the BERT-NLI-
dataset and input the textual contents and bound- STSb-base model as the text encoder and ResNet-
ing boxes of entities to the model. We report the 50 as the vision encoder. In the modality fusion, we
entity-level F1 score as metrics. use 12 layers of cross-modal attention in MGDoc.
We set the hidden state size as 768 and the atten-
Receipt Understanding The goal of the receipt
tion head number as 12. We freeze the pre-trained
understanding task is to recognize the role of a se-
weights of the multi-modal encoder and randomly
ries of text lines in a document. We use the CORD
initialize the remaining parameters, which are then
dataset (Park et al., 2019) for this task. The CORD
learned during our pre-training stage. We run the
dataset is fully annotated with bounding boxes and
pre-training for 5 epochs with 8 NVIDIA V100
textual contents and contains 800 and 100 pages
32G GPUs and the AdamW optimizer. The batch
in the training and testing sets, respectively. There
size is set to 64; the learning rate is set to 10−6 ;
are 30 entity types marked in the dataset; we report
the warmup is conducted in the first 20% training
entity-level F1 score for our experiments.
steps.
Document Image Classification The document
3.4 Results
image classification task aims to classify the pages
into different semantic categories. We use the RVL- We compare MGDoc with the strong baselines in
CDIP dataset (Harley et al.) for this task, which the document understanding tasks in Table 1. We
list out the specific settings of each model in the FUNSD RVL-CDIP
Model (F1) (Acurracy)
layout-rich pre-training, to clearly demonstrate the
effectiveness of our model. All these baseline mod- MGDoc
w/o pre-training 83.01 91.23
els resort to different techniques to achieve com- w/ MTM+MVM 87.20 93.92
petitive results. BERT (Devlin et al., 2018), Lay- w/ MTM+MVM+MGM 89.44 93.64

outLM (Xu et al., 2020b), LayoutLMv2 (Xu et al.,


Table 2: The results of the ablation study for pre-
2020a), BROS (Hong et al., 2021), TILT (Powalski training. Performance steadily increases as pre-training
et al., 2021), and DocFormer (Appalaraju et al., tasks are added.
2021) encode word-level features, and SelfDoc (Li
et al., 2021), and UDoc (Gu et al., 2021) encodes
FUNSD CORD
region-level features. MGDoc surpasses all the Model (F1) (F1)
existing methods with the help of the informa- MGDoc
tion from all different levels of granularity, and w/ Region 80.82 94.24
achieves a new state-of-the-art performance in the w/ Region + Word 86.96 95.49
w/ Page + Region 81.65 94.69
fine-grained tasks, i.e., the form understanding task w/ Page + Region + Word 89.44 97.11
and receipt understanding task. It also achieves
promising performance on the coarse-grained task, Table 3: The results of the ablation study for multi-
i.e., the document image classification task. Specif- granular features. We observe a steady increment with
ically, MGDoc improves the entity-level F1 score more features involved, and the word-level features
of the FUNSD dataset by 1.48% and improves contribute most to the improvement.
the entity-level F1 score of the CORD dataset by
0.25%, compared with the second-best model. We
stage so all the parameters can only be learned in
partially attribute the performance difference on
the downstream tasks. In the second setting, we in-
the RVL-CDIP dataset to the OCR engine, since
clude the commonly-used masking techniques. The
LayoutLMv2 and TILT use the Microsoft OCR and
model is pre-trained with the two masking tasks
BROS uses the CLOVA OCR, and these commer-
in our design, the mask sentence modeling, and
cial OCR engines provide more accurate results.
the mask vision modeling. Performance steadily
As discussed in Gu et al. (2021), the quality of the
increases as pre-training tasks are added; overall,
OCR engine influences the performance of the doc-
pre-training improves the performance by 6.43%,
ument image classification. It is also worth men-
2.69% on FUNSD and RVL-CDIP, respectively.
tioning that our model involves relatively smaller
number of trainable parameters and also requires We believe that the masking strategy enables
less pre-training data, which makes MGDoc more the model to learn from the multi-modal context
applicable in realistic scenarios. of the page. In the third experiment in the table,
we add the alignment techniques between words
The performance of the form and receipt under-
and regions designed to strengthen the connection
standing tasks is improved by region-level infor-
between multiple granularities. The performance
mation. UDoc surpasses the word-scale models
on FUNSD is further improved by 2.24%, while
by large margins, and our proposed, MGDoc, even
there is also a decrease of 0.28% in the performance
further improves the UDoc by modeling the align-
on RVL-CDIP. Local connections between words
ment between regions and words. We conclude
and regions are helpful in fine-grained tasks but
that the region-level information strengthens the lo-
may introduce some noise to coarse-grained tasks.
cality of the feature extraction, and the word-level
information further improves the classification re- To study the role of features from each granular-
sults. Such connection is realized by region-word ity, we also conduct an ablation study using differ-
alignment, which is visualized in Section 3.6. ent combinations of multi-granular features, where
we feed the model with features from region-level
inputs, region-level and word-level inputs, page-
3.5 Ablation Study
level and region-level inputs, respectively. We re-
To study the importance of the pre-training tasks, port the performance on FUNSD and CORD in
we design an ablation study that skips several pre- Table 3. We observe a steady increment with more
training tasks. The results are shown in Table 2. features involved, and the word-level features con-
In the first setting, we skip the entire pre-training tribute more to the improvement.
Example 1: Visualization of
• Text: Brown & Williamson Tobacco Corp. Example 2
• Label: Answer
• UDoc: Question
• MGDoc: Answer
(, 52, #2,…, #9, #7

Example 2: Example 3: Reg.


• Text: (502) 568-7297 • Text: GEOGRAPHY
• Label: Answer • Label: Header
• UDoc: Question • UDoc: Question
• MGDoc: Answer • MGDoc: Question Visualization of
Example 4

Example 4:
• Text: File with:
• Label: Question File, with, :
• UDoc: Header
• MGDoc: Question Reg.

Figure 4: The error analysis and visualization

0.8 0.8 3.7 Error Analysis


region

region

0.6 0.6

0.4 0.4

0.2 0.2

word word
(a) (b)
We select several representative cases in the com-
0.8 0.8
parison between UDoc and MGDoc and show them
in Figure 4. We also visualize the weight matrix
region

region

0.6 0.6

0.4 0.4

0.2 0.2
of the entities in the same way as in Section 3.6.
word word In these examples, our proposed model can lever-
(c) (d) age the more fine-grained signal from word-level
inputs and make the correct prediction. In exam-
Figure 5: The correlation weight visualization between ple 2, the entity is labeled as Answer where the
regions and words corresponding question, “Fax No.:”, is at the top
of this column. Due to the large distance of this
question-answer pair, UDoc predicts the entity as
3.6 Region-word Correlation Visualization
Question, while MGDoc can give the right predic-
We visualize the correlation between regions and tion by directly learning from the digits inside of
words using heat maps. We select four examples the text fields, which is a strong signal for answers.
from the FUNSD dataset and show the heat maps of From the heat map, we can also see that a lighter
the final feature dot product in Figure 5. The x-axis color appears in the corresponding area of the en-
and the y-axis correspond to the words and regions, tity. Meanwhile, the word-level information even
respectively, and the lighter the color is, the higher strengthens the multi-modal features since it pro-
correlation there is. Some cropping is applied for vides more details of a given text field. As we can
clearer visualization. From the heat maps, we can observe in example 4, the entity “File with:” is
observe that there are highlighted areas along the likely to be Header or Question given its textual
matrix diagonal, which means our model learns contents and location in the page, but MGDoc can
the region-word hierarchy in the pre-training stage predict from the rich visual features that this field
and can leverage such correspondence in down- is a part of normal text and less likely to be Header;
stream tasks. We also see some lighter colored these rich inputs allow MGDoc to make the cor-
blocks in the matrix. Since all the words and re- rect prediction where UDoc cannot. However, in
gions are serialized in positional order, these lighter example 3, both UDoc and MGDoc cannot predict
colored blocks indicate the model is able to use the correctly. The ground-truth label is Header but
localized features in the model with the help of both models predict the entity as Question. The
multi-granular inputs. This ability further confirms entity is not at the top of the page where the header
that our intuition that combining information from entities are more likely to be located, so we at-
different levels of granularity will be beneficial is tribute this error to the dependence of MGDoc to
correct. the spatial information.
4 Related Work lead to improvements in downstream tasks.
As for future work, since we have not fully ex-
Word-level Models Word-level models inherit ploited the multi-granular information, we will go
the architecture of pure-text pre-trained language beyond the page level and investigate the possibility
models. Word-level contextual information is en- of encoding multiple pages. We are also interested
coded by a multi-layered transformer, and spatial in inputs that are more fine-grained than word level,
and visual features are added to refine the represen- such as pixels.
tation. Inspired by the positional embeddings in
Vaswani et al. (2017); Raffel et al. (2019); Dai et al. 6 Acknowledgement
(2019), absolute or relative spatial features based
on the bounding boxes are proposed to encode the This work was supported in part by Adobe Re-
words’ layout with respect to each other (Xu et al., search. We thank anonymous reviewers and pro-
2020b,a; Hong et al., 2021; Garncarek et al., 2021). gram chairs for their valuable and insightful feed-
Computer vision deep models (He et al., 2016; back.
Xie et al., 2017) are used to extract features from
the document images, and self-supervised learning Limitations
methods are applied to learn the cross-modal corre-
Although we inherit the idea of using region-level
lation between images and words (Xu et al., 2020a;
inputs from (Gu et al., 2021; Li et al., 2021),
Powalski et al., 2021).
we cannot keep their merits of saving comput-
Region-level Models Region-level models en- ing resources. Region-level models encode re-
code the regions in the document page including gions instead of all the words in the page, so the
text blocks, headings, and paragraphs (Li et al., smaller number of features are included in the self-
2021; Gu et al., 2021). Similar spatial and visual attention layers. However, we want to leverage the
features are used in these models as in the word- fine-grained word-level information as (Xu et al.,
level models. With the help of coarse-grained in- 2020b,a; Hong et al., 2021), so the words are also
puts, region-level models can emphasize the rich considered in the multi-granular attention and the
locality features and catch high-level cues. Another multi-modal attention layers. Compared to existing
difference with the word-level models is that the works, our work requires more memory storage
number of regions is much smaller than the word during training and testing.
number on the page, so the region-level models are
more efficient when processing long documents. Ethical Considerations
This paper presents a new framework for document
5 Conclusions and Future Work
image understanding tasks. Our model is built on
We present MGDoc, a multi-modal multi-granular open-source tools and datasets, and we aim at in-
pre-training framework, which goes beyond the ex- creasing the efficiency of processing various doc-
isting region-level or word-level models and lever- uments and also bringing convenience to ordinary
ages the contents at multiple levels of granularity people’s life. Thus, we do not anticipate any major
to understand the document pages better. Existing ethical concerns.
models fail to use the informative multi-granular
features in the document due to the restriction
from the word-level model architecture, and lead References
to unsatisfactory results. We solve these issues Srikar Appalaraju, Bhavan Jasani, Bhargava Urala
with the new architecture design and tailored pre- Kota, Yusheng Xie, and R Manmatha. 2021. Doc-
training tasks. With a unified multi-modal encoder, former: End-to-end transformer for document under-
standing. In Proceedings of the IEEE/CVF Interna-
we embed the features from pages, regions, and tional Conference on Computer Vision, pages 993–
words into the same hyperspace, and design a multi- 1003.
granular attention mechanism and multi-granularity
modeling task for MGDoc to learn the spatial hier- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-
bonell, Quoc V Le, and Ruslan Salakhutdinov.
archical relation between them. Experiments show 2019. Transformer-xl: Attentive language mod-
that our proposed model can understand the spatial els beyond a fixed-length context. arXiv preprint
relation between the multi-granular features and arXiv:1901.02860.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee,
Kristina Toutanova. 2018. Bert: Pre-training of deep Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee.
bidirectional transformers for language understand- 2019. Cord: a consolidated receipt dataset for post-
ing. arXiv preprint arXiv:1810.04805. ocr parsing. In Workshop on Document Intelligence
at NeurIPS 2019.
Łukasz Garncarek, Rafał Powalski, Tomasz
Stanisławek, Bartosz Topolski, Piotr Halama, Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz,
Michał Turski, and Filip Graliński. 2021. Lambert: Tomasz Dwojak, Michał Pietruszka, and Gabriela
Layout-aware language modeling for information Pałka. 2021. Going full-tilt boogie on document un-
extraction. In International Conference on Doc- derstanding with text-image-layout transformer. In
ument Analysis and Recognition, pages 532–547. International Conference on Document Analysis and
Springer. Recognition, pages 732–747. Springer.

Jiuxiang Gu, Jason Kuen, Vlad I Morariu, Han- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
dong Zhao, Rajiv Jain, Nikolaos Barmpalios, Ani Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Nenkova, and Tong Sun. 2021. Unidoc: Unified Wei Li, and Peter J Liu. 2019. Exploring the limits
pretraining framework for document understanding. of transfer learning with a unified text-to-text trans-
Advances in Neural Information Processing Systems, former. arXiv preprint arXiv:1910.10683.
34:39–50.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
Adam W Harley, Alex Ufkes, and Konstantinos G Der- bert: Sentence embeddings using siamese bert-
panis. Evaluation of deep convolutional nets for doc- networks. arXiv preprint arXiv:1908.10084.
ument image classification and retrieval. In Interna-
tional Conference on Document Analysis and Recog- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
nition (ICDAR). Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian you need. Advances in neural information process-
Sun. 2016. Deep residual learning for image recog- ing systems, 30.
nition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770– Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu,
778. and Kaiming He. 2017. Aggregated residual trans-
formations for deep neural networks. In Proceed-
Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok ings of the IEEE conference on computer vision and
Hwang, Daehyun Nam, and Sungrae Park. 2021. pattern recognition, pages 1492–1500.
Bros: A pre-trained language model focusing on text
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu
and layout for better key information extraction from
Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
documents. arXiv preprint arXiv:2108.04539.
Zhang, Wanxiang Che, et al. 2020a. Layoutlmv2:
Guillaume Jaume, Hazim Kemal Ekenel, and Jean- Multi-modal pre-training for visually-rich document
Philippe Thiran. 2019. Funsd: A dataset for form understanding. arXiv preprint arXiv:2012.14740.
understanding in noisy scanned documents. In 2019
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang,
International Conference on Document Analysis and
Furu Wei, and Ming Zhou. 2020b. Layoutlm: Pre-
Recognition Workshops (ICDARW), volume 2, pages
training of text and layout for document image
1–6. IEEE.
understanding. In Proceedings of the 26th ACM
David Lewis, Gady Agam, Shlomo Argamon, Ophir SIGKDD International Conference on Knowledge
Frieder, David Grossman, and Jefferson Heard. Discovery & Data Mining, pages 1192–1200.
2006. Building a test collection for complex doc-
ument information processing. In Proceedings of
the 29th annual international ACM SIGIR confer-
ence on Research and development in information
retrieval, pages 665–666.

Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I Morariu,


Handong Zhao, Rajiv Jain, Varun Manjunatha, and
Hongfu Liu. 2021. Selfdoc: Self-supervised docu-
ment representation learning. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 5652–5660.

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan


Lee. 2019. Vilbert: Pretraining task-agnostic visi-
olinguistic representations for vision-and-language
tasks. Advances in neural information processing
systems, 32.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy