Mgdoc: Pre-Training With Multi-Granular Hierarchy For Document Image Understanding
Mgdoc: Pre-Training With Multi-Granular Hierarchy For Document Image Understanding
Abstract
eλ
P Attention Pfλ Attention fλP
Word 2 ww3
2
Task #3: Multi-Granularity Modeling
Word 3 ww1
3 fR2 fw3 Inside
Word 4 ww1 fPfTTV Multi-granular fPfTTV Multi-granular fPfTT→V
T
4 ePλ Attention fλ
P Attention fPλ fR3 fw2 Outside
Word 5 ww1
5 Multi-granularity: λ ∈ {P} ∪ {R1, . . . , Rm} ∪ {w1, . . . , wn}
Figure 3: The multi-modal multi-granular pre-training framework of MGDoc. Inputs consist of visual, textual,
and spatial representations of the page, regions on the page, and individual words. Multi-granular attention learns
relationships within and across granularities within single modalities, followed by multi-granular attention across
modalities. The final output consists of an embedding for each region at each granularity, on top of which three
self-supervision tasks are added to pre-train the model. Specifically, the multi-granularity task ensures that our
model makes use of multi-granular inputs and multi-granular attention to solve spatial tasks.
page and leverages the spatial hierarchy between 2.2 Multi-modal Encoding
them. There are three stages in our architecture. The multi-modal encoding is designed to encode
First, OCR engine, human annotations, or digital the text, visual, and spatial information of the multi-
parsing provide us with the text and the bounding granular inputs into the embedding space. We first
boxes of contents at different levels of granular- acquire the inputs at the word-level, encoding the
ity. We focus on pages, regions, and words in this text and bounding box of each word. At the region-
paper and leave the more fine-grained pixel-level level, words are grouped into regions where all
and coarse-grained multi-page modeling for future words within the region are combined and the en-
work. We input the document image, textual con- closing bounding box of all the words is used as
tent, and bounding boxes at these three levels to the region bounding box. At the page-level, the tex-
the multi-modal encoder to encode multi-granular tual input is the sequence of all words in the page
information into text and image embeddings. Fol- and the width and height of the document image
lowing previous work (Xu et al., 2020b), we use is used as the page-level bounding box. Now, the
spatial embeddings to encode spatial layout infor- inputs at different levels of granularity consist of
mation. Next, we design a multi-granular attention the textual content and the bounding box. We de-
mechanism to extract the correlation between the note the inputs as P , {R1, ..., Rm }, {w1 , ..., wn },
features from different levels. Distinct from the nor- where m, n are the number of regions and words.
mal self-attention mechanism in BERT (Lu et al., The multi-modal encoding takes the textual con-
2019), multi-granular attention computes the dot tent of each input unit, ranging from a single word,
product between the features and encodes the hier- to sentences, to the whole textual content of a
archical relationship between regions and words by page, and encodes it with a pre-trained language
adding an attention bias. Then, the cross-attention model, e.g., SBERT (Reimers and Gurevych, 2019).
mechanism is used to combine the features from We add the spatial embeddings to the text encoder
different modalities. Finally, the sum of the final outputs where the fully-connected layer is used to
text and visual features is used in the pre-training project the bounding boxes into a hyperspace. In
or fine-tuning tasks. this way, the text embeddings of our model are
augmented with spatial information. Then, MG-
Doc encodes the entire image with a visual feature
extractor, e.g., ResNet (He et al., 2016), and ex-
tracts region feature maps using bounding boxes as into a fixed-sized vector and each value in this
the Region of Interest (ROI) areas. The results of vector is added to an attention head of the multi-
the vision encoder have different resolutions due granular attention module. After the multi-granular
to the sizes of bounding boxes, but they lie in the attention module, self-attention is applied to the
same feature space. Similarly, we add the spatial input embeddings to learn contextual information
embeddings from the bounding boxes to the visual and hierarchical relationships between the multi-
embeddings. The multi-modal embeddings are rep- granular inputs. We denote the resulting textual and
resented as follows, visual features by fλT and fλV , respectively, where
λ ∈ {P } ∪ {R1 , ..., Rm } ∪ {w1 , ..., wn }.
eTλ = EncT (textλ ) + FC(boxλ ) + EmbT
2.4 Cross-modal Attention
eVλ = EncV (img)[boxλ ] + FC(boxλ ) + EmbV
As mentioned in Section 2.3, multi-granular atten-
where eTλ and eVλare the text and visual embed- tion is only designed for the interaction between
dings; λ ∈ {P } ∪ {R1 , ..., Rm } ∪ {w1 , ..., wn } different levels of granularity; however, it is also
denotes different levels of granularity; EncT and essential to fuse information from multiple modal-
EncV are the text and image encoders, respec- ities. Therefore, we design the cross-modal atten-
tively; text and box refer to the textual contents tion module to conduct the modality fusion. Fol-
and bounding boxes; img is the entire document lowing previous works in visual-language model-
image; FC(·) is the fully-connected layer; EmbT ing, we use a cross-attention mechanism to fuse
and EmbV are the type embeddings for text and the textual and visual features. Specifically, the
vision. cross-attention function is formulated as,
(W Q α)> (W K β)
2.3 Multi-granular Attention CrossAttn(α|β) = σ √ WV β
d
Given the multi-modal embeddings described
above, we design a multi-granular attention to en- where α and β are matrices of the same size; σ
code the hierarchical relation between regions and is the Softmax function to normalize the attention
words. Specifically, we add attention biases to the matrix; W Q , W K , W V are the trainable weights
original self-attention weights to strengthen the in each of the attention heads for query, key and,
region-word correlation. We apply multi-granular value. Then we list text and visual features from
attention to the text embeddings and visual em- different levels of granularity as F T = {fλT } and
beddings individually because the purpose of this F V = {fλV } and compute the multi-modal features
module is to learn the interaction between different as follows,
levels of granularity rather than to fuse modalities. fλT →V = CrossAttn(F T |F V )[λ]
Therefore, without loss of generality, we omit the
fλV →T = CrossAttn(F V |F T )[λ]
notation of modality in the expressions. The atten-
tion weight is computed as From these expressions, we can see that the cross
attention uses the dot product between multi-modal
1
Aα,β = √ (W Q eα )> (W K eβ ) features as attention weights. In this way, the given
d modality can learn from the other modality and the
+ HierBias(boxα ⊆ or 6⊆ boxβ ) module also bridges the gap between the modalities.
+ RelBias(boxα − boxβ ), We call the output of this module text or visual
multi-modal features to distinguish from the text
where α, β ∈ {P } ∪ {R1 , ..., Rm } ∪ {w1 , ..., wn }; and visual features in Section 2.2, and denote them
the first part is the same as the attention mecha- as fλT →V and fλV →T . The final representation is the
nism in the original BERT; RelBias(·) is the at- sum of the textual and visual multi-modal features,
tention weight bias to encode the relative distance fλ = fλT →V + fλV →T , λ ∈ {P } ∪ {R1 , ..., Rm } ∪
between the bounding boxes; HierBias(·) is the {w1 , ..., wn }, and is used in the pre-training and
attention weight bias to encode the inside or out- downstream tasks.
side relation which models the spatial hierarchy
within the page. Since the regions are created by 2.5 Pre-training Tasks
grouping the words, all the words correspond to Large-scale pre-training has shown strong results
a specific region. We embed the binary relation in document image understanding tasks. With a
large amount of unlabeled data, pre-trained models T →V is the contextual feature of the zero
where f[0]| λ̄
can learn the latent data distribution without man- vector given the unmasked inputs.
ually labelled supervision and can easily transfer
the learned knowledge to downstream tasks. The Multi-Granularity Modeling The multi-
design of pre-training tasks is crucial for successful granularity modeling task asks the model to
pre-training. We go beyond the classic mask mod- understand the spatial hierarchy between different
eling and apply the mask text modeling and the levels of granularity. Since the page-level input
mask vision modeling on all the inputs from differ- includes all regions and words, it is trivial for the
ent levels of granularity. Due to the unified multi- model to learn it. We only focus on the hierarchical
model encoder (see Section 2.2), it is possible for relation between the regions and words. Although
us to treat all levels of granularity equally and in- the relation is also encoded in the multi-granular
troduce a unified masking task for each modality. attention, it is necessary to reinforce the model
Because we believe that spatial hierarchical rela- to emphasize the region-word correspondence.
tionships are essential for encoding documents, we Otherwise, the spatial hierarchy biases are random
design a pre-training task that requires the model add-ons to the attention matrix.
to identify the spatial relationship between con- The model takes the region-level and word-level
tent at different levels of granularity. The final features and predicts which region the given the
training loss is the sum of the pre-training tasks, word is located in. We first compute the dot prod-
L = LM T M + LM V M + LM GM . Below we pro- uct of the region-level and word-level features as
vide details for each component. the score and use the Cross-entropy as the loss
function.
Mask Text Modeling The mask text modeling
task requires the model to understand the textual in- >
X e fw fr ∗
puts of the model. Specifically, we randomly select LM GM =
efw> fr∗ + >f
fw
P
r∈R−{r∗ } e
r
a proportion of regions or words, and their textual w∈W
contents are replaced with a special token [MASK].
We run the model to obtain the contextual features where W = {w1 , ..., wn } and R = {R1 , ..., Rm };
of these masked inputs and compare them with the r∗ is the region that includes the word w.
encoding result of original textual inputs. We use
the Mean Absolute Error as the loss function. 3 Experiments
X
V →T
LM T M = eTλ − f[MASK]| λ̄ 3.1 Pre-training Settings
λ∈Λ
We use the RVL-CDIP dataset (Harley et al.) as
where Λ = {P } ∪ {R1 , ..., Rm } ∪ {w1 , ..., wn }; our pre-training corpus. The RVL-CDIP dataset
eTλ denotes the encoding result of the original tex- is a scanned document image dataset containing
tual contents; λ̄ denotes the multi-granular context 400,000 grey-scale images and covering a variety
V →T
without λ; f[MASK]| is the contextual feature of the of layout patterns. We use OCR engines to rec-
λ̄
masked textual inputs. ognize the location of textual content in the docu-
ment images and also the location of the individ-
Mask Vision Modeling Similarly to the mask ual words. Following Gu et al. (2021), we use
text modeling task, we use mask vision model- EasyOCR 1 with two different output modes: non-
ing to learn visual contextual information. Instead paragraph and paragraph. The difference is that the
of replacing the [MASK] token as is done in mask non-paragraph mode extracts the individual words
text modeling, we set the visual embeddings of the in the pages, and the paragraph mode groups these
selected areas to zero vectors. The loss function results into regions. The OCR engine allows us to
computes the Mean Absolute Error between the design the architecture and the pre-training tasks
contextual feature of masked areas and the original focusing on the multi-granularity of document im-
visual embeddings. The mask vision modeling loss ages. Therefore, the paragraph results serve as the
is formulated as, region-level inputs, and the non-paragraph results
X serve as the word-level inputs.
T →V
LM V M = eVλ − f[0]| λ̄
λ∈Λ 1
https://github.com/JaidedAI/EasyOCR
Pre-training FUNSD CORD RVL-CDIP
Scale Model
Corpus #Data #Param. (F1) (F1) (Accuracy)
BERTBASE - - 110M 60.26 89.68 89.81
BERTLARGE - - 340M 65.63 90.25 89.92
LayoutLMBASE IIT-CDIP 11M 113M 78.66 94.72 94.42
LayoutLMLARGE IIT-CDIP 11M 343M 78.95 94.93 94.43
BROSBASE IIT-CDIP 11M 110M 83.05 96.50 -
BROSLARGE IIT-CDIP 11M 340M 84.52 97.28 -
Word LayoutLMv2BASE IIT-CDIP 11M 200M 82.76 94.95 95.25
LayoutLMv2LARGE IIT-CDIP 11M 426M 84.20 96.01 95.64
TILTBASE RVL-CDIP+ 1.1M 230M - 95.11 95.25
TILTLARGE RVL-CDIP+ 1.1M 780M - 96.33 95.52
DocFormerBASE IIT-CDIP- 5M 183M 83.34 96.33 96.17
DocFormerLARGE IIT-CDIP- 5M 536M 84.55 96.99 95.50
SelfDoc RVL-CDIP 320K - 83.36 - 92.81
SelfDoc+VGG-16 RVL-CDIP 320K - - - 93.81
Region UDoc IIT-CDIP- 1M 272M 87.96 96.64 93.96
UDoc‡ IIT-CDIP- 1M 272M 87.93 96.86 95.05
Region+Word MGDoc (Ours) RVL-CDIP 320K 203M∗ 89.44 97.11 93.64
Table 1: The experiment results and comparison. * indicates that non-trainable parameters are not included. The
total #param. of MGDoc is 312M. ‡ implies unfreezing the sentence encoder during the finetuning. RVL-CDIP+:
TILT uses extra training pages in pre-training; IIT-CDIP-: DocFormer uses a subset of IIT-CDIP in pre-training.
Example 4:
• Text: File with:
• Label: Question File, with, :
• UDoc: Header
• MGDoc: Question Reg.
region
0.6 0.6
0.4 0.4
0.2 0.2
word word
(a) (b)
We select several representative cases in the com-
0.8 0.8
parison between UDoc and MGDoc and show them
in Figure 4. We also visualize the weight matrix
region
region
0.6 0.6
0.4 0.4
0.2 0.2
of the entities in the same way as in Section 3.6.
word word In these examples, our proposed model can lever-
(c) (d) age the more fine-grained signal from word-level
inputs and make the correct prediction. In exam-
Figure 5: The correlation weight visualization between ple 2, the entity is labeled as Answer where the
regions and words corresponding question, “Fax No.:”, is at the top
of this column. Due to the large distance of this
question-answer pair, UDoc predicts the entity as
3.6 Region-word Correlation Visualization
Question, while MGDoc can give the right predic-
We visualize the correlation between regions and tion by directly learning from the digits inside of
words using heat maps. We select four examples the text fields, which is a strong signal for answers.
from the FUNSD dataset and show the heat maps of From the heat map, we can also see that a lighter
the final feature dot product in Figure 5. The x-axis color appears in the corresponding area of the en-
and the y-axis correspond to the words and regions, tity. Meanwhile, the word-level information even
respectively, and the lighter the color is, the higher strengthens the multi-modal features since it pro-
correlation there is. Some cropping is applied for vides more details of a given text field. As we can
clearer visualization. From the heat maps, we can observe in example 4, the entity “File with:” is
observe that there are highlighted areas along the likely to be Header or Question given its textual
matrix diagonal, which means our model learns contents and location in the page, but MGDoc can
the region-word hierarchy in the pre-training stage predict from the rich visual features that this field
and can leverage such correspondence in down- is a part of normal text and less likely to be Header;
stream tasks. We also see some lighter colored these rich inputs allow MGDoc to make the cor-
blocks in the matrix. Since all the words and re- rect prediction where UDoc cannot. However, in
gions are serialized in positional order, these lighter example 3, both UDoc and MGDoc cannot predict
colored blocks indicate the model is able to use the correctly. The ground-truth label is Header but
localized features in the model with the help of both models predict the entity as Question. The
multi-granular inputs. This ability further confirms entity is not at the top of the page where the header
that our intuition that combining information from entities are more likely to be located, so we at-
different levels of granularity will be beneficial is tribute this error to the dependence of MGDoc to
correct. the spatial information.
4 Related Work lead to improvements in downstream tasks.
As for future work, since we have not fully ex-
Word-level Models Word-level models inherit ploited the multi-granular information, we will go
the architecture of pure-text pre-trained language beyond the page level and investigate the possibility
models. Word-level contextual information is en- of encoding multiple pages. We are also interested
coded by a multi-layered transformer, and spatial in inputs that are more fine-grained than word level,
and visual features are added to refine the represen- such as pixels.
tation. Inspired by the positional embeddings in
Vaswani et al. (2017); Raffel et al. (2019); Dai et al. 6 Acknowledgement
(2019), absolute or relative spatial features based
on the bounding boxes are proposed to encode the This work was supported in part by Adobe Re-
words’ layout with respect to each other (Xu et al., search. We thank anonymous reviewers and pro-
2020b,a; Hong et al., 2021; Garncarek et al., 2021). gram chairs for their valuable and insightful feed-
Computer vision deep models (He et al., 2016; back.
Xie et al., 2017) are used to extract features from
the document images, and self-supervised learning Limitations
methods are applied to learn the cross-modal corre-
Although we inherit the idea of using region-level
lation between images and words (Xu et al., 2020a;
inputs from (Gu et al., 2021; Li et al., 2021),
Powalski et al., 2021).
we cannot keep their merits of saving comput-
Region-level Models Region-level models en- ing resources. Region-level models encode re-
code the regions in the document page including gions instead of all the words in the page, so the
text blocks, headings, and paragraphs (Li et al., smaller number of features are included in the self-
2021; Gu et al., 2021). Similar spatial and visual attention layers. However, we want to leverage the
features are used in these models as in the word- fine-grained word-level information as (Xu et al.,
level models. With the help of coarse-grained in- 2020b,a; Hong et al., 2021), so the words are also
puts, region-level models can emphasize the rich considered in the multi-granular attention and the
locality features and catch high-level cues. Another multi-modal attention layers. Compared to existing
difference with the word-level models is that the works, our work requires more memory storage
number of regions is much smaller than the word during training and testing.
number on the page, so the region-level models are
more efficient when processing long documents. Ethical Considerations
This paper presents a new framework for document
5 Conclusions and Future Work
image understanding tasks. Our model is built on
We present MGDoc, a multi-modal multi-granular open-source tools and datasets, and we aim at in-
pre-training framework, which goes beyond the ex- creasing the efficiency of processing various doc-
isting region-level or word-level models and lever- uments and also bringing convenience to ordinary
ages the contents at multiple levels of granularity people’s life. Thus, we do not anticipate any major
to understand the document pages better. Existing ethical concerns.
models fail to use the informative multi-granular
features in the document due to the restriction
from the word-level model architecture, and lead References
to unsatisfactory results. We solve these issues Srikar Appalaraju, Bhavan Jasani, Bhargava Urala
with the new architecture design and tailored pre- Kota, Yusheng Xie, and R Manmatha. 2021. Doc-
training tasks. With a unified multi-modal encoder, former: End-to-end transformer for document under-
standing. In Proceedings of the IEEE/CVF Interna-
we embed the features from pages, regions, and tional Conference on Computer Vision, pages 993–
words into the same hyperspace, and design a multi- 1003.
granular attention mechanism and multi-granularity
modeling task for MGDoc to learn the spatial hier- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-
bonell, Quoc V Le, and Ruslan Salakhutdinov.
archical relation between them. Experiments show 2019. Transformer-xl: Attentive language mod-
that our proposed model can understand the spatial els beyond a fixed-length context. arXiv preprint
relation between the multi-granular features and arXiv:1901.02860.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee,
Kristina Toutanova. 2018. Bert: Pre-training of deep Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee.
bidirectional transformers for language understand- 2019. Cord: a consolidated receipt dataset for post-
ing. arXiv preprint arXiv:1810.04805. ocr parsing. In Workshop on Document Intelligence
at NeurIPS 2019.
Łukasz Garncarek, Rafał Powalski, Tomasz
Stanisławek, Bartosz Topolski, Piotr Halama, Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz,
Michał Turski, and Filip Graliński. 2021. Lambert: Tomasz Dwojak, Michał Pietruszka, and Gabriela
Layout-aware language modeling for information Pałka. 2021. Going full-tilt boogie on document un-
extraction. In International Conference on Doc- derstanding with text-image-layout transformer. In
ument Analysis and Recognition, pages 532–547. International Conference on Document Analysis and
Springer. Recognition, pages 732–747. Springer.
Jiuxiang Gu, Jason Kuen, Vlad I Morariu, Han- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
dong Zhao, Rajiv Jain, Nikolaos Barmpalios, Ani Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Nenkova, and Tong Sun. 2021. Unidoc: Unified Wei Li, and Peter J Liu. 2019. Exploring the limits
pretraining framework for document understanding. of transfer learning with a unified text-to-text trans-
Advances in Neural Information Processing Systems, former. arXiv preprint arXiv:1910.10683.
34:39–50.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
Adam W Harley, Alex Ufkes, and Konstantinos G Der- bert: Sentence embeddings using siamese bert-
panis. Evaluation of deep convolutional nets for doc- networks. arXiv preprint arXiv:1908.10084.
ument image classification and retrieval. In Interna-
tional Conference on Document Analysis and Recog- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
nition (ICDAR). Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian you need. Advances in neural information process-
Sun. 2016. Deep residual learning for image recog- ing systems, 30.
nition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770– Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu,
778. and Kaiming He. 2017. Aggregated residual trans-
formations for deep neural networks. In Proceed-
Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok ings of the IEEE conference on computer vision and
Hwang, Daehyun Nam, and Sungrae Park. 2021. pattern recognition, pages 1492–1500.
Bros: A pre-trained language model focusing on text
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu
and layout for better key information extraction from
Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
documents. arXiv preprint arXiv:2108.04539.
Zhang, Wanxiang Che, et al. 2020a. Layoutlmv2:
Guillaume Jaume, Hazim Kemal Ekenel, and Jean- Multi-modal pre-training for visually-rich document
Philippe Thiran. 2019. Funsd: A dataset for form understanding. arXiv preprint arXiv:2012.14740.
understanding in noisy scanned documents. In 2019
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang,
International Conference on Document Analysis and
Furu Wei, and Ming Zhou. 2020b. Layoutlm: Pre-
Recognition Workshops (ICDARW), volume 2, pages
training of text and layout for document image
1–6. IEEE.
understanding. In Proceedings of the 26th ACM
David Lewis, Gady Agam, Shlomo Argamon, Ophir SIGKDD International Conference on Knowledge
Frieder, David Grossman, and Jefferson Heard. Discovery & Data Mining, pages 1192–1200.
2006. Building a test collection for complex doc-
ument information processing. In Proceedings of
the 29th annual international ACM SIGIR confer-
ence on Research and development in information
retrieval, pages 665–666.