Docile Benchmark For Document Information Localization and Extraction
Docile Benchmark For Document Information Localization and Extraction
1
Rossum.ai, https://rossum.ai, {name.surname}@rossum.ai
2
Visual Recognition Group, Czech Technical University in Prague
3
University of La Rochelle, France
4
Computer Vision Center, Universitat Autónoma de Barcelona, Spain
Abstract. This paper introduces the DocILE benchmark with the largest
dataset of business documents for the tasks of Key Information Local-
ization and Extraction and Line Item Recognition. It contains 6.7k an-
notated business documents, 100k synthetically generated documents,
and nearly 1M unlabeled documents for unsupervised pre-training. The
dataset has been built with knowledge of domain- and task-specific as-
pects, resulting in the following key features: (i) annotations in 55 classes,
which surpasses the granularity of previously published key information
extraction datasets by a large margin; (ii) Line Item Recognition repre-
sents a highly practical information extraction task, where key informa-
tion has to be assigned to items in a table; (iii) documents come from
numerous layouts and the test set includes zero- and few-shot cases as
well as layouts commonly seen in the training set. The benchmark comes
with several baselines, including RoBERTa, LayoutLMv3 and DETR-
based Table Transformer; applied to both tasks of the DocILE bench-
mark, with results shared in this paper, offering a quick starting point
for future work. The dataset, baselines and supplementary material are
available at https://github.com/rossumai/docile.
1 Introduction
Automating information extraction from business documents has the potential
to streamline repetitive human labour and allow data entry workers to focus
on more strategic tasks. Despite the recent shift towards business digitaliza-
tion, the majority of Business-to-Business (B2B) communication still happens
through the interchange of semi-structured1 business documents such as invoices,
tax forms, orders, etc. The layouts of these documents were designed for human
1
We use the term semi-structured documents as [62, 69]; visual structure is strongly
related to the document semantics, but the layout is variable.
2 Š. Šimsa et al.
2 Related Work
To address the related work, we first introduce general approaches to document
understanding, before specifically focusing on information extraction tasks and
existing datasets.
Table 1: Datasets with KILE and LIR annotations for semi-structured business
documents.
name document # docs classes source multi lang. task
type labeled page
DocILE (ours) invoice-like 106680 55 digital, yes en KILE,
scan LIR
CORD [58] receipts 11000 30−422 photo no id ≈KILE,
≈LIR3
WildReceipt [74] receipts 1740 25 photo no en KILE
EPHOIE [79] chinese 1494 10 scan no zh KILE
forms
Ghega [49] patents, 246 11/8 scan yes en KILE
datasheets
48]. The landscape of IE problems and datasets was recently reviewed by Borch-
mann et al. [5], building the DUE Benchmark for a wide range of document
understanding tasks, and by Skalický et al. [69], who argue that the crucial
problems for automating B2B document communication are Key Information
Localization and Extraction and Line Item Recognition.
Key Information Extraction (KIE) [15,29,72] aims to extract pre-defined
key information (categories of ”fields” – name, email, the amount due, etc.) from
a document. A number of datasets for KIE are publicly available [29,49,72,73,74,
74, 79]. However, as noted by [69], most of them are relatively small and contain
only a few annotated field categories.
Key Information Localization and Extraction (KILE) [69] addition-
ally requires precise localization of the extracted information in the input im-
age or PDF, which is crucial for human-in-the-loop interactions, auditing, and
other processing of the documents. However, many of the existing KIE datasets
miss the localization annotations [5, 29, 72]. Publicly available KILE datasets
on business documents [49, 58, 74, 79] and their sizes are listed in Table 1. Due
to the lack of large-scale datasets for KILE from business documents, noted
by several authors [11, 35, 56, 69, 75], many research publications use private
datasets [9, 26, 33, 43, 55, 56, 65, 87].
Line Item Recognition (LIR) [69] is a part of table extraction [3, 9, 25, 46,
56] that aims at finding Line Items (LI), localizing and extracting key informa-
tion for each item. The task is related to Table Structure Recognition [64,71,78],
which typically aims at detecting table rows, columns and cells. However, sole ta-
ble structure recognition is not sufficient for LIR: an enumerated item may span
2
54 classes mentioned in [58], but the repository https://github.com/clovaai/cord
only considers 30 out of 42 listed classes, as of January 2023.
3
COORD annotations contain classification of word tokens (as in NER) but with the
additional information which tokens are grouped together into fields or menu items,
effectively upgrading the annotations to KILE/LIR field annotations.
DocILE Benchmark for Document Information Localization and Extraction 5
Line Item: 1
Line Item: 2
Line Item: 3
Line Item: 4
Line Item: 5
Line Item: 6
Line Item: 7
Line Item: 8
Fig. 1: DocILE: a document with KILE and LIR annotations (left) and the Line
Item areas emphasized (right) by alternating blue and green for odd and
even items, respectively. Bottom: color legend for the KILE and LIR classes.
several rows in a table; and columns are often not sufficient to distinguish all
semantic information. There are several datasets [14, 52, 66, 71, 88, 89] for Table
Detection and/or Structure Recognition, PubTables-1M [71] being the largest
with a million tables from scientific articles. The domain of scientific articles is
prevailing among the datasets [14, 66, 71, 89], due to easily obtainable annota-
tions from the LATEX source codes. However, there is a non-trivial domain shift
introduced by the difference in the Tables from scientific papers and business
documents. FinTabNet [88] and SynthTabNet [52] are closer to our domain,
covering table structure recognition of complex financial tables. These datasets,
however, only contain annotations of the table grid/cells. From the available
datasets, CORD [58] is the closest to the task of Line Item Recognition with its
annotation of sub-menu items. The documents in CORD are all receipts, which
generally have simpler structure than other typical business documents, which
makes the task too simple as previously mentioned in [5].
Named Entity Recognition (NER) [39] is the task of assigning one of the
pre-defined categories to entities (usually words or word-pieces in the document)
which makes it strongly related to KILE and LIR, especially when these entities
have a known location. Note that the task of NER is less general as it only
6 Š. Šimsa et al.
Documents in the DocILE dataset come from two public data sources: UCSF In-
dustry Documents Library [80] and Public Inspection Files (PIF) [82]. The UCSF
Industry Documents Library contains documents from industries that influence
public health, such as tobacco companies. This source has been used to create
the following document datasets: RVL-CDIP [23], IIT-CDIP [37], FUNSD [32],
DocVQA [48] and OCR-IDL [4]. PIF contains a variety of information about
American broadcast stations. We specifically use the ”political files” with doc-
uments (invoices, orders, ”contracts”) from TV and radio stations for political
campaign ads, previously used to create the Deepform [73]. Documents from
both sources were retrieved in the PDF format.
Documents for DocILE were selected from the two sources as follows. For
UCSF IDL, we used the public API [81] to retrieve only publicly available doc-
uments of type invoice. For documents from PIF, we retrieved all ”political
files” from tv, fm and am broadcasts. We discarded documents with broken
PDFs, duplicates4 , and documents not classified as invoice-like 5 . Other types of
documents, such as budgets or financial reports, were discarded as they typically
contain different key information. We refer to the selected documents from the
two sources as PIF and UCSF documents.
1.0
244.2k
Validation set
209.5k
0.8
% documents in the set
0.6 0.20
109.7k
0.4 0.15
1049
55.2k
0.10
39.4k
38.6k
29.2k
0.2
21.4k
75
14.5k
11.0k
0.05
265
30
0.0 0.00
1 2 3 1 2 3 4 5 6 7 8 9 10 11+
# pages # pages
Fig. 2: Distribution of the number of document pages in the training, validation
and unlabeled sets. The numbers of documents are displayed above the bars.
8 Š. Šimsa et al.
105
Training set
Validation set
104 Unlabeled set
103
# documents
102
101
100
0 200 400 600 800 1000
Layout cluster
Fig. 3: The number of documents of each layout cluster in the training, validation
and unlabeled sets, on a logarithmic scale. While some clusters have up to 100k
documents, the largest cluster in train. + val. contains only 90 documents.
The clustering was manually corrected for the annotated set. More details
about the clustering can be found in the Supplementary Material.
In the annotation process, documents were skipped if they were not invoice-
like, if they contained handwritten or redacted key information or if they listed
more than one set of line items (e.g. several tables listing unrelated types of
items).
Additionally, PDF files composed of several documents (e.g., a different in-
voice on each page) were split and annotated separately.
For KILE and LIR, fields are annotated as a triplet of location (bounding box
and page), field type (class) and text. LIR fields additionally contain the line item
ID, assigning the line item they belong to. If the same content is listed in several
tables with different granularity (but summing to the same total amount), the
less detailed set of line items is annotated.
Notice that the fields can overlap, sometimes completely. A field can be multi-
line or contain only parts of words. There can be multiple fields with the same
field type on the same page, either having the same value in multiple locations
or even having different values as well as multiple fields with the same field type
in the same line item. The full list of field types and their description are in the
Supplementary Material.
Additional annotations, not necessary for the benchmark evaluation, are
available and can be used in the training or for other research purposes. Ta-
ble structure annotations include: 1) line item headers, representing the headers
of columns corresponding to one field type in the table, and 2) the table grid,
containing information about rows, their position and classification into header,
data, gap, etc., and columns, their position and field type when the values in
the column correspond to this field type. Additionally, metadata contain: docu-
DocILE Benchmark for Document Information Localization and Extraction 9
ment type, currency, layout cluster ID, source and original filename (linking the
document to the source), page count and page image sizes.
Annotating the 6, 680 documents took approx. 2, 500 hours of annotators’
time including the verification. Of the annotated documents, 53.7% originate
from PIF and the remaining 46.3% from UCSF IDL. The annotated documents
underwent the image pre-processing described in the Supplementary Material.
All remaining documents from PIF and UCSF form the unlabeled set.
The annotated documents in the DocILE dataset are split into training (5, 180),
validation (500), and test (1, 000) sets. The synthetic set with 100k documents
and unlabeled set with 932k documents are provided as an optional extension to
the training set, as unsupervised pre-training [85] and synthetic training data [6,
12, 19, 52, 53] have been demonstrated to improve results of machine learning
models in different domains.
The training, validation and test splitting was done so that the validation and
test sets contain 25% of zero-shot samples (from layouts unseen during train-
ing8 ), 25% of few-shot samples (from layouts with ≤ 3 documents seen during
training) and 50% of many-shot samples (from layouts with more examples seen
during training). This allows to measure both the generalization of the evaluated
methods and the advantage of observing documents of known layouts.
The test set annotations are not public and the test set predictions will be
evaluated through the RRC website9 , where the benchmark and competition is
hosted. The validation set can be used when access to annotations and metadata
is needed for experiments in different tasks.
As inputs to the document synthesis described in Section 3.5, 100 one-page
documents were chosen from the training set, each from a different layout cluster.
In the test, resp. validation sets, roughly half of the few-shot samples are from
layouts for which synthetic documents were generated. There are no synthetic
documents generated for zero-shot samples. For many-shot samples, 35 − 40%
of documents are from layouts with synthetic documents.
(where applicable) annotations. Such full annotations were the input to a rule-
based document synthesizer, which uses a rich set of content generators10 to fill
semantically relevant information in the annotated areas. Additionally, a style
generator controls and enriches the look of the resulting documents (via font
family and size, border styles, shifts of the document contents, etc.). The docu-
ments are first generated as HTML files and then rendered to PDF. The HTML
source code of all generated documents is shared with the dataset and can be
used for future work, e.g., for generative methods for conversion of document
images into a markup language.
3.6 Format
The dataset is shared in the form of pre-processed11 document PDFs with task
annotations in JSON. Additionally, each document comes with DocTR [51] OCR
predictions with word-level text and location12 .
A python library docile13 is provided to ease the work with the dataset.
Contact Details:
Phone: +420 233 232 344
E–mail: bob@website.com
Fig. 4: Each word is split uniformly into pseudo-character boxes based on the
number of characters. Pseudo-Character Centers are the centers of these boxes.
format etc., despite being needed in practice, is not performed. With the sim-
plifications, the main task can also be viewed as a detection problem. Note that
when several instances of the same field type are present, all of them should be
detected.
the requirements from Track 1 (on field type and location) and if it is assigned to
the correct line item. Since the matching of ground truth (GT) and predicted line
items may not be straightforward due to errors in the prediction, our evaluation
metric chooses the best matching in two steps:
1. for each pair of predicted and GT line items, the predicted fields are evalu-
ated as in Track 1,
2. the maximum matching is found between predicted and GT line items, max-
imizing the overall recall.
5 Baseline Methods
We provide as baselines several popular state-of-the-art transformer architec-
tures, covering text-only (RoBERTa), image-only (DETR) and multi-modal (Lay-
outLMv3) document representations. The code and model checkpoints for all
baseline methods are distributed with the dataset.
5.2 RoBERTa
RoBERTa [44] is a modification of the BERT [10] model which uses improved
training scheme and minor tweaks of the architecture (different tokenizer). It can
be used for NER task simply by adding a classification head after the RoBERTa
embedding layer. Our first baseline is purely text based and uses RoBERTaBASE
as the backbone of the joint multi-label NER model described in Section 5.1.
5.3 LayoutLMv3
While the RoBERTa-based baseline only operates on the text input, LayoutLMv3
[28] is multi-modal transformer architecture that incorporates image, text, and
layout information jointly. The images are encoded by splitting into non-overlap-
ping patches and feeding the patches to a linear projection layer, after which they
are combined with positional embeddings. The text tokens are combined with
one-dimensional and two-dimensional positional embeddings, where the former
accounts for the position in the sequence of tokens, and the latter specifies the
spatial location of the token in the document. The two-dimensional positional
embedding incorporates the layout information. All these tokens are then fed
to the transformer model. We use the LayoutLMv3BASE architecture as our sec-
ond baseline, also using the multi-label NER formulation from Section 5.1. Since
LayoutLMv3BASE was pre-trained on an external document dataset, prohibited
in the benchmark, we pre-train a checkpoint from scratch in Section 5.4.
We use the standard masked language modeling [10] as the unsupervised pre-
training objective to pre-train RoBERTaOURS and LayoutLMv3OURS 14 models.
The pre-training is performed from scratch using the 932k unlabeled samples
introduced in Section 3. Note that the pre-training uses the OCR predictions
provided with the dataset (with reading order re-ordering).
Additionally, RoBERTaBASE/OURS+SYNTH and LayoutLMv3OURS+SYNTH base-
lines use supervised pre-training on the DocILE synthetic data.
14
Note that LayoutLMv3BASE [28] used two additional pre-training objectives,
namely masked image modelling and word-patch alignment. Since pre-training
code is not publicly available and some of the implementation details are missing,
LayoutLMv3OURS used only masked language modelling.
14 Š. Šimsa et al.
Table 3: Baseline results for KILE & LIR. LayoutLMv3BASE , achieving the best
results, was pre-trained on another document dataset – IIT-CDIP [37], which is
prohibited in the official benchmark. The best results among permitted models
are underlined. The primary metric for each task is shown in bold.
KILE LIR
Model F1 AP Prec. Recall F1 AP Prec. Recall
RoBERTaBASE 0.664 0.534 0.658 0.671 0.686 0.576 0.695 0.678
RoBERTaOURS 0.645 0.515 0.634 0.656 0.686 0.570 0.693 0.678
LayoutLMv3BASE (prohibited) 0.698 0.553 0.701 0.694 0.721 0.586 0.746 0.699
LayoutLMv3OURS 0.639 0.507 0.636 0.641 0.661 0.531 0.682 0.641
RoBERTaBASE+SYNTH 0.664 0.539 0.659 0.669 0.698 0.583 0.710 0.687
RoBERTaOURS+SYNTH 0.652 0.527 0.648 0.656 0.675 0.559 0.696 0.655
LayoutLMv3OURS+SYNTH 0.655 0.512 0.662 0.648 0.691 0.582 0.709 0.673
NER upper bound 0.946 0.897 1.000 0.897 0.961 0.926 1.000 0.926
DETRtable + RoBERTaBASE - - - - 0.682 0.560 0.706 0.660
DETRtable + DETRLI + RoBERTaBASE - - - - 0.594 0.407 0.632 0.560
5.7 Results
The baselines described above were evaluated on the DocILE test set, the re-
sults are in Table 3. Interestingly, from our pre-trained models (marked OURS ),
15
https://huggingface.co/facebook/detr-resnet-50
DocILE Benchmark for Document Information Localization and Extraction 15
the RoBERTa baseline outperforms the LayoutLMv3 baseline utilizing the same
RoBERTa model in its backbone. We attribute this mainly to differences in the
LayoutLMv3 pre-training: 1) our pre-training used only the masked language
modelling loss, as explained in Section 5.4, 2) we did not perform a full hyper-
parameter search, and 3) our pre-training performs image augmentations not
used in the original LayoutLMv3 pre-training, these are described in the Sup-
plementary Material.
Models pre-trained on the synthetic training data are marked with SYNTH .
Synthetic pre-training improved the results for both KILE and LIR in all cases
except for LIR with RoBERTaOURS+SYNTH , validating the usefulness of the
synthetic subset.
The best results among the models permitted in the benchmark – i.e. not uti-
lizing additional document datasets – were achieved by RoBERTaBASE+SYNTH .
6 Conclusions
The DocILE benchmark includes the largest research dataset of business docu-
ments labeled with fine-grained targets for the tasks of Key Information Local-
ization and Extraction and Line Item Recognition. The motivation is to provide
a practical benchmark for evaluation of information extraction methods in a
domain where future advancements can considerably save time that people and
businesses spend on document processing. The baselines described and evaluated
in Section 5, based on state-of-the-art transformer architectures, demonstrate
that the benchmark presents very challenging tasks. The code and model check-
points for the baselines are provided to the research community allowing quick
start for the future work.
The benchmark is used for a research competition hosted at ICDAR 2023
and CLEF 2023 and will stay open for post-competition submission for long-
term evaluation. We are looking forward to contributions from different machine
learning communities to compare solutions inspired by document layout mod-
elling, language modelling and question answering, computer vision, information
retrieval, and other approaches.
Areas for future contributions to the benchmark include different training
objective statements — such as different variants of NER, object detection, or
sequence-to-sequence modelling [77], or graph reasoning [74]; different model ar-
chitectures, unsupervised pre-training [28, 77], utilization of table structure —
e.g., explicitly modelling regularity in table columns to improve in LIR; address-
ing dataset shifts [57, 68]; or zero-shot learning [34].
Acknowledgements We acknowledge the funding and support from Rossum and
the intensive work of its annotation team, particularly Petra Hrdličková and Kateřina
Večerková. YP and JM were supported by Research Center for Informatics (project
CZ.02.1.01/0.0/0.0/16 019/0000765 funded by OP VVV), by the Grant Agency of the
Czech Technical University in Prague, grant No. SGS20/171/OHK3 /3T/13, by Project
StratDL in the realm of COMET K1 center Software Competence Center Hagenberg,
and Amazon Research Award. DK was supported by grant PID2020-116298GB-I00
funded by MCIN/AE/NextGenerationEU and ELSA (GA 101070617) funded by EU.
16 Š. Šimsa et al.
References
1. Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: End-
to-end transformer for document understanding. In: ICCV (2021)
2. Baek, Y., Nam, D., Park, S., Lee, J., Shin, S., Baek, J., Lee, C.Y., Lee, H.: Cle-
val: Character-level evaluation for text detection and recognition tasks. In: CVPR
workshops (2020)
3. Bensch, O., Popa, M., Spille, C.: Key information extraction from documents: Eval-
uation and generator. In: Abbès, S.B., Hantach, R., Calvez, P., Buscaldi, D., Dessı̀,
D., Dragoni, M., Recupero, D.R., Sack, H. (eds.) Proceedings of DeepOntoNLP and
X-SENTIMENT (2021)
4. Biten, A.F., Tito, R., Gomez, L., Valveny, E., Karatzas, D.: Ocr-idl: Ocr annota-
tions for industry document library dataset. ECCV workshops (2022)
5. Borchmann, L., Pietruszka, M., Stanislawek, T., Jurkiewicz, D., Turski, M., Szyn-
dler, K., Graliński, F.: DUE: End-to-end document understanding benchmark. In:
NeurIPS (2021)
6. Bušta, M., Patel, Y., Matas, J.: E2E-MLT - an unconstrained end-to-end method
for multi-language scene text. In: ACCV workshops (2019)
7. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-
to-end object detection with transformers. In: ECCV (2020)
8. Davis, B., Morse, B., Cohen, S., Price, B., Tensmeyer, C.: Deep visual template-free
form parsing. In: ICDAR (2019)
9. Denk, T.I., Reisswig, C.: Bertgrid: Contextualized embedding for 2d document
representation and understanding. arXiv (2019)
10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
rectional transformers for language understanding. arXiv (2018)
11. Dhakal, P., Munikar, M., Dahal, B.: One-shot template matching for automatic
document data capture. In: Artificial Intelligence for Transforming Business and
Society (AITB) (2019)
12. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van
Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convo-
lutional networks. In: ICCV (2015)
13. Du, Y., Li, C., Guo, R., Yin, X., Liu, W., Zhou, J., Bai, Y., Yu, Z., Yang, Y.,
Dang, Q., Wang, H.: PP-OCR: A practical ultra lightweight OCR system. arXiv
(2020)
14. Fang, J., Tao, X., Tang, Z., Qiu, R., Liu, Y.: Dataset, ground-truth and perfor-
mance metrics for table detection evaluation. In: Blumenstein, M., Pal, U., Uchida,
S. (eds.) DAS (2012)
15. Garncarek, L., Powalski, R., Stanislawek, T., Topolski, B., Halama, P., Turski, M.,
Graliński, F.: Lambert: layout-aware language modeling for information extraction.
In: ICDAR (2021)
16. Geimfari, L.: Mimesis: The fake data generator. https://github.com/
lk-geimfari/mimesis (2022)
17. Gu, J., Kuen, J., Morariu, V.I., Zhao, H., Jain, R., Barmpalios, N., Nenkova, A.,
Sun, T.: Unidoc: Unified pretraining framework for document understanding. In:
NeurIPS (2021)
18. Gu, Z., Meng, C., Wang, K., Lan, J., Wang, W., Gu, M., Zhang, L.: Xylayoutlm:
Towards layout-aware multimodal networks for visually-rich document understand-
ing. In: CVPR (2022)
DocILE Benchmark for Document Information Localization and Extraction 17
19. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in nat-
ural images. In: CVPR (2016)
20. Hamad, K.A., Mehmet, K.: A detailed analysis of optical character recognition
technology. International Journal of Applied Mathematics Electronics and Com-
puters (2016)
21. Hamdi, A., Carel, E., Joseph, A., Coustaty, M., Doucet, A.: Information extraction
from invoices. In: ICDAR (2021)
22. Hammami, M., Héroux, P., Adam, S., d’Andecy, V.P.: One-shot field spotting on
colored forms using subgraph isomorphism. In: ICDAR (2015)
23. Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for
document image classification and retrieval. In: ICDAR (2015)
24. Herzig, J., Nowak, P.K., Müller, T., Piccinno, F., Eisenschlos, J.M.: Tapas: Weakly
supervised table parsing via pre-training. arXiv (2020)
25. Holeček, M., Hoskovec, A., Baudiš, P., Klinger, P.: Table understanding in struc-
tured documents. In: ICDAR workshops (2019)
26. Holt, X., Chisholm, A.: Extracting structured data from invoices. In: Proceedings
of the Australasian Language Technology Association Workshop 2018. pp. 53–59
(2018)
27. Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: A pre-trained
language model focusing on text and layout for better key information extraction
from documents. In: AAAI (2022)
28. Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: Pre-training for document
ai with unified text and image masking. In: ACM-MM (2022)
29. Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., Jawahar, C.V.: IC-
DAR2019 competition on scanned receipt OCR and information extraction. In:
ICDAR (2019)
30. Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for
semi-structured document information extraction. arXiv (2020)
31. Islam, N., Islam, Z., Noor, N.: A survey on optical character recognition system.
arXiv (2017)
32. Jaume, G., Ekenel, H.K., Thiran, J.P.: FUNSD: A dataset for form understanding
in noisy scanned documents. In: ICDAR (2019)
33. Katti, A.R., Reisswig, C., Guder, C., Brarda, S., Bickel, S., Höhne, J., Faddoul,
J.B.: Chargrid: Towards understanding 2d documents. In: EMNLP (2018)
34. Kil, J., Chao, W.L.: Revisiting document representations for large-scale zero-shot
learning. arXiv (2021)
35. Krieger, F., Drews, P., Funk, B., Wobbe, T.: Information extraction from invoices:
a graph neural network approach for datasets with high layout variety. In: Inno-
vation Through Information Systems: Volume II: A Collection of Latest Research
on Technology Issues (2021)
36. Lee, C.Y., Li, C.L., Dozat, T., Perot, V., Su, G., Hua, N., Ainslie, J., Wang, R.,
Fujii, Y., Pfister, T.: Formnet: Structural encoding beyond sequential modeling in
form document information extraction. In: ACL (2022)
37. Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building
a test collection for complex document information processing. In: SIGIR (2006)
38. Li, C., Bi, B., Yan, M., Wang, W., Huang, S., Huang, F., Si, L.: Structurallm:
Structural pre-training for form understanding. In: ACL (2021)
39. Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recog-
nition. IEEE Transactions on Knowledge and Data Engineering (2020)
18 Š. Šimsa et al.
40. Li, Y., Qian, Y., Yu, Y., Qin, X., Zhang, C., Liu, Y., Yao, K., Han, J., Liu, J., Ding,
E.: Structext: Structured text understanding with multi-modal transformers. In:
ACM-MM (2021)
41. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)
42. Lin, W., Gao, Q., Sun, L., Zhong, Z., Hu, K., Ren, Q., Huo, Q.: Vibertgrid: a jointly
trained multi-modal 2d document representation for key information extraction
from documents. In: ICDAR (2021)
43. Liu, W., Zhang, Y., Wan, B.: Unstructured document recognition on business
invoice. Technical Report (2016)
44. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretrain-
ing Approach. arXiv (2019)
45. Lohani, D., Belaı̈d, A., Belaı̈d, Y.: An invoice reading system using a graph con-
volutional network. In: ACCV workshops (2018)
46. Majumder, B.P., Potti, N., Tata, S., Wendt, J.B., Zhao, Q., Najork, M.: Repre-
sentation learning for information extraction from form-like documents. In: ACL
(2020)
47. Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: Info-
graphicVQA. In: WACV (2022)
48. Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: A dataset for vqa on document
images. In: WACV (2021)
49. Medvet, E., Bartoli, A., Davanzo, G.: A probabilistic approach to printed document
understanding. In: ICDAR (2011)
50. Memon, J., Sami, M., Khan, R.A., Uddin, M.: Handwritten optical character
recognition (ocr): A comprehensive systematic literature review (slr). IEEE Ac-
cess (2020)
51. Mindee: doctr: Document text recognition. https://github.com/mindee/doctr
(2021)
52. Nassar, A., Livathinos, N., Lysak, M., Staar, P.W.J.: Tableformer: Table structure
understanding with transformers. arXiv (2022)
53. Nayef, N., Patel, Y., Busta, M., Chowdhury, P.N., Karatzas, D., Khlif, W., Matas,
J., Pal, U., Burie, J.C., Liu, C.l., et al.: ICDAR 2019 robust reading challenge
on multi-lingual scene text detection and recognition—rrc-mlt-2019. In: ICDAR
(2019)
54. Olejniczak, K., Šulc, M.: Text detection forgot about document ocr. In: CVWW
(2023)
55. Palm, R.B., Laws, F., Winther, O.: Attend, copy, parse end-to-end information
extraction from documents. In: ICDAR (2019)
56. Palm, R.B., Winther, O., Laws, F.: Cloudscan - A configuration-free invoice anal-
ysis system using recurrent neural networks. In: ICDAR (2017)
57. Pampari, A., Ermon, S.: Unsupervised calibration under covariate shift. arXiv
(2020)
58. Park, S., Shin, S., Lee, B., Lee, J., Surh, J., Seo, M., Lee, H.: Cord: A consolidated
receipt dataset for post-ocr parsing. In: NeurIPS workshops (2019)
59. Powalski, R., Borchmann, L., Jurkiewicz, D., Dwojak, T., Pietruszka, M., Palka,
G.: Going full-tilt boogie on document understanding with text-image-layout trans-
former. In: ICDAR (2021)
60. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y.,
Li, W., Liu, P.J., et al.: Exploring the limits of transfer learning with a unified
text-to-text transformer. JMLR (2020)
DocILE Benchmark for Document Information Localization and Extraction 19
61. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de-
tection with region proposal networks. In: NeurIPS (2015)
62. Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table de-
tection in invoice documents by graph neural networks. In: ICDAR (2019)
63. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-
nition challenge. IJCV (2015)
64. Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: Deep learning
for detection and structure recognition of tables in document images. In: ICDAR
(2017)
65. Schuster, D., Muthmann, K., Esser, D., Schill, A., Berger, M., Weidling, C., Aliyev,
K., Hofmeier, A.: Intellix–end-user trained information extraction for document
archiving. In: ICDAR (2013)
66. Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with
distantly supervised neural networks. In: Chen, J., Gonçalves, M.A., Allen, J.M.,
Fox, E.A., Kan, M., Petras, V. (eds.) Proceedings of the 18th ACM/IEEE on Joint
Conference on Digital Libraries, JCDL (2018)
67. Šimsa, Š., Šulc, M., Skalickỳ, M., Patel, Y., Hamdi, A.: Docile 2023 teaser: Docu-
ment information localization and extraction. In: ECIR (2023)
68. Šipka, T., Šulc, M., Matas, J.: The hitchhiker’s guide to prior-shift adaptation. In:
WACV (2022)
69. Skalický, M., Šimsa, Š., Uřičář, M., Šulc, M.: Business document information ex-
traction: Towards practical benchmarks. In: CLEF (2022)
70. Smith, R.: An overview of the tesseract ocr engine. In: ICDAR (2007)
71. Smock, B., Pesala, R., Abraham, R.: Pubtables-1m: Towards comprehensive table
extraction from unstructured documents. In: CVPR (2022)
72. Stanislawek, T., Graliński, F., Wróblewska, A., Lipiński, D., Kaliska, A., Rosalska,
P., Topolski, B., Biecek, P.: Kleister: key information extraction datasets involving
long documents with complex layouts. In: ICDAR (2021)
73. Stray, J., Svetlichnaya, S.: Deepform: Extract information from documents (2020),
https://wandb.ai/deepform/political-ad-extraction, benchmark
74. Sun, H., Kuang, Z., Yue, X., Lin, C., Zhang, W.: Spatial dual-modality graph
reasoning for key information extraction. arXiv (2021)
75. Sunder, V., Srinivasan, A., Vig, L., Shroff, G., Rahul, R.: One-shot information
extraction from document images using neuro-deductive program synthesis. arXiv
(2019)
76. Tanaka, R., Nishida, K., Yoshida, S.: Visualmrc: Machine reading comprehension
on document images. In: AAAI (2021)
77. Tang, Z., Yang, Z., Wang, G., Fang, Y., Liu, Y., Zhu, C., Zeng, M., Zhang, C.,
Bansal, M.: Unifying vision, text, and layout for universal document processing.
arXiv (2022)
78. Tensmeyer, C., Morariu, V.I., Price, B., Cohen, S., Martinez, T.: Deep splitting
and merging for table structure decomposition. In: ICDAR (2019)
79. Wang, J., Liu, C., Jin, L., Tang, G., Zhang, J., Zhang, S., Wang, Q., Wu, Y., Cai,
M.: Towards robust visual information extraction in real world: New dataset and
novel solution. In: AAAI (2021)
80. Web: Industry Documents Library. https://www.industrydocuments.ucsf.edu/,
accessed: 2022-10-20
81. Web: Industry Documents Library API. https://www.industrydocuments.ucsf.
edu/research-tools/api/, accessed: 2022-10-20
20 Š. Šimsa et al.
1 Dataset Details
The following pre-processing was applied to documents from the annotated set:
– (UCSF only) PDFs from UCSF contain a text ”Source: [URL in UCSF]” on
the bottom of the page. As this text would affect the competition tasks, it
was removed from the PDFs. Instead, the original document ID is recorded
in the dataset metadata.
– Skewed pages were detected in PDFs and images rendered from the PDF
pages were deskewed automatically1 . New PDFs were generated from the
deskewed images2 , each rendered to have the longer dimension equal to 842
pixels (this corresponds to the longer side of A4 at 72 DPI).
Documents in the DocILE dataset are assigned to different layout clusters, see
Figure 1 for an example of a layout cluster. To find the layout clustering for the
1
Using Leptonica: https://tpgit.github.io/Leptonica/skew_8c_source.html
2
Using pdf2image: https://pypi.org/project/pdf2image/
2
annotated and unlabeled sets3 , layout clusters were first detected with the algo-
rithm described in this section and then manually corrected4 for the annotated
set.
The clustering algorithm takes KILE and LIR fields on input. These fields
were predicted by a proprietary model. Then, the following two distance func-
tions were used to detect which clusters should be merged together.
By combining the above distance functions, layout clusters were found in the
following steps:
1. Starting with single document clusters for all documents, clusters were iter-
atively merged if their distance was under a selected threshold. At most 50
documents per cluster were sent for annotation.
2. After annotation, predicted fields were replaced with ground truth fields and
the clustering was re-run. The result was then manually corrected.
3. To cluster the unlabeled set, field predictions were used and each document
was assigned independently to the closest cluster in the annotated set when
the distance was lower than a threshold. When the distance exceeded the
threshold, it was assigned cluster id −1, representing that a corresponding
cluster was not found – which happened for 18.8% of the unlabeled docu-
ments. On a random sample of 100 unlabeled documents, 72 documents were
assigned to some cluster with a precision of 86%. Determining whether the
28 documents were correctly unassigned was not checked as it would require
much more effort.
3
For the synthetic set the clustering is implicit from the selection of the template
documents
4
Notice that full manual check was infeasible, as that would involve checking over half
of million of cluster pairs for a potential merge. Instead, potential cluster merges and
splits were generated using the two distance metrics described in Section 1.3.
5
The y-axis is ignored as it often varies even for documents of the same layout, e.g.,
based on the length of the table.
Supplementary Material: DocILE 3
The following rules and parameters were used to split the dataset into training,
validation and test split and to select the templates for synthetic documents.
– The target number of documents in the test set and validation set is 1, 000
and 500, respectively. Test set is selected first and then validation set is
selected from the remaining documents with the same algorithm. The re-
maining documents form the training set. With respect to the test set, we
consider the documents from both training and validation sets as available
for training. Below we describe the selection of the test set.
– The target number of synthetic templates (documents from the training set
whose annotations are used for generation of synthetic documents) is set to
100.
– The target ratio of documents from PIF and UCSF sources in the selected
set is 55%:45%, which is similar to the ratio in the full annotated dataset.
Without this constraint the distribution would be significantly different as
the two sources have different distribution of cluster sizes, which affects fur-
ther selection of the documents.
– Let us call a cluster X-shot with respect to the test set if it has X documents
available for training. The cluster is called zero-shot if X = 0, few-shot if
0 < X < 4 and many-shot if X ≥ 4. The target proportion of test documents
belonging to 0, 1, 2, 3-shot clusters is set to approximately 1/4, 1/12, 1/12,
1/12, respectively, for each source. I.e., 1/2 of the test documents belong to
zero-shot or few-shot clusters and 1/2 of the documents belong to many-shot
clusters.
– To ensure diversity of test samples belonging to zero-shot and few-shot clus-
ters, we allow at most 20 samples from each zero-shot or few-shot cluster in
the test set.
– When selecting few-shot samples, approximately half of the documents should
belong to clusters containing a synthetic template, e.g., 1/24 · 1000 ≈ 42 of
the documents in the test set should belong to clusters that contain a syn-
thetic template and that have two documents available for training. The
remaining synthetic templates are from many-shot clusters or clusters not
contained in the test set.
0
1000
2000
3000
4000
5000
account_num
amount_due
amount_paid
amount_total_gross
amount_total_net
4
amount_total_tax
1.5
bank_num
bic
currency_code_amount_due
customer_billing_address
customer_billing_name
customer_delivery_address
customer_delivery_name
customer_id
customer_order_id
customer_other_address
customer_other_name
customer_registration_id
customer_tax_id
date_due
date_issue
document_id
iban
order_id
payment_reference
Description of Annotations
payment_terms
tax_detail_gross
tax_detail_net
displayed in Figures 2 and 3 resperctively.
tax_detail_rate
tax_detail_tax
their impact on the evaluation is only minor.
vendor_address
vendor_email
vendor_name
vendor_order_id
vendor_registration_id
vendor_tax_id
0
1000
2000
3000
4000
5000
line_item_amount_gross
line_item_amount_net
line_item_code
line_item_currency
line_item_date
line_item_description
line_item_discount_amount
line_item_discount_rate
line_item_hts_number
line_item_order_id
line_item_person_name
(right) fieldtypes, out of 5580 documents in the training and validation sets.
line_item_position
line_item_quantity
Fig. 2: The number of documents containing individual KILE (left) and LIR
The number of documents and clusters containing individual fieldtypes are
set are 5, 31, 40 and 96 respectively. With the micro-averaged evaluation metric
bic, customer tax id and line item tax rate. Their counts in the training
Note that some rather rare fields are not represented in the validation set: iban,
Field Types for KILE and LIR are described in Tables 1 and 2, respectively.
line_item_tax
line_item_tax_rate
line_item_unit_price_gross
line_item_unit_price_net
line_item_units_of_measure
line_item_weight
Supplementary Material: DocILE 5
1000 1000
800 800
600 600
400 400
200 200
0 0
payment_reference
tax_detail_gross
amount_paid
vendor_address
account_num
amount_due
amount_total_gross
bic
customer_billing_address
vendor_registration_id
amount_total_net
amount_total_tax
bank_num
customer_delivery_address
customer_registration_id
date_due
iban
customer_tax_id
date_issue
document_id
tax_detail_net
tax_detail_rate
tax_detail_tax
vendor_email
vendor_tax_id
vendor_name
currency_code_amount_due
customer_billing_name
customer_id
customer_other_address
customer_delivery_name
order_id
vendor_order_id
customer_order_id
customer_other_name
payment_terms
line_item_amount_gross
line_item_code
line_item_date
line_item_tax
line_item_unit_price_gross
line_item_amount_net
line_item_currency
line_item_description
line_item_discount_amount
line_item_discount_rate
line_item_hts_number
line_item_position
line_item_quantity
line_item_tax_rate
line_item_unit_price_net
line_item_units_of_measure
line_item_weight
line_item_order_id
line_item_person_name
Fig. 3: The number of clusters containing individual KILE (left) and LIR (right)
fieldtypes, out of 1063 clusters in the training and validation sets.
Supplementary Material: DocILE 7
2 Evaluation Details
Although the two benchmark tasks use different primary metrics, the evaluation
of both tasks includes AP, F1, precision and recall. Predictions can use an op-
tional flag use only for ap to indicate the prediction does not have big enough
confidence and should not be counted towards F1, precision and recall but it
should still be used for AP6 . When AP is computed for LIR, only the ”con-
fident” predictions (with use only for ap=False) are used to find the perfect
matching between line items.
Additionally, we set up a secondary evaluation benchmark for end-to-end
KILE and LIR, where a correctly recognized field also needs to exactly read out
the text.
The benchmark also computes results on the test subsets to compare the
performance on the 0-shot, few-shot and many-shot samples (based on how many
documents from the same layout cluster are available for training) as shown in
Table 3. Furthermore, it is possible to evaluate separately on samples belonging
to clusters for which synthetic documents were generated, to study the influence
of using synthetic data. The provided evaluation code supports evaluation on
any dataset subset, e.g., based on the source of the document (UCSF vs PIF ).
In Average Precision, predictions are sorted by score (confidence) from the high-
est to the lowest and added iteratively. This ordering is not well defined if several
predictions have the same score or if the scores are not provided. Taking into
account the use only for ap flag, effectively marking some predictions as less
confident, the predictions are sorted by the following criteria:
6
Note that when computing AP, adding predictions with lower score than all other
predictions can never decrease the metric, no matter how low is the precision among
these extra predictions.
8
3 Baseline Details
Unsupervised Pre-training: The RoBERTaOURS [5] model was pre-trained
for 50k training steps with a batch size of 64. The LayoutLMv3OURS [1] was
pre-trained using AdamW optimizer [6] for 30 epochs using a batch size of 16.
Both trainings use a cosine decay for the learning rate with a linear warmup.
Note that LayoutLMv3 [1] uses three training objectives: masked language mod-
eling, masked image modeling, and word-patch alignment loss. Whereas our
pre-training only uses masked language modeling. Furthermore, LayoutLMv3
pre-trains without any data augmentations on the IIT-CDIP [2] dataset. Our
setup uses random horizontal flipping of the images and trains on the unlabeled
subset of the introduced DocILE dataset.
OCR Re-ordering We observed that for the Line Item separation task, it
is crucial to provide the text tokens in per-line reading order (i.e., from top to
bottom, each text line from left to right). The pseudocode of the used re-ordering
algorithm is in Algorithm 1. Since we use a joint multi-label model for both KILE
and LIR, the re-ordering was applied for every training and inference.
References
1. Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: Pre-training for document
ai with unified text and image masking. In: ACM-MM (2022)
2. Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building
a test collection for complex document information processing. In: SIGIR (2006)
3. Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with
differentiable binarization. In: Proceedings of the AAAI conference on artificial in-
telligence. vol. 34, pp. 11474–11481 (2020)
4. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)
5. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining
Approach. arXiv (2019)
6. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
7. Mindee: doctr: Document text recognition. https://github.com/mindee/doctr
(2021)
8. Olejniczak, K., Šulc, M.: Text detection forgot about document ocr. In: CVWW
(2023)
9. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based
sequence recognition and its application to scene text recognition. PAMI (2016)