Wiktextract: Wiktionary as Machine-Readable Structured Data

Tatu Ylonen

Wiktextract: Wiktionary as Machine-Readable Structured Data

Abstract

We present a machine-readable structured data version of Wiktionary. Unlike previous Wiktionary extractions, the new extractor, Wiktextract, fully interprets and expands templates and Lua modules in Wiktionary. This enables it to perform a more complete, robust, and maintainable extraction. The extracted data is multilingual and includes lemmas, inflected forms, translations, etymology, usage examples, pronunciations (including URLs of sound files), lexical and semantic relations, and various morphological, syntactic, semantic, topical, and dialectal annotations. We extract all data from the English Wiktionary. Comparing against previous extractions from language-specific dictionaries, we find that its coverage for non-English languages often matches or exceeds the coverage in the language-specific editions, with the added benefit that all glosses are in English. The data is freely available and regularly updated, enabling anyone to add more data and correct errors by editing Wiktionary. The extracted data is in JSON format and designed to be easy to use by researchers, downstream resources, and application developers.

Anthology ID:: 2022.lrec-1.140
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 1317–1325
Language:
URL:: https://aclanthology.org/2022.lrec-1.140/
DOI:
Bibkey:
Cite (ACL):: Tatu Ylonen. 2022. Wiktextract: Wiktionary as Machine-Readable Structured Data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1317–1325, Marseille, France. European Language Resources Association.
Cite (Informal):: Wiktextract: Wiktionary as Machine-Readable Structured Data (Ylonen, LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.140.pdf

PDF Cite Search Fix data

Wiktextract: Wiktionary as Machine-Readable Structured Data

Abstract

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.