Wikidata:Lexicographical data/Documentation/Lemmata
The lemmata (singular lemma) of a lexeme are primarily used as human-readable representations of the lexeme. Usually lemmata are the written forms of a word, phrase, or affix that would be found in a dictionary describing them, whether or not they are considered the 'base' or 'stem' forms morphologically.
Each lemma consists of a string accompanied by a valid IETF language tag (or 'language code'). This code, which often coincides with the ISO 639-1 or ISO 639-3 code for a language (such as es
or dag
), may also contain subtags referring to a particular writing system (e.g. ja-hira
), region (e.g. de-ch
), or orthographic standard (e.g. pt-ao1990
).
Lexeme lemmata are what are displayed when using the {{L}}
template to link to a lexeme on Wikidata.
Examples
[edit]- The English lexeme Lexeme:L1296 has the lemma 'tall' because most English dictionaries provide information about this lexeme under the heading 'tall' and not under something like 'taller' or 'tallest'.
- The Bengali lexeme Lexeme:L308045 has the lemma 'খাওয়া' because most Bengali dictionaries provide information about this lexeme under the heading 'খাওয়া' and not under something like 'খাই', 'খা-', or 'খেতে'.
- The Italian lexeme Lexeme:L1196895 has the lemma 'cantare' because most Italian dictionaries provide information about it under that heading and not under something like 'canto', 'cantante', or 'cantato'.
- The Modern Greek lexeme Lexeme:L1098915 has the lemma 'πίνω' because most dictionaries of the modern Greek language provide information about it under that heading and not under something like 'πιω', 'πίνομαι', or 'πιει'.
- The Korean lexeme Lexeme:L154 has the lemma '마시다' because most Korean dictionaries provide information about it under that form, rather than something like '마시-', '마셔', or even '마십니다'.
- The Japanese lexeme Lexeme:L830 has the lemmata '為る' and 'する', even though the use of the 為 character is very unusual, because most Japanese dictionaries provide information about this lexeme under both headings.
Multiple lemmata
[edit]Lexemes can have several lemmata, particularly when there are differences in the writing system or other orthographic conventions within a given language. Different lemmata are indicated with different language tags, and a lexeme may only have one lemma for a given language tag (that is, there cannot be two lemmata on the same lexeme with language code da
or two with language code gsg
).
Examples of writing system differences
[edit]- The Hausa lexeme Lexeme:L314793 has two lemmata, 'aboki' with code
ha
and 'أَبُوكِی' with codeha-arab
, which are representations of the same dictionary form in the Latin script (used more generally) and the Arabic script. - The Hindustani lexeme Lexeme:L641622 has two lemmata, 'चाचा' with code
hi
and 'چاچا' with codeur
, which are representations of the same dictionary form (pronounced /t͡ʃɑː.t͡ʃɑː/) in the Devanagari script (used for Hindi) and the Arabic script (used for Urdu). - The Japanese lexeme Lexeme:L572 has two lemmata, 'のむ' with code
ja-hira
and '飲む' with codeja
, which are representations of the same dictionary form in either exclusively hiragana or the mixed script of Chinese characters, hiragana, and katakana. - The New Persian lexeme Lexeme:L742511 has two lemmata, 'دیدن' with code
fa
and 'дидан' with codetg
, which are representations of the same dictionary form in the Arabic script (used for Persian in Iran and Afghanistan) and in the Cyrillic script (used for Tajik). - The Punjabi lexeme Lexeme:L679506 has two lemmata, 'ਪਿੰਡ' with code
pa
and 'پِنڈ' with codepnb
, which are representations of the same dictionary form in the Arabic script (used in Pakistan) and the Gurmukhi script (used in India). - The Southern Min lexeme Lexeme:L308008 has three lemmata, '城市' with code
nan-hani
, 'siânn-tshī' with codenan-x-Q56929
, and 'siâⁿ-chhī' with codenan-x-Q559173
. These represent using either Chinese characters or one of two romanization systems, each corresponding to the same word form. - The Turkish lexeme Lexeme:L1171764 has two lemmata, 'yaşamak' with code
tr
and 'یاشامق' with codeota
, which are representations of the same word form before and after the introduction of the Latin script to Turkish in 1928.
Examples of orthographic variation differences
[edit]- The English lexeme Lexeme:L35013 has two lemmata, 'hemophilia' with code
en
and 'haemophilia' with codeen-gb
, reflecting a difference in spelling this word between different parts of the English-speaking world. - The Hebrew lexeme Lexeme:L63672 has two lemmata, 'אדום' with code
he
and 'אָדֹם' with codehe-x-Q21283070
, which reflect differences in how the same word form is spelt depending on whether diacritics are present. - The Portuguese lexeme Lexeme:L500697 has two lemmata, 'ciência' with code
pt
and 'sciência' with codept-colb1945
reflecting differences in orthographic standard between (in the former case) Portugal, Brazil, and Cape Verde and (in the latter case) the rest of the Lusosphere. - The Esperanto lexeme Lexeme:L616380 has three lemmata, 'akuŝistino' with code
eo
, 'akusxistino' with codeeo-xsistemo
, and 'akushistino' with codeeo-hsistemo
, reflecting differences in how the circumflex diacritic is substituted when typesetting Esperanto using only ASCII characters. - The Belarusian lexeme Lexeme:L8880 has two lemmata, 'есці' with code
be
and 'е́сьці' with codebe-tarask
, reflecting differences in orthographic standard before and after reforms introduced in the territory of Belarus in 1933.
Handling lemmata language code uniqueness
[edit]Because it is not possible to add multiple lemmata to a lexeme that share the same language code, different strategies have been employed across different languages to deal with this problem. What works for the languages below may not be optimal for your own language: be sure to weigh these and other strategies before choosing one for your own language!
- For Bokmål and Nynorsk, where the variation in spelling of a word is tied more to personal preference than to any particular standard, entirely different lexemes are created to deal with this variation, such as in Lexeme:L1219886 for 'kvalkjøt' and Lexeme:L1219887 for 'kvalkjøtt', both in Nynorsk.
- For Southern Min, one pronunciation variation is treated as the lemma and the others as forms, such as in Lexeme:L306309 where the pronunciation 'muê' from one dialect is treated as a lemma and pronunciations from three other dialects are added as lexeme forms.
- For Bengali, where a spelling differs based on its prescription by the language authority in Bangladesh or the language authority in West Bengal, this difference is indicated via a custom language code (see below) on the lemma using the QIDs for those language authorities' items, as in Lexeme:L308189.
Custom language codes
[edit]Some language codes used in lemmata may contain an '-x-' in them. There are two main reasons this would be present in a language code: 1) the desirable language code, while a valid IETF language tag, is unsupported or unsupportable in Wikidata, or 2) a variant of an existing supported language tag is unsupported or unsupportable in Wikidata.
Entirely unsupported language codes
[edit]For languages whose language codes are not yet supported or are not supportable, a last-resort option for a language code to use would involve adding a private-use subtag, containing the QID for the Wikidata item for the language, with the mis
base code.
- Lexemes in Torwali (Q2665246), such as Lexeme:L1003531, have a lemma with the code
mis-x-Q2665246
(though the desired supportable code would betrw
). - Lexemes in Soyot (Q4426878), such as Lexeme:L1015954, have a lemma with the code
mis-x-Q4426878
(though no supportable language code exists). - Lexemes in Láadan (Q35757), such as Lexeme:L623039, have a lemma with the code
mis-x-Q35757
(though the desired supportable code would beldn
). - Lexemes in Yaghnobi (Q34247), such as Lexeme:L684534, have a lemma with the code
mis-x-Q34247
(though the desired supportable code would beyai
). - Lexemes in Proto-Indo-European (Q37178), such as Lexeme:L638724, have a lemma with the code
mis-x-Q37178
(though no supportable language code exists).
Unsupported variants of supported language codes
[edit]If a language has a supported language code, but a variation whose language code is not supported or supportable, the private-use subtag may be attached directly to the existing supported code.
- Lexemes in the Varendri (Q48726757) of Bengali, such as Lexeme:L672268, have a lemma with the code
bn-x-Q48726757
(where 'bn' is the existing supported code, but no supportable code substitute exists). - Lemmata in Devanagari Sindhi (Q116688933) for lexemes in Sindhi use the language code
sd-x-Q116688933
(where 'sd' is the existing supported code, but the supportable codesd-deva
exists ). - Lemmata in the Adlam (Q19606346) for lexemes in Fula use the language code
ff-x-Q19606346
(where 'ff' is the existing supported code, but the supportable codeff-adlm
exists). - Lemmata in the Brolikva (Q113301414) system for lexemes in Brahui use the language code
brh-x-Q113301414
(where 'brh' is the existing supported code, but the supportable codebrh-latn
exists ). - Lemmata in the Mongolian (Q1055705) for lexemes in Mongolian use the language code
mn-x-Q1055705
(where 'mn' is the existing supported code, but the supportable codemn-mong
was rejected by the Language Committee for some reason).