We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30
Part of Speech Tagging and
Named Entity Recognition
Parts of Speech From the earliest linguistic traditions (Yaska and Panini 5th C. BCE, Aristotle 4th C. BCE), the idea that words can be classified into grammatical categories • part of speech, word classes, POS, POS tags 8 parts of speech attributed to Dionysius Thrax of Alexandria (c. 1st C. BCE): • noun, verb, pronoun, preposition, adverb, conjunction, participle, article • These categories are relevant for NLP today. Two classes of words: Open vs. Closed
Closed class words
• Relatively fixed membership • Usually function words: short, frequent words with grammatical function • determiners: a, an, the • pronouns: she, he, I • prepositions: on, under, over, near, by, … Open class words • Usually content words: Nouns, Verbs, Adjectives, Adverbs • Plus interjections: oh, ouch, uh-huh, yes, hello • New nouns and verbs like iPhone or to fax Open class ("content") words Nouns Verbs Adjectives old green tasty
Proper Common Main Adverbs slowly yesterday
Janet cat, cats eat Italy mango went Interjections Ow hello Numbers 122,312 … more one Closed class ("function") Auxiliary Determiners the some can Prepositions to with had Conjunctions and or Particles off up … more
Pronouns they its
Part-of-Speech Tagging Assigning a part-of-speech to each word in a text. Words often have more than one POS. book: • VERB: (Book that flight) • NOUN: (Hand me that book). Part-of-Speech Tagging Map from sequence x1,…,xn of words to y1,…,yn of POS tags "Universal Dependencies" Tagset Nivre et al. 2016 Sample "Tagged" English sentences There/PRO were/VERB 70/NUM children/NOUN there/ADV ./PUNC Preliminary/ADJ findings/NOUN were/AUX reported/VERB in/ADP today/NOUN ’s/PART New/PROPN England/PROPN Journal/PROPN of/ADP Medicine/PROPN Why Part of Speech Tagging?
◦ Can be useful for other NLP tasks
◦ Parsing: POS tagging can improve syntactic parsing ◦ MT: reordering of adjectives and nouns (say from Spanish to English) ◦ Sentiment or affective tasks: may want to distinguish adjectives or other POS ◦ Text-to-speech (how do we pronounce “lead” or "object"?) ◦ Or linguistic or language-analytic computational tasks ◦ Need to control for POS when studying linguistic change like creation of new words, or meaning shift ◦ Or control for POS in measuring meaning similarity or difference How difficult is POS tagging in English? Roughly 15% of word types are ambiguous • Hence 85% of word types are unambiguous • Janet is always PROPN, hesitantly is always ADV But those 15% tend to be very common. So ~60% of word tokens are ambiguous E.g., back earnings growth took a back/ADJ seat a small building in the back/NOUN a clear majority of senators back/VERB the bill enable the country to buy back/PART debt I was twenty-one back/ADV then POS tagging performance in English How many tags are correct? (Tag accuracy) ◦ About 97% ◦ Hasn't changed in the last 10+ years ◦ HMMs, CRFs, BERT perform similarly . ◦ Human accuracy about the same But baseline is 92%! ◦ Baseline is performance of stupidest possible method ◦ "Most frequent class baseline" is an important baseline for many tasks ◦ Tag every word with its most frequent tag ◦ (and tag unknown words as nouns) ◦ Partly easy because ◦ Many words are unambiguous Sources of information for POS tagging Janet will back the bill AUX/NOUN/VERB? NOUN/VERB? Prior probabilities of word/tag • "will" is usually an AUX Identity of neighboring words • "the" means the next word is probably not a verb Morphology and wordshape: ◦ Prefixes unable: un- ADJ ◦ Suffixes importantly: -ly ADV ◦ Capitalization Janet: CAP PROPN Standard algorithms for POS tagging Supervised Machine Learning Algorithms: • Hidden Markov Models • Conditional Random Fields (CRF)/ Maximum Entropy Markov Models (MEMM) • Neural sequence models (RNNs or Transformers) • Large Language Models (like BERT), finetuned All required a hand-labeled training set, all about equal performance (97% on English) All make use of information sources we discussed • Via human created features: HMMs and CRFs • Via representation learning: Neural LMs Named Entity Recognition (NER) Named Entities ◦ Named entity, in its core usage, means anything that can be referred to with a proper name. Most common 4 tags: ◦ PER (Person): “Marie Curie” ◦ LOC (Location): “New York City” ◦ ORG (Organization): “Stanford University” ◦ GPE (Geo-Political Entity): “India, Colorado" ◦ Often multi-word phrases ◦ But the term is also extended to things that aren't entities: ◦ dates, times, prices Named Entity tagging The task of named entity recognition (NER): • find spans of text that constitute proper names • tag the type of the entity. NER output Why NER? Sentiment analysis: consumer’s sentiment toward a particular company or person? Question Answering: answer questions about an entity? Information Extraction: Extracting facts about entities from text. Why NER is hard 1) Segmentation • In POS tagging, no segmentation problem since each word gets one tag. • In NER we have to find and segment the entities! 2) Type ambiguity BIO Tagging How can we turn this structured problem into a sequence problem like POS tagging, with one label per word?
[PER Jane Villanueva] of [ORG United] , a unit of [ORG
United Airlines Holding] , said the fare applies to the [LOC Chicago ] route. BIO Tagging [PER Jane Villanueva] of [ORG United] , a unit of [ORG United Airlines Holding] , said the fare applies to the [LOC Chicago ] route.
Now we have one tag per token!!!
BIO Tagging B: token that begins a span I: tokens inside a span O: tokens outside of any span
# of tags (where n is #entity types):
1 O tag, n B tags, n I tags total of 2n+1 BIO Tagging variants: IO and BIOES [PER Jane Villanueva] of [ORG United] , a unit of [ORG United Airlines Holding] , said the fare applies to the [LOC Chicago ] route. Standard algorithms for NER Supervised Machine Learning given a human- labeled training set of text annotated with tags • Hidden Markov Models • Conditional Random Fields (CRF)/ Maximum Entropy Markov Models (MEMM) • Neural sequence models (RNNs or Transformers) • Large Language Models (like BERT), finetuned Part of Speech Tagging Techniques Part-of-Speech Tagging How hard is the tagging problem? That: • as a determiner (followed by a noun):Give me that hammer. • as a demonstrative pronoun (without a following noun):Who gave you that? • as a conjunction (connecting two clauses):I didn’t know that she was married. The number of word types in Brown corpus by degree of • as a relative pronoun (forming the subject, object, or ambiguity. complement of a relative clause):It’s a song that my mother taught me. • Many of the 40% ambiguous tokens are easy to • as an adverb (before an adjective or adverb):Three years? I can’t wait that long. disambiguate, because – Various tags associated with a word are not equally likely, or event. – E.g., ‘a’ can be a determiner or a letter (perhaps as part of a acronym) • But the determiner sense is much more likely Part-of-Speech Tagging Many tagging algorithms fall into two classes: ◦ Rule-based taggers ◦ Involve a large database of hand-written disambiguation rule specifying, for example, that an ambiguous word is a noun rather than a verb if it follows a determiner. ◦ Stochastic taggers ◦ Resolve tagging ambiguities by using a training corpus to count the probability of a given word having a given tag in a given context. The Brill tagger, called the transformation-based tagger, shares features of both tagging architecture. Rule-Based Part-of-Speech Tagging The earliest algorithms for automatically assigning POS were based on a two-stage architecture ◦ First, use a dictionary to assign each word a list of potential POS. ◦ Second, use large lists of hand-written disambiguation rules to winnow down this list to a single POS for each word The ENGTWOL tagger (1995) is based on the same two stage architecture, with much more sophisticated lexicon and disambiguation rules than before. ◦ Lexicon: ◦ 56000 entries ◦ A word with multiple POS is counted as separate entries Rule-Based Part-of-Speech Tagging In the first stage of tagger, ◦ each word is run through the two-level lexicon transducer and ◦ the entries for all possible POS are returned. A set of about 1,100 constraints are then applied to the input sentences to rule out incorrect POS. Rule-Based Part-of-Speech Tagging
A simplified version of the constraint:
ADVERBIAL-THAT RULE Given input: “that” if (+1 A/ADV/QUANT); /* if next word is adj, adverb, or quantifier */ (+2 SENT-LIM); /* and following which is a sentence boundary, */ (NOT -1 SVOC/A); /* and the previous word is not a verb like */ /* ‘consider’ which allows adj as object complements */ then eliminate non-ADV tags else eliminate ADV tags