0% found this document useful (0 votes)
17 views48 pages

Corpora

Uploaded by

divyamanjari1604
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views48 pages

Corpora

Uploaded by

divyamanjari1604
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Words and Corpora

Basic Text
Processin
g
Corpus (plural corpora):
A computer-readable collection of text or speech.
The Brown corpus is a million-word collection of samples from 500 written
English texts from different genres (newspaper, fiction, non-fiction, academic, etc.),
assembled at Brown University in 1963.

He stepped out into the hall, was delighted to encounter a water brother.
This sentence has 13 words if we don’t count punctuation marks as words, 15 if we
count punctuation. Whether we treat period (“.”), comma (“,”)
The Switchboard corpus of American English telephone conversations between strangers
was collected in the early 1990s; it contains 2430 conversations averaging 6 minutes each,
totaling 240 hours of speech and about 3 million words.

I do uh main- mainly business data processing

This utterance has two kinds of disfluencies.


The broken-off word main- is called a fragment.

Words like uh and um are called fillers or filled pauses.

Should we consider these to be words? Again, it depends on the application. If we are


building a speech transcription system, we might want to eventually strip out the
disfluencies.
How about inflected forms like cats versus cat?
These two words have the same lemma cat but are different wordforms.
A lemma is a set of lexical forms having the same stem, the same major part-of-
speech, and the same word sense.
The wordform is the full inflected or derived form of the word.
For morphologically complex languages like Arabic, we often need to deal with
lemmatization.
For many tasks in English, however, wordforms are sufficient.
How many words in a sentence?

Types are the number of distinct words in a corpus;


if the set of words in the vocabulary is V, the number of types is the
vocabulary size |V|.
Tokens are the total number N of running words.
Brown sentence:
They picnicked by the pool, then lay back on the grass and looked at
the stars.
ignore punctuation
16 tokens and 14 types
How many words in a corpus?
N = number of tokens
V = vocabulary = set of types, |V| is size of vocabulary
Heaps Law = Herdan's Law = where often .67 < β < .75
i.e., vocabulary size grows with > square root of the number of word tokens
Tokens = N Types = |V|
Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13+ million
Corpora
Words don't appear out of nowhere!
A text is produced by
• a specific writer(s),
• at a specific time,
• in a specific variety,
• of a specific language,
• for a specific function.
Corpora vary along dimension
like
◦ Language: 7097 languages in the world
◦ Variety, like African American Language varieties.
◦ AAE Twitter posts might include forms like "iont" (I don't)
◦ Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was
beautiful:) ]
H/E: dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
◦ Genre: newswire, fiction, scientific articles, Wikipedia
◦ Author Demographics: writer's age, gender, ethnicity, SES
Corpus datasheets
Gebru et al (2020), Bender and Friedman (2018)

Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled?
Was there consent? Pre-processing?
+Annotation process, language variety, demographics, etc.
Text Normalization

Every NLP task requires text normalization:


1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
Unix Tools for Crude Tokenization and Normalization

let’s begin with the ‘complete words’ of Shakespeare in one file, sh.txt.

We can use tr to tokenize the words by changing every sequence of non


alphabetic characters to a newline
(’A-Za-z’ means alphabetic, the -c option complements to non-alphabet,
and the -s option squeezes all sequences into a single character)
The first step: tokenizing
tr -sc ’A-Za-z’ ’\n’ < shakes.txt

THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
Simple Tokenization in UNIX

Given a text file, output the word tokens and their frequencies
tr -sc ’A-Za-z’ ’\n’ < shakes.txt
Change all non-alpha to newlines
| sort
Sort in alphabetical order
| uniq –c
Merge and count each type

1945 A
72 AARON
19 ABBESS
25 Aaron
6 Abate
1 Abates
5 Abbess
6 Abbey
3 Abbot ...
The second step: sorting
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head

A
A
A
A
A
A
A
A
A
...
Merging upper and lower case
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq –c

Sorting the counts


tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ |
sort | uniq –c | sort –n –r

The -n option to sort means to sort numerically rather than alphabetically,


and the -r option means to sort in reverse order (highest-to-lowest):
The results show that the most frequent words in Shakespeare, as in any
other corpus, are the short function words like articles, pronouns,
prepositions:
27378 the
26084 and
22538 i
19771 to
17481 of
14725 a
13826 you
12489 my
11318 that
11112 in
Word Tokenization
Tokenization: the task of segmenting running text into words.
While the Unix command sequence just removed all the numbers and
punctuation, for most NLP applications we’ll need to keep these in our
tokenization.

We often want to break off punctuation as a separate token;


commas are a useful piece of information for parsers, periods help indicate
sentence boundaries.

But we’ll often want to keep the punctuation that occurs word internally, in
examples like m.p.h.,
Issues in Tokenization

Can't just blindly remove punctuation:


◦ m.p.h., Ph.D., AT&T, cap’n
Special characters and numbers will need to be kept in prices ($45.55) and dates
(01/02/06); we don’t want to segment that price into separate tokens of “45” and
“55”.
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses (someone@cs.colorado.edu)

Clitic: a word that doesn't stand on its own and can only occur when it is attached
to another word.
converting what’re to the two tokens what are, and we’re to we are.
◦ In French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
tokenization algorithms may also tokenize multiword expressions like New York
or rock ’n’ roll as a single token, which requires a multiword expression
dictionary of some sort.
Tokenization is thus intimately tied up with named entity recognition, the task of
detecting names, dates, and organizations.
◦ New York, rock ’n’ roll
Tokenization in NLTK
since tokenization needs to be run before any other language processing, it
needs to be very fast.
The standard method for tokenization is therefore to use deterministic
algorithms based on regular expressions compiled into very efficient finite state
automata.
Tokenization in languages
without spaces
Many languages (like Chinese, Japanese, Thai) don't use spaces to
separate words!

How do we decide where the token boundaries should be?


Word tokenization in Chinese

Chinese words are composed of characters called "hanzi" (or sometimes just "zi")

Each one represents a meaning unit called a morpheme.

Each word has on average 2.4 of them.

But deciding what counts as a word is complex and not agreed upon.
How to do word tokenization in
Chinese?

姚明进入总决赛 “Yao Ming reaches the finals”

3 words?
姚明 进入 总决赛
YaoMing reaches finals

5 words?
姚 明 进入 总 决赛
Yao Ming reaches overall finals

7 characters? (don't use words at all):


姚 明 进 入 总 决 赛
Yao Ming enter enter overall decision game
How to do word tokenization in
Chinese?

姚明进入总决赛 “Yao Ming reaches the finals”

3 words?
姚明 进入 总决赛
YaoMing reaches finals

5 words?
姚 明 进入 总 决赛
Yao Ming reaches overall finals

7 characters? (don't use words at all):


姚 明 进 入 总 决 赛
Yao Ming enter enter overall decision game
Word tokenization / segmentation

So in Chinese it's common to just treat each character (zi) as a token.


• So the segmentation step is very simple

In other languages (like Thai and Japanese), more complex word segmentation
is required.
• The standard algorithms are neural sequence models trained by
supervised machine learning.
Byte Pair Encoding
Third option to tokenizing text
Instead of
• white-space segmentation
• single-character segmentation (as in Chinese)

Use the data to tell us how to tokenize.

if our training corpus contains, say the words low, new, newer, but not lower,
then if the word lower appears in our test corpus, our system will not know
what to do with it.
To deal with this unknown word problem, modern tokenizers often
automatically induce sets of tokens that include tokens smaller than
words, called subwords.
Subwords can be arbitrary substrings, or they can be meaning-bearing
units like the morphemes -est or -er.
Subword tokenization (because tokens can be parts of words as well as
whole words)
Subword tokenization
Three common algorithms:
◦ Byte-Pair Encoding (BPE)
◦ Unigram language modeling tokenization
◦ WordPiece
All have 2 parts:
◦ A token learner that takes a raw training corpus and induces a
vocabulary (a set of tokens).
◦ A token segmenter that takes a raw test sentence and tokenizes it
according to that vocabulary
Byte Pair Encoding (BPE) token learner

Let vocabulary be the set of all individual characters


= {A, B, C, D,…, a, b, c, d….}
Repeat:
◦ Choose the two symbols that are most frequently
adjacent in the training corpus (say 'A', 'B')
◦ Add a new merged symbol 'AB' to the vocabulary
◦ Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
BPE token learner algorithm
Byte Pair Encoding (BPE)
Addendum
Most subword algorithms are run inside space-
separated tokens.
So we commonly first add a special end-of-word
symbol '__' before space in training corpus
Next, separate into letters.
BPE token learner
Original (very fascinating🙄) corpus:
low low low low low lowest lowest newer newer
newer newer newer newer wider wider wider
new new
Add end-of-word tokens, resulting in this vocabulary:
representation
BPE token learner

Merge e r to er
BPE

Merge er _ to er_
BPE

Merge n e to ne
BPE

The next merges are:


BPE token segmenter
algorithm
On the test data, run each merge learned from the
training data:
◦ Greedily
◦ In the order we learned them
◦ (test frequencies don't play a role)
So: merge every e r to er, then merge er _ to er_, etc.
Result:
◦ Test set "n e w e r _" would be tokenized as a full word
◦ Test set "l o w e r _" would be two tokens: "low er_"
Properties of BPE tokens
Usually include frequent words
And frequent subwords
• Which are often morphemes like -est or –er
A morpheme is the smallest meaning-bearing unit of a
language
• unlikeliest has 3 morphemes un-, likely, and -est
Word Normalization,
Basic Text Lemmatization and
Processin Stemming
g
Word Normalization
Def: Putting words/tokens in a standard format.
Choosing a single a single normal form for words with
multiple forms like
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
This standardization may be valuable, despite the spelling
information that is lost in the normalization process
Case folding: another kind of normalization
Applications like IR: reduce all letters to lower case
Mapping everything to lower case means that Woodchuck and
woodchuck are represented identically
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
◦ Fed vs. fed
◦ SAIL vs. sail
very helpful for generalization in many tasks, such as information retrieval or speech
recognition

For sentiment analysis, MT, Information extraction


◦ Case is helpful (US versus us is important)
Lemmatization
Task of determining that two words have the same root, despite their surface differences
Represent all words as their lemma, their shared root
= dictionary headword form:
◦ am, are, is  be
◦ car, cars, car's, cars'  car
◦ Spanish quiero (‘I want’), quieres (‘you want’)
 querer ‘want'

◦ He is reading detective stories


 He be read detective story
Lemmatization is done by Morphological Parsing
Morphemes:
◦ The small meaningful units that make up words
◦ Stems: The core meaning-bearing units
◦ Affixes: Parts that adhere to stems, often with grammatical functions

Morphological Parsers:
◦ Parse cats into two morphemes cat and s
◦ Parse Spanish amaren (‘if in the future they would love’) into morpheme amar
‘to love’, and the morphological features 3PL and future subjunctive.
Stemming
Reduce terms to stems, chopping off affixes crudely
This was not the map we Thi wa not the map we
found in Billy Bones’s found in Billi Bone s
chest, but an accurate chest but an accur copi
copy, complete in all complet in all thing
things-names and name and height and
heights and soundings- sound with the singl
with the single exception except of the red cross
of the red crosses and and the written note
the written notes. .
Porter Stemmer : Widely used stemming
algorithm

Based on a series of rewrite rules run in series


◦ A cascade, in which output of each pass fed as input to
the next pass
Some sample rules:
Dealing with complex morphology is
necessary for many languages
◦ e.g., the Turkish word:
◦ Uygarlastiramadiklarimizdanmissinizcasina
◦ `(behaving) as if you are among those whom we could not civilize’
◦ Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Sentence Segmentation
The most use-ful cues for segmenting a text into sentences are punctuation, like
periods, question marks, and exclamation points.
!, ? mostly unambiguous but period “.” is very ambiguous
◦ Sentence boundary
◦ Abbreviations like Inc. or Dr.
◦ Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to classify a period as either (a)
part of the word or (b) a sentence-boundary.
◦ An abbreviation dictionary can help
Sentence segmentation can then often be done by rules based on this tokenization.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy