Corpora
Corpora
Basic Text
Processin
g
Corpus (plural corpora):
A computer-readable collection of text or speech.
The Brown corpus is a million-word collection of samples from 500 written
English texts from different genres (newspaper, fiction, non-fiction, academic, etc.),
assembled at Brown University in 1963.
He stepped out into the hall, was delighted to encounter a water brother.
This sentence has 13 words if we don’t count punctuation marks as words, 15 if we
count punctuation. Whether we treat period (“.”), comma (“,”)
The Switchboard corpus of American English telephone conversations between strangers
was collected in the early 1990s; it contains 2430 conversations averaging 6 minutes each,
totaling 240 hours of speech and about 3 million words.
Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled?
Was there consent? Pre-processing?
+Annotation process, language variety, demographics, etc.
Text Normalization
let’s begin with the ‘complete words’ of Shakespeare in one file, sh.txt.
THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
Simple Tokenization in UNIX
Given a text file, output the word tokens and their frequencies
tr -sc ’A-Za-z’ ’\n’ < shakes.txt
Change all non-alpha to newlines
| sort
Sort in alphabetical order
| uniq –c
Merge and count each type
1945 A
72 AARON
19 ABBESS
25 Aaron
6 Abate
1 Abates
5 Abbess
6 Abbey
3 Abbot ...
The second step: sorting
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head
A
A
A
A
A
A
A
A
A
...
Merging upper and lower case
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq –c
But we’ll often want to keep the punctuation that occurs word internally, in
examples like m.p.h.,
Issues in Tokenization
Clitic: a word that doesn't stand on its own and can only occur when it is attached
to another word.
converting what’re to the two tokens what are, and we’re to we are.
◦ In French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
tokenization algorithms may also tokenize multiword expressions like New York
or rock ’n’ roll as a single token, which requires a multiword expression
dictionary of some sort.
Tokenization is thus intimately tied up with named entity recognition, the task of
detecting names, dates, and organizations.
◦ New York, rock ’n’ roll
Tokenization in NLTK
since tokenization needs to be run before any other language processing, it
needs to be very fast.
The standard method for tokenization is therefore to use deterministic
algorithms based on regular expressions compiled into very efficient finite state
automata.
Tokenization in languages
without spaces
Many languages (like Chinese, Japanese, Thai) don't use spaces to
separate words!
Chinese words are composed of characters called "hanzi" (or sometimes just "zi")
But deciding what counts as a word is complex and not agreed upon.
How to do word tokenization in
Chinese?
3 words?
姚明 进入 总决赛
YaoMing reaches finals
5 words?
姚 明 进入 总 决赛
Yao Ming reaches overall finals
3 words?
姚明 进入 总决赛
YaoMing reaches finals
5 words?
姚 明 进入 总 决赛
Yao Ming reaches overall finals
In other languages (like Thai and Japanese), more complex word segmentation
is required.
• The standard algorithms are neural sequence models trained by
supervised machine learning.
Byte Pair Encoding
Third option to tokenizing text
Instead of
• white-space segmentation
• single-character segmentation (as in Chinese)
if our training corpus contains, say the words low, new, newer, but not lower,
then if the word lower appears in our test corpus, our system will not know
what to do with it.
To deal with this unknown word problem, modern tokenizers often
automatically induce sets of tokens that include tokens smaller than
words, called subwords.
Subwords can be arbitrary substrings, or they can be meaning-bearing
units like the morphemes -est or -er.
Subword tokenization (because tokens can be parts of words as well as
whole words)
Subword tokenization
Three common algorithms:
◦ Byte-Pair Encoding (BPE)
◦ Unigram language modeling tokenization
◦ WordPiece
All have 2 parts:
◦ A token learner that takes a raw training corpus and induces a
vocabulary (a set of tokens).
◦ A token segmenter that takes a raw test sentence and tokenizes it
according to that vocabulary
Byte Pair Encoding (BPE) token learner
Merge e r to er
BPE
Merge er _ to er_
BPE
Merge n e to ne
BPE
Morphological Parsers:
◦ Parse cats into two morphemes cat and s
◦ Parse Spanish amaren (‘if in the future they would love’) into morpheme amar
‘to love’, and the morphological features 3PL and future subjunctive.
Stemming
Reduce terms to stems, chopping off affixes crudely
This was not the map we Thi wa not the map we
found in Billy Bones’s found in Billi Bone s
chest, but an accurate chest but an accur copi
copy, complete in all complet in all thing
things-names and name and height and
heights and soundings- sound with the singl
with the single exception except of the red cross
of the red crosses and and the written note
the written notes. .
Porter Stemmer : Widely used stemming
algorithm