0% found this document useful (0 votes)

17 views48 pages

Corpora

Uploaded by

divyamanjari1604

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views48 pages

Corpora

Uploaded by

divyamanjari1604

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 48

Words and Corpora

Basic Text
Processin
g
Corpus (plural corpora):
A computer-readable collection of text or speech.
The Brown corpus is a million-word collection of samples from 500 written
English texts from different genres (newspaper, fiction, non-fiction, academic, etc.),
assembled at Brown University in 1963.

He stepped out into the hall, was delighted to encounter a water brother.
This sentence has 13 words if we don’t count punctuation marks as words, 15 if we
count punctuation. Whether we treat period (“.”), comma (“,”)
The Switchboard corpus of American English telephone conversations between strangers
was collected in the early 1990s; it contains 2430 conversations averaging 6 minutes each,
totaling 240 hours of speech and about 3 million words.

I do uh main- mainly business data processing

This utterance has two kinds of disfluencies.

The broken-off word main- is called a fragment.

Words like uh and um are called fillers or filled pauses.

Should we consider these to be words? Again, it depends on the application. If we are

building a speech transcription system, we might want to eventually strip out the
disfluencies.
How about inflected forms like cats versus cat?
These two words have the same lemma cat but are different wordforms.
A lemma is a set of lexical forms having the same stem, the same major part-of-
speech, and the same word sense.
The wordform is the full inflected or derived form of the word.
For morphologically complex languages like Arabic, we often need to deal with
lemmatization.
For many tasks in English, however, wordforms are sufficient.
How many words in a sentence?

Types are the number of distinct words in a corpus;

if the set of words in the vocabulary is V, the number of types is the
vocabulary size |V|.
Tokens are the total number N of running words.
Brown sentence:
They picnicked by the pool, then lay back on the grass and looked at
the stars.
ignore punctuation
16 tokens and 14 types
How many words in a corpus?
N = number of tokens
V = vocabulary = set of types, |V| is size of vocabulary
Heaps Law = Herdan's Law = where often .67 < β < .75
i.e., vocabulary size grows with > square root of the number of word tokens
Tokens = N Types = |V|
Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13+ million
Corpora
Words don't appear out of nowhere!
A text is produced by
• a specific writer(s),
• at a specific time,
• in a specific variety,
• of a specific language,
• for a specific function.
Corpora vary along dimension
like
◦ Language: 7097 languages in the world
◦ Variety, like African American Language varieties.
◦ AAE Twitter posts might include forms like "iont" (I don't)
◦ Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was
beautiful:) ]
H/E: dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
◦ Genre: newswire, fiction, scientific articles, Wikipedia
◦ Author Demographics: writer's age, gender, ethnicity, SES
Corpus datasheets
Gebru et al (2020), Bender and Friedman (2018)

Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled?
Was there consent? Pre-processing?
+Annotation process, language variety, demographics, etc.
Text Normalization

Every NLP task requires text normalization:

1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
Unix Tools for Crude Tokenization and Normalization

let’s begin with the ‘complete words’ of Shakespeare in one file, sh.txt.

We can use tr to tokenize the words by changing every sequence of non

alphabetic characters to a newline
(’A-Za-z’ means alphabetic, the -c option complements to non-alphabet,
and the -s option squeezes all sequences into a single character)
The first step: tokenizing
tr -sc ’A-Za-z’ ’\n’ < shakes.txt

THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
Simple Tokenization in UNIX

Given a text file, output the word tokens and their frequencies
tr -sc ’A-Za-z’ ’\n’ < shakes.txt
Change all non-alpha to newlines
| sort
Sort in alphabetical order
| uniq –c
Merge and count each type

1945 A
72 AARON
19 ABBESS
25 Aaron
6 Abate
1 Abates
5 Abbess
6 Abbey
3 Abbot ...
The second step: sorting
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head

A
A
A
A
A
A
A
A
A
...
Merging upper and lower case
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq –c

Sorting the counts

tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ |
sort | uniq –c | sort –n –r

The -n option to sort means to sort numerically rather than alphabetically,

and the -r option means to sort in reverse order (highest-to-lowest):
The results show that the most frequent words in Shakespeare, as in any
other corpus, are the short function words like articles, pronouns,
prepositions:
27378 the
26084 and
22538 i
19771 to
17481 of
14725 a
13826 you
12489 my
11318 that
11112 in
Word Tokenization
Tokenization: the task of segmenting running text into words.
While the Unix command sequence just removed all the numbers and
punctuation, for most NLP applications we’ll need to keep these in our
tokenization.

We often want to break off punctuation as a separate token;

commas are a useful piece of information for parsers, periods help indicate
sentence boundaries.

But we’ll often want to keep the punctuation that occurs word internally, in
examples like m.p.h.,
Issues in Tokenization

Can't just blindly remove punctuation:

◦ m.p.h., Ph.D., AT&T, cap’n
Special characters and numbers will need to be kept in prices ($45.55) and dates
(01/02/06); we don’t want to segment that price into separate tokens of “45” and
“55”.
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses (someone@cs.colorado.edu)

Clitic: a word that doesn't stand on its own and can only occur when it is attached
to another word.
converting what’re to the two tokens what are, and we’re to we are.
◦ In French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
tokenization algorithms may also tokenize multiword expressions like New York
or rock ’n’ roll as a single token, which requires a multiword expression
dictionary of some sort.
Tokenization is thus intimately tied up with named entity recognition, the task of
detecting names, dates, and organizations.
◦ New York, rock ’n’ roll
Tokenization in NLTK
since tokenization needs to be run before any other language processing, it
needs to be very fast.
The standard method for tokenization is therefore to use deterministic
algorithms based on regular expressions compiled into very efficient finite state
automata.
Tokenization in languages
without spaces
Many languages (like Chinese, Japanese, Thai) don't use spaces to
separate words!

How do we decide where the token boundaries should be?

Word tokenization in Chinese

Chinese words are composed of characters called "hanzi" (or sometimes just "zi")

Each one represents a meaning unit called a morpheme.

Each word has on average 2.4 of them.

But deciding what counts as a word is complex and not agreed upon.
How to do word tokenization in
Chinese?

姚明进入总决赛 “Yao Ming reaches the finals”

3 words?
姚明进入总决赛
YaoMing reaches finals

5 words?
姚明进入总决赛
Yao Ming reaches overall finals

7 characters? (don't use words at all):

姚明进入总决赛
Yao Ming enter enter overall decision game
How to do word tokenization in
Chinese?

姚明进入总决赛 “Yao Ming reaches the finals”

3 words?
姚明进入总决赛
YaoMing reaches finals

5 words?
姚明进入总决赛
Yao Ming reaches overall finals

7 characters? (don't use words at all):

姚明进入总决赛
Yao Ming enter enter overall decision game
Word tokenization / segmentation

So in Chinese it's common to just treat each character (zi) as a token.

• So the segmentation step is very simple

In other languages (like Thai and Japanese), more complex word segmentation
is required.
• The standard algorithms are neural sequence models trained by
supervised machine learning.
Byte Pair Encoding
Third option to tokenizing text
Instead of
• white-space segmentation
• single-character segmentation (as in Chinese)

Use the data to tell us how to tokenize.

if our training corpus contains, say the words low, new, newer, but not lower,
then if the word lower appears in our test corpus, our system will not know
what to do with it.
To deal with this unknown word problem, modern tokenizers often
automatically induce sets of tokens that include tokens smaller than
words, called subwords.
Subwords can be arbitrary substrings, or they can be meaning-bearing
units like the morphemes -est or -er.
Subword tokenization (because tokens can be parts of words as well as
whole words)
Subword tokenization
Three common algorithms:
◦ Byte-Pair Encoding (BPE)
◦ Unigram language modeling tokenization
◦ WordPiece
All have 2 parts:
◦ A token learner that takes a raw training corpus and induces a
vocabulary (a set of tokens).
◦ A token segmenter that takes a raw test sentence and tokenizes it
according to that vocabulary
Byte Pair Encoding (BPE) token learner

Let vocabulary be the set of all individual characters

= {A, B, C, D,…, a, b, c, d….}
Repeat:
◦ Choose the two symbols that are most frequently
adjacent in the training corpus (say 'A', 'B')
◦ Add a new merged symbol 'AB' to the vocabulary
◦ Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
BPE token learner algorithm
Byte Pair Encoding (BPE)
Addendum
Most subword algorithms are run inside space-
separated tokens.
So we commonly first add a special end-of-word
symbol '__' before space in training corpus
Next, separate into letters.
BPE token learner
Original (very fascinating🙄) corpus:
low low low low low lowest lowest newer newer
newer newer newer newer wider wider wider
new new
Add end-of-word tokens, resulting in this vocabulary:
representation
BPE token learner

Merge e r to er
BPE

Merge er _ to er_
BPE

Merge n e to ne
BPE

The next merges are:

BPE token segmenter
algorithm
On the test data, run each merge learned from the
training data:
◦ Greedily
◦ In the order we learned them
◦ (test frequencies don't play a role)
So: merge every e r to er, then merge er _ to er_, etc.
Result:
◦ Test set "n e w e r _" would be tokenized as a full word
◦ Test set "l o w e r _" would be two tokens: "low er_"
Properties of BPE tokens
Usually include frequent words
And frequent subwords
• Which are often morphemes like -est or –er
A morpheme is the smallest meaning-bearing unit of a
language
• unlikeliest has 3 morphemes un-, likely, and -est
Word Normalization,
Basic Text Lemmatization and
Processin Stemming
g
Word Normalization
Def: Putting words/tokens in a standard format.
Choosing a single a single normal form for words with
multiple forms like
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
This standardization may be valuable, despite the spelling
information that is lost in the normalization process
Case folding: another kind of normalization
Applications like IR: reduce all letters to lower case
Mapping everything to lower case means that Woodchuck and
woodchuck are represented identically
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
◦ Fed vs. fed
◦ SAIL vs. sail
very helpful for generalization in many tasks, such as information retrieval or speech
recognition

For sentiment analysis, MT, Information extraction

◦ Case is helpful (US versus us is important)
Lemmatization
Task of determining that two words have the same root, despite their surface differences
Represent all words as their lemma, their shared root
= dictionary headword form:
◦ am, are, is  be
◦ car, cars, car's, cars'  car
◦ Spanish quiero (‘I want’), quieres (‘you want’)
 querer ‘want'

◦ He is reading detective stories

 He be read detective story
Lemmatization is done by Morphological Parsing
Morphemes:
◦ The small meaningful units that make up words
◦ Stems: The core meaning-bearing units
◦ Affixes: Parts that adhere to stems, often with grammatical functions

Morphological Parsers:
◦ Parse cats into two morphemes cat and s
◦ Parse Spanish amaren (‘if in the future they would love’) into morpheme amar
‘to love’, and the morphological features 3PL and future subjunctive.
Stemming
Reduce terms to stems, chopping off affixes crudely
This was not the map we Thi wa not the map we
found in Billy Bones’s found in Billi Bone s
chest, but an accurate chest but an accur copi
copy, complete in all complet in all thing
things-names and name and height and
heights and soundings- sound with the singl
with the single exception except of the red cross
of the red crosses and and the written note
the written notes. .
Porter Stemmer : Widely used stemming
algorithm

Based on a series of rewrite rules run in series

◦ A cascade, in which output of each pass fed as input to
the next pass
Some sample rules:
Dealing with complex morphology is
necessary for many languages
◦ e.g., the Turkish word:
◦ Uygarlastiramadiklarimizdanmissinizcasina
◦ `(behaving) as if you are among those whom we could not civilize’
◦ Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Sentence Segmentation
The most use-ful cues for segmenting a text into sentences are punctuation, like
periods, question marks, and exclamation points.
!, ? mostly unambiguous but period “.” is very ambiguous
◦ Sentence boundary
◦ Abbreviations like Inc. or Dr.
◦ Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to classify a period as either (a)
part of the word or (b) a sentence-boundary.
◦ An abbreviation dictionary can help
Sentence segmentation can then often be done by rules based on this tokenization.

Topics: Travel by Adrian Tennant: Teacher'S Notes
No ratings yet
Topics: Travel by Adrian Tennant: Teacher'S Notes
6 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
The Varieties of Philippine English
100% (4)
The Varieties of Philippine English
12 pages
Engleski Pitanja I Odgovori
No ratings yet
Engleski Pitanja I Odgovori
6 pages
Words and Corpora J+M
No ratings yet
Words and Corpora J+M
49 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
Lect 4 Words and Tokenizing
No ratings yet
Lect 4 Words and Tokenizing
24 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
5 Basic Text Processing
No ratings yet
5 Basic Text Processing
6 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
NLP Week 02
No ratings yet
NLP Week 02
55 pages
Week 2
No ratings yet
Week 2
90 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Text Proc
No ratings yet
Text Proc
55 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
NLP Digital Notes
No ratings yet
NLP Digital Notes
128 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
2 Text Processing
No ratings yet
2 Text Processing
58 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
3.chapter4 - Lexical Representations
No ratings yet
3.chapter4 - Lexical Representations
36 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
100% (1)
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
66 pages
2 TextProc Mar 25 2021
No ratings yet
2 TextProc Mar 25 2021
71 pages
NLP Week 02
No ratings yet
NLP Week 02
54 pages
03 Word Tokenization 14-26
No ratings yet
03 Word Tokenization 14-26
6 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
3.word Level Analysis-Tokenization Stemming
No ratings yet
3.word Level Analysis-Tokenization Stemming
8 pages
Week 02 Tokenizers
No ratings yet
Week 02 Tokenizers
36 pages
Week 3
No ratings yet
Week 3
15 pages
Session1 2024 - 2025 - Natural Language Processing
No ratings yet
Session1 2024 - 2025 - Natural Language Processing
40 pages
Tokeniz Prob!
No ratings yet
Tokeniz Prob!
4 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
NLP m2
No ratings yet
NLP m2
71 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
NLP - Pos and N-Gram Models
No ratings yet
NLP - Pos and N-Gram Models
21 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
C10 - Ai - Unit 3 - NLP - Half Yearly
No ratings yet
C10 - Ai - Unit 3 - NLP - Half Yearly
37 pages
Lecture 1 Text Preprocessing PDF
No ratings yet
Lecture 1 Text Preprocessing PDF
29 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
No ratings yet
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
46 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
46 pages
1009 NLP PPT
No ratings yet
1009 NLP PPT
31 pages
2 Textprocessingboth
No ratings yet
2 Textprocessingboth
46 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
All Practicals
No ratings yet
All Practicals
33 pages
Word Segmentation Sentence Segmentation: Recommended Reading
No ratings yet
Word Segmentation Sentence Segmentation: Recommended Reading
31 pages
NLP Lecture 6 Week 3
No ratings yet
NLP Lecture 6 Week 3
9 pages
Between Words and Characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
No ratings yet
Between Words and Characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
27 pages
Words & Transducers
No ratings yet
Words & Transducers
7 pages
9-Word and Sentence Segmentation-17!01!2024
No ratings yet
9-Word and Sentence Segmentation-17!01!2024
32 pages
Natural Language Processing 1
No ratings yet
Natural Language Processing 1
19 pages
NLP Ai X
No ratings yet
NLP Ai X
6 pages
Text Processing
No ratings yet
Text Processing
114 pages
Lec 2
No ratings yet
Lec 2
21 pages
Tokenization
No ratings yet
Tokenization
26 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Mandarin Chinese Characters Language Practice Pad: Learn Mandarin Chinese in Just a Few Minutes Per Day! (Fully Romanized)
From Everand
Mandarin Chinese Characters Language Practice Pad: Learn Mandarin Chinese in Just a Few Minutes Per Day! (Fully Romanized)
Xin Liang
No ratings yet
Chinese Characters Writing Practice Pad: Learn Chinese in Just Minutes a Day!
From Everand
Chinese Characters Writing Practice Pad: Learn Chinese in Just Minutes a Day!
Xin Liang
No ratings yet
UIIITIACTII Accomodation Theory Mind Map
No ratings yet
UIIITIACTII Accomodation Theory Mind Map
1 page
Laporan Latihan Industri Politeknik
75% (8)
Laporan Latihan Industri Politeknik
10 pages
Steingass Eng-Arab 1882
No ratings yet
Steingass Eng-Arab 1882
476 pages
Cross Cultural Context PDF
No ratings yet
Cross Cultural Context PDF
9 pages
R56779 Star Early Literacy Sample Items
No ratings yet
R56779 Star Early Literacy Sample Items
6 pages
Advanced Italian PDF
100% (2)
Advanced Italian PDF
48 pages
Infixes and Types Infixes
0% (1)
Infixes and Types Infixes
23 pages
Sentence and Its Elements
No ratings yet
Sentence and Its Elements
16 pages
TXTBK + Qualas: Textbook-Based Instruction Paired With MELC-Based Quality Assured Learning Activity Sheet (LAS)
No ratings yet
TXTBK + Qualas: Textbook-Based Instruction Paired With MELC-Based Quality Assured Learning Activity Sheet (LAS)
11 pages
Just Phonics 1st-3rd Class TRB
No ratings yet
Just Phonics 1st-3rd Class TRB
288 pages
Daily Lesson Plan Using Appropriate Modifiers Grade 8 Anthurium
No ratings yet
Daily Lesson Plan Using Appropriate Modifiers Grade 8 Anthurium
10 pages
Principles of Teaching Mother Tongue
No ratings yet
Principles of Teaching Mother Tongue
13 pages
17Zg Cöfvlk Wbeüb-2022
No ratings yet
17Zg Cöfvlk Wbeüb-2022
16 pages
Grade 8 Losing Identity Bec. of Immigration
No ratings yet
Grade 8 Losing Identity Bec. of Immigration
4 pages
Fishing Word Search
No ratings yet
Fishing Word Search
3 pages
Plan de Mejoramiento 2020
No ratings yet
Plan de Mejoramiento 2020
9 pages
English Grammar-Sentence Structure
No ratings yet
English Grammar-Sentence Structure
10 pages
Urdu Walk
No ratings yet
Urdu Walk
2 pages
Setswana HL P2 May-June 2019
No ratings yet
Setswana HL P2 May-June 2019
25 pages
WEEK 8 KOJ 3444 Writing News Lead
No ratings yet
WEEK 8 KOJ 3444 Writing News Lead
17 pages
Present-Perfect 4782
No ratings yet
Present-Perfect 4782
1 page
Activity 3 Daily Routines
No ratings yet
Activity 3 Daily Routines
4 pages
Student Assessment Workbook: BSBWRT401 Write Complex Documents
No ratings yet
Student Assessment Workbook: BSBWRT401 Write Complex Documents
25 pages
Plural Nouns Exercise
No ratings yet
Plural Nouns Exercise
1 page
(Ebook PDF) Cultural Anthropology Asking Questions About Humanity by Robert L. Welsch Instant Download
100% (1)
(Ebook PDF) Cultural Anthropology Asking Questions About Humanity by Robert L. Welsch Instant Download
52 pages
Cross-Cultural Communication - Self Evaluation Surveys
No ratings yet
Cross-Cultural Communication - Self Evaluation Surveys
4 pages
Baldovino - Iee31 - Unveiling The Tapestry of Transformation - The Multifaceted Impact of Westernization On Culture, Governance, and Socioeconomic Realities in The Philippines
No ratings yet
Baldovino - Iee31 - Unveiling The Tapestry of Transformation - The Multifaceted Impact of Westernization On Culture, Governance, and Socioeconomic Realities in The Philippines
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Corpora

Uploaded by

Corpora

Uploaded by

Words and Corpora

I do uh main- mainly business data processing

This utterance has two kinds of disfluencies.

Words like uh and um are called fillers or filled pauses.

Should we consider these to be words? Again, it depends on the application. If we are

Types are the number of distinct words in a corpus;

Every NLP task requires text normalization:

We can use tr to tokenize the words by changing every sequence of non

Sorting the counts

The -n option to sort means to sort numerically rather than alphabetically,

We often want to break off punctuation as a separate token;

Can't just blindly remove punctuation:

How do we decide where the token boundaries should be?

Each one represents a meaning unit called a morpheme.

Each word has on average 2.4 of them.

姚明进入总决赛 “Yao Ming reaches the finals”

7 characters? (don't use words at all):

姚明进入总决赛 “Yao Ming reaches the finals”

7 characters? (don't use words at all):

So in Chinese it's common to just treat each character (zi) as a token.

Use the data to tell us how to tokenize.

Let vocabulary be the set of all individual characters

The next merges are:

For sentiment analysis, MT, Information extraction

◦ He is reading detective stories

Based on a series of rewrite rules run in series

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.