0% found this document useful (0 votes)

3 views

2.BasicTextProcessing NEW

The document discusses basic text processing techniques in Natural Language Processing (NLP), focusing on regular expressions, text normalization, and the early chatbot ELIZA. It covers the use of regular expressions for pattern matching and text manipulation, as well as the importance of tokenization and normalization in preparing text for analysis. Additionally, it highlights the distinction between word types and tokens, and introduces lemmatization and stemming as methods for reducing words to their base forms.

Uploaded by

Rasha Elsayed Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

2.BasicTextProcessing NEW

Uploaded by

Rasha Elsayed Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Natural Language Processing (NLP)

Basic Text Processing

Based on Jurafsky & Martin (2018), Chapter 2

Katrien Beuls
Artificial Intelligence Laboratory
Vrije Universiteit Brussel

1
TER
BASIC TEXT PROCESSING
Regular Expressions, Text
ELIZA

Normalization, Edit Distance

User: You are like my father in some ways.
ELIZA: WHAT RESEMBLANCE DO YOU SEE
User: You are not very aggressive but I think you don’t want me to
notice that.
ELIZA: WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE
User: You don’t argue with me.
ELIZA: WHY DO YOU THINK I DON’T ARGUE WITH YOU
User: You are afraid of me.
ELIZA: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU
Weizenbaum (1966)
IZA The dialogue above is from ELIZA, an early natural language processing sys-
tem that could carry on a limited conversation with a user by imitating the responses
of a Rogerian psychotherapist (Weizenbaum, 1966). ELIZA is a surprisingly simple
program that uses pattern matching to recognize phrases like “You are X” and trans-
late them into suitable outputs like “What makes you think I am X?”. This simple
technique succeeds in this domain because ELIZA doesn’t actually need to know 2
BASIC TEXT PROCESSING
REGULAR EXPRESSIONS

I A formal language for specifying text strings

I How can we search for any of

these?
I woodchuck
I woodchucks
I Woodchuck
I Woodchucks

3
REGULAR EXPRESSIONS
DISJUNCTIONS

I Letters inside square brackets: []

Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit

I Ranges: [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down ...

4
REGULAR EXPRESSIONS
NEGATION IN DISJUNCTION

I Negations: [ˆSs]
Pattern Matches
[ˆA-Z] Not an upper case letter Oyfn pripetchik
[ˆSs] Neither ‘S’ nor ‘s’ I have no reason
[ˆ\.] Not a period our resident Djinn
aˆb The pattern ‘aˆb’ look up aˆb now

5
REGULAR EXPRESSIONS
MORE DISJUNCTION

I Woodchucks is another name for groundhog!

I The pipe | for disjunction

6
REGULAR EXPRESSIONS
? *+.

Pattern Matches
colou?r Optional previous char color colour
oo*h! 0 or more of previous char oh! ooh! oooh! ooooh!
o+h! 1 or more of previous char oh! ooh! oooh! ooooh!
baa+ baa baaa baaaa baaaaa
beg.n begin begun began beg3n

7
REGULAR EXPRESSIONS
ANCHORS ˆ $

Pattern Matches
ˆ[A-Z] Palo Alto
ˆ[ˆA-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!

8
REGULAR EXPRESSIONS
EXERCISE

I Find all instances of the word “the” in a text.

I /the/
Misses capitalised examples

I /[tT]he/
Incorrectly returns other or theology

I /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
Does not return “the” when it begins a line

I /(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/

9
REGULAR EXPRESSIONS
OPERATOR PRECEDENCE HIERARCHY

I From highest to lowest precedence:

Parenthesis ()
Counters * + ? {}
Sequences and anchors the ˆmy end$
Disjunction |

I Counters > Sequences: /the*/ matches theeeee but not

thethe
I Sequences > Disjunction: /the|any/ matches the or any
but not theny

10
REGULAR EXPRESSIONS
SUBSTITUTIONS

I Substitution operator s/regexp1/pattern/ used in Unix

commands like vim or sed allows a string characterized by
a regular expression to be replaced by another string:
s/colour/color
I Referring to a subpart of the string matching the ﬁrst
pattern: e.g. put angle brackets around all integers in a text:
s/[0-9]+/<\1>/

11
REGULAR EXPRESSIONS
CAPTURE GROUPS

/the (.*)er they were, the \1er they will be/

I Parenthesis used for storing a pattern in memory
= a capture group
I Resulting match is stored in a numbered register
I If you match two different sets of parentheses, \2 means
whatever matched the second capture group
/the (.*)er they (.*), the \1er we \2 /
I Use a non-capturing group if you don’t want to capture the
resulting pattern in a register:
/((?:some|a few) (people|cats) like some \1/

12
REGULAR EXPRESSIONS
ELIZA OR A SIMPLE CHATBOT

I Works by having a cascade of (ranked) regular expression

substitutions
I Input lines are ﬁrst uppercased
I First substitutions change all instances of MY to YOUR and
I’M to YOU ARE
I The next set of substitutions matches and replaces other
patterns in the input

s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/

s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

13
WORDS
WHAT COUNTS AS A WORD?

I How many words are in the following Brown sentence?

I He stepped out into the hall, was delighted to encounter a
water brother.
I 13 words (15 with punctuation marks)
I How many words are in the following utterance from the
Switchboard corpus?
I I do uh main- mainly business data processing
I Two disﬂuencies: a fragment (main-) and a ﬁlled pause (uh)

14
WORDS
WHAT COUNTS AS A WORD?

I How about inﬂected forms like cats versus cat?

I They have the same lemma cat but different wordforms
I A lemma is a set of lexical forms having the same stem, the
same major part-of-speech, and the same word sense
I The wordform is the fully inﬂected or derived form of the
word

15
WORDS
HOW MANY WORDS ARE THERE IN ENGLISH?

I Types are the number of distinct words in a corpus

I If the set of words in the vocabulary is V, the number of
types is the vocabulary size |V|
I Tokens are the total number N of running words
I If we ignore punctuation, the following Brown sentence has
16 tokens and 14 types:
I They picknicked by the pool, then lay back on the grass and
looked at the stars.

16
WORDS
POPULAR ENGLISH LANGUAGE CORPORA

Corpus Tokens = N Types = |V|

Shakespeare 884 thousand 31 thousand
Brown corpus 1 million 38 thousand
Switchboard telephone conversations 2.4 million 20 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13 million

The relationship between the number of types and number of tokens

is called Herdan’s Law (Herdan, 1960) or Heap’s Law (Heaps, 1978),
where k and β are positive constants and 0 < β < 1:

|V| = kNβ

17
WORDS
DICTIONARY ENTRIES

I Look at the number of lemmas instead of wordform types

I Dictionaries can help in giving lemma counts
I Dictionary entries or boldface forms are a very rough upper
bound on the number of lemmas
I The 1989 edition of the Oxford English Dictionary had
615,000 entries

18
TEXT NORMALIZATION
FIRST NLP TASK

At least three tasks are commonly applied as part of any

normalization process:
1. Segmenting/tokenizing words from running text
2. Normalizing word formats
3. Segmenting sentences in running text

19
TEXT NORMALIZATION
WORD TOKENIZATION

Main challenges:
I Break off punctuation as a separate token but preserve it
when it occurs word internally (Ph.D., AT&T, ...)
I Keep special characters and numbers in prices ($45.55)
and dates (15/02/2019)
I Expand clitic contractions that are marked by apostrophes
(we’re → we are)
I Tokenize multiword expressions like New York or rock ’n’
roll as a single token

20
contrast, use a comma to mark the decimal point, and spaces (or sometimes periods)
where English puts commas, for example, 555 500,50.
clitic A tokenizer can also be used to expand clitic contractions that are marked by
apostrophes, for example, converting what’re to the two tokens what are, and
TEXT NORMALIZATION
we’re to we are. A clitic is a part of a word that can’t stand on its own, and can only
occur when it is attached to another word. Some such contractions occur in other
PENN TREEBANK TOKENIZATION
alphabetic languages, including articles and pronouns in French (j’ai, l’homme).
Depending on the application, tokenization algorithms may also tokenize mul-
tiword expressions like New York or rock ’n’ roll as a single token, which re-
quires a multiword expression dictionary of some sort. Tokenization is thus inti-
mately tied up with named entity detection, the task of detecting names, dates, and
Pennorganizations
Treebank (Chapter 17).
tokenization standard separates out clitics
One commonly used tokenization standard is known as the Penn Treebank to-
(doesn’t
Penn Treebank becomes does plus n’t), keeps(treebanks)
kenization standard, used for the parsed corpora hyphenated releasedwords
by the Lin-
tokenization
together,
guistic and separates
Data Consortium (LDC),outtheall punctuation:
source of many useful datasets. This standard
separates out clitics (doesn’t becomes does plus n’t), keeps hyphenated words to-
gether, and separates out all punctuation:
Input: “The San Francisco-based restaurant,” they said, “doesn’t charge $10”.
Output: “ The San Francisco-based restaurant , ” they
said , “ does n’t charge $ 10 ” .

Tokens can also be normalized, in which a single normalized form is chosen for
words with multiple forms like USA and US or uh-huh and uhhuh. This standard-
ization may be valuable, despite the spelling information that is lost in the normal-
ization process. For information retrieval, we might want a query for US to match a
document that has USA; for information extraction we might want to extract coherent
information that is consistent across differently-spelled instances.
21
TEXT NORMALIZATION
NORMALIZING TOKENS

Choosing a single normalized form for words with multiple

forms such as USA and US.
I Valuable for information retrieval if you want to query for
US to match a document that has USA
I In information extraction we might want to extract
coherent information that is consistent across
differently-spelled instances

22
TEXT NORMALIZATION
CASE FOLDING

Case folding is another kind of normalization

I For tasks like speech recognition and information retrieval,
everything is mapped to lower case
I For sentiment analysis and other text classiﬁcation tasks,
information extraction and machine translation, case is
quite helpful and case folding is generally not done

23
TEXT NORMALIZATION
EFFICIENCY

Because tokenization needs to be run before any other

language processing, it is important to be very fast.
I Deterministic algorithms based on regular expressions
compiled into very efﬁcient ﬁnite state automata
I Carefully designed deterministic algorithms can deal with
the ambiguities that arise (e.g. the apostrophe)

24
TEXT NORMALIZATION
COLLAPSING WORDS

I Lemmatization is the task of determining that two words

have the same root, despite their surface differences
I How is lemmatization done?
I Most sophisticated methods for lemmatization involve
complete morphological parsing of the word
I Two broad classes of morphemes can be distinguished:
1. Stems - the central morpheme of the word, supplying the
main meaning
2. Afﬁxes - adding “additional” meanings of various kinds

25
involve complete morphological parsing of the word. Morphology is the study of
morpheme the way words are built up from smaller meaning-bearing units called morphemes.
stem Two broad classes of morphemes can be distinguished: stems—the central mor-
affix pheme of the word, supplying the main meaning— and affixes—adding “additional”
meanings
TEXT of various kinds. So, for example, the word fox consists of one morpheme
NORMALIZATION
(the morpheme fox) and the word cats consists of two: the morpheme cat and the
morpheme -s. AWORDS
COLLAPSING morphological parser takes a word like cats and parses it into the
two morphemes cat and s, or a Spanish word like amaren (‘if in the future they
would love’) into the morphemes amar ‘to love’, 3PL, and future subjunctive.

The Porter Stemmer

I Stemming is a simpler but cruder method, which mainly
Lemmatization algorithms can be complex. For this reason we sometimes make use
consists of chopping off word-ﬁnal afﬁxes
of a simpler but cruder method, which mainly consists of chopping off word-final
stemming Iaffixes.
OneThisof naive version of
the most morphological
widely used analysis
stemming stemming. Onefor
is calledalgorithms of
ter stemmer the most widely used stemming algorithms is the Porter (1980). The Porter stemmer
English is the Porter stemmer (1980)
applied to the following paragraph:
This was not the map we found in Billy Bones’s chest, but
C HAPTER 2 • anR EGULAR
accurate copy, complete
E XPRESSIONS , T EXT Nin all things-names
ORMALIZATION , E DIT Dand heights
ISTANCE
and soundings-with the single exception of the red crosses
and the written notes.
produces the following stemmed output:
Thi wa not the map we found in Billi Bone s chest but an
accur copi complet in all thing name and height and sound
with the singl except of the red cross and the written note
cascade The algorithm is based on series of rewrite rules run in series, as a cascade, in
which the output of each pass is fed as input to the next pass; here is a sampling of
the rules:
26
TEXT NORMALIZATION
COLLAPSING WORDS

I The Porter stemmer algorithm is based on a series of

rewrite rules in series, as a cascade

ATIONAL → ATE (e.g. relational → relate)

ING → if stem contains vowel (e.g., motoring → motor)
SSES → SS (e.g., grasses → grass)

I Detailed rule lists for the Porter stemmer can be found on

Martin Porter’s homepage

27
TEXT NORMALIZATION
SENTENCE SEGMENTATION

Sentence segmentation is another important step in text

processing
I The most useful cues are punctuation (periods, question
marks, exclamation points)
I Periods can be ambiguous: Mr. or Inc.
In general, sentence tokenization methods work by building a
binary classiﬁer that decides if a period is a part of the word or
is a sentence-boundary marker

28
MINIMUM EDIT DISTANCE
STRING SIMILARITY

I Calculating the similarity between two strings is useful in

many NLP tasks, such as spelling correction or
coreference resolution
I The minimum edit distance between two strings is deﬁned
as the minimum number of editing operations (operations
like insertion, deletion, substitution) needed to transform
one string into another

29
s of symbols expressing an operation list for
om string: d for deletion, s for substitution, i fo
MINIMUM EDIT DISTANCE
EXAMPLE ALIGNMENT

ing theI The

minimum edit
gap between distance
intention between
and execution is 5 two strings

peration list for converting the top string into the bot
30
MINIMUM EDIT DISTANCE
LEVENSHTEIN DISTANCE

I We can also assign a particular cost or weight to each of

these operations
I Levenshtein distance: Each insertion or deletion has a
cost of 1 and substitutions are not allowed
I (This is equivalent to allowing substitution, but giving each
substitution a cost of 2 since any substitution can be represented
by one insertion and one deletion)
I Using this metric, the Levenshtein distance between
intention and execution is 8

31
MINIMUM EDIT DISTANCE
ALGORITHM

I How do we ﬁnd the minimum edit distance?

I We can think of this as a search task, in which we are
searching for the shortest path - a sequence of edits - from
one string to another
I The space of all possible edits is enormous, so we can’t
search naively
I However, lots of distinct edit paths will end up in the same
state (string), so rather than recomputing all those paths,
we could just remember the shortest path to a state each
time we saw it

32
MINIMUM EDIT DISTANCE
DYNAMIC PROGRAMMING

I Dynamic programming is the name for a class of

algorithms, ﬁrst introduced by Bellman (1957), that apply a
table-driven method to solve problems by combining
solutions to sub-problems
I Some of the most commonly used algorithms in NLP make
use of dynamic programming, such as the Viterbi
algorithm and the CKY algorithm for parsing

33
a table-driven method to solve problems by combining solutions to sub-problems.
Some of the most commonly used algorithms in natural language processing make
use of dynamic programming, such as the Viterbi algorithm (Chapter 8) and the
CKY algorithm for parsing (Chapter 11).
MINIMUM EDIT DISTANCE
The intuition of a dynamic programming problem is that a large problem can
be solved by properly
SHORTEST PATH combining the solutions to various sub-problems. Consider
the shortest path of transformed words that represents the minimum edit distance
between the strings intention and execution shown in Fig. 2.15.

i n t e n t i o n
delete i
n t e n t i o n
substitute n by e
e t e n t i o n
substitute t by x
e x e n t i o n
insert u
e x e n u t i o n
substitute n by c
e x e c u t i o n
Figure 2.15 Path from intention to execution.

Imagine some string (perhaps it is exention) that is in this optimal path (whatever
it is). The intuition of dynamic programming is that if exention is in the optimal
operation list, then the optimal sequence must also include the optimal path from 34
HAPTER 2 MINIMUM
• R EGULAR E EDIT DISTANCE
XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

ALGORITHM

function M IN -E DIT-D ISTANCE(source, target) returns min-distance

n L ENGTH(source)
m L ENGTH(target)
Create a distance matrix distance[n+1,m+1]

# Initialization: the zeroth row and column is the distance from the empty string
D[0,0] = 0
for each row i from 1 to n do
D[i,0] D[i-1,0] + del-cost(source[i])
for each column j from 1 to m do
D[0,j] D[0, j-1] + ins-cost(target[j])

# Recurrence relation:
for each row i from 1 to n do
for each column j from 1 to m do
D[i, j] M IN( D[i 1, j] + del-cost(source[i]),
D[i 1, j 1] + sub-cost(source[i], target[j]),
D[i, j 1] + ins-cost(target[j]))
# Termination
return D[n,m]

Figure 2.16 The minimum edit distance algorithm, an example of the class of dynamic
35
D[i 1, j 1] + sub-cost(source[i], target[j]),
D[i, j 1] + ins-cost(target[j]))
# Termination
return D[n,m]
MINIMUM EDIT DISTANCE
Figure 2.16 The minimum edit distance algorithm, an example of the class of dynamic
THEprogramming
EDIT DISTANCE MATRIX
algorithms. The various costs can either be fixed (e.g., 8x, ins-cost(x) = 1)
or can be specific to the letter (to model the fact that some letters are more likely to be in-
serted than others). We assume that there is no cost for substituting a letter for itself (i.e.,
sub-cost(x, x) = 0).

Src\Tar # e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i 1 2 3 4 5 6 7 6 7 8
n 2 3 4 5 6 7 8 7 8 7
t 3 4 5 6 7 8 7 8 9 8
e 4 3 4 5 6 7 8 9 10 9
n 5 4 5 6 7 8 9 10 11 10
t 6 5 6 7 8 9 8 9 10 11
i 7 6 7 8 9 10 9 8 9 10
o 8 7 8 9 10 11 10 9 8 9
n 9 8 9 10 11 12 11 10 9 8
Figure 2.17 Computation of minimum edit distance between intention and execution with
the algorithm of Fig. 2.16, using Levenshtein distance with cost of 1 for insertions or dele-
tions, 2 for substitutions.

minimum edit distance algorithm to store the pointers and compute the backtrace to
output an alignment. 36
MINIMUM EDIT DISTANCE
PRODUCING AN ALIGNMENT

2.6 • S UMMARY 25

# e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i "1 - "2 - "3 - "4 - "5 - "6 - "7 -6 7 8
n "2 - "3 - "4 - "5 - "6 - "7 - "8 "7 - "8 -7
t "3 - "4 - "5 - "6 - "7 - "8 -7 "8 - "9 "8
e "4 -3 4 - 5 6 7 "8 - "9 - " 10 "9
n "5 "4 - "5 - "6 - "7 - "8 - "9 - " 10 - " 11 -" 10
t "6 "5 - "6 - "7 - "8 - "9 -8 9 10 " 11
i "7 "6 - "7 - "8 - "9 - " 10 "9 -8 9 10
o "8 "7 - "8 - "9 - " 10 - " 11 " 10 "9 -8 9
n "9 "8 - "9 - " 10 - " 11 - " 12 " 11 " 10 "9 -8
Figure 2.18 When entering a value in each cell, we mark which of the three neighboring
cells we came from with up to three arrows. After the table is full we compute an alignment
(minimum edit path) by using a backtrace, starting at the 8 in the lower-right corner and
following the arrows back. The sequence of bold cells represents one possible minimum cost
alignment between the two strings. Diagram design after Gusfield (1997).

Summary 37
MINIMUM EDIT DISTANCE
EXTENSIONS

I The algorithm allows arbitrary weights on the operations

I For spelling correction, substitutions are more likely to
happen between letters that are next to each other on the
keyboard
I The Viterbi algorithm is a probabilistic extension of
minimum edit distance
I Viterbi computes the “maximum probability alignment” of
one string with another (cf. Ch. 8 on POS tagging)

38
SUMMARY

I The regular expression language is a powerful tool for

pattern-matching.
I Basic operations in regular expressions include concatenation
of symbols, disjunction of symbols ([], |, and ),˙ counters (*, +, and
{n,m}), anchors (ˆ, $) and precedence operators ((,)).
I Word tokenization and normalization are generally done by
cascades of simple regular expressions substitutions or finite
automata.
I The Porter algorithm is a simple and efficient way to do
stemming, stripping off affixes. It does not have high accuracy
but may be useful for some tasks.
I The minimum edit distance between two strings is the minimum
number of operations it takes to edit one into the other. Minimum
edit distance can be computed by dynamic programming, which
also results in an alignment of the two strings.
39

SMU02C Monitoring Unit Datasheet
100% (3)
SMU02C Monitoring Unit Datasheet
2 pages
Course Syllabus: (University of The City of Manila)
No ratings yet
Course Syllabus: (University of The City of Manila)
16 pages
8 Week To Lean Digital
No ratings yet
8 Week To Lean Digital
50 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
2 TextProc Mar 25 2021
No ratings yet
2 TextProc Mar 25 2021
71 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Text Proc
No ratings yet
Text Proc
55 pages
Week 2
No ratings yet
Week 2
90 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
Corpora
No ratings yet
Corpora
48 pages
2
No ratings yet
2
29 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
No ratings yet
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
46 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
46 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
41 pages
Week3
No ratings yet
Week3
15 pages
5 BASIC TEXT PROCESSING
No ratings yet
5 BASIC TEXT PROCESSING
6 pages
Regular Expressions, Text Normalization, Edit Distance
No ratings yet
Regular Expressions, Text Normalization, Edit Distance
30 pages
Chapter 1 + 2
No ratings yet
Chapter 1 + 2
9 pages
Lecture-2n-04032024-081220pm-19022025-105409am
No ratings yet
Lecture-2n-04032024-081220pm-19022025-105409am
38 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Morphological Analysis (1)
No ratings yet
Morphological Analysis (1)
118 pages
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
100% (1)
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
66 pages
NLP m1
No ratings yet
NLP m1
148 pages
NLP Unit1Content
No ratings yet
NLP Unit1Content
106 pages
2 Textprocessingboth
No ratings yet
2 Textprocessingboth
46 pages
lec2
No ratings yet
lec2
21 pages
1009_nlp_ppt
No ratings yet
1009_nlp_ppt
31 pages
3b_TextProcessing
No ratings yet
3b_TextProcessing
32 pages
UNIT-1 notes
No ratings yet
UNIT-1 notes
19 pages
PART B NOTES
No ratings yet
PART B NOTES
62 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
Supervised Machine Learning for Text Analysis in R 1st Edition Emil Hvitfeldt Julia Silge instant download
100% (1)
Supervised Machine Learning for Text Analysis in R 1st Edition Emil Hvitfeldt Julia Silge instant download
41 pages
3.Chapter4_Lexical Representations
No ratings yet
3.Chapter4_Lexical Representations
36 pages
Multimedia Application L2
No ratings yet
Multimedia Application L2
47 pages
Regular Expressions, Text Normalization, Edit Distance
No ratings yet
Regular Expressions, Text Normalization, Edit Distance
23 pages
NLP m2
No ratings yet
NLP m2
71 pages
Lecture 1 Text Preprocessing PDF
No ratings yet
Lecture 1 Text Preprocessing PDF
29 pages
NLP MODULE-2 Final Copy
No ratings yet
NLP MODULE-2 Final Copy
114 pages
NLP Digital Notes
No ratings yet
NLP Digital Notes
128 pages
Wordlevel Analysis - Chap2
No ratings yet
Wordlevel Analysis - Chap2
97 pages
NLP - SEM
No ratings yet
NLP - SEM
31 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
Sample Paper Questions - NLP (Part 2)
No ratings yet
Sample Paper Questions - NLP (Part 2)
7 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
Lect2 Regular Expressions
No ratings yet
Lect2 Regular Expressions
41 pages
03 Word Tokenization 14-26
No ratings yet
03 Word Tokenization 14-26
6 pages
NLP - Pos and N-Gram Models
No ratings yet
NLP - Pos and N-Gram Models
21 pages
nlp unit-1 (1)
No ratings yet
nlp unit-1 (1)
5 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Mastering French: A Guide to Learning the Language: The Language Collection
From Everand
Mastering French: A Guide to Learning the Language: The Language Collection
Christopher Ford
5/5 (1)
Spanish Simple & Easy!
From Everand
Spanish Simple & Easy!
D. DeJuan Turner
No ratings yet
The Sounds of Words
From Everand
The Sounds of Words
Anita Landoll
No ratings yet
lect33-textcat (1)
No ratings yet
lect33-textcat (1)
70 pages
Syntactic and Dependency Parsing
No ratings yet
Syntactic and Dependency Parsing
159 pages
ch07-consistency-replication (1)
No ratings yet
ch07-consistency-replication (1)
30 pages
reduction proofs
No ratings yet
reduction proofs
9 pages
bag_of_words nlp
No ratings yet
bag_of_words nlp
23 pages
Tut4_WordEmb nlp
No ratings yet
Tut4_WordEmb nlp
30 pages
Primes
No ratings yet
Primes
39 pages
10-estimators-pre-lecture
No ratings yet
10-estimators-pre-lecture
109 pages
slides08-lr-parsing
No ratings yet
slides08-lr-parsing
25 pages
new trends for authentication
No ratings yet
new trends for authentication
5 pages
2DI90_chID190-CH5
No ratings yet
2DI90_chID190-CH5
62 pages
2DI90_ch11 (1)
No ratings yet
2DI90_ch11 (1)
54 pages
Jarrar.LectureNotes.Ch1.Introduction
No ratings yet
Jarrar.LectureNotes.Ch1.Introduction
18 pages
NLP-LLM
No ratings yet
NLP-LLM
47 pages
2DI90_ch9 (1)
No ratings yet
2DI90_ch9 (1)
83 pages
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 nlp
No ratings yet
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 nlp
126 pages
01-bayes-all-handout prob
No ratings yet
01-bayes-all-handout prob
28 pages
13-oo-opolymorphism plc
No ratings yet
13-oo-opolymorphism plc
15 pages
imc_shift-cipher
No ratings yet
imc_shift-cipher
17 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
07-covariance-answers-hidden-lecture
No ratings yet
07-covariance-answers-hidden-lecture
62 pages
04-textcat text class
No ratings yet
04-textcat text class
77 pages
13-neuralcrf pos tagging
No ratings yet
13-neuralcrf pos tagging
40 pages
02 Random Vars All Handout
No ratings yet
02 Random Vars All Handout
23 pages
3_slides corpus3
No ratings yet
3_slides corpus3
88 pages
61799956 POS tagging
No ratings yet
61799956 POS tagging
63 pages
4_slides Regualer expression
No ratings yet
4_slides Regualer expression
75 pages
2 Corpora and Smoothing
No ratings yet
2 Corpora and Smoothing
85 pages
01-introduction plc
No ratings yet
01-introduction plc
53 pages
Ch. 1 Notes
No ratings yet
Ch. 1 Notes
11 pages
Botipod™: Gas Lift Technical Unit
No ratings yet
Botipod™: Gas Lift Technical Unit
25 pages
B757-767 Series: by Flightfactor and Steptosky For X-Plane 11.35+ (VR Compatable) Produced by Vmax
No ratings yet
B757-767 Series: by Flightfactor and Steptosky For X-Plane 11.35+ (VR Compatable) Produced by Vmax
12 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
17 pages
Mobile Detector: Text & Image by Aditya Ramane
No ratings yet
Mobile Detector: Text & Image by Aditya Ramane
4 pages
File Operations in Assembly Language - Mark's Blog
No ratings yet
File Operations in Assembly Language - Mark's Blog
4 pages
Delco Remy Starting System
No ratings yet
Delco Remy Starting System
9 pages
Physical World: Science 1 Principal Thrusts 5 Fundamental Forces in Nature 7
No ratings yet
Physical World: Science 1 Principal Thrusts 5 Fundamental Forces in Nature 7
168 pages
A Character Sketch of Antonio
89% (9)
A Character Sketch of Antonio
2 pages
Chemical Engineering s7 & s8
No ratings yet
Chemical Engineering s7 & s8
337 pages
Pre Formulation
0% (1)
Pre Formulation
53 pages
Flexible Rigid Pavements
No ratings yet
Flexible Rigid Pavements
66 pages
Class Presentation Cs 323732 Ken Flannigan
No ratings yet
Class Presentation Cs 323732 Ken Flannigan
69 pages
12.factors Affecting The Purchase Intension of Smartphone PDF
No ratings yet
12.factors Affecting The Purchase Intension of Smartphone PDF
16 pages
Econ. 201 Major Assignment 2024 Or2025
No ratings yet
Econ. 201 Major Assignment 2024 Or2025
3 pages
Electrical Specification: Daewoo Continuous Wave Magnetron 2M218J
No ratings yet
Electrical Specification: Daewoo Continuous Wave Magnetron 2M218J
2 pages
Year 7 WK 4 Math L.P 24 - 25
No ratings yet
Year 7 WK 4 Math L.P 24 - 25
15 pages
Fluid Mechanics - Statics
No ratings yet
Fluid Mechanics - Statics
21 pages
Data Wrangling Report
No ratings yet
Data Wrangling Report
3 pages
MINI PROJECT REPORT (Centrifugal Pump)
No ratings yet
MINI PROJECT REPORT (Centrifugal Pump)
54 pages
The Kalas
100% (1)
The Kalas
3 pages
CDG61
No ratings yet
CDG61
314 pages
Aldehydes, Ketones and Carboxylic Acids + Amines
No ratings yet
Aldehydes, Ketones and Carboxylic Acids + Amines
124 pages
MTH 102
No ratings yet
MTH 102
2 pages
Relationship of Sine Theta and Acceleration
No ratings yet
Relationship of Sine Theta and Acceleration
3 pages
Types of Ethernet Topology
No ratings yet
Types of Ethernet Topology
2 pages
University of Engineering and Technology: Lab Report
No ratings yet
University of Engineering and Technology: Lab Report
34 pages
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
No ratings yet
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
50 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2.BasicTextProcessing NEW

Uploaded by

2.BasicTextProcessing NEW

Uploaded by

Natural Language Processing (NLP)

Basic Text Processing

Normalization, Edit Distance

I A formal language for specifying text strings

I How can we search for any of

I Letters inside square brackets: []

I Woodchucks is another name for groundhog!

I Find all instances of the word “the” in a text.

I From highest to lowest precedence:

I Counters > Sequences: /the*/ matches theeeee but not

I Substitution operator s/regexp1/pattern/ used in Unix

/the (.*)er they were, the \1er they will be/

I Works by having a cascade of (ranked) regular expression

s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/

I How many words are in the following Brown sentence?

I How about inﬂected forms like cats versus cat?

I Types are the number of distinct words in a corpus

Corpus Tokens = N Types = |V|

The relationship between the number of types and number of tokens

I Look at the number of lemmas instead of wordform types

At least three tasks are commonly applied as part of any

Choosing a single normalized form for words with multiple

Case folding is another kind of normalization

Because tokenization needs to be run before any other

I Lemmatization is the task of determining that two words

The Porter Stemmer

I The Porter stemmer algorithm is based on a series of

ATIONAL → ATE (e.g. relational → relate)

I Detailed rule lists for the Porter stemmer can be found on

Sentence segmentation is another important step in text

I Calculating the similarity between two strings is useful in

ing theI The

I We can also assign a particular cost or weight to each of

I How do we ﬁnd the minimum edit distance?

I Dynamic programming is the name for a class of

function M IN -E DIT-D ISTANCE(source, target) returns min-distance

I The algorithm allows arbitrary weights on the operations

I The regular expression language is a powerful tool for

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.