0% found this document useful (0 votes)
16 views

Week3

Uploaded by

fatimabuhari2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Week3

Uploaded by

fatimabuhari2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Text Processing

 Introduction
 Tokenizing (segmenting) words
 Normalizing word formats
 Segmenting sentences
 The practical
CS3TM20 © XH 1
Introduction
 Text normalization is the process of transforming text into a
single canonical form before almost any natural language processing
of a text.
• predictive text and handwriting recognition.
• web search engines
• machine translation, text analysis to detect sentiment in tweets and
blogs.
 At least three tasks are commonly applied as part of any
normalization process:
1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
CS3TM20 © XH 2
How many words?
N = number of tokens (pieces in a document)
V = vocabulary = set of types, |V| is size of vocabulary
Heaps Law = Herdan's Law =
where often .67 < β < .75
i.e., vocabulary size grows with > square root of the number of word tokens
Tokens = N Types = |V|
Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13+ million
Tokenizing (segmenting) words
 Separate a chunk of continuous text into separate words.
 Tokenization is thus intimately tied up with named entity recognition.
 Segment off a token between instances of spaces (Space-based
tokenization) for languages that use space characters between
words.
 Tokenization needs to be run before any other language processing
and fast.
 The standard method for tokenization is therefore to use
deterministic algorithms based on regular expression.
 Word tokenization is more complex in languages which do not use
spaces to mark potential word-boundaries.

CS3TM20 © XH 4
Issues in Tokenization
•Can't just blindly remove punctuation:
• m.p.h., Ph.D., AT&T, cap’n
• prices ($45.55)
• dates (01/02/06)
• URLs (http://www.stanford.edu)
• hashtags (#nlproc)
• email addresses (someone@cs.colorado.edu)
•Clitic: a word that doesn't stand on its own
• "are" in we're, French "je" in j'ai, "le" in l'honneur
•When should multiword expressions (MWE) be
words?
• New York, rock ’n’ roll
Tokenization in NLTK
Bird, Loper and Klein (2009), Natural Language Processing with Python. O’Reilly
Word Normalization

Putting words/tokens in a standard format


Information Retrieval: indexed text & query terms must
have same form.
• U.S.A. or USA
• uhhuh or uh-huh
• Fed or fed
• am, is, be, are
 We implicitly define equivalence classes of terms.

CS3TM20 © XH 7
Case folding
Applications like IR: reduce all letters to lower case
Since users tend to use lower case
Possible exception: upper case in mid-sentence?
e.g., General Motors
Fed vs. fed
SAIL vs. sail
For sentiment analysis, MT, Information extraction,
case is helpful (US versus us is important)
Stemming
Reduce terms to stems, chopping off affixes crudely
stemming a word or sentence may result in words that are not
actual words.
This was not the map we Thi wa not the map we
found in Billy Bones’s found in Billi Bone s
chest, but an accurate chest but an accur copi
copy, complete in all complet in all thing
things-names and name and height and
heights and soundings- sound with the singl
with the single exception except of the red cross
of the red crosses and and the written note
the written notes. .
Porter Stemmer
Based on a series of rewrite rules run in series
A cascade, in which output of each pass fed to next pass
Some sample rules:
Lemmatization

Represent all words as their lemma, their shared root


= dictionary headword form:
• am, are, is  be
• car, cars, car's, cars'  car
• Spanish quiero (‘I want’), quieres (‘you want’)
 querer ‘want'
• He is reading detective stories
 He be read detective story
What is difference between stemming and Lemmatization?
Morphology is the study of words, how they are formed, and
their relationship to other words in the same language.
It analyses the structure of words and parts of words such
as stems, root words, prefixes, and suffixes.
Lemmatization usually refers to doing things properly with
the use of a vocabulary and morphological analysis of words.
Stemming usually refers to a crude heuristic process that
chops off the ends of words in the hope of achieving this goal
correctly most of the time.
Sentence Segmentation
!, ? mostly unambiguous but period “.” is very ambiguous
Sentence boundary
Abbreviations like Inc. or Dr.
Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to
classify a period as either (a) part of the word or (b) a
sentence-boundary.
An abbreviation dictionary can help
Sentence segmentation can then often be done by rules
based on this tokenization.
Determining if a word is end-of-sentence: a Decision Tree
Practical
The practical is based on NLTK Chapter 3
The information covered is more than an hour. Please
focus on linguistic aspects if you are confused by
programming language.

Why text processing is needed?


What are the terminologies?
Can you explain the output of the codes?

https://www.nltk.org/book/ch03.html
15
Slides adapted from Jure Leskovec

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy