0% found this document useful (0 votes)
62 views31 pages

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

The document summarizes text and web mining. It discusses definitions of text mining, including deriving high-quality information from text and exploratory data analysis leading to unknown information. It also discusses two definitions of data mining - goal-oriented mining focusing on useful and non-obvious results, and method-oriented mining involving extracting patterns from massive data.

Uploaded by

Zafar Iqbal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views31 pages

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

The document summarizes text and web mining. It discusses definitions of text mining, including deriving high-quality information from text and exploratory data analysis leading to unknown information. It also discusses two definitions of data mining - goal-oriented mining focusing on useful and non-obvious results, and method-oriented mining involving extracting patterns from massive data.

Uploaded by

Zafar Iqbal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

CS423

DATA WAREHOUSING AND DATA


MINING

Chapter 12
Text and Web Mining

Dr. Hammad Afzal

hammad.afzal@mcs.edu.pk

Department of Computer Software Engineering


National University of Sciences and Technology (NUST)
WHAT IS “TEXT MINING”?
 “Text mining, also referred to as text data mining,
roughly equivalent to text analytics, refers to the process
of deriving high-quality information from text.” -
Wikipedia

 “Another way to view text data mining is as a process of


exploratory data analysis that leads to heretofore
unknown information, or to answers for questions for
which the answer is not currently known.” - Hearst, 1999

2
TWO DIFFERENT DEFINITIONS OF
MINING
 Goal-oriented (effectiveness driven)
 Any process that generates useful results that are non-obvious
is called “mining”.
 Keywords: “useful” + “non-obvious”
 Data isn’t necessarily massive

 Method-oriented (efficiency driven)


 Any process that involves extracting information from
massive data is called “mining”
 Keywords: “massive” + “pattern”
 Patterns aren’t necessarily useful
3
KNOWLEDGE DISCOVERY FROM TEXT
DATA
 IBM’s Watson wins at Jeopardy! - 2011

4
WHAT IS INSIDE WATSON?
 “Watson had access to 200 million pages of structured
and unstructured content consuming four terabytes of
disk storage including the full text of Wikipedia” – PC
World

 “The sources of information for Watson include


encyclopedias, dictionaries, thesauri, newswire articles,
and literary works. Watson also used databases,
taxonomies, and ontologies. Specifically, DBPedia,
WordNet, and Yago were used.” – AI Magazine

5
TEXT MINING AROUND US
 Sentiment analysis

6
TEXT MINING AROUND US
 Sentiment analysis

7
TEXT MINING AROUND US
 Document summarization

8
TEXT MINING AROUND US
 Document summarization

9
TEXT MINING AROUND US
 Movie recommendation

10
TEXT MINING AROUND US
 News recommendation

11
HOW TO PERFORM TEXT MINING?
 As computer scientists, we view it as
 Text Mining = Data Mining + Text Data

Em
So ail Sc
Na In f t s ie n
Ap fo wa
pl tu rm Bl re tif
ie d r al ati og d ic
lan on s o cu lit
m

Tw
ac re m er
gu W atu

ee
hi a tri e en re

ts
ne ge ev b ta N
lea pr al p ag t io ew
rn oce e s n s sa
in ssin r ti
g c le
g s

12
MINING TEXT DATA: AN INTRODUCTION

Data Mining / Knowledge Discovery

05/25/2022
Structured Data Multimedia Free Text Hypertext
omeLoan ( Frank Rizzo bought <a href>Frank Rizzo
oanee: Frank Rizzo his home from Lake </a> Bought
ender: MWF View Real Estate in <a hef>this home</a>
gency: Lake View 1992. from <a href>Lake
mount: $200,000 He paid $200,000 View Real Estate</a>
erm: 15 years under a15-year loan In <b>1992</b>.
Loans($200K,[map],...) from MW Financial. <p>... 13
NATURAL LANGUAGE PROCESSING
A dog is chasing a boy on the playground Lexical

05/25/2022
Det Noun Aux Verb Det Noun Prep Det Noun analysis
(part-of-speech
Noun Phrase tagging)
Noun Phrase Complex Verb Noun Phrase

Prep Phrase
Semantic analysis Verb Phrase
Syntactic analysis
Dog(d1). (Parsing)
Boy(b1).
Playground(p1). Verb Phrase
Chasing(d1,b1,p1).
Sentence
+
Scared(x) if Chasing(_,x,_).
A person saying this may
be reminding another person to
get the dog back…
Scared(b1)
Inference Pragmatic analysis
14
(speech act)
LEVELS OF TEXT REPRESENTATION

05/25/2022
 Character (character n-grams and sequences)

 Words (stop-words, stemming, lemmatization)

 Phrases (word n-grams, proximity features)

 Part-of-Speech Tags

 Taxonomies/Thesaurus
15
CHARACTER LEVEL
 Character level representation of a text

05/25/2022
 consists
from sequences of characters…
 …a document is represented by a frequency

 distribution of sequences

 Usually
we deal with contiguous strings…
 …each character sequence of length 1, 2, 3, …

 represent a feature with its frequency

16
CHARACTER LEVEL: GOOD AND BAD SIDES

 It captures simple patterns on character level

05/25/2022
 useful for e.g. spam detection, copy detection

 It is used as a basis for “string kernels” in combination with SVM


for capturing complex character sequence patterns

 For deeper semantic tasks, the representation is too weak

17
WORD LEVEL
 The most common representation of text; used for many techniques

05/25/2022
 Tokenization: split text into the words

 Relations among word surface forms and their senses:


 Synonymy: different form, same meaning (e.g. singer, vocalist)
 Hyponymy: one word denotes a subclass of an another (e.g.
breakfast, meal)

18
STOP-WORDS
 Word frequencies in texts have power distribution: …small number
of very frequent words; big number of low frequency words

05/25/2022
 Stop-words are words that do not carry information

 Usually we remove them to help the methods to perform better

 Stop words are language dependent – examples:


 English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN,
 AGAINST, ALL, ALMOST, ALONE, ALONG, ALREADY,

19
STEMMING
 Different forms of the same word are usually problematic for text

05/25/2022
data analysis, because they have different spelling and similar
meaning (e.g. learns, learned, learning,…)

 Stemming is a process of transforming a word into its stem


(normalized form)

 Stemming provides an inexpensive mechanism to merge

20
NOUN/VERB PHRASES
 Instead of having just single words we can deal with phrases
 Noun Phrase
 Verb Phrase

21
PART OF SPEECH
 Introduces word-types enabling to differentiate words functions

 Mainly used for “information extraction” where we are interested in


e.g. named entities which are “noun phrases”

 Another possible use is reduction of the vocabulary


 (features)
 E.g. nouns carry most of the information in text Documents

 POS Taggers are usually learned by HMM on manually tagged data

22
PART OF SPEECH

23
WORDNET – DATABASE OF LEXICAL RELATIONS
 The most well developed and widely used database for English
 Consists of 4 databases (nouns, verbs, adjectives, adverbs)

 Each database consists of sense entries – each sense consists of a set


of synonyms. E.g.
 musician, instrumentalist, player
 person, individual, someone
 life form, organism, being

24
WORDNET – DATABASE OF LEXICAL
RELATIONS

25
BAG-OF-TOKENS APPROACHES

Documents Token Sets

Four score and seven years


nation – 5
ago our fathers brought forth
civil - 1
on this continent, a new
war – 2
nation, conceived in Liberty,
Feature men – 2
and dedicated to the
Extraction died – 4
proposition that all men are
people – 5
created equal.
Liberty – 1
Now we are engaged in a
God – 1
great civil war, testing

whether that nation, or …

Loses all order-specific information!


26
Severely limits context!
27
COSINE SIMILARITY
 A document can be represented by thousands of
attributes, each recording the frequency of a particular
word (such as keywords) or phrase in the document.

28
COSINE SIMILARITY

 Other vector objects: gene features in micro-arrays, …

 Applications:
information retrieval, biologic taxonomy, gene
feature mapping, ...

 Cosinemeasure: If d1 and d2 are two vectors (e.g., term-frequency


vectors), then

cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

where  indicates vector dot product, ||d||: the length of vector


29 d
EXAMPLE: COSINE SIMILARITY

 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,


where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

30
EXAMPLE: COSINE SIMILARITY

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94

31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy