0% found this document useful (0 votes)

62 views31 pages

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

The document summarizes text and web mining. It discusses definitions of text mining, including deriving high-quality information from text and exploratory data analysis leading to unknown information. It also discusses two definitions of data mining - goal-oriented mining focusing on useful and non-obvious results, and method-oriented mining involving extracting patterns from massive data.

Uploaded by

Zafar Iqbal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views31 pages

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

Uploaded by

Zafar Iqbal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

CS423

DATA WAREHOUSING AND DATA

MINING

Chapter 12
Text and Web Mining

Dr. Hammad Afzal

hammad.afzal@mcs.edu.pk

Department of Computer Software Engineering

National University of Sciences and Technology (NUST)
WHAT IS “TEXT MINING”?
 “Text mining, also referred to as text data mining,
roughly equivalent to text analytics, refers to the process
of deriving high-quality information from text.” -
Wikipedia

 “Another way to view text data mining is as a process of

exploratory data analysis that leads to heretofore
unknown information, or to answers for questions for
which the answer is not currently known.” - Hearst, 1999

2
TWO DIFFERENT DEFINITIONS OF
MINING
 Goal-oriented (effectiveness driven)
 Any process that generates useful results that are non-obvious
is called “mining”.
 Keywords: “useful” + “non-obvious”
 Data isn’t necessarily massive

 Method-oriented (efficiency driven)

 Any process that involves extracting information from
massive data is called “mining”
 Keywords: “massive” + “pattern”
 Patterns aren’t necessarily useful
3
KNOWLEDGE DISCOVERY FROM TEXT
DATA
 IBM’s Watson wins at Jeopardy! - 2011

4
WHAT IS INSIDE WATSON?
 “Watson had access to 200 million pages of structured
and unstructured content consuming four terabytes of
disk storage including the full text of Wikipedia” – PC
World

 “The sources of information for Watson include

encyclopedias, dictionaries, thesauri, newswire articles,
and literary works. Watson also used databases,
taxonomies, and ontologies. Specifically, DBPedia,
WordNet, and Yago were used.” – AI Magazine

5
TEXT MINING AROUND US
 Sentiment analysis

6
TEXT MINING AROUND US
 Sentiment analysis

7
TEXT MINING AROUND US
 Document summarization

8
TEXT MINING AROUND US
 Document summarization

9
TEXT MINING AROUND US
 Movie recommendation

10
TEXT MINING AROUND US
 News recommendation

11
HOW TO PERFORM TEXT MINING?
 As computer scientists, we view it as
 Text Mining = Data Mining + Text Data

Em
So ail Sc
Na In f t s ie n
Ap fo wa
pl tu rm Bl re tif
ie d r al ati og d ic
lan on s o cu lit
m

Tw
ac re m er
gu W atu

ee
hi a tri e en re

ts
ne ge ev b ta N
lea pr al p ag t io ew
rn oce e s n s sa
in ssin r ti
g c le
g s

12
MINING TEXT DATA: AN INTRODUCTION

Data Mining / Knowledge Discovery

05/25/2022
Structured Data Multimedia Free Text Hypertext
omeLoan ( Frank Rizzo bought <a href>Frank Rizzo
oanee: Frank Rizzo his home from Lake </a> Bought
ender: MWF View Real Estate in <a hef>this home</a>
gency: Lake View 1992. from <a href>Lake
mount: $200,000 He paid $200,000 View Real Estate</a>
erm: 15 years under a15-year loan In <b>1992</b>.
Loans($200K,[map],...) from MW Financial. <p>... 13
NATURAL LANGUAGE PROCESSING
A dog is chasing a boy on the playground Lexical

05/25/2022
Det Noun Aux Verb Det Noun Prep Det Noun analysis
(part-of-speech
Noun Phrase tagging)
Noun Phrase Complex Verb Noun Phrase

Prep Phrase
Semantic analysis Verb Phrase
Syntactic analysis
Dog(d1). (Parsing)
Boy(b1).
Playground(p1). Verb Phrase
Chasing(d1,b1,p1).
Sentence
+
Scared(x) if Chasing(_,x,_).
A person saying this may
be reminding another person to
get the dog back…
Scared(b1)
Inference Pragmatic analysis
14
(speech act)
LEVELS OF TEXT REPRESENTATION

05/25/2022
 Character (character n-grams and sequences)

 Words (stop-words, stemming, lemmatization)

 Phrases (word n-grams, proximity features)

 Part-of-Speech Tags

 Taxonomies/Thesaurus
15
CHARACTER LEVEL
 Character level representation of a text

05/25/2022
 consists
from sequences of characters…
 …a document is represented by a frequency

 distribution of sequences

 Usually
we deal with contiguous strings…
 …each character sequence of length 1, 2, 3, …

 represent a feature with its frequency

16
CHARACTER LEVEL: GOOD AND BAD SIDES

 It captures simple patterns on character level

05/25/2022
 useful for e.g. spam detection, copy detection

 It is used as a basis for “string kernels” in combination with SVM

for capturing complex character sequence patterns

 For deeper semantic tasks, the representation is too weak

17
WORD LEVEL
 The most common representation of text; used for many techniques

05/25/2022
 Tokenization: split text into the words

 Relations among word surface forms and their senses:

 Synonymy: different form, same meaning (e.g. singer, vocalist)
 Hyponymy: one word denotes a subclass of an another (e.g.
breakfast, meal)

18
STOP-WORDS
 Word frequencies in texts have power distribution: …small number
of very frequent words; big number of low frequency words

05/25/2022
 Stop-words are words that do not carry information

 Usually we remove them to help the methods to perform better

 Stop words are language dependent – examples:

 English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN,
 AGAINST, ALL, ALMOST, ALONE, ALONG, ALREADY,

19
STEMMING
 Different forms of the same word are usually problematic for text

05/25/2022
data analysis, because they have different spelling and similar
meaning (e.g. learns, learned, learning,…)

 Stemming is a process of transforming a word into its stem

(normalized form)

 Stemming provides an inexpensive mechanism to merge

20
NOUN/VERB PHRASES
 Instead of having just single words we can deal with phrases
 Noun Phrase
 Verb Phrase

21
PART OF SPEECH
 Introduces word-types enabling to differentiate words functions

 Mainly used for “information extraction” where we are interested in

e.g. named entities which are “noun phrases”

 Another possible use is reduction of the vocabulary

 (features)
 E.g. nouns carry most of the information in text Documents

 POS Taggers are usually learned by HMM on manually tagged data

22
PART OF SPEECH

23
WORDNET – DATABASE OF LEXICAL RELATIONS
 The most well developed and widely used database for English
 Consists of 4 databases (nouns, verbs, adjectives, adverbs)

 Each database consists of sense entries – each sense consists of a set

of synonyms. E.g.
 musician, instrumentalist, player
 person, individual, someone
 life form, organism, being

24
WORDNET – DATABASE OF LEXICAL
RELATIONS

25
BAG-OF-TOKENS APPROACHES

Documents Token Sets

Four score and seven years

nation – 5
ago our fathers brought forth
civil - 1
on this continent, a new
war – 2
nation, conceived in Liberty,
Feature men – 2
and dedicated to the
Extraction died – 4
proposition that all men are
people – 5
created equal.
Liberty – 1
Now we are engaged in a
God – 1
great civil war, testing
…
whether that nation, or …

Loses all order-specific information!

26
Severely limits context!
27
COSINE SIMILARITY
 A document can be represented by thousands of
attributes, each recording the frequency of a particular
word (such as keywords) or phrase in the document.

28
COSINE SIMILARITY

 Other vector objects: gene features in micro-arrays, …

 Applications:
information retrieval, biologic taxonomy, gene
feature mapping, ...

 Cosinemeasure: If d1 and d2 are two vectors (e.g., term-frequency

vectors), then

cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

where  indicates vector dot product, ||d||: the length of vector

29 d
EXAMPLE: COSINE SIMILARITY

 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

30
EXAMPLE: COSINE SIMILARITY

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94

Employee Leave Management System Using PHP
No ratings yet
Employee Leave Management System Using PHP
56 pages
Mapping Texts 2024
No ratings yet
Mapping Texts 2024
326 pages
Mastering The Oracle 1z0-908 MySQL 8.0 Database Administrator Exam - Key Questions and Insights - Galaxy - Ai
No ratings yet
Mastering The Oracle 1z0-908 MySQL 8.0 Database Administrator Exam - Key Questions and Insights - Galaxy - Ai
20 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
Deleting Diskgroup Via Asmcmd
No ratings yet
Deleting Diskgroup Via Asmcmd
7 pages
Lecture 6-Text Mining and Sentiment Analysis
No ratings yet
Lecture 6-Text Mining and Sentiment Analysis
57 pages
Mapping Texts - Computational Text Analysis For The Social - Dustin S - Stoltz, Marshall A - Taylor - 2024 - Computational Social Science - 9780197756874 - Anna's Archive
No ratings yet
Mapping Texts - Computational Text Analysis For The Social - Dustin S - Stoltz, Marshall A - Taylor - 2024 - Computational Social Science - 9780197756874 - Anna's Archive
326 pages
Week 1-4 Text An
No ratings yet
Week 1-4 Text An
74 pages
1152cs191 Data Visualization Unit IV
No ratings yet
1152cs191 Data Visualization Unit IV
99 pages
Lec 5 e Text Analytics Vector Space TF IDF
No ratings yet
Lec 5 e Text Analytics Vector Space TF IDF
51 pages
Dbms Project
No ratings yet
Dbms Project
116 pages
Intro To TM
No ratings yet
Intro To TM
32 pages
JD - Cloud Engineer
No ratings yet
JD - Cloud Engineer
3 pages
Exam 2
No ratings yet
Exam 2
5 pages
Partition Types
No ratings yet
Partition Types
4 pages
Text Mining
No ratings yet
Text Mining
34 pages
Bcse206l FDS Module-4 Smsatapathy
No ratings yet
Bcse206l FDS Module-4 Smsatapathy
50 pages
MySQL Perf Tuning OOW2015 Dim
No ratings yet
MySQL Perf Tuning OOW2015 Dim
141 pages
Text and Web Analytics
No ratings yet
Text and Web Analytics
48 pages
EBUS622 - Week 5 - Lecture - Text Preparation
No ratings yet
EBUS622 - Week 5 - Lecture - Text Preparation
40 pages
Mis DW & DM
100% (1)
Mis DW & DM
11 pages
Chapter - 5 Algebra
No ratings yet
Chapter - 5 Algebra
18 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
Introduction To Text Visualization by Nan Cao, Weiwei Cui (Auth.)
No ratings yet
Introduction To Text Visualization by Nan Cao, Weiwei Cui (Auth.)
122 pages
DVT U4 My Notes
No ratings yet
DVT U4 My Notes
15 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
Bio TIMEr
No ratings yet
Bio TIMEr
12 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
45 pages
Text
No ratings yet
Text
102 pages
Text Mining
No ratings yet
Text Mining
31 pages
Lect 5
No ratings yet
Lect 5
40 pages
CDCC BANK Technical Annexures
No ratings yet
CDCC BANK Technical Annexures
6 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Text Extraction Research Paper
No ratings yet
Text Extraction Research Paper
6 pages
Chapter 07 - in Class
No ratings yet
Chapter 07 - in Class
49 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
06 Text and Document
No ratings yet
06 Text and Document
43 pages
CSE442 Text
No ratings yet
CSE442 Text
89 pages
Ecommercedashboard
No ratings yet
Ecommercedashboard
5 pages
Semantic Technology-Assisted Review STAR Document
No ratings yet
Semantic Technology-Assisted Review STAR Document
14 pages
Computational Journalism 2016 Week 2: Text Analysis
No ratings yet
Computational Journalism 2016 Week 2: Text Analysis
68 pages
Project Final Presentation
No ratings yet
Project Final Presentation
30 pages
Lab 5
No ratings yet
Lab 5
27 pages
Lecture Notes - DynamoDB
No ratings yet
Lecture Notes - DynamoDB
24 pages
SQL Interview Questions You Must Prepare: The Ultimate Guide
No ratings yet
SQL Interview Questions You Must Prepare: The Ultimate Guide
20 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Lab 4 Tasks - Table Creation Solved
No ratings yet
Lab 4 Tasks - Table Creation Solved
6 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
Chapter 03 - Sharda 11e Full Accessible PPT 07
No ratings yet
Chapter 03 - Sharda 11e Full Accessible PPT 07
29 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
DS Finalexam (Thxtoshravani)
No ratings yet
DS Finalexam (Thxtoshravani)
31 pages
Role of Computers in Research
100% (1)
Role of Computers in Research
5 pages
RonishSingh XII-B Practical File CS
No ratings yet
RonishSingh XII-B Practical File CS
35 pages
Setting Up The Environment
No ratings yet
Setting Up The Environment
170 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
CH 06 PPTaccessible
No ratings yet
CH 06 PPTaccessible
71 pages
4003 Assessment 1
No ratings yet
4003 Assessment 1
11 pages
CT075!3!2 DTM Topic 12 Text Data Mining
No ratings yet
CT075!3!2 DTM Topic 12 Text Data Mining
25 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Text Mining - Hanmei Fan - Fall 2006
No ratings yet
Text Mining - Hanmei Fan - Fall 2006
37 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
01 - Introduction To Text Analytics - Part2
No ratings yet
01 - Introduction To Text Analytics - Part2
48 pages
Week 12
No ratings yet
Week 12
19 pages
PG Program in Web Development
No ratings yet
PG Program in Web Development
11 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Oratop16 5 1
No ratings yet
Oratop16 5 1
10 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Text Mining
No ratings yet
Text Mining
25 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
1 s2.0 S1877050922020737 Main
No ratings yet
1 s2.0 S1877050922020737 Main
16 pages
Text Mining
No ratings yet
Text Mining
62 pages
CS Pract 2023 Question Bank
No ratings yet
CS Pract 2023 Question Bank
7 pages
Doyle 2014 Art Talk
No ratings yet
Doyle 2014 Art Talk
29 pages
Clustering in Machine Learning - Javatpoint
No ratings yet
Clustering in Machine Learning - Javatpoint
10 pages
Hashing in Data Structure
No ratings yet
Hashing in Data Structure
23 pages
Text Mining: Data Mining - Volinsky - 2011 - Columbia University
No ratings yet
Text Mining: Data Mining - Volinsky - 2011 - Columbia University
63 pages
Assigmnent I TEXT WEB Media (2024 Feb)
No ratings yet
Assigmnent I TEXT WEB Media (2024 Feb)
12 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
DDM 3
No ratings yet
DDM 3
43 pages
Unit 1 Fod
No ratings yet
Unit 1 Fod
43 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Sanjana Dbms Work
No ratings yet
Sanjana Dbms Work
38 pages
Event Management System Synopsis
No ratings yet
Event Management System Synopsis
7 pages
NLP Unit Test 2
No ratings yet
NLP Unit Test 2
10 pages
Case Management System
No ratings yet
Case Management System
34 pages
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
From Everand
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
Lena Neill
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

Uploaded by

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal

Uploaded by

CS423

DATA WAREHOUSING AND DATA

Dr. Hammad Afzal

Department of Computer Software Engineering

 “Another way to view text data mining is as a process of

 Method-oriented (efficiency driven)

 “The sources of information for Watson include

Data Mining / Knowledge Discovery

 Words (stop-words, stemming, lemmatization)

 Phrases (word n-grams, proximity features)

 represent a feature with its frequency

 It captures simple patterns on character level

 It is used as a basis for “string kernels” in combination with SVM

 For deeper semantic tasks, the representation is too weak

 Relations among word surface forms and their senses:

 Usually we remove them to help the methods to perform better

 Stop words are language dependent – examples:

 Stemming is a process of transforming a word into its stem

 Stemming provides an inexpensive mechanism to merge

 Mainly used for “information extraction” where we are interested in

 Another possible use is reduction of the vocabulary

 POS Taggers are usually learned by HMM on manually tagged data

 Each database consists of sense entries – each sense consists of a set

Documents Token Sets

Four score and seven years

Loses all order-specific information!

 Other vector objects: gene features in micro-arrays, …

 Cosinemeasure: If d1 and d2 are two vectors (e.g., term-frequency

cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

where  indicates vector dot product, ||d||: the length of vector

 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

 Ex: Find the similarity between documents 1 and 2.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.