0% found this document useful (0 votes)
14 views23 pages

Screenshot 2024-06-04 at 12.02.17 AM

Chapter 3 of the Business Intelligence and Analytics textbook discusses text analytics and text mining, emphasizing the importance of extracting knowledge from unstructured data. It differentiates between text mining, web mining, and data mining, outlines the text mining process, and highlights various applications in fields such as law, finance, and medicine. The chapter also covers natural language processing (NLP) and its challenges, as well as tools available for text mining.

Uploaded by

54saleh53
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views23 pages

Screenshot 2024-06-04 at 12.02.17 AM

Chapter 3 of the Business Intelligence and Analytics textbook discusses text analytics and text mining, emphasizing the importance of extracting knowledge from unstructured data. It differentiates between text mining, web mining, and data mining, outlines the text mining process, and highlights various applications in fields such as law, finance, and medicine. The chapter also covers natural language processing (NLP) and its challenges, as well as tools available for text mining.

Uploaded by

54saleh53
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Business Intelligence and Analytics:

Systems for Decision Support


Global Edition
(10th Edition)

Chapter 3:
Text Analytics, Text Mining
Learning Objectives
n Describe text mining and understand the need
for text mining
n Differentiate between text mining, Web mining,
and data mining
n Understand the different application areas for
text mining
n Know the process of carrying out a text mining
project
n Understand the different methods to introduce
structure to text-based data
(Continued…)
7-2 © Pearson Education Limited 2014
Text Mining Concepts
n 85-90 percent of all corporate data is in some
kind of unstructured form (e.g., text)
n Unstructured corporate data is doubling in size
every 18 months
n Tapping into these information sources is not an
option, but a need to stay competitive
n Answer: text mining
n A semi-automated process of extracting knowledge
from unstructured data sources ‫نظمة‬%‫من مصادر البيانات غير ا‬
‫عرفة‬%‫ستخراج ا‬+ ‫عملية نصف آلية‬

n a.k.a. text data mining or knowledge discovery in


textual databases

7-3 © Pearson Education Limited 2014


Text Analytics and Text Mining
TEXT ANALYTICS

Text Mining
Information
Web Mining
Retrieval

Information
Data Mining
Extraction

Natural Language Processing Linguistic Machine Learning

Computer Science Statistics Management Science Artificial Intelligence

7-4 © Pearson Education Limited 2014


Data Mining versus Text Mining
n Both seek for novel and useful patterns
n Both are semi-automated processes
n Difference is the nature of the data:
n Structured versus unstructured data
n Structured data: in databases
n Unstructured data: Word documents, PDF
files, text excerpts, XML files, and so on
n Text mining – first, impose structure to
the data, then mine the structured data.
7-5 © Pearson Education Limited 2014
Text Mining Concepts
n Benefits of text mining are obvious, especially in
text-rich data environments
n e.g., law (court orders), academic research (research
articles), finance (quarterly reports), medicine
(discharge summaries), biology (molecular
interactions), technology (patent files), marketing
(customer comments), etc.
n Electronic communication records (e.g., Email)
n Spam filtering
n Email prioritization and categorization
n Automatic response generation
7-6 © Pearson Education Limited 2014
Text Mining Application Area
n Information extraction
n Topic tracking
n Summarization
n Categorization
n Clustering
n Concept linking
n Question answering

7-7 © Pearson Education Limited 2014


Natural Language Processing
(NLP)
n Structuring a collection of text
n Old approach: bag-of-words
n New approach: natural language processing
n NLP is …
n a very important concept in text mining
n a subfield of artificial intelligence and computational
linguistics
n the studies of "understanding" the natural human
language
n Syntax versus semantics-based text mining
7-8 © Pearson Education Limited 2014
Natural Language Processing
(NLP)
n What is “Understanding” ?
n Human understands, what about
computers?
n Natural language is vague, context driven
n True understanding requires extensive
knowledge of a topic

n Can/will computers ever understand natural


language the same/accurate way we do?
7-9 © Pearson Education Limited 2014
Natural Language Processing
(NLP)
n Challenges in NLP
n Part-of-speech tagging
n Text segmentation
n Word sense disambiguation
n Syntax ambiguity
n Imperfect or irregular input
n Speech acts

n Dream of AI community
n to have algorithms that are capable of automatically
reading and obtaining knowledge from text
7-10 © Pearson Education Limited 2014
Natural Language Processing
(NLP)
n WordNet
n A laboriously hand-coded database of English words,
their definitions, sets of synonyms, and various
semantic relations between synonym sets.
n A major resource for NLP.
n Need automation to be completed.
n Sentiment Analysis
n A technique used to detect favorable and unfavorable
opinions toward specific products and services
n SentiWordNet

7-11 © Pearson Education Limited 2014


NLP Task Categories
n Information retrieval, information extraction
n Named-entity recognition
n Question answering
n Automatic summarization
n Natural language generation & understanding
n Machine translation
n Foreign language reading & writing
n Speech recognition
n Text proofing, optical character recognition
7-12 © Pearson Education Limited 2014
Text Mining Applications
n Marketing applications
n Enables better CRM
n Security applications
n ECHELON, OASIS
n Deception detection (…)
n Medicine and biology
n Literature-based gene identification (…)
n Academic applications
n Research stream analysis
7-13 © Pearson Education Limited 2014
Text Mining Process
Context diagram for Software/hardware limitations

the text mining Privacy issues


Linguistic limitations
process

Unstructured data (text) Extract Context-specific knowledge


knowledge
from available
Structured data (databases) data sources
A0

Domain expertise
Tools and techniques

7-14 © Pearson Education Limited 2014


Text Mining Process
Task 1 Task 2 Task 3

Establish the Corpus: Create the Term- Extract Knowledge:


Collect & Organize the Document Matrix: Discover Novel
Domain Specific Introduce Structure Patterns from the
Unstructured Data to the Corpus T-D Matrix

Feedback Feedback

The inputs to the process The output of the Task 1 is a The output of the Task 2 is a The output of Task 3 is a
includes a variety of relevant collection of documents in flat file called term-document number of problem specific
unstructured (and semi- some digitized format for matrix where the cells are classification, association,
structured) data sources such computer processing populated with the term clustering models and
as text, XML, HTML, etc. frequencies visualizations

The three-step text mining process

7-15 © Pearson Education Limited 2014


Text Mining Process
n Step 1: Establish the corpus
n Collect all relevant unstructured data
(e.g., textual documents, XML files, emails,
Web pages, short notes, voice recordings…)
n Digitize, standardize the collection
(e.g., all in ASCII text files)
n Place the collection in a common place
(e.g., in a flat file, or in a directory as
separate files)

7-16 © Pearson Education Limited 2014


Text Mining Process
n Step 2: Create the Term-by-Document
Matrix (TDM)
en
t ng
m e eri
Terms sk ge gin
t ri an
a e n en
t
en m are m
stm je ct ftw lop
e ve P
Documents inv pro so de SA ...
Document 1 1 1

Document 2 1

Document 3 3 1

Document 4 1

Document 5 2 1

Document 6 1 1
...

7-17 © Pearson Education Limited 2014


Text Mining Process
n Step 2: Create the Term-by-Document
Matrix (TDM)
n Should all terms be included?
n Stop words, include words
n Synonyms, homonyms
n Stemming
n What is the best representation of the indices
(values in cells)?
n Row counts; binary frequencies; log frequencies;
n Inverse document frequency
7-18 © Pearson Education Limited 2014
the best representation of the
indices (values in cells)
n Binary Frequencies: f(wf) = 1 if wf >0 otherwise f(wf) = 0.
n Log Frequencies
n f(wf) = 1+log(wf) if wf >0 otherwise f(wf) = 0.

n Inverse Document Frequencies (idf)


0 #$ %$!" = 0
n idf(i,j) = !
(1 + log(%$!" ))log(./0$! ) #$ %$!" ≥ 1
Where:
i: ith word, j:jth document.
N: Total number of documents.
Wfi,j : Frequency of word i in document j
!"! : $ℎ& '()*&+ ," !,-(&)'./ .ℎ0. 1'-2(!& .ℎ1/ 3,+!
7-19 © Pearson Education Limited 2014
Text Mining Process
n Step 2: Create the Term–by–Document
Matrix (TDM)
n TDM is a sparse matrix. How can we reduce
the dimensionality of the TDM?
n Manual - a domain expert goes through it
n Eliminate terms with very few occurrences in very
few documents (?)
n Transform the matrix using singular value
decomposition (SVD)
n SVD is similar to principle component analysis

7-20 © Pearson Education Limited 2014


Text Mining Process
n Step 3: Extract patterns/knowledge
n Classification (text categorization)
n Clustering (natural groupings of text)
n Improve search recall
n Improve search precision
n Scatter/gather
n Query-specific clustering
n Association
n Trend Analysis (…)

7-21 © Pearson Education Limited 2014


Text Mining Tools
n Commercial Software Tools
n IBM SPSS Modler - Text Miner
n SAS Enterprise Miner – Text Miner
n Statistical Data Miner – Text Miner
n ClearForest, …
n Free Software Tools
n RapidMiner
n GATE
n Spy-EM, …
7-22 © Pearson Education Limited 2014

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy