Screenshot 2024-06-04 at 12.02.17 AM
Screenshot 2024-06-04 at 12.02.17 AM
Chapter 3:
Text Analytics, Text Mining
Learning Objectives
n Describe text mining and understand the need
for text mining
n Differentiate between text mining, Web mining,
and data mining
n Understand the different application areas for
text mining
n Know the process of carrying out a text mining
project
n Understand the different methods to introduce
structure to text-based data
(Continued…)
7-2 © Pearson Education Limited 2014
Text Mining Concepts
n 85-90 percent of all corporate data is in some
kind of unstructured form (e.g., text)
n Unstructured corporate data is doubling in size
every 18 months
n Tapping into these information sources is not an
option, but a need to stay competitive
n Answer: text mining
n A semi-automated process of extracting knowledge
from unstructured data sources نظمة%من مصادر البيانات غير ا
عرفة%ستخراج ا+ عملية نصف آلية
Text Mining
Information
Web Mining
Retrieval
Information
Data Mining
Extraction
n Dream of AI community
n to have algorithms that are capable of automatically
reading and obtaining knowledge from text
7-10 © Pearson Education Limited 2014
Natural Language Processing
(NLP)
n WordNet
n A laboriously hand-coded database of English words,
their definitions, sets of synonyms, and various
semantic relations between synonym sets.
n A major resource for NLP.
n Need automation to be completed.
n Sentiment Analysis
n A technique used to detect favorable and unfavorable
opinions toward specific products and services
n SentiWordNet
Domain expertise
Tools and techniques
Feedback Feedback
The inputs to the process The output of the Task 1 is a The output of the Task 2 is a The output of Task 3 is a
includes a variety of relevant collection of documents in flat file called term-document number of problem specific
unstructured (and semi- some digitized format for matrix where the cells are classification, association,
structured) data sources such computer processing populated with the term clustering models and
as text, XML, HTML, etc. frequencies visualizations
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...