0% found this document useful (0 votes)
19 views21 pages

Seven Text Mining Techniques

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views21 pages

Seven Text Mining Techniques

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

What are Text Mining

Techniques?
• The process of text mining involves various
activities that assist in deriving information
from unstructured text data. Text mining
techniques can be explained as the processes
that conduct mining of text and discover
insights from the data. These techniques
deploy various text mining tools and
applications for their execution.
Information Extraction (IE)

• It is the technique used to extract valuable information from a


massive amount of data. IE is the starting step for systems to
decipher unstructured text by discovering key phrases and
relationships within text, and involves the tasks as tokenization,
identification of named entities, sentence segmentation, and part-
of-speech assignments.

• For this IE systems are practised to bring out specific information,
attributes and entities from the document and recognise their
relationship. After this, the extracted corpora are accumulated
into associated databases for additional processing. In order to
inspect and evaluate the pertinent information/outcomes from
the extracted data, precision and recall process are used.
Information Retrieval (IR)

• IR is the process of extracting out pertinent information and


connected patterns from the given set of words or phrases. In
information retrieval, different algorithms are deployed for
tracking the user’s behaviour and discover relevant data and
information accordingly.

• For example, Google Search Engine uses information retrieval
systems consistently for deriving relevant documents according to
phrases on the web. For this purpose, search engines implement
query based algorithms to maintain the trends and achieve more
associated results. After that, search engines provide more
relevant and accurate information to users according to their
search needs.
Natural Language Processing

• NLP deals with the automatic processing and


analysis of unstructured textual information
and allows computers to read via analyzing
sentence structure and grammar. It performs
various types of analysis such as NER,
summarization, sentiment analysis, as below
• Summarization: To give synopsis of huge textual data for making a
concise, and intelligible summary of substantial points of a
document.
• Part-of-Speech (PoS) tagging: To allocate a tag for each word/token
in a document on the basis of its part of speech as specifying nouns,
verbs, adjectives, etc. PoS tagging permits semantic analysis over
unstructured text.
• Text categorization: To analyze text documents and classify them on
the basis of predefined topics or categories and benefits when
categorizing synonyms and abbreviations.It is also known as text
classification.


• Sentiment analysis: To determine positive or
negative sentiment from inside/outside data
sources, and allow users to trace changes in
customer behaviour over a specific time period.
In order to obtain relevant information regarding
perceptions of brands, products, and services,
sentiment analysis is used and hence propel
organizations to connect with customers to
improve processes, user experience &
satisfaction.
Clustering

• Clustering method is an unsupervised process that classifies text


documents into groups through applying various clustering
algorithms. What happens in clustering is similar terms or patterns
are organized and extracted from several documents where
clustering is conducted in top-down and bottom-up manner.

• As a result, distinct partitions, called clusters, are generated and
each cluster has a number of documents. The content of each
document in a single cluster is very similar and content in different
clusters are dissimilar such that the quality of clustering is
accounted for better.

• A fundamental clustering algorithm keeps track of topics
for each document and measures the weightage of how
better the documents fit into each cluster.

• The quality of a clustering result relies on similarity
measures of text content used by the clustering method
and its implementation such that a good clustering
method generates a great quality of clusters with high
intra-cluster similarity and low inter-cluster similarity.

• It is different from categorization as in clustering,
text contents are clustered without previous
knowledge of classes. The main advantage of
clustering is that text content can be relevant to
multiple classes.

• Different clustering techniques are hierarchical,
distribution, density centroid, and
k-means clustering, used for analyzing
unstructured text documents.
Categorization

• Under the categorization method, one or more categories of


independent (free format) text documents are assigned.
Depending on the input-output examples to discriminate new
documents , categorization is considered as supervised learning
method. Based on the texts content, predefined classes are
assigned to each text documents,

• The process of text categorization involves methods such as pre-
processing, indexing, dimensionality reduction, and classification
with the objective to train classifiers on the basis of recognized
examples and then unrecognized examples would be categorized
automatically. Also, text categorization faces the difficulty of high
dimensionality of feature space.
• Some useful analytical classification models,
used to categorize text, are naive bayesian
classifier, nearest neighbor classifier,
decision trees, and support vector machines.
Applications included in categorization are
document organization, spam filtering, SMS
categorization, and hierarchical categorization
of web pages.

Visualization

• Visualization methods can improve and clarify the analysis of


relevant information. In order to outline individual documents or
clusters of documents, text flags are practised to show the category
of documents and colors are used to show document density.

• In this method, large textual sources in a visual hierarchy so that a
user might interact with the documents via diving and scaling. For
example, the government uses information visualization to detect
the terrorist networks and to identify crime-information.

• The process of visualization technique has three steps;

• Data preparation: This step involves determining and
obtaining original data of visualization and creating
original data space.
• Data Analysis and Extraction: The process of evaluating
and extracting visualization data, required from original
data, and to form visualization data space is termed as
Data Analysis and extraction.
• Visualization Mapping: This step takes some mapping
algorithms for mapping visualization data space to
visualization target. (from)

Text Summarization

• With the fundamental aim to decrease the length, details and


complexity of a document while keeping significant points and
actual meaning, text summarization helps in dealing whether a
lengthy document accomplishes the user’s requirements or not
and also in resolving whether it is worthwhile reading for
further information or not, and hence text summary could be
replaced by groups of documents.

• Whenever a user reads a first paragraph, text summarization
software handles and summarizes a large text document in less
time than to users. It can be classified into two parts;

• Abstractive Summarization: It creates a clear
perception of key concepts in the text and depicts
those concepts in the natural language. It employs
linguistics methods for understanding, transforming
and explaining text into precise form.
• Extractive Summarization: These are conducted via
deriving major text segments, relying on statistical
analysis of text features such as words/phrases
frequency, position or suggested words to detect
the sentences to be extracted.
• In particular, text summarization is three steps process;

• Pre-processing: This step makes structured
representation of actual text. Tokenization, stop word
removal, and stemming are some methods, applied for
pre-processing.
• Processing: Algorithms are applied in order to translate
and interpret summary structure out of text structure.
• Development state: This step includes retrieving the
final summary from summary structure.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy