0% found this document useful (0 votes)

1 views25 pages

Unit 2

Unit II focuses on text extraction techniques, emphasizing pre-processing methods essential for natural language processing (NLP) such as tokenization, stemming, and lemmatization. It also covers text clustering, detailing the processes involved in grouping unlabelled texts and the various clustering algorithms, including hard and soft clustering. Additionally, the document discusses probabilistic models like Hidden Markov Models and their applications in analyzing natural language texts.

Uploaded by

vijayalakshmim.set

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views25 pages

Unit 2

Uploaded by

vijayalakshmim.set

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

UNIT II: UNIT II TEXT EXTRACTION

Pre-processing Techniques – Clustering – Probabilistic Models–Browsing and Query Refinement on

presentation Layer – Link Analysis–Visualization Approaches and its Operations.

2.1 TEXT PRE-PROCESSING TECHNIQUES

 Text preprocessing is an essential step in natural language processing (NLP) that involves
cleaning and transforming unstructured text data to prepare it for analysis.
 Natural language processing (NLP) can be thought of as an intersection of Linguistics,
Computer Science and Artificial Intelligence that helps computers understand, interpret and
manipulate human language.
 Pre-processing Techniques includes the following techniques

1. Lower Casing
2. Tokenization
3. Punctuation Mark Removal
4. Stop Word Removal
5. Stemming
6. Lemmatization

1. Text Pre-processing Using Lower Casing

 It is to convert our text data into lower case. But why is this step needed?
 When we have a text input, such as a paragraph we find words both in lower as well as upper
case. However, the same words written in different cases are considered as different entities
by the computer.
 For example: ‘Girl‘ and ‘girl‘ are considered as two separate words by the computer even
though they mean the same.

1
 In order to resolve this issue, we must convert all the words to lower case. This provides
uniformity in the text.

2. Understand Tokenization in Text Pre-processing

 The next text preprocessing step is Tokenization.

 Tokenization is the process of breaking up the paragraph into smaller units such as sentences
or words.
 Each unit is then considered as an individual token.
 The fundamental principle of Tokenization is to try to understand the meaning of the text by
analyzing the smaller units or tokens that constitute the paragraph.
 To do this, we shall use the NLTK library. NLTK is the Natural Language Toolkit library in
python that is used for Text Preprocessing.
Sentence Tokenize
 Now we shall take a paragraph as input and tokenize it into its constituting sentences.
 The result is a list stored in the variable ‘sentences.
 It contains each sentence of the paragraph.
 The length of the list gives us the total number of sentences.
Word Tokenize
 Similarly, we can also tokenize the paragraph into words.
 The result is a list called ‘words’, containing each word of the paragraph.
 The length of the list gives us the total number of words present in our paragraph.

3) Punctuation Mark Removal

 This brings us to the next step.
 We must now remove the punctuation marks from our list of words.

 Now, we can remove all the punctuation marks from our list of words by excluding any
alphanumeric element.

2
4) Stop Word Removal

 Stop words are basically a set of commonly used words in any language, not just English. Stop
words are a collection of words that occur frequently in any language but do not add much
meaning to the sentences.
 These are common words that are part of the grammar of any language.
 Every language has its own set of stop words.
 Example: the context of a search engine, if your search query is “how to develop information
retrieval applications”,
 Where
 First each word will be separated using “ ”
 the search engine tries to find web pages that contained the terms “how”, “to” “develop”,
“information”, ”retrieval”, “applications” Stemming

5) Stemming

 As the name suggests, Stemming is the process of reduction of a word into its root or stem
word.
 The word affixes are removed leaving behind only the root form or lemma.
 For example: The words “connecting”, “connect”, “connection”, “connects” are all reduced
to the root form “connect”. The words “studying”, “studies”, “study” are all reduced to
“studi”.
 Lemmatization
 Lemmatization is the process of grouping together different inflected forms of the same word.
It's used in computational linguistics, natural language processing (NLP) and chatbots.
Lemmatization links similar meaning words as one word, making tools such as chatbots and
search engine queries more effective and accurate.
 Example
 Lemmatization takes a word and breaks it down to its lemma. For example, the verb "walk"
might appear as "walking," "walks" or "walked." Inflectional endings such as "s," "ed" and
"ing"are removed. Lemmatization groups these words as its lemma, "walk."

3
2.2 TEXT CLUSTERING

 The text clustering process involves several stages of data preprocessing, feature extraction,
and similarity measurement.
 The goal of these stages is to transform raw text data into numerical representations that can
be processed and analyzed by machine learning algorithms.

Fig. Text Clustering

 From social media analytics to risk management and cybercrime protection, dealing with
textual data has never been more important.
 Text clustering is the task of grouping a set of unlabelled texts in such a way that texts in the
same cluster are more similar to each other than to those in other clusters.
 Text clustering algorithms process text and determine if natural clusters (groups) exist in the
data.
 The big idea is that documents can be represented numerically as vectors of features.
 The similarity in text can be compared by measuring the distance between these feature
vectors.
 Objects that are near each other should belong to the same cluster.
 Objects that are far from each other should belong to different clusters.
 A criterion function that tells us that we've got the best possible clusters and stop further
processing.
 A greedy algorithm to optimize the criterion function.

4
Features
 Document Retrieval: To improve recall, start by adding other documents from the same
cluster.
 Taxonomy Generation: Automatically generate hierarchical taxonomies for browsing
content.
 Fake News Identification: Detect if a news is genuine or fake.
 Language Translation: Translation of a sentence from one language to another.
 Spam Mail Filtering: Detect unsolicited and unwanted email/messages.
 Customer Support Issue Analysis: Identify commonly reported support issues.
 Clustering is a unsupervised learning approach.
 Classification is a supervised learning approach that maps an input to an output based on
example input-output pairs.

Types of clustering

 Hard Clustering: This groups items such that each item is assigned to only one cluster. For example,
we want to know if a tweet is expressing a positive or negative sentiment. k-means is a hard
clustering algorithm.
 Soft Clustering: Sometimes we don't need a binary answer. Soft clustering is about grouping items
such that an item can belong to multiple clusters. Fuzzy C Means (FCM) is a soft clustering
algorithm.

5
STEPS

 Text pre-processing: Text can be noisy, hiding information between stop words, inflexions and
sparse representations. Pre-processing makes the dataset easier to work with.
 Feature Extraction: One of the commonly used techniques to extract the features from textual data
is calculating the frequency of words/tokens in the document/corpus.
 Clustering: We can then cluster different text documents based on the features we have generated.

What are the steps involved in text pre-processing?

Below are the main components involved in pre-processing.
 Tokenization: Tokenization is the process of parsing text data into smaller units (tokens) such as
words and phrases.
 Transformation: It converts the text to lowercase, removes all diacritics/accents in the text, and
parses html tags.
 Normalization: Text normalization is the process of transforming a text into a canonical (root) form.
Stemming and lemmatization techniques are used for deriving the root word.
 Filtering: Stop words are common words used in a language, such as 'the', 'a', 'on', 'is', or 'all'. These
words do not carry important meaning for text clustering and are usually removed from texts.
What are the levels of text clustering?
Text clustering can be document level, sentence level or word level.
 Document level: It serves to regroup documents about the same topic. Document clustering has
applications in news articles, emails, search engines, etc.
 Sentence level: It's used to cluster sentences derived from different documents. Tweet analysis is an
example.
 Word level: Word clusters are groups of words based on a common theme. The easiest way to build a
cluster is by collecting synonyms for a particular word. For example, WordNet is a lexical database
for the English language that groups English words into sets of synonyms called synsets.

6
 Words can be similar lexically or semantically:
o Lexical similarity: Words are similar lexically if they have a similar character sequence.
Lexical similarity can be measured using string-based algorithms that operate on string
sequences and character composition.
o Semantic similarity: Words are similar semantically if they have the same meaning, are
opposite of each other, used in the same way, used in the same context or one is a type of
another.
o Semantic similarity can be measured using corpus-based or knowledge-based algorithms.
o Some of the metrics for computing similarity between two pieces of text are Jaccard
coefficient, cosine similarity and Euclidean distance.

7
 Hierarchical: In the divisive approach, we start with one cluster and split that into sub-clusters.
Example algorithms include DIANA and MONA. In the agglomerative approach, each document
starts as its own cluster and then we merge similar ones into bigger clusters. Examples include
BIRCH and CURE.
 Partitioning: k-means is a popular algorithm but requires the right choice of k. Other examples are
ISODATA and PAM.
 Density: Instead of using a distance measure, we form clusters based on how many data points fall
within a given radius. DBSCAN is the most well-known algorithm.
 Graph: Some algorithms have made use of knowledge graphs to assess document similarity. This
addresses the problem of polysemy (ambiguity) and synonymy (similar meaning).
 Probabilistic: A cluster of words belong to a topic and the task is to identify these topics. Words also
have probabilities that they belong to a topic.

2.3 PROBABILISTIC MODELS

 Several common themes frequently recur in many tasks related to processing and analyzing
complex phenomena, including natural language texts.
 Among these themes are classification schemes, clustering, probabilistic models, and rule-based
systems.
 Probabilistic models often show better accuracy and robustness against the noise than categorical
models.
 The ultimate reason for this is not quite clear and is an excellent subject for a philosophical
debate.
 Nevertheless, several probabilistic models have turned out to be especially useful for the different
tasks in extracting meaning from natural language texts.
 Most prominent among these probabilistic approaches are hidden Markov models (HMMs),
stochastic context-free grammars (SCFG), and maximal entropy Markov Models [MEMM].

1) HIDDEN MARKOV MODELS[HMM]

 An HMM is a finite-state automaton with stochastic state transitions and symbol emissions.
 The automaton models a probabilistic generative process.
8
 In this process, a sequence of symbols is produced by starting in an initial state, emitting a symbol
selected by the state, making a transition to a new state, emitting a symbol selected by the state,
and repeating this transition–emission cycle until a designated final state is reached.
 Formally, let O = {o1,...oM} be the finite set of observation symbols and Q = {q1,... qN} be the
finite set of states.
 A first-order Markov model λ is a triple (π, A, B),
 where π : Q → [0, 1] defines the starting probabilities,
 A : Q × Q → [0, 1] defines the transition probabilities,
 and B : Q × O → [0, 1] denotes the emission probabilities.
 Because the functions π, A, and B define true probabilities, they must satisfy

 A model λ together with the random process described above induces a probability distribution
over the set O* of all possible observation sequences

2) Stochastic context-free grammar [SCFG]

 CFG is a context-free grammar in which each production is increase the size with a
probability.

 The probability of a derivation (parse) is then the product of the probabilities of the
productions used in that derivation;

 Thus some derivations are more consistent with the stochastic grammar than others.

Ex
Let Language of the grammar: L = { “a boy runs”, “a boy walks”, “the boy runs”, “the boy walks”, “a
dog runs”, “a dog walks”, “the dog runs”, “the dog walks” }

Notations

9
Another example

Another derivation

Language of the grammar is

A context-free grammar is a generative model G = (V, α, S, R)

 where: V is a non terminal alphabet, (e.g., {A, B, C, D, E, ...}

 α is a terminal alphabet, (e.g., {a, c, g, t} )
 S∈Vis a special start symbol R is a set of rewriting rules called productions.
 Productions in R are rules of the form: X→ λ where X∈V, λ∈(V∪α)

3) MAXIMAL ENTROPY MARKOV MODELS [MEMM]

• An MEMM is a discriminative model that extends a standard maximum entropy classifier by

assuming that the unknown values to be learnt are connected in a Markov chain rather than
being conditionally independent of each other.

10
• Assume we have a random sample with a training set of n examples x_1 to x_n. These input
values are assumed to be independent so the probability of the function f(x;w) is the product
of the probabilities of each input. So we have:

• The likelihood function considers the training data as fixed but varies the parameter values w.
The principle of maximum likelihood says that we want to find the parameter values w such
that it models the training data x with the maximum probability.

• The variable w is a vector of weights or the

embedding vector. The goal is to find the weight parameters that will maximize the

 Similar to maximum likelihood, the maximum conditional likelihood says that wechoose a
parameter estimate w^ that maximizes the product f(y_i|x_i; w).
 In this conditional probability, we do not need to assume that x_i are independent. We only
need to assume y_i are independent conditionally on x_i. For a specific value of x, we
estimate w^ as:

Shortcoming of MEMM
 Although it seems like this algorithm solves all the issues from previous ones, it has a
shortcoming. It is called the label bias problem.

 The transitions from a given state are competing against each other (where the sum of the
transition is equal to 1).

 This creates a bias where the states with fewer arcs are preferred.

 Here is an example where state 1 has only 2 transitional states to 1 and 2 but state 2 has the
transition to all other states.

 In this case, the transition to state 1 is always preferred.

11
2.4 Browsing and Query Refinement on presentation Layer

1 BROWSING

 Browsing is a term open to broad interpretation.

 With respect to text mining systems, however, it usually refers to the general front-end
framework through which an enduser searches, queries, displays, and interacts with embedded
or middle-tier knowledge-discovery algorithms.
 The software that implements this framework is called a browser.
 Beyond their ability to allow a user to
(a) manipulate the various knowledge discovery algorithms they may operate and
(b) explore the resulting patterns, most browsers also generally support functionality to link to
some portion of the full text of documents underlying the patterns that these knowledge
discovery algorithms may return.
 Usually, browsers in text mining operate as a user interface to specialized query languages
that allow parameterized operation of different pattern search algorithms, though this
functionality is now almost always commanded through a graphical user interface (GUI) in
real-world text mining applications.

12
 This means that, practically many discovery operations are “kicked off” by a query for a
particular type of pattern through a browser interface, which runs a query argument that
executes a search algorithm.
 Answers are returned via a large number of possible display modalities in the GUI, ranging
from simple lists and tables to navigable nodal trees to complex graphs generated by
extremely sophisticated data visualization tools.
 Figure 2.4.1 shows a simple distribution browser that allows a user to search for specific
distributions while looking at a concept hierarchy to provide some order and context to the t

Fig

2.4.1 Example of interactive browser

1.1 Displaying and Browsing Distributions

 Traditional document retrieval systems allow a user to ask for all documents containing
certain concepts – UK and USA, for example –
 but then present the entire set of matching documents with little information about the
collection’s internal structure other than perhaps sorting them by relevance score or
chronological order.
 In contrast, browsing distributions in a text mining system can enable a user to investigate the
contents of a document set by sorting it according to the child distribution of any node in a
concept hierarchy such as topics, countries, companies, and so on.
 Once the documents are analyzed in this fashion and the distribution is displayed, a user
could, for instance, access the specific documents of each subgroup (see Figure 2.4.2)

13
1.2 Displaying and Exploring Associations

 Even when data from a document collection are moderately sized, association-finding
methods will often generate substantial numbers of results.
 Therefore, associationdiscovery tools in text mining must assist a user in identifying the
useful results out of all those the system generates
 One method for doing this is to support association browsing by clustering associations with
identical left-hand sides (LHSs).
 Then, these clusters can be displayed in decreasing order of the generality of their LHS.
 Associations that have more general LHSs will be listed before more specific associations.
 The top-level nodes in the hierarchical tree are sorted in decreasing order of the number of
documents that support all associations in which they appear.
 Some text mining systems include fully featured, association-specific browsing tools (see
Figures 2.4.3 and 2.4.4) geared toward providing users with an easy way for finding
associations and then filtering and sorting them in different orders.
 This type of browser tool can support the specification of simple constraints on the presented
associations.
 The user can select a set of concepts from the set of all possible concepts appearing in the
associations and then choose the logical test to be performed on the associations

14
1.3 Navigation and Exploration by Means of Concept Hierarchies

 Concept hierarchies and taxonomies can play many different roles in text mining systems.
 However, it is important not to overlook the usefulness of various hierarchical representations
in navigation and user exploration.
 Often, it is visually easier to traverse a comprehensive tree-structure of nodes relating to all
the concepts relevant to an entire document collection or an individual pattern query result set
than to scroll down a long, alphabetically sorted list of concept labels.
 Indeed, sometimes the knowledge inherent in the hierarchical structuring of concepts can
serve as an aid to the interactive or free-form exploration of concept relationships, or both – a
critical adjunct to uncovering hidden but interesting knowledge.
 A concept hierarchy or taxonomy can also enable the user of a text mining system to specify
mining tasks concisely.

15
2. Query Refinement on presentation Layer

Query refinement on the presentation layer refers to the process of improving and enhancing
user queries or search inputs within a user interface to provide more accurate and relevant results.

It involves manipulating the user's search query to better match the intended meaning or to
account for possible errors, ambiguity, or incomplete information. This process is typically employed
in search engines, databases, and information retrieval systems to deliver better search results and
improve the overall user experience.

Here are some common techniques used for query refinement on the presentation layer:

1. Autocomplete or Type-ahead: As the user begins typing their query, the system offers
suggestions in real-time, based on popular or previously entered queries. This helps users
quickly find relevant terms and reduces the chance of spelling errors.

2. Spell Correction: Automatic spell checking and correction can help users when they mistype
a word or misspell it. The system suggests the correct spelling or automatically corrects the
query to improve the accuracy of the search results.

3. Synonym Expansion: The system expands the user query to include synonyms or related
terms, increasing the chances of finding relevant results even if the original query terms are
not explicitly present in the data.

4. Query Suggestions: Along with the search results, the system may also offer related or
alternative queries to help users refine their search and explore different options.

5. Faceted Search: This technique allows users to filter search results using various attributes or
categories (facets) related to the data. It enables users to narrow down results by applying
multiple filters to the original query.

16
6. Natural Language Processing (NLP): Utilizing NLP techniques, the system can understand
the intent behind the user's query and reformulate it to obtain more accurate and contextually
relevant results.

7. Query Logs Analysis: Analyzing user interactions and query logs can provide insights into
common query refinements made by users. This information can be used to enhance the
system's query refinement capabilities.

8. User Feedback and Relevance Feedback: Incorporating user feedback on the search results
allows the system to learn from user interactions and improve query refinement over time.

Overall, query refinement on the presentation layer is a valuable aspect of search and information
retrieval systems, as it empowers users to find the information they need more efficiently and
effectively.

It also helps to bridge the gap between the user's intent and the available data, resulting in an
enhanced user experience.

2.5 Link Analysis

 Text link analysis, also known as link analysis or graph analysis.

 It is a method used to analyze the relationships and connections between pieces of textual data,
such as documents, web pages, or entities.

 It involves representing the textual elements as nodes and the links or relationships between them
as edges in a graph.

 This approach allows for the exploration and extraction of valuable insights from the patterns of
connections within the data.

 In text link analysis, the graph structure can be constructed based on various types of relationships
between textual elements, such as:

17
1. Co-occurrence: Nodes (documents, words, or entities) are connected if they co-occur in the
same context or document. For example, if two words frequently appear together in a set of
documents, a link is established between them.

2. Citation or Reference: In academic documents, research papers, or articles, a citation

relationship is formed when one document references or cites another. These citation links can
be used to evaluate the importance or influence of specific documents.

3. Mention or N-grams: Links can be created between documents or phrases that mention or
contain specific entities or n-grams (sequences of n words). This helps identify connections
between related topics.

4. Similarity: Nodes are linked based on their semantic similarity. For instance, documents or
sentences that share similar meanings might be connected.

5. Named Entity Relationships: In natural language processing, named entities (such as people,
organizations, or locations) can be linked based on their interactions or associations in the
text.

 There are several different measures of centrality, each capturing a different aspect of a node's
importance within the network. Some of the common centrality measures include:

1. Degree Centrality: Degree centrality is the simplest centrality measure and is based on the
number of connections (edges) that a node has. Nodes with high degree centrality are well-
connected to many other nodes in the network. In a social network, high-degree nodes would
represent highly popular or sociable individuals.
2. Betweenness Centrality: Betweenness centrality quantifies the extent to which a node acts as
a bridge or intermediary between other nodes in the network. Nodes with high betweenness
centrality lie on many shortest paths between pairs of other nodes, facilitating the flow of
information or influence. In a social network, such nodes might play a crucial role in
connecting different groups of individuals.
3. Closeness Centrality: Closeness centrality measures how close a node is to all other nodes in
the network. Nodes with high closeness centrality can quickly interact with other nodes, and
information can spread efficiently from them to the rest of the network.

Ex: Social Network Graph:

18
Let's understand centrality with an example using a simple social network graph. Consider a
social network of friends, where each node represents an individual, and the edges represent
friendships between them. We will calculate different centrality measures for this network.
A
/\
B C
/\/\
D---E---F
In this network:

 Node A is friends with B and C.

 Node B is friends with A, D, and E.
 Node C is friends with A, E, and F.
 Node D is friends with B and E.
 Node E is friends with B, C, D, and F.
 Node F is friends with C and E.

Now, let's calculate some centrality measures for this social network:

1. Degree Centrality: Degree centrality measures the number of connections (edges) that a node
has.

 Degree Centrality of Node A: 2 (connected to B and C)

 Degree Centrality of Node B: 3 (connected to A, D, and E)
 Degree Centrality of Node C: 3 (connected to A, E, and F)
 Degree Centrality of Node D: 2 (connected to B and E)
 Degree Centrality of Node E: 4 (connected to B, C, D, and F)
 Degree Centrality of Node F: 2 (connected to C and E)

2. Betweenness Centrality: Betweenness centrality quantifies the extent to which a node acts as
a bridge or intermediary between other nodes in the network.

 Betweenness Centrality of Node A: 0 (no shortest paths pass through A)

 Betweenness Centrality of Node B: 5 (B lies on 5 shortest paths between other nodes)
 Betweenness Centrality of Node C: 5 (C lies on 5 shortest paths between other nodes)

19
 Betweenness Centrality of Node D: 2 (D lies on 2 shortest paths between other nodes)
 Betweenness Centrality of Node E: 8 (E lies on 8 shortest paths between other nodes)
 Betweenness Centrality of Node F: 0 (no shortest paths pass through F)

3. Closeness Centrality: Closeness centrality measures how close a node is to all other nodes in
the network.

 Closeness Centrality of Node A: 1.5

 Closeness Centrality of Node B: 0.625
 Closeness Centrality of Node C: 0.625
 Closeness Centrality of Node D: 0.625
 Closeness Centrality of Node E: 0.625
 Closeness Centrality of Node F: 1.5

4. Eigenvector Centrality: Eigenvector centrality takes into account both a node's own
centrality and the centrality of its neighbors. A node's centrality is higher if it is connected to
other highly central nodes. Eigenvector centrality is useful for identifying nodes that are
influential due to their connections to other influential nodes.
5. PageRank: PageRank is a centrality measure used by Google's search engine algorithm to
rank web pages based on their importance. It considers both the number and quality of links
pointing to a web page, giving higher rank to pages that are linked to by other high-ranking
pages.

 Centrality measures help researchers and analysts understand the structure and dynamics of
networks, (e.g., in the case of network attacks or failures).
 The specific centrality measure used depends on the characteristics and objectives of the
network analysis being conducted.

Applications of Text Link Analysis:

1. Search Engine Ranking: Text link analysis is a crucial component of search engine
algorithms, helping to determine the relevance and importance of web pages based on their
incoming and outgoing links.

2. Information Retrieval: By analyzing the connections between documents, link analysis can
improve information retrieval systems, providing more accurate and relevant search results.

20
3. Social Network Analysis: In social networks, link analysis is used to identify relationships
between users, detect communities, and understand the influence of individuals within the
network.

4. Recommender Systems: Text link analysis can be applied to build recommender systems that
suggest related articles, products, or services based on user preferences and item connections.

5. Fraud Detection: In financial transactions or networks, link analysis can be utilized to

identify suspicious connections and potential fraud patterns.

2.6 Visualization Approaches and its Operations

• The term “Data Visualization”, derived from “Data Science”

• Data Science is the science of analyzing raw data using statistics and machine learning
techniques

• This changes the raw data from various sources (surveys, feedback, list of purchases, votes,
etc.), for analysis.”

21
 In the high-level functional architecture of a text mining system illustrated in Figure 2.6.1
visualization tools are among those system elements situated closest to the user.
 Visualization tools are mechanisms that serve to facilitate human interactivity with a text
mining system.
 These tools are layered on top of – and are dependent upon – the existence of a processed
document collection and the various algorithms that make up a text mining system’s core
mining capabilities.
 The increased emphasis on adding more sophisticated and varied visualization tools to text
mining systems has had several implications for these systems’ architectural design.
 Although older text mining systems often had rigidly integrated visualization tools built into
their user interface (UI) front ends, newer text mining systems emphasize modularity and
abstraction between their front-end (i.e., presentation layer) architectures
 This kind of visualization tool can also easily be made to allow a user to click on a node
concept and either move to the underlying documents containing the concept or to connect to
information about sets or distributions of documents containing the concept within the
document collection.
 This latter type of information – the answer set to a rather routine type of query in many text
mining systems – can be demonstrated by means of a concept set graph.
 A very simple DAG is shown in Figure 2.6.2. Traditional, rigidly directed hierarchical
representations might be both much less obvious and less efficient in showing that four
separate concepts have a similar or possibly analogous relationship to a fifth concept.
 Because of their ability to illustrate more complex relationships, DAGs are very frequently
leveraged as the basis for moderately sophisticated relationship maps in text mining
applications.

22
DAGs can also be employed as the basis for modeling activity networks. An activity network is a
visual structure in which each vertex represents a task to be completed or a choice to be made and the
directed edges refer to subsequent tasks or choices. See Figure 2.6.3

Similarity Functions for Simple Concept

Association Graphs Similarity functions often form an essential part of working with simple concept
association graphs, allowing a user to view relations between concepts according to differing
weighting measures. Association rules involving sets (or concepts) A and B that have been described
in detail in Chapter II are often introduced into a graph format in an undirected way and specified by
a support and a confidence threshold. A fixed confidence threshold is often not very reasonable
because it is independent of the support from the RHS of the rule. As a result, an association should
have a significantly higher confidence than the share of the RHS in the whole context to be
considered as interesting. Significance is measured by a statistical test (e.g., t-test or chi-square)

Equivalence Classes, Partial Orderings, Redundancy Filters

Very many pairs of subsets can be built from a given category of concepts, (e.g., all pairs of
country subsets for the set of all countries). Each of these pairs is a possible association between
subsets of concepts. Even if the threshold of the similarity function is increased, the resulting graph
can have too complex a structure. We now define several equivalence relations to build equivalence
classes of associations. Only a representative association from each class will then be included in the
keyword graph in the default case. A first equivalence is called cover equivalence. Two associations
are coverequivalent if they have the same cover. For example (Iran, Iraq) => (Kuwait, USA) is
23
equivalent to (Iran, Iraq, Kuwait) => USA because they both have the same cover (Iran, Iraq, Kuwait,
USA). The association with the highest similarity is selected as the representative from a cover
equivalence class.

Operations on visualization approaches

Browsing-Support Operations

 Browsing-support operations enable access to the underlying document collections from the
concept set visual interface.
 Essentially, a concept set corresponds to a query that can be forwarded to the collection
retrieving those documents, which include all the concepts of the set.
 Therefore, each concept set appearing in a graph can be activated for browsing purposes.
 Moreover, derived sets based on set operations (e.g., difference and intersection) can be
activated for retrieval.

Search Operations

 Search operations define new search tasks related to nodes or associations selected in the
graph.
 A graph presents the results of a (former) search task and thus puts together sets of concepts
or sets of associations.
 In a GUI, the user can specify the search constraints: syntactical, background, quality, and
redundancy constraints.
 The former search is now to be refined by a selection of reference sets or associations in the
result graph.
 Some of the search constraints may be modified in the GUI for the scheduled refined search.
 In refinement operations, the user can, for example, increase the number of elements that are
allowed in a concept set.

Link Operations

 Link operations combine several concept graphs.

 Elements in one graph are selected and corresponding elements are highlighted in the second
graph.
 Three types of linked graphs can be distinguished: links between set graphs, between
association graphs, and between set and association graphs.

24
 When linking two set graphs, one or several sets are selected in one graph and corresponding
sets are highlighted in the second graph.
 A correspondence for sets can rely, for instance, on the intersections of a selected set with the
sets in the other graph.
 Then all those sets that have a high overlap with a selected set in the first graph are
highlighted in the second graph.
 When selected elements in a set graph are linked with an association graph, associations in the
second graph that have a high overlap with a selected set are highlighted.
 For instance, in a company graph, all country nodes that have a high intersection with a
selected topic in an economical topic graph can be highlighted.
 Thus, linkage of graphs relies on the construct of a correspondence between two set or
association patterns.
 For example, a correspondence between two sets can be defined by a criterion referring to
their intersection, a correspondence between a set and an association by a specialization
condition for the more special association constructed by adding the set to the original
association, and a correspondence between two associations by a specialization condition for
an association constructed by combining the two associations.
 Presentation Operations: A first interaction class relates to diverse presentation options for the
graphs.
 It includes a number of operations essential to the customization, personalization, calibration,
and administration of presentation-layer elements, includingsorting, expanding or collapsing
filtering or finding zooming or unzooming nodes, edges, or graph regions.
 Although all these presentation-layer operations can have important effects on facilitating
usability – and as a result – increased user interaction, some can in certain situations have a
very substantial impact on the overall power of systems visualization tools.
 Zoom operations, in particular, can add significant capabilities to otherwise very simple
concept graphs.
 For instance, by allowing a user to zoom automatically to a predetermined focal point in a
graph, one can add at least something reminiscent of the type of functionality found in much
more sophisticated fisheye and self-ordering map (SOM) visualization

Week 1-4 Text An
No ratings yet
Week 1-4 Text An
74 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Graph Theory With Applications - J. Bondy, U. Murty
83% (12)
Graph Theory With Applications - J. Bondy, U. Murty
270 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
Text Processing
No ratings yet
Text Processing
5 pages
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction To NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
Ai CH 4
No ratings yet
Ai CH 4
53 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Rada F. Mihalcea, Dragomir R. Radev - Graph-Based Natural Language Processing and Information Retrieval-Cambridge University Press (2011)
100% (1)
Rada F. Mihalcea, Dragomir R. Radev - Graph-Based Natural Language Processing and Information Retrieval-Cambridge University Press (2011)
202 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
Solved Paper
No ratings yet
Solved Paper
51 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
Unit 5 - Aiaaia
No ratings yet
Unit 5 - Aiaaia
19 pages
Unit 5
No ratings yet
Unit 5
8 pages
NLP Notes
No ratings yet
NLP Notes
10 pages
PDF NLP
No ratings yet
PDF NLP
7 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
NLP Lect 2
No ratings yet
NLP Lect 2
5 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
DS LM
No ratings yet
DS LM
126 pages
Great Big Natural Language Processing Primer KDnuggets
No ratings yet
Great Big Natural Language Processing Primer KDnuggets
25 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Text Mining
No ratings yet
Text Mining
62 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
DS Lecture Notes
No ratings yet
DS Lecture Notes
100 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
Unit 1 NLP and TA
No ratings yet
Unit 1 NLP and TA
9 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
NLP Final
No ratings yet
NLP Final
33 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
Chapter 2 Solutions
No ratings yet
Chapter 2 Solutions
6 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Parts Toshiba
100% (1)
Parts Toshiba
335 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Preprocessing Techniquesfor Text Mining
No ratings yet
Preprocessing Techniquesfor Text Mining
7 pages
Coba Coba Upload
No ratings yet
Coba Coba Upload
3 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
Chapter 4 (Discrete Math)
No ratings yet
Chapter 4 (Discrete Math)
72 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Assigmnent I TEXT WEB Media (2024 Feb)
No ratings yet
Assigmnent I TEXT WEB Media (2024 Feb)
12 pages
Fathima A.S.
0% (1)
Fathima A.S.
44 pages
Unit 3
No ratings yet
Unit 3
27 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
1) What Is Natural Language Processing?
No ratings yet
1) What Is Natural Language Processing?
14 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Unit 4
No ratings yet
Unit 4
17 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
1912101-Advanced Data Structures and Algorithms
No ratings yet
1912101-Advanced Data Structures and Algorithms
12 pages
Concept Review - Chapter 1-10
No ratings yet
Concept Review - Chapter 1-10
51 pages
Gtu 4th It Question Paper
No ratings yet
Gtu 4th It Question Paper
38 pages
3 Data Structures - 8096095
No ratings yet
3 Data Structures - 8096095
46 pages
Network Design Management and Technical Perspectives 2nd Edition Teresa C. Piliouras All Chapter Instant Download
100% (10)
Network Design Management and Technical Perspectives 2nd Edition Teresa C. Piliouras All Chapter Instant Download
79 pages
Graph - Xem lại lần làm thử - BK LMS
No ratings yet
Graph - Xem lại lần làm thử - BK LMS
32 pages
Design and Analysis of Algorithms Unit 2
No ratings yet
Design and Analysis of Algorithms Unit 2
32 pages
DAA ch3 Updated 2016
No ratings yet
DAA ch3 Updated 2016
27 pages
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
m4 l1 Graph Coloring
No ratings yet
m4 l1 Graph Coloring
66 pages
The Jordan Brouwer Theorem For Graphs
No ratings yet
The Jordan Brouwer Theorem For Graphs
26 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
4 1 C A Mathematicalmodeling
No ratings yet
4 1 C A Mathematicalmodeling
9 pages
323 Lecture Notes 8 Part 2
No ratings yet
323 Lecture Notes 8 Part 2
26 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Lectures On Discrete Geometry
No ratings yet
Lectures On Discrete Geometry
80 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
The Structure and Function of Complex Networks: M. E. J. Newman
No ratings yet
The Structure and Function of Complex Networks: M. E. J. Newman
58 pages
Lexicography
No ratings yet
Lexicography
21 pages
Signed Graph
No ratings yet
Signed Graph
18 pages
1.1 Graph Theory
100% (1)
1.1 Graph Theory
8 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
Detail Syllabus BCA Sem. 2
No ratings yet
Detail Syllabus BCA Sem. 2
12 pages
Discrete Math Final Exam Compress
No ratings yet
Discrete Math Final Exam Compress
4 pages
Elementary Graph Theory: Robin Truax March 2020
No ratings yet
Elementary Graph Theory: Robin Truax March 2020
15 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
Congratulations! You Passed!: Problem Set #4
No ratings yet
Congratulations! You Passed!: Problem Set #4
3 pages
Discret Math Unit 4-6
No ratings yet
Discret Math Unit 4-6
8 pages
Discrete Mathematics Midterm Quiz 2
No ratings yet
Discrete Mathematics Midterm Quiz 2
5 pages
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
From Everand
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
Christopher Right
2.5/5 (2)
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 2

Uploaded by

Unit 2

Uploaded by

UNIT II: UNIT II TEXT EXTRACTION

Pre-processing Techniques – Clustering – Probabilistic Models–Browsing and Query Refinement on

2.1 TEXT PRE-PROCESSING TECHNIQUES

1. Text Pre-processing Using Lower Casing

2. Understand Tokenization in Text Pre-processing

 The next text preprocessing step is Tokenization.

3) Punctuation Mark Removal

Fig. Text Clustering

What are the steps involved in text pre-processing?

2.3 PROBABILISTIC MODELS

1) HIDDEN MARKOV MODELS[HMM]

2) Stochastic context-free grammar [SCFG]

Language of the grammar is

A context-free grammar is a generative model G = (V, α, S, R)

 where: V is a non terminal alphabet, (e.g., {A, B, C, D, E, ...}

3) MAXIMAL ENTROPY MARKOV MODELS [MEMM]

• An MEMM is a discriminative model that extends a standard maximum entropy classifier by

• The variable w is a vector of weights or the

 In this case, the transition to state 1 is always preferred.

 Browsing is a term open to broad interpretation.

2.4.1 Example of interactive browser

1.1 Displaying and Browsing Distributions

2.5 Link Analysis

 Text link analysis, also known as link analysis or graph analysis.

2. Citation or Reference: In academic documents, research papers, or articles, a citation

Ex: Social Network Graph:

 Node A is friends with B and C.

 Degree Centrality of Node A: 2 (connected to B and C)

 Betweenness Centrality of Node A: 0 (no shortest paths pass through A)

 Closeness Centrality of Node A: 1.5

Applications of Text Link Analysis:

5. Fraud Detection: In financial transactions or networks, link analysis can be utilized to

2.6 Visualization Approaches and its Operations

• The term “Data Visualization”, derived from “Data Science”

Similarity Functions for Simple Concept

Equivalence Classes, Partial Orderings, Redundancy Filters

Operations on visualization approaches

 Link operations combine several concept graphs.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.