0% found this document useful (0 votes)
9 views17 pages

Unit - 2

Unit II focuses on text extraction, detailing the importance of data preprocessing techniques such as data cleaning, integration, transformation, and reduction to enhance data quality for machine learning. It also covers clustering, probabilistic models, browsing and query refinement, and link analysis, which are essential for organizing, analyzing, and extracting insights from textual data. These methodologies aim to improve the efficiency and accuracy of information retrieval and analysis in various applications.

Uploaded by

priyam3783
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views17 pages

Unit - 2

Unit II focuses on text extraction, detailing the importance of data preprocessing techniques such as data cleaning, integration, transformation, and reduction to enhance data quality for machine learning. It also covers clustering, probabilistic models, browsing and query refinement, and link analysis, which are essential for organizing, analyzing, and extracting insights from textual data. These methodologies aim to improve the efficiency and accuracy of information retrieval and analysis in various applications.

Uploaded by

priyam3783
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT II - TEXT EXTRACTION

UNIT II - TEXT EXTRACTION 9

Pre-processing Techniques – Clustering – Probabilistic Models – Browsing and Query


Refinement on presentation Layer- Link Analysis – Visualization Approaches and its Operations.

Data Preprocessing

Data Preprocessing includes the steps we need to follow to transform or encode data so that it may be

easily parsed by the machine.

The main agenda for a model to be accurate and precise in predictions is that the algorithm should be

able to easily interpret the data's features.

Why is Data Preprocessing important?

The majority of the real-world datasets for machine learning are highly susceptible to be missing,

inconsistent, and noisy due to their heterogeneous origin.


Applying data mining algorithms on this noisy data would not give quality results as they would fail to

identify patterns effectively. Data Processing is, therefore, important to improve the overall data quality.

 Duplicate or missing values may give an incorrect view of the overall statistics of data.

 Outliers and inconsistent data points often tend to disturb the model’s overall learning, leading to
false predictions.

Quality decisions must be based on quality data. Data Preprocessing is important to get this quality data,

without which it would just be a Garbage In, Garbage Out scenario.


4 Steps in Data Preprocessing

Now, let's discuss more in-depth four main stages of data preprocessing.

Data Cleaning

Data Cleaning is particularly done as part of data preprocessing to clean the data by filling missing
values, smoothing the noisy data, resolving the inconsistency, and removing outliers.

1. Missing values
Here are a few ways to solve this issue:

 Ignore those tuples

This method should be considered when the dataset is huge and numerous missing values are present
within a tuple.

 Fill in the missing values

There are many methods to achieve this, such as filling in the values manually, predicting the missing
values using regression method, or numerical methods like attribute mean.

2. Noisy Data

It involves removing a random error or variance in a measured variable. It can be done with the help of
the following techniques:

 Binning

It is the technique that works on sorted data values to smoothen any noise present in it. The data is
divided into equal-sized bins, and each bin/bucket is dealt with independently. All data in a segment
can be replaced by its mean, median or boundary values.

 Regression

This data mining technique is generally used for prediction. It helps to smoothen noise by fitting all the
data points in a regression function. The linear regression equation is used if there is only one
independent attribute; else Polynomial equations are used.

 Clustering

Creation of groups/clusters from data having similar values. The values that don't lie in the cluster can
be treated as noisy data and can be removed.

3. Removing outliers

Clustering techniques group together similar data points. The tuples that lie outside the cluster are
outliers/inconsistent data.
Data Integration

Data Integration is one of the data preprocessing steps that are used to merge the data present in multiple

sources into a single larger data store like a data warehouse.

Data Integration is needed especially when we are aiming to solve a real-world scenario like detecting

the presence of nodules from CT Scan images. The only option is to integrate the images from multiple

medical nodes to form a larger database.

We might run into some issues while adopting Data Integration as one of the Data Preprocessing steps:

 Schema integration and object matching: The data can be present in different formats, and
attributes that might cause difficulty in data integration.

 Removing redundant attributes from all data sources.

 Detection and resolution of data value conflicts.

Data Transformation

Once data clearing has been done, we need to consolidate the quality data into alternate forms by

changing the value, structure, or format of data using the below-mentioned Data Transformation

strategies.

Generalization

The low-level or granular data that we have converted to high-level information by using concept

hierarchies. We can transform the primitive data in the address like the city to higher-level information

like the country.

Normalization
It is the most important Data Transformation technique widely used. The numerical attributes are scaled

up or down to fit within a specified range. In this approach, we are constraining our data attribute to a

particular container to develop a correlation among different data points. Normalization can be done in

multiple ways, which are highlighted here:

 Min-max normalization

 Z-Score normalization

 Decimal scaling normalization

Attribute Selection

New properties of data are created from existing attributes to help in the data mining process. For

example, date of birth, data attribute can be transformed to another property like is_senior_citizen for

each tuple, which will directly influence predicting diseases or chances of survival, etc.

Aggregation

It is a method of storing and presenting data in a summary format. For example sales, data can be

aggregated and transformed to show as per month and year format.

Data Reduction

The size of the dataset in a data warehouse can be too large to be handled by data analysis and data

mining algorithms.

One possible solution is to obtain a reduced representation of the dataset that is much smaller in volume

but produces the same quality of analytical results.

Here is a walkthrough of various Data Reduction strategies.

Data cube aggregation


It is a way of data reduction, in which the gathered data is expressed in a summary form.

Dimensionality reduction

Dimensionality reduction techniques are used to perform feature extraction. The dimensionality of a

dataset refers to the attributes or individual features of the data. This technique aims to reduce the

number of redundant features we consider in machine learning algorithms. Dimensionality reduction can

be done using techniques like Principal Component Analysis etc.

Data compression

By using encoding technologies, the size of the data can significantly reduce. But compressing data can

be either lossy or non-lossy. If original data can be obtained after reconstruction from compressed data,

this is referred to as lossless reduction; otherwise, it is referred to as lossy reduction.

Discretization

Data discretization is used to divide the attributes of the continuous nature into data with intervals. This

is done because continuous features tend to have a smaller chance of correlation with the target variable.

Thus, it may be harder to interpret the results. After discretizing a variable, groups corresponding to the

target can be interpreted. For example, attribute age can be discretized into bins like below 18, 18-44,

44-60, above 60.

Numerosity reduction

The data can be represented as a model or equation like a regression model. This would save the burden

of storing huge datasets instead of a model.

Attribute subset selection


It is very important to be specific in the selection of attributes. Otherwise, it might lead to high

dimensional data, which are difficult to train due to underfitting/overfitting problems. Only attributes

that add more value towards model training should be considered, and the rest all can be discarded.

Data Quality Assessment

Data Quality Assessment includes the statistical approaches one needs to follow to ensure that the data

has no issues. Data is to be used for operations, customer management, marketing analysis, and decision

making—hence it needs to be of high quality.

The main components of Data Quality Assessment include:

1. The completeness with no missing attribute values

2. Accuracy and reliability in terms of information

3. Consistency in all features

4. Maintain data validity

5. It does not contain any redundancy

Data Quality Assurance process has involves three main activities.

1. Data profiling: It involves exploring the data to identify the data quality issues. Once the
analysis of the issues is done, the data needs to be summarized according to no duplicates, blank
values etc identified.

2. Data cleaning: It involves fixing data issues.

3. Data monitoring: It involves maintaining data in a clean state and having a continuous check
on business needs being satisfied by the data.

2. Clustering
Clustering in text extraction refers to the process of grouping similar textual documents
or pieces of text together based on their content. It is a technique used in natural language
processing and information retrieval to organize and categorize large amounts of text
data. The goal of text clustering is to discover patterns, relationships, and themes within
the text that might not be immediately apparent.

In the context of text extraction, clustering involves identifying and grouping text
documents or segments that share common characteristics or topics. This process helps in
organizing and summarizing large text datasets, making it easier to analyze and extract
valuable insights from the text. Text clustering is often used as a preliminary step before
more detailed analysis, such as sentiment analysis, topic modeling, or document
summarization.

Key points about clustering in text extraction:

Similarity Measurement: Clustering algorithms typically use a measure of similarity (or


distance) between pairs of text documents. Similarity can be based on various factors,
such as word usage, frequency, and context.
Unsupervised Learning: Clustering is typically an unsupervised learning technique,
meaning it doesn't require labeled training data. The algorithm identifies patterns solely
based on the content of the text.

Grouping for Exploration: Text clustering allows researchers or analysts to explore large
text collections by organizing them into meaningful groups. These groups might
represent topics, themes, or categories present in the text.

Applications: Clustering in text extraction has numerous applications, such as document


organization, topic discovery, content recommendation, and identifying emerging trends
in textual data.

Algorithm Variety: Various clustering algorithms can be applied to text data, including k-
means, hierarchical clustering, DBSCAN, and more. Each algorithm has its own
strengths and weaknesses, making them suitable for different types of text analysis tasks.

Pre-processing: Pre-processing steps, such as text normalization, stop-word removal, and


feature extraction, are often applied before clustering to prepare the text data for analysis.

3.Probabilistic Models

Probabilistic models refer to a class of statistical techniques and models used to analyze and
extract information from textual data while incorporating uncertainty and probability
distributions. These models are based on the principles of probability theory and are
particularly useful when dealing with uncertain or ambiguous text data.
In the context of text extraction, probabilistic models are employed to capture the likelihood
of certain events or patterns occurring in the text. These models help estimate the
probabilities of different outcomes, making them suitable for tasks such as information
retrieval, text classification, sentiment analysis, and more.

Key points about probabilistic models in text extraction:

Uncertainty Handling: Text data often contains inherent uncertainty, ambiguity, and
variability. Probabilistic models provide a framework to handle and quantify this uncertainty
by assigning probabilities to different outcomes.

Bayesian Framework: Many probabilistic models in text extraction are based on the Bayesian
framework, which allows for the incorporation of prior knowledge and updating of
probabilities as new evidence (textual information) is observed.

Text Classification: Probabilistic models are commonly used for text classification tasks,
where the goal is to assign documents to predefined categories or classes. These models
estimate the probability of a document belonging to each class and make predictions based
on these probabilities.

Topic Modeling: Topic modeling techniques, such as Latent Dirichlet Allocation (LDA), are
probabilistic models that aim to discover the underlying topics in a collection of text
documents. These models estimate the probability distribution of words in each topic and the
probability distribution of topics in each document.
Information Retrieval: In information retrieval, probabilistic models are used to rank and
retrieve relevant documents based on their likelihood of containing the information a user is
seeking. Models like the Okapi BM25 and language models are probabilistic in nature.

Named Entity Recognition (NER): Probabilistic models can be used for named entity
recognition, where the goal is to identify and classify entities (such as names of people,
places, organizations) in text. These models estimate the probability of a sequence of words
being an entity.

Sentiment Analysis: For sentiment analysis, probabilistic models can estimate the likelihood
of a given text expressing positive, negative, or neutral sentiment.

Language Generation: In natural language generation tasks, probabilistic models are used to
generate coherent and contextually appropriate text. For instance, generative language
models like GPT-3 are based on probabilistic modeling principles.

Probabilistic models provide a robust framework for dealing with uncertainty in text
extraction tasks. They enable researchers and practitioners to make informed decisions and
predictions based on probabilities, making them valuable tools for various applications in the
field of natural language processing and text analysis.

4. Browsing and Query Refinement on presentation Layer

Browsing and query refinement on the presentation layer refer to user-interface and interaction
techniques that enhance the way users explore and interact with textual data. These techniques
are designed to facilitate efficient information retrieval, exploration of relevant content, and
improvement of search results by allowing users to interact with the extracted information in a
more intuitive and user-friendly manner.

Key points about browsing and query refinement on the presentation layer in text
extraction:

User-Friendly Exploration: Browsing and query refinement techniques focus on creating an


intuitive and user-friendly interface for users to interact with the extracted textual information.
The goal is to simplify the process of finding relevant content within a large collection of text
documents.

Faceted Search: Faceted search is a browsing technique that allows users to filter and navigate
through search results using multiple attributes or facets. Each facet corresponds to a specific
characteristic or category associated with the extracted text, such as author, date, topic, or
keyword.

Content-Based Filtering: Content-based filtering is a technique where the presentation layer


suggests related or similar documents to the user based on the content of the documents they are
currently viewing. This helps users discover additional relevant information without explicitly
formulating new queries.

Collaborative Filtering: Collaborative filtering involves suggesting content based on the


preferences and actions of other users. In the context of text extraction, collaborative filtering can
recommend documents that users with similar interests have found valuable.
Dynamic Queries: Dynamic queries allow users to refine search results by adjusting various
parameters in real-time. For example, users can dynamically adjust date ranges, relevance
thresholds, or other filters to quickly refine their search.

Query Expansion: Query refinement techniques enable users to expand or modify their queries
to improve search results. Suggestions for query expansion can be based on synonyms, related
terms, or previously viewed documents.

Visualization: Visualization techniques provide graphical representations of text data, making it


easier for users to identify patterns, relationships, and trends within the extracted information.
Visualization aids in quick comprehension of complex textual content.

Interactive Operations: Interactive operations, such as drag-and-drop, sliders, and checkboxes,


empower users to manipulate and interact with the presented text and search results in real-time.

Feedback Loop: Browsing and query refinement techniques often incorporate user feedback to
improve search accuracy and relevance over time. The system adapts to user preferences and
behavior.

Overall, browsing and query refinement on the presentation layer in text extraction aim to
enhance the user experience, support efficient exploration of textual content, and help users find
the information they need more effectively. These techniques leverage user interactions to refine
and optimize search results, providing a seamless and interactive approach to accessing and
navigating through extracted text data.

5. what is Link Analysis


Link analysis refers to the process of analyzing and understanding the relationships and
connections between various textual elements, such as documents, sentences, phrases, or
keywords. It involves identifying and examining the links or associations between these
elements to uncover patterns, semantic relationships, and insights within the text.

In the context of text extraction, link analysis focuses on understanding how different pieces
of text are interconnected and how they contribute to the overall meaning or context. It is
often used to enhance information retrieval, knowledge discovery, and content
summarization from large collections of text data.

Key points about link analysis in text extraction:

Semantic Relationships: Link analysis helps reveal semantic relationships between text
elements, such as synonyms, antonyms, hypernyms, hyponyms, and co-occurring terms.
Understanding these relationships contributes to a deeper understanding of the content.

Co-Occurrence Analysis: Link analysis can identify words, phrases, or entities that
frequently appear together in the same context. This can help uncover associations between
concepts and topics.

Topic Modeling: Link analysis can be used to construct topic models by analyzing the links
between documents or keywords. This assists in summarizing and organizing textual
information based on underlying themes.

Entity Resolution: Link analysis can aid in resolving references to the same entity across
different documents, helping to build a more comprehensive and accurate representation of
the information.

Hyperlinks and Citations: In the context of web-based text extraction, link analysis
involves analyzing hyperlinks and citations between web pages or documents. This can
provide insights into content relevance, authority, and influence.

Graph Representation: Link analysis often involves representing relationships as a graph,


with nodes representing text elements and edges representing connections between them.
Graph-based algorithms can then be applied for analysis.
Contextual Understanding: Link analysis helps provide a broader context for individual
text elements by revealing how they fit into the larger narrative or network of information.

Information Retrieval: By understanding links between keywords, phrases, or documents,


link analysis can enhance the accuracy and relevance of information retrieval systems.

Knowledge Discovery: Link analysis contributes to knowledge discovery by uncovering


hidden patterns, trends, and insights within text data that might not be immediately apparent.

Sentiment and Opinion Analysis: Link analysis can help identify connections between
sentiment-bearing terms, contributing to a more nuanced understanding of opinions
expressed in text.

Overall, link analysis in text extraction is a technique that aids in revealing the intricate web
of relationships between different textual elements. It provides a powerful tool for
understanding the structure, context, and meaning embedded within large volumes of text
data, enabling more effective information extraction and analysis.

6. Visualization Approaches and its Operations.

Visualization approaches and their operations refer to techniques that transform textual
data into graphical representations to facilitate a better understanding of patterns,
relationships, and insights within the text. These visualizations enable users to explore
and interpret complex textual information more effectively.

Key points about visualization approaches and operations in text extraction:

Word Clouds: Word clouds visually represent the frequency of words in a text document
or collection. The size of each word corresponds to its frequency, allowing users to
quickly identify prominent terms.

Bar Charts and Histograms: Bar charts and histograms display the frequency
distribution of words, topics, or concepts. They provide insights into the relative
importance or occurrence of different elements.
Heatmaps: Heatmaps visualize the co-occurrence or similarity between words, topics, or
documents. Brighter colors indicate stronger relationships.

Network Graphs: Network graphs display relationships between entities, terms, or


documents as nodes and edges. They can reveal connections and structures within the text
data.

Topic Models: Visualizations of topic models (e.g., LDA) show the distribution of words
across identified topics. These visualizations help users understand the main themes
within the text.

Document Clusters: Visualizing document clusters helps users identify groups of related
documents. Clusters can be represented as groups of points on a 2D plane using
dimensionality reduction techniques.

Time-Series Plots: Time-series plots display how the frequency or sentiment of certain
terms changes over time. They can reveal trends and patterns in text data.

Chord Diagrams: Chord diagrams display relationships between categories or entities


using arcs that connect them. They are useful for illustrating interactions and connections
in textual data.

Tree Maps: Tree maps represent hierarchical relationships within the text, showing how
concepts or topics are nested and organized.

Sentiment Analysis Visualizations: Visualizations can depict the distribution of


positive, negative, and neutral sentiment in text data, providing an overview of emotional
content.

Interactive Visualizations: Interactive visualizations allow users to interact with the


data, enabling actions like filtering, zooming, and panning. This enhances exploration
and discovery.

Geospatial Visualizations: For text data with geographic references, geospatial


visualizations show the geographical distribution of topics, sentiments, or entities.

Word Embedding Visualizations: Techniques like t-SNE can visualize high-


dimensional word embeddings, showing relationships between words in a reduced 2D or
3D space.

Coherence and Incoherence Visualization: These visualizations assess the coherence or


incoherence of topics in topic models, aiding model evaluation.

Visualizations in text extraction operations allow users to gain insights at a glance,


identify trends, and discover relationships that might be difficult to perceive from the raw
text alone. They play a crucial role in conveying information, supporting decision-
making, and providing a more intuitive means of exploring and interpreting textual data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy