0% found this document useful (0 votes)

9 views17 pages

Unit - 2

Unit II focuses on text extraction, detailing the importance of data preprocessing techniques such as data cleaning, integration, transformation, and reduction to enhance data quality for machine learning. It also covers clustering, probabilistic models, browsing and query refinement, and link analysis, which are essential for organizing, analyzing, and extracting insights from textual data. These methodologies aim to improve the efficiency and accuracy of information retrieval and analysis in various applications.

Uploaded by

priyam3783

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views17 pages

Unit - 2

Uploaded by

priyam3783

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

UNIT II - TEXT EXTRACTION

UNIT II - TEXT EXTRACTION 9

Pre-processing Techniques – Clustering – Probabilistic Models – Browsing and Query

Refinement on presentation Layer- Link Analysis – Visualization Approaches and its Operations.

Data Preprocessing

Data Preprocessing includes the steps we need to follow to transform or encode data so that it may be

easily parsed by the machine.

The main agenda for a model to be accurate and precise in predictions is that the algorithm should be

able to easily interpret the data's features.

Why is Data Preprocessing important?

The majority of the real-world datasets for machine learning are highly susceptible to be missing,

inconsistent, and noisy due to their heterogeneous origin.

Applying data mining algorithms on this noisy data would not give quality results as they would fail to

identify patterns effectively. Data Processing is, therefore, important to improve the overall data quality.

 Duplicate or missing values may give an incorrect view of the overall statistics of data.

 Outliers and inconsistent data points often tend to disturb the model’s overall learning, leading to
false predictions.

Quality decisions must be based on quality data. Data Preprocessing is important to get this quality data,

without which it would just be a Garbage In, Garbage Out scenario.

4 Steps in Data Preprocessing

Now, let's discuss more in-depth four main stages of data preprocessing.

Data Cleaning

Data Cleaning is particularly done as part of data preprocessing to clean the data by filling missing
values, smoothing the noisy data, resolving the inconsistency, and removing outliers.

1. Missing values
Here are a few ways to solve this issue:

 Ignore those tuples

This method should be considered when the dataset is huge and numerous missing values are present
within a tuple.

 Fill in the missing values

There are many methods to achieve this, such as filling in the values manually, predicting the missing
values using regression method, or numerical methods like attribute mean.

2. Noisy Data

It involves removing a random error or variance in a measured variable. It can be done with the help of
the following techniques:

 Binning

It is the technique that works on sorted data values to smoothen any noise present in it. The data is
divided into equal-sized bins, and each bin/bucket is dealt with independently. All data in a segment
can be replaced by its mean, median or boundary values.

 Regression

This data mining technique is generally used for prediction. It helps to smoothen noise by fitting all the
data points in a regression function. The linear regression equation is used if there is only one
independent attribute; else Polynomial equations are used.

 Clustering

Creation of groups/clusters from data having similar values. The values that don't lie in the cluster can
be treated as noisy data and can be removed.

3. Removing outliers

Clustering techniques group together similar data points. The tuples that lie outside the cluster are
outliers/inconsistent data.
Data Integration

Data Integration is one of the data preprocessing steps that are used to merge the data present in multiple

sources into a single larger data store like a data warehouse.

Data Integration is needed especially when we are aiming to solve a real-world scenario like detecting

the presence of nodules from CT Scan images. The only option is to integrate the images from multiple

medical nodes to form a larger database.

We might run into some issues while adopting Data Integration as one of the Data Preprocessing steps:

 Schema integration and object matching: The data can be present in different formats, and
attributes that might cause difficulty in data integration.

 Removing redundant attributes from all data sources.

 Detection and resolution of data value conflicts.

Data Transformation

Once data clearing has been done, we need to consolidate the quality data into alternate forms by

changing the value, structure, or format of data using the below-mentioned Data Transformation

strategies.

Generalization

The low-level or granular data that we have converted to high-level information by using concept

hierarchies. We can transform the primitive data in the address like the city to higher-level information

like the country.

Normalization
It is the most important Data Transformation technique widely used. The numerical attributes are scaled

up or down to fit within a specified range. In this approach, we are constraining our data attribute to a

particular container to develop a correlation among different data points. Normalization can be done in

multiple ways, which are highlighted here:

 Min-max normalization

 Z-Score normalization

 Decimal scaling normalization

Attribute Selection

New properties of data are created from existing attributes to help in the data mining process. For

example, date of birth, data attribute can be transformed to another property like is_senior_citizen for

each tuple, which will directly influence predicting diseases or chances of survival, etc.

Aggregation

It is a method of storing and presenting data in a summary format. For example sales, data can be

aggregated and transformed to show as per month and year format.

Data Reduction

The size of the dataset in a data warehouse can be too large to be handled by data analysis and data

mining algorithms.

One possible solution is to obtain a reduced representation of the dataset that is much smaller in volume

but produces the same quality of analytical results.

Here is a walkthrough of various Data Reduction strategies.

Data cube aggregation

It is a way of data reduction, in which the gathered data is expressed in a summary form.

Dimensionality reduction

Dimensionality reduction techniques are used to perform feature extraction. The dimensionality of a

dataset refers to the attributes or individual features of the data. This technique aims to reduce the

number of redundant features we consider in machine learning algorithms. Dimensionality reduction can

be done using techniques like Principal Component Analysis etc.

Data compression

By using encoding technologies, the size of the data can significantly reduce. But compressing data can

be either lossy or non-lossy. If original data can be obtained after reconstruction from compressed data,

this is referred to as lossless reduction; otherwise, it is referred to as lossy reduction.

Discretization

Data discretization is used to divide the attributes of the continuous nature into data with intervals. This

is done because continuous features tend to have a smaller chance of correlation with the target variable.

Thus, it may be harder to interpret the results. After discretizing a variable, groups corresponding to the

target can be interpreted. For example, attribute age can be discretized into bins like below 18, 18-44,

44-60, above 60.

Numerosity reduction

The data can be represented as a model or equation like a regression model. This would save the burden

of storing huge datasets instead of a model.

Attribute subset selection

It is very important to be specific in the selection of attributes. Otherwise, it might lead to high

dimensional data, which are difficult to train due to underfitting/overfitting problems. Only attributes

that add more value towards model training should be considered, and the rest all can be discarded.

Data Quality Assessment

Data Quality Assessment includes the statistical approaches one needs to follow to ensure that the data

has no issues. Data is to be used for operations, customer management, marketing analysis, and decision

making—hence it needs to be of high quality.

The main components of Data Quality Assessment include:

1. The completeness with no missing attribute values

2. Accuracy and reliability in terms of information

3. Consistency in all features

4. Maintain data validity

5. It does not contain any redundancy

Data Quality Assurance process has involves three main activities.

1. Data profiling: It involves exploring the data to identify the data quality issues. Once the
analysis of the issues is done, the data needs to be summarized according to no duplicates, blank
values etc identified.

2. Data cleaning: It involves fixing data issues.

3. Data monitoring: It involves maintaining data in a clean state and having a continuous check
on business needs being satisfied by the data.

2. Clustering
Clustering in text extraction refers to the process of grouping similar textual documents
or pieces of text together based on their content. It is a technique used in natural language
processing and information retrieval to organize and categorize large amounts of text
data. The goal of text clustering is to discover patterns, relationships, and themes within
the text that might not be immediately apparent.

In the context of text extraction, clustering involves identifying and grouping text
documents or segments that share common characteristics or topics. This process helps in
organizing and summarizing large text datasets, making it easier to analyze and extract
valuable insights from the text. Text clustering is often used as a preliminary step before
more detailed analysis, such as sentiment analysis, topic modeling, or document
summarization.

Key points about clustering in text extraction:

Similarity Measurement: Clustering algorithms typically use a measure of similarity (or

distance) between pairs of text documents. Similarity can be based on various factors,
such as word usage, frequency, and context.
Unsupervised Learning: Clustering is typically an unsupervised learning technique,
meaning it doesn't require labeled training data. The algorithm identifies patterns solely
based on the content of the text.

Grouping for Exploration: Text clustering allows researchers or analysts to explore large
text collections by organizing them into meaningful groups. These groups might
represent topics, themes, or categories present in the text.

Applications: Clustering in text extraction has numerous applications, such as document

organization, topic discovery, content recommendation, and identifying emerging trends
in textual data.

Algorithm Variety: Various clustering algorithms can be applied to text data, including k-
means, hierarchical clustering, DBSCAN, and more. Each algorithm has its own
strengths and weaknesses, making them suitable for different types of text analysis tasks.

Pre-processing: Pre-processing steps, such as text normalization, stop-word removal, and

feature extraction, are often applied before clustering to prepare the text data for analysis.

3.Probabilistic Models

Probabilistic models refer to a class of statistical techniques and models used to analyze and
extract information from textual data while incorporating uncertainty and probability
distributions. These models are based on the principles of probability theory and are
particularly useful when dealing with uncertain or ambiguous text data.
In the context of text extraction, probabilistic models are employed to capture the likelihood
of certain events or patterns occurring in the text. These models help estimate the
probabilities of different outcomes, making them suitable for tasks such as information
retrieval, text classification, sentiment analysis, and more.

Key points about probabilistic models in text extraction:

Uncertainty Handling: Text data often contains inherent uncertainty, ambiguity, and
variability. Probabilistic models provide a framework to handle and quantify this uncertainty
by assigning probabilities to different outcomes.

Bayesian Framework: Many probabilistic models in text extraction are based on the Bayesian
framework, which allows for the incorporation of prior knowledge and updating of
probabilities as new evidence (textual information) is observed.

Text Classification: Probabilistic models are commonly used for text classification tasks,
where the goal is to assign documents to predefined categories or classes. These models
estimate the probability of a document belonging to each class and make predictions based
on these probabilities.

Topic Modeling: Topic modeling techniques, such as Latent Dirichlet Allocation (LDA), are
probabilistic models that aim to discover the underlying topics in a collection of text
documents. These models estimate the probability distribution of words in each topic and the
probability distribution of topics in each document.
Information Retrieval: In information retrieval, probabilistic models are used to rank and
retrieve relevant documents based on their likelihood of containing the information a user is
seeking. Models like the Okapi BM25 and language models are probabilistic in nature.

Named Entity Recognition (NER): Probabilistic models can be used for named entity
recognition, where the goal is to identify and classify entities (such as names of people,
places, organizations) in text. These models estimate the probability of a sequence of words
being an entity.

Sentiment Analysis: For sentiment analysis, probabilistic models can estimate the likelihood
of a given text expressing positive, negative, or neutral sentiment.

Language Generation: In natural language generation tasks, probabilistic models are used to
generate coherent and contextually appropriate text. For instance, generative language
models like GPT-3 are based on probabilistic modeling principles.

Probabilistic models provide a robust framework for dealing with uncertainty in text
extraction tasks. They enable researchers and practitioners to make informed decisions and
predictions based on probabilities, making them valuable tools for various applications in the
field of natural language processing and text analysis.

4. Browsing and Query Refinement on presentation Layer

Browsing and query refinement on the presentation layer refer to user-interface and interaction
techniques that enhance the way users explore and interact with textual data. These techniques
are designed to facilitate efficient information retrieval, exploration of relevant content, and
improvement of search results by allowing users to interact with the extracted information in a
more intuitive and user-friendly manner.

Key points about browsing and query refinement on the presentation layer in text
extraction:

User-Friendly Exploration: Browsing and query refinement techniques focus on creating an

intuitive and user-friendly interface for users to interact with the extracted textual information.
The goal is to simplify the process of finding relevant content within a large collection of text
documents.

Faceted Search: Faceted search is a browsing technique that allows users to filter and navigate
through search results using multiple attributes or facets. Each facet corresponds to a specific
characteristic or category associated with the extracted text, such as author, date, topic, or
keyword.

Content-Based Filtering: Content-based filtering is a technique where the presentation layer

suggests related or similar documents to the user based on the content of the documents they are
currently viewing. This helps users discover additional relevant information without explicitly
formulating new queries.

Collaborative Filtering: Collaborative filtering involves suggesting content based on the

preferences and actions of other users. In the context of text extraction, collaborative filtering can
recommend documents that users with similar interests have found valuable.
Dynamic Queries: Dynamic queries allow users to refine search results by adjusting various
parameters in real-time. For example, users can dynamically adjust date ranges, relevance
thresholds, or other filters to quickly refine their search.

Query Expansion: Query refinement techniques enable users to expand or modify their queries
to improve search results. Suggestions for query expansion can be based on synonyms, related
terms, or previously viewed documents.

Visualization: Visualization techniques provide graphical representations of text data, making it

easier for users to identify patterns, relationships, and trends within the extracted information.
Visualization aids in quick comprehension of complex textual content.

Interactive Operations: Interactive operations, such as drag-and-drop, sliders, and checkboxes,

empower users to manipulate and interact with the presented text and search results in real-time.

Feedback Loop: Browsing and query refinement techniques often incorporate user feedback to
improve search accuracy and relevance over time. The system adapts to user preferences and
behavior.

Overall, browsing and query refinement on the presentation layer in text extraction aim to
enhance the user experience, support efficient exploration of textual content, and help users find
the information they need more effectively. These techniques leverage user interactions to refine
and optimize search results, providing a seamless and interactive approach to accessing and
navigating through extracted text data.

5. what is Link Analysis

Link analysis refers to the process of analyzing and understanding the relationships and
connections between various textual elements, such as documents, sentences, phrases, or
keywords. It involves identifying and examining the links or associations between these
elements to uncover patterns, semantic relationships, and insights within the text.

In the context of text extraction, link analysis focuses on understanding how different pieces
of text are interconnected and how they contribute to the overall meaning or context. It is
often used to enhance information retrieval, knowledge discovery, and content
summarization from large collections of text data.

Key points about link analysis in text extraction:

Semantic Relationships: Link analysis helps reveal semantic relationships between text
elements, such as synonyms, antonyms, hypernyms, hyponyms, and co-occurring terms.
Understanding these relationships contributes to a deeper understanding of the content.

Co-Occurrence Analysis: Link analysis can identify words, phrases, or entities that
frequently appear together in the same context. This can help uncover associations between
concepts and topics.

Topic Modeling: Link analysis can be used to construct topic models by analyzing the links
between documents or keywords. This assists in summarizing and organizing textual
information based on underlying themes.

Entity Resolution: Link analysis can aid in resolving references to the same entity across
different documents, helping to build a more comprehensive and accurate representation of
the information.

Hyperlinks and Citations: In the context of web-based text extraction, link analysis
involves analyzing hyperlinks and citations between web pages or documents. This can
provide insights into content relevance, authority, and influence.

Graph Representation: Link analysis often involves representing relationships as a graph,

with nodes representing text elements and edges representing connections between them.
Graph-based algorithms can then be applied for analysis.
Contextual Understanding: Link analysis helps provide a broader context for individual
text elements by revealing how they fit into the larger narrative or network of information.

Information Retrieval: By understanding links between keywords, phrases, or documents,

link analysis can enhance the accuracy and relevance of information retrieval systems.

Knowledge Discovery: Link analysis contributes to knowledge discovery by uncovering

hidden patterns, trends, and insights within text data that might not be immediately apparent.

Sentiment and Opinion Analysis: Link analysis can help identify connections between
sentiment-bearing terms, contributing to a more nuanced understanding of opinions
expressed in text.

Overall, link analysis in text extraction is a technique that aids in revealing the intricate web
of relationships between different textual elements. It provides a powerful tool for
understanding the structure, context, and meaning embedded within large volumes of text
data, enabling more effective information extraction and analysis.

6. Visualization Approaches and its Operations.

Visualization approaches and their operations refer to techniques that transform textual
data into graphical representations to facilitate a better understanding of patterns,
relationships, and insights within the text. These visualizations enable users to explore
and interpret complex textual information more effectively.

Key points about visualization approaches and operations in text extraction:

Word Clouds: Word clouds visually represent the frequency of words in a text document
or collection. The size of each word corresponds to its frequency, allowing users to
quickly identify prominent terms.

Bar Charts and Histograms: Bar charts and histograms display the frequency
distribution of words, topics, or concepts. They provide insights into the relative
importance or occurrence of different elements.
Heatmaps: Heatmaps visualize the co-occurrence or similarity between words, topics, or
documents. Brighter colors indicate stronger relationships.

Network Graphs: Network graphs display relationships between entities, terms, or

documents as nodes and edges. They can reveal connections and structures within the text
data.

Topic Models: Visualizations of topic models (e.g., LDA) show the distribution of words
across identified topics. These visualizations help users understand the main themes
within the text.

Document Clusters: Visualizing document clusters helps users identify groups of related
documents. Clusters can be represented as groups of points on a 2D plane using
dimensionality reduction techniques.

Time-Series Plots: Time-series plots display how the frequency or sentiment of certain
terms changes over time. They can reveal trends and patterns in text data.

Chord Diagrams: Chord diagrams display relationships between categories or entities

using arcs that connect them. They are useful for illustrating interactions and connections
in textual data.

Tree Maps: Tree maps represent hierarchical relationships within the text, showing how
concepts or topics are nested and organized.

Sentiment Analysis Visualizations: Visualizations can depict the distribution of

positive, negative, and neutral sentiment in text data, providing an overview of emotional
content.

Interactive Visualizations: Interactive visualizations allow users to interact with the

data, enabling actions like filtering, zooming, and panning. This enhances exploration
and discovery.

Geospatial Visualizations: For text data with geographic references, geospatial

visualizations show the geographical distribution of topics, sentiments, or entities.

Word Embedding Visualizations: Techniques like t-SNE can visualize high-

dimensional word embeddings, showing relationships between words in a reduced 2D or
3D space.

Coherence and Incoherence Visualization: These visualizations assess the coherence or

incoherence of topics in topic models, aiding model evaluation.

Visualizations in text extraction operations allow users to gain insights at a glance,

identify trends, and discover relationships that might be difficult to perceive from the raw
text alone. They play a crucial role in conveying information, supporting decision-
making, and providing a more intuitive means of exploring and interpreting textual data.

SCDL Project Report Digital Marketing
No ratings yet
SCDL Project Report Digital Marketing
68 pages
Web Assignment
No ratings yet
Web Assignment
6 pages
Google Digital Garage Summary
50% (2)
Google Digital Garage Summary
17 pages
Down 2
No ratings yet
Down 2
61 pages
Unit 3
No ratings yet
Unit 3
18 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Mining
No ratings yet
Data Mining
5 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
Dta Mining
No ratings yet
Dta Mining
15 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
23 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
U1_DA_Data Preprocessing
No ratings yet
U1_DA_Data Preprocessing
6 pages
3-Preprocessing
No ratings yet
3-Preprocessing
27 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
-16-Data Preprocessing
No ratings yet
-16-Data Preprocessing
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Module 2
No ratings yet
Module 2
8 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Mining
No ratings yet
Data Mining
22 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Normalization
No ratings yet
Normalization
35 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
It Systems 2nd Sem Question Bank & Solutions
No ratings yet
It Systems 2nd Sem Question Bank & Solutions
56 pages
Digital Culture and The Practices of Art
No ratings yet
Digital Culture and The Practices of Art
31 pages
Internet Search Engines
No ratings yet
Internet Search Engines
8 pages
Digital Marketing Glossary
No ratings yet
Digital Marketing Glossary
17 pages
Website Audit Report
No ratings yet
Website Audit Report
9 pages
Jiang 2017
No ratings yet
Jiang 2017
11 pages
A STUDY ON ATTRACTING AND RETAINING POLICIES AT DREAM BUZZ SOLUTIONS (AutoRecovered)
No ratings yet
A STUDY ON ATTRACTING AND RETAINING POLICIES AT DREAM BUZZ SOLUTIONS (AutoRecovered)
36 pages
CDMM
No ratings yet
CDMM
45 pages
Google Search Report
No ratings yet
Google Search Report
7 pages
Rrllb81: Tutorial Letter 103/2/2024
No ratings yet
Rrllb81: Tutorial Letter 103/2/2024
55 pages
Web Terminology
No ratings yet
Web Terminology
3 pages
Business Model of Google
100% (4)
Business Model of Google
57 pages
GEE 002 Module 21
No ratings yet
GEE 002 Module 21
36 pages
SEO Strategy & Skills
No ratings yet
SEO Strategy & Skills
21 pages
Software Eng
No ratings yet
Software Eng
14 pages
11th Chapter-2 Revision Notes
No ratings yet
11th Chapter-2 Revision Notes
3 pages
Computer Notes Unit-4.4
No ratings yet
Computer Notes Unit-4.4
3 pages
Search Original Research ABC 1
No ratings yet
Search Original Research ABC 1
1 page
Academic Search Techniques
No ratings yet
Academic Search Techniques
1 page
24MCA20120_Vishal (1)
No ratings yet
24MCA20120_Vishal (1)
9 pages
Writing Techniques For Online Journalism
No ratings yet
Writing Techniques For Online Journalism
16 pages
We Live Like Royalty!
No ratings yet
We Live Like Royalty!
25 pages
Chapter 11 (For Student)
No ratings yet
Chapter 11 (For Student)
7 pages
Abbreviations From Chapter 6 To Chapter 11 PDF
No ratings yet
Abbreviations From Chapter 6 To Chapter 11 PDF
60 pages
EmpTech - Lesson 3: Contexualized Online and Research Skills
No ratings yet
EmpTech - Lesson 3: Contexualized Online and Research Skills
26 pages
7500 Lesson 14 Slides
No ratings yet
7500 Lesson 14 Slides
25 pages
Sumit Kumar Singh: Education Experience
No ratings yet
Sumit Kumar Singh: Education Experience
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit - 2

Uploaded by

Unit - 2

Uploaded by

UNIT II - TEXT EXTRACTION

UNIT II - TEXT EXTRACTION 9

Pre-processing Techniques – Clustering – Probabilistic Models – Browsing and Query

easily parsed by the machine.

able to easily interpret the data's features.

Why is Data Preprocessing important?

inconsistent, and noisy due to their heterogeneous origin.

without which it would just be a Garbage In, Garbage Out scenario.

 Ignore those tuples

 Fill in the missing values

sources into a single larger data store like a data warehouse.

medical nodes to form a larger database.

 Removing redundant attributes from all data sources.

 Detection and resolution of data value conflicts.

like the country.

multiple ways, which are highlighted here:

 Decimal scaling normalization

aggregated and transformed to show as per month and year format.

but produces the same quality of analytical results.

Here is a walkthrough of various Data Reduction strategies.

Data cube aggregation

be done using techniques like Principal Component Analysis etc.

this is referred to as lossless reduction; otherwise, it is referred to as lossy reduction.

44-60, above 60.

of storing huge datasets instead of a model.

Attribute subset selection

Data Quality Assessment

making—hence it needs to be of high quality.

The main components of Data Quality Assessment include:

1. The completeness with no missing attribute values

2. Accuracy and reliability in terms of information

3. Consistency in all features

4. Maintain data validity

5. It does not contain any redundancy

Data Quality Assurance process has involves three main activities.

2. Data cleaning: It involves fixing data issues.

Key points about clustering in text extraction:

Similarity Measurement: Clustering algorithms typically use a measure of similarity (or

Applications: Clustering in text extraction has numerous applications, such as document

Pre-processing: Pre-processing steps, such as text normalization, stop-word removal, and

Key points about probabilistic models in text extraction:

4. Browsing and Query Refinement on presentation Layer

User-Friendly Exploration: Browsing and query refinement techniques focus on creating an

Content-Based Filtering: Content-based filtering is a technique where the presentation layer

Collaborative Filtering: Collaborative filtering involves suggesting content based on the

Visualization: Visualization techniques provide graphical representations of text data, making it

Interactive Operations: Interactive operations, such as drag-and-drop, sliders, and checkboxes,

5. what is Link Analysis

Key points about link analysis in text extraction:

Graph Representation: Link analysis often involves representing relationships as a graph,

Information Retrieval: By understanding links between keywords, phrases, or documents,

Knowledge Discovery: Link analysis contributes to knowledge discovery by uncovering

6. Visualization Approaches and its Operations.

Key points about visualization approaches and operations in text extraction:

Network Graphs: Network graphs display relationships between entities, terms, or

Chord Diagrams: Chord diagrams display relationships between categories or entities

Sentiment Analysis Visualizations: Visualizations can depict the distribution of

Interactive Visualizations: Interactive visualizations allow users to interact with the

Geospatial Visualizations: For text data with geographic references, geospatial

Word Embedding Visualizations: Techniques like t-SNE can visualize high-

Coherence and Incoherence Visualization: These visualizations assess the coherence or

Visualizations in text extraction operations allow users to gain insights at a glance,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.