0% found this document useful (0 votes)

52 views14 pages

Lecture 8 - Pre Processing Techniques

Uploaded by

trisim mathur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views14 pages

Lecture 8 - Pre Processing Techniques

Uploaded by

trisim mathur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Advance Text, Social and Media Analytics –

Preprocessing Techniques

1
Pre-Processing Techniques
Text mining is dependent on the various preprocessing techniques that infer or
extract structured representations from raw unstructured data sources, or do
both, that one might even say text mining is to a degree defined by these elaborate
preparatory techniques.
According to their task According to the algorithms and formal
frameworks that they use
Task-oriented preprocessing approaches envision Categorizing text mining preprocessing
the process of creating a structured document techniques by either their task orientation or
representation in terms of tasks and subtasks and
the formal frameworks from which they
usually involve some sort of preparatory goal or
problem that needs to be solved such as extracting derive does not mean that “mixing and
titles and authors from a PDF document. matching” techniques from either category
for a given text mining application are
Analysing complex phenomena that prohibited.
can be also applied to natural language texts.
Such approaches include classification
schemes, probabilistic models, and rule-based
systems approaches.
NLP, Text Categorization and IE techniques can
be used.
2
Pre-Processing Techniques
Each of the preprocessing techniques starts with a partially structured document and proceeds
to enrich the structure by refining the present features and adding new ones. In the end, the
most advanced and meaning-representing features are used for the text mining, whereas the
rest are discarded.
Task Oriented Approach

Informally, the task of the document structuring process is to take the most
“raw” representation and convert it to the representation through which the essence
(i.e., the meaning) of the document surfaces.

- divide-and-conquer strategy
The subtasks can be divided broadly into three classes – preparatory processing, general-
purpose NLP tasks, and problem-dependent tasks.

For example, the raw input may be a PDF document, a scanned

page, or even recorded speech. The task of the preparatory processing is to convert
the raw input into a stream of text, possibly labeling the internal text zones such as
paragraphs, columns, or tables
3
Pre-Processing Techniques – Task Oriented Approach

4
Pre-Processing Techniques – Task Oriented Approach
It is sometimes also possible for the preparatory processing to extract some
document-level fields such as <Author> or <Title> in cases in which the visual
position of the fields allows their identification.

The general-purpose NLP tasks process text documents using the general
knowledge of the natural language. The tasks may include tokenization,
morphological analysis, POS tagging, and syntactic parsing – either shallow
or deep. The tasks are general purpose in the sense that their output is not specific
to any particular problem.

The output can rarely be relevant for the end-user and is typically employed for
further problem-dependent processing. The domain-related knowledge, however,
can often enhance the performance of the general-purpose NLP tasks and is often
used at different levels of processing.

5
6
Parsing essentially means how to assign a
structure to a sequence of text. Syntactic
parsing involves the analysis of words in the
sentence for grammar and their arrangement in
a manner that shows the relationships among
the words.

Parsing

7
Pre-Processing Techniques – General Purpose NLP Task
It is currently an orthodox opinion that language processing in humans cannot be separated
into independent components. Various experiments in psycholinguistics clearly
demonstrate that the different stages of analysis – phonetic, morphological, syntactical,
semantical, and pragmatical – occur simultaneously and depend on each other.
The precise algorithms of human language processing are unknown, however, and although
several systems do try to combine the stages into a coherent single process, a completely
satisfactory solution has not yet been achieved. Thus, most the text understanding systems
employ the traditional divide-and-conquer strategy, separating the whole problem into
several subtasks and solving them independently.

1. Tokenization
The approach most frequently found in text mining systems involves breaking the text into
sentences and words, which is called tokenization. The main challenge in identifying
sentence boundaries in the English language is distinguishing between a period that signals
the end of a sentence and a period that is part of a previous token like Mr., Dr., and so on. It
is common for the tokenizer also to extract token features capitalization, the inclusion of
digits, punctuation, special characters, and so on
8
Pre-Processing Techniques – General Purpose NLP Task
2. Part of Speech Tagging
POS tagging is the annotation of words with the appropriate POS tags based on the context
in which they appear. POS tags divide words into categories based on the role they play
in the sentence in which they appear. POS tags provide information about the semantic
content of a word. Nouns usually denote “tangible and intangible things,” whereas
prepositions express relationships between “things.”

Most POS tag sets make use of the same basic categories. The most common set
of tags contains seven different tags (Article, Noun, Verb, Adjective, Preposition,
Number, and Proper Noun). Some systems contain a much more elaborate set of
tags. For example, the complete Brown Corpus tag set has no less than 87 basic tags.

9
Pre-Processing Techniques – General Purpose NLP Task
3 Syntactical Parsing

Syntactical parsing components perform a full syntactical analysis of sentences

according to a certain grammar theory. The basic division is between the constituency and
dependency grammar.
Constituency grammars describe the syntactical structure of sentences in terms of
recursively built phrases – sequences of syntactically grouped elements. Most constituency
grammars distinguish between noun phrases, verb phrases, prepositional phrases, adjective
phrases, and clauses. Each phrase may consist of zero or smaller phrases or words
according to the rules of the grammar. Additionally, the syntactical structure of sentences
includes the roles of different phrases. Thus, a noun phrase may be labelled as the subject
of the sentence, its direct object, or the complement.
Dependency grammars, on the other hand, do not recognize the constituents as separate
linguistic units but focus instead on the direct relations between words. A typical
dependency analysis of a sentence consists of a labeled DAG with words for nodes and
specific relationships (dependencies) for edges. For instance, a subject and direct object
nouns of a typical sentence depend on the main verb, an adjective depends on the noun it
modifies, and so on. Usually, the phrases can be recovered from a dependency analysis –
they are the connected components of the sentence graph.

10
Pre-Processing Techniques – General Purpose NLP Task
4 Shallow Parsing

Instead of providing a complete analysis (a parse) of a whole sentence, shallow parsers

produce only parts that are easy and unambiguous. Typically, small and simple noun and
verb phrases are generated, whereas more complex clauses are not formed. Similarly, most
prominent dependencies might be formed, but unclear and ambiguous ones are left
unresolved.

For the purposes of information extraction, shallow parsing is usually sufficient and
therefore preferable to full analysis because of its far greater speed and robustness.

11
Pre-Processing Techniques – Problem Dependant Task (Text Categorization and Information
Extraction)

The final stages of document structuring create representations that are meaningful for
either later (and more sophisticated) processing phases or direct interaction of the text
mining system user. The text mining techniques normally expect the documents to be
represented as sets of features, which are considered to be structureless atomic entities
possibly organized into a taxonomy – an IsA-hierarchy.

Both of these techniques are also popularly referred to as “tagging” (because of the tag-
formatted structures they introduce in a processed document), and they enable one to obtain
formal, structured representations of documents. Text categorization and IE enable users to
move from a “machine-readable” representation of the documents to a “machine-
understandable” form of the documents.

12
Pre-Processing Techniques – Problem Dependant Task (Text Categorization and Information
Extraction)

Text categorization (sometime called text classification) tasks tag each document
with a small number of concepts or keywords. The set of all possible concepts or keywords
is usually manually prepared, closed, and comparatively small. The hierarchy relation
between the keywords is also prepared manually.

IE must often be distinguished from

information retrieval or what is more
informally called “search.” Information
retrieval returns documents that match a
given query but still require the user to
read through these documents to locate
the relevant information. IE, on the other
hand, aims at pinpointing the relevant
information and presenting it in a
structured format – typically in a tabular
format. For analysts and other
knowledge workers, IE can save
valuable time by dramatically speeding
up discovery-type work.

13
https://www.youtube.com/watch?v=5ctbvkAMQO4&t=172s

What is NLP
https://www.youtube.com/watch?v=CMrHM8a3hqw

NLP Chapter-1
No ratings yet
NLP Chapter-1
24 pages
Multilingual Information Retrieval
No ratings yet
Multilingual Information Retrieval
18 pages
An Overview On Extractive Text Summariza
No ratings yet
An Overview On Extractive Text Summariza
13 pages
Natural Language Processing: State of The Art, Current Trends and Challenges
No ratings yet
Natural Language Processing: State of The Art, Current Trends and Challenges
23 pages
Natural Language Processing State of The Art Curre
No ratings yet
Natural Language Processing State of The Art Curre
26 pages
Background Research: 2.1 Machine Learning
No ratings yet
Background Research: 2.1 Machine Learning
9 pages
Seminar On Natural Language Processing
No ratings yet
Seminar On Natural Language Processing
21 pages
Unit 2
No ratings yet
Unit 2
25 pages
Applied Text Analysis 2
No ratings yet
Applied Text Analysis 2
30 pages
Unit V Natural Language Processing
No ratings yet
Unit V Natural Language Processing
20 pages
NLP Bit Bank
No ratings yet
NLP Bit Bank
8 pages
Seven Text Mining Techniques
No ratings yet
Seven Text Mining Techniques
21 pages
General Architecture of Text Mining Systems
No ratings yet
General Architecture of Text Mining Systems
6 pages
NLP Notes
No ratings yet
NLP Notes
16 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
?? ??? ????????? ?????????
No ratings yet
?? ??? ????????? ?????????
23 pages
NLP U5
No ratings yet
NLP U5
26 pages
Deep Parsing and Tools For NLP
No ratings yet
Deep Parsing and Tools For NLP
50 pages
DVT Unit 4
No ratings yet
DVT Unit 4
21 pages
Strath Prints 002611
No ratings yet
Strath Prints 002611
39 pages
404-BA-Chapter V
No ratings yet
404-BA-Chapter V
22 pages
DVT UNIT - 4 Notes 211124
No ratings yet
DVT UNIT - 4 Notes 211124
21 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
Text Mining and Natural Language Processing - Introduction For The Special Issue
No ratings yet
Text Mining and Natural Language Processing - Introduction For The Special Issue
2 pages
7
No ratings yet
7
4 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
Preprocessing Techniquesfor Text Mining
No ratings yet
Preprocessing Techniquesfor Text Mining
7 pages
Natural Language Processing State of The Art, Current Trends and Challenges - s11042-022-13428-4 PDF
No ratings yet
Natural Language Processing State of The Art, Current Trends and Challenges - s11042-022-13428-4 PDF
32 pages
NLP Unit 1 Answers
No ratings yet
NLP Unit 1 Answers
7 pages
Information Extraction
No ratings yet
Information Extraction
8 pages
Ai DP 2
No ratings yet
Ai DP 2
3 pages
ورقة الذكاء
No ratings yet
ورقة الذكاء
7 pages
Khurana, D. (2017) - Natural Language Processing: State of Art, Current Trends and Challenges.
No ratings yet
Khurana, D. (2017) - Natural Language Processing: State of Art, Current Trends and Challenges.
25 pages
Unit 4 NLP Notes
No ratings yet
Unit 4 NLP Notes
35 pages
Fundaments of Text Analysis
No ratings yet
Fundaments of Text Analysis
14 pages
Unit V Expert Systems Notes
No ratings yet
Unit V Expert Systems Notes
15 pages
1) What Is Natural Language Processing?
No ratings yet
1) What Is Natural Language Processing?
14 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
Tasks in NLP
No ratings yet
Tasks in NLP
7 pages
Different Text Mining Techniques
No ratings yet
Different Text Mining Techniques
4 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Unit4 Final
No ratings yet
Unit4 Final
57 pages
Notes MSC NLP
No ratings yet
Notes MSC NLP
36 pages
A998 PDF
No ratings yet
A998 PDF
16 pages
NLP 833
No ratings yet
NLP 833
26 pages
7-Text Classification-13-11-2024
No ratings yet
7-Text Classification-13-11-2024
53 pages
Project Report
No ratings yet
Project Report
12 pages
Natural Language Processing State of The Art Curre
No ratings yet
Natural Language Processing State of The Art Curre
26 pages
Information Retrieval
No ratings yet
Information Retrieval
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
NLP Basics
No ratings yet
NLP Basics
7 pages
AI Unit 3 Lecture 2
No ratings yet
AI Unit 3 Lecture 2
8 pages
FALLSEM2023-24 CSE4022 ETH VL2023240103739 2023-08-23 Reference-Material-II
No ratings yet
FALLSEM2023-24 CSE4022 ETH VL2023240103739 2023-08-23 Reference-Material-II
5 pages
CMR University School of Engineering and Technology Department of Cse and It
No ratings yet
CMR University School of Engineering and Technology Department of Cse and It
8 pages
Deploying Natural Language Processing For Social Science Analysis
No ratings yet
Deploying Natural Language Processing For Social Science Analysis
2 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 8 - Pre Processing Techniques

Uploaded by

Lecture 8 - Pre Processing Techniques

Uploaded by

Advance Text, Social and Media Analytics –

For example, the raw input may be a PDF document, a scanned

Syntactical parsing components perform a full syntactical analysis of sentences

Instead of providing a complete analysis (a parse) of a whole sentence, shallow parsers

IE must often be distinguished from

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.