0% found this document useful (0 votes)
52 views14 pages

Lecture 8 - Pre Processing Techniques

Uploaded by

trisim mathur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views14 pages

Lecture 8 - Pre Processing Techniques

Uploaded by

trisim mathur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Advance Text, Social and Media Analytics –

Preprocessing Techniques

1
Pre-Processing Techniques
Text mining is dependent on the various preprocessing techniques that infer or
extract structured representations from raw unstructured data sources, or do
both, that one might even say text mining is to a degree defined by these elaborate
preparatory techniques.
According to their task According to the algorithms and formal
frameworks that they use
Task-oriented preprocessing approaches envision Categorizing text mining preprocessing
the process of creating a structured document techniques by either their task orientation or
representation in terms of tasks and subtasks and
the formal frameworks from which they
usually involve some sort of preparatory goal or
problem that needs to be solved such as extracting derive does not mean that “mixing and
titles and authors from a PDF document. matching” techniques from either category
for a given text mining application are
Analysing complex phenomena that prohibited.
can be also applied to natural language texts.
Such approaches include classification
schemes, probabilistic models, and rule-based
systems approaches.
NLP, Text Categorization and IE techniques can
be used.
2
Pre-Processing Techniques
Each of the preprocessing techniques starts with a partially structured document and proceeds
to enrich the structure by refining the present features and adding new ones. In the end, the
most advanced and meaning-representing features are used for the text mining, whereas the
rest are discarded.
Task Oriented Approach

Informally, the task of the document structuring process is to take the most
“raw” representation and convert it to the representation through which the essence
(i.e., the meaning) of the document surfaces.

- divide-and-conquer strategy
The subtasks can be divided broadly into three classes – preparatory processing, general-
purpose NLP tasks, and problem-dependent tasks.

For example, the raw input may be a PDF document, a scanned


page, or even recorded speech. The task of the preparatory processing is to convert
the raw input into a stream of text, possibly labeling the internal text zones such as
paragraphs, columns, or tables
3
Pre-Processing Techniques – Task Oriented Approach

4
Pre-Processing Techniques – Task Oriented Approach
It is sometimes also possible for the preparatory processing to extract some
document-level fields such as <Author> or <Title> in cases in which the visual
position of the fields allows their identification.

The general-purpose NLP tasks process text documents using the general
knowledge of the natural language. The tasks may include tokenization,
morphological analysis, POS tagging, and syntactic parsing – either shallow
or deep. The tasks are general purpose in the sense that their output is not specific
to any particular problem.

The output can rarely be relevant for the end-user and is typically employed for
further problem-dependent processing. The domain-related knowledge, however,
can often enhance the performance of the general-purpose NLP tasks and is often
used at different levels of processing.

5
6
Parsing essentially means how to assign a
structure to a sequence of text. Syntactic
parsing involves the analysis of words in the
sentence for grammar and their arrangement in
a manner that shows the relationships among
the words.

Parsing

7
Pre-Processing Techniques – General Purpose NLP Task
It is currently an orthodox opinion that language processing in humans cannot be separated
into independent components. Various experiments in psycholinguistics clearly
demonstrate that the different stages of analysis – phonetic, morphological, syntactical,
semantical, and pragmatical – occur simultaneously and depend on each other.
The precise algorithms of human language processing are unknown, however, and although
several systems do try to combine the stages into a coherent single process, a completely
satisfactory solution has not yet been achieved. Thus, most the text understanding systems
employ the traditional divide-and-conquer strategy, separating the whole problem into
several subtasks and solving them independently.

1. Tokenization
The approach most frequently found in text mining systems involves breaking the text into
sentences and words, which is called tokenization. The main challenge in identifying
sentence boundaries in the English language is distinguishing between a period that signals
the end of a sentence and a period that is part of a previous token like Mr., Dr., and so on. It
is common for the tokenizer also to extract token features capitalization, the inclusion of
digits, punctuation, special characters, and so on
8
Pre-Processing Techniques – General Purpose NLP Task
2. Part of Speech Tagging
POS tagging is the annotation of words with the appropriate POS tags based on the context
in which they appear. POS tags divide words into categories based on the role they play
in the sentence in which they appear. POS tags provide information about the semantic
content of a word. Nouns usually denote “tangible and intangible things,” whereas
prepositions express relationships between “things.”

Most POS tag sets make use of the same basic categories. The most common set
of tags contains seven different tags (Article, Noun, Verb, Adjective, Preposition,
Number, and Proper Noun). Some systems contain a much more elaborate set of
tags. For example, the complete Brown Corpus tag set has no less than 87 basic tags.

9
Pre-Processing Techniques – General Purpose NLP Task
3 Syntactical Parsing

Syntactical parsing components perform a full syntactical analysis of sentences


according to a certain grammar theory. The basic division is between the constituency and
dependency grammar.
Constituency grammars describe the syntactical structure of sentences in terms of
recursively built phrases – sequences of syntactically grouped elements. Most constituency
grammars distinguish between noun phrases, verb phrases, prepositional phrases, adjective
phrases, and clauses. Each phrase may consist of zero or smaller phrases or words
according to the rules of the grammar. Additionally, the syntactical structure of sentences
includes the roles of different phrases. Thus, a noun phrase may be labelled as the subject
of the sentence, its direct object, or the complement.
Dependency grammars, on the other hand, do not recognize the constituents as separate
linguistic units but focus instead on the direct relations between words. A typical
dependency analysis of a sentence consists of a labeled DAG with words for nodes and
specific relationships (dependencies) for edges. For instance, a subject and direct object
nouns of a typical sentence depend on the main verb, an adjective depends on the noun it
modifies, and so on. Usually, the phrases can be recovered from a dependency analysis –
they are the connected components of the sentence graph.

10
Pre-Processing Techniques – General Purpose NLP Task
4 Shallow Parsing

Instead of providing a complete analysis (a parse) of a whole sentence, shallow parsers


produce only parts that are easy and unambiguous. Typically, small and simple noun and
verb phrases are generated, whereas more complex clauses are not formed. Similarly, most
prominent dependencies might be formed, but unclear and ambiguous ones are left
unresolved.

For the purposes of information extraction, shallow parsing is usually sufficient and
therefore preferable to full analysis because of its far greater speed and robustness.

11
Pre-Processing Techniques – Problem Dependant Task (Text Categorization and Information
Extraction)

The final stages of document structuring create representations that are meaningful for
either later (and more sophisticated) processing phases or direct interaction of the text
mining system user. The text mining techniques normally expect the documents to be
represented as sets of features, which are considered to be structureless atomic entities
possibly organized into a taxonomy – an IsA-hierarchy.

Both of these techniques are also popularly referred to as “tagging” (because of the tag-
formatted structures they introduce in a processed document), and they enable one to obtain
formal, structured representations of documents. Text categorization and IE enable users to
move from a “machine-readable” representation of the documents to a “machine-
understandable” form of the documents.

12
Pre-Processing Techniques – Problem Dependant Task (Text Categorization and Information
Extraction)

Text categorization (sometime called text classification) tasks tag each document
with a small number of concepts or keywords. The set of all possible concepts or keywords
is usually manually prepared, closed, and comparatively small. The hierarchy relation
between the keywords is also prepared manually.

IE must often be distinguished from


information retrieval or what is more
informally called “search.” Information
retrieval returns documents that match a
given query but still require the user to
read through these documents to locate
the relevant information. IE, on the other
hand, aims at pinpointing the relevant
information and presenting it in a
structured format – typically in a tabular
format. For analysts and other
knowledge workers, IE can save
valuable time by dramatically speeding
up discovery-type work.

13
https://www.youtube.com/watch?v=5ctbvkAMQO4&t=172s

What is NLP
https://www.youtube.com/watch?v=CMrHM8a3hqw

14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy