Lecture 8 - Pre Processing Techniques
Lecture 8 - Pre Processing Techniques
Preprocessing Techniques
1
Pre-Processing Techniques
Text mining is dependent on the various preprocessing techniques that infer or
extract structured representations from raw unstructured data sources, or do
both, that one might even say text mining is to a degree defined by these elaborate
preparatory techniques.
According to their task According to the algorithms and formal
frameworks that they use
Task-oriented preprocessing approaches envision Categorizing text mining preprocessing
the process of creating a structured document techniques by either their task orientation or
representation in terms of tasks and subtasks and
the formal frameworks from which they
usually involve some sort of preparatory goal or
problem that needs to be solved such as extracting derive does not mean that “mixing and
titles and authors from a PDF document. matching” techniques from either category
for a given text mining application are
Analysing complex phenomena that prohibited.
can be also applied to natural language texts.
Such approaches include classification
schemes, probabilistic models, and rule-based
systems approaches.
NLP, Text Categorization and IE techniques can
be used.
2
Pre-Processing Techniques
Each of the preprocessing techniques starts with a partially structured document and proceeds
to enrich the structure by refining the present features and adding new ones. In the end, the
most advanced and meaning-representing features are used for the text mining, whereas the
rest are discarded.
Task Oriented Approach
Informally, the task of the document structuring process is to take the most
“raw” representation and convert it to the representation through which the essence
(i.e., the meaning) of the document surfaces.
- divide-and-conquer strategy
The subtasks can be divided broadly into three classes – preparatory processing, general-
purpose NLP tasks, and problem-dependent tasks.
4
Pre-Processing Techniques – Task Oriented Approach
It is sometimes also possible for the preparatory processing to extract some
document-level fields such as <Author> or <Title> in cases in which the visual
position of the fields allows their identification.
The general-purpose NLP tasks process text documents using the general
knowledge of the natural language. The tasks may include tokenization,
morphological analysis, POS tagging, and syntactic parsing – either shallow
or deep. The tasks are general purpose in the sense that their output is not specific
to any particular problem.
The output can rarely be relevant for the end-user and is typically employed for
further problem-dependent processing. The domain-related knowledge, however,
can often enhance the performance of the general-purpose NLP tasks and is often
used at different levels of processing.
5
6
Parsing essentially means how to assign a
structure to a sequence of text. Syntactic
parsing involves the analysis of words in the
sentence for grammar and their arrangement in
a manner that shows the relationships among
the words.
Parsing
7
Pre-Processing Techniques – General Purpose NLP Task
It is currently an orthodox opinion that language processing in humans cannot be separated
into independent components. Various experiments in psycholinguistics clearly
demonstrate that the different stages of analysis – phonetic, morphological, syntactical,
semantical, and pragmatical – occur simultaneously and depend on each other.
The precise algorithms of human language processing are unknown, however, and although
several systems do try to combine the stages into a coherent single process, a completely
satisfactory solution has not yet been achieved. Thus, most the text understanding systems
employ the traditional divide-and-conquer strategy, separating the whole problem into
several subtasks and solving them independently.
1. Tokenization
The approach most frequently found in text mining systems involves breaking the text into
sentences and words, which is called tokenization. The main challenge in identifying
sentence boundaries in the English language is distinguishing between a period that signals
the end of a sentence and a period that is part of a previous token like Mr., Dr., and so on. It
is common for the tokenizer also to extract token features capitalization, the inclusion of
digits, punctuation, special characters, and so on
8
Pre-Processing Techniques – General Purpose NLP Task
2. Part of Speech Tagging
POS tagging is the annotation of words with the appropriate POS tags based on the context
in which they appear. POS tags divide words into categories based on the role they play
in the sentence in which they appear. POS tags provide information about the semantic
content of a word. Nouns usually denote “tangible and intangible things,” whereas
prepositions express relationships between “things.”
Most POS tag sets make use of the same basic categories. The most common set
of tags contains seven different tags (Article, Noun, Verb, Adjective, Preposition,
Number, and Proper Noun). Some systems contain a much more elaborate set of
tags. For example, the complete Brown Corpus tag set has no less than 87 basic tags.
9
Pre-Processing Techniques – General Purpose NLP Task
3 Syntactical Parsing
10
Pre-Processing Techniques – General Purpose NLP Task
4 Shallow Parsing
For the purposes of information extraction, shallow parsing is usually sufficient and
therefore preferable to full analysis because of its far greater speed and robustness.
11
Pre-Processing Techniques – Problem Dependant Task (Text Categorization and Information
Extraction)
The final stages of document structuring create representations that are meaningful for
either later (and more sophisticated) processing phases or direct interaction of the text
mining system user. The text mining techniques normally expect the documents to be
represented as sets of features, which are considered to be structureless atomic entities
possibly organized into a taxonomy – an IsA-hierarchy.
Both of these techniques are also popularly referred to as “tagging” (because of the tag-
formatted structures they introduce in a processed document), and they enable one to obtain
formal, structured representations of documents. Text categorization and IE enable users to
move from a “machine-readable” representation of the documents to a “machine-
understandable” form of the documents.
12
Pre-Processing Techniques – Problem Dependant Task (Text Categorization and Information
Extraction)
Text categorization (sometime called text classification) tasks tag each document
with a small number of concepts or keywords. The set of all possible concepts or keywords
is usually manually prepared, closed, and comparatively small. The hierarchy relation
between the keywords is also prepared manually.
13
https://www.youtube.com/watch?v=5ctbvkAMQO4&t=172s
What is NLP
https://www.youtube.com/watch?v=CMrHM8a3hqw
14