Ijcst V3i2p17
Ijcst V3i2p17
ABSTRACT
The rapid growth of World Wide Web has rendered the document classification by humans infeasible which has given
impetus to the techniques like Data mining, NLP and Machine Learning for automatic classification of textual documents.
With the high availability of information from diverse sources, classification tasks have attained paramount importance.
Automated text classification has been considered as a vital method to manage and process a vast amount of documents in
digital forms. This paper provides an insight into text classification process, its phases and various classifiers. It also aims at
comparing and contrasting various available classifiers on the basis of few criteria like time complexity and performance.
Keywords:- Data Mining, Natural Language Processing, Classifier, Text classification, Machine Learning.
I. INTRODUCTION
II. CLASSIFICATION PROCESS
With the increasing availability of digital documents from
diverse sources, text classification is gaining popularity day
From the perspective of automatic text classification
in and day out. There is a mushroom growth of digital data
systems, classification task can be sequenced
made available in the last few years, data discovery and
data mining have worked together to extract meaningful
data into useful information and knowledge [10]. Text
mining refers to the process of deriving high quality
information from text. It is conducive in utilizing
information contained in textual documents in various
ways including discovery of patterns, association among
entities etc. and this is done with the amalgamation of
NLP(Natural Language Processing), Data Mining and
Machine learning techniques.
Feature extraction is the process of selecting a subset of the Stop words like: “a”, “the”, “but” are required by the
terms occurring in the training set and using only this grammar structure of any language but inculcate no
subset as features in text classification. Feature extraction meaning. Likewise, stemming converts different word form
serves two main purposes. First, it makes training and into similar canonical form. Statistical filtering practices
applying a classifier more efficient by decreasing the size are used to glean those words that have higher statistical
of the effective vocabulary. Second, feature selection often significance. Most represented statistical filtering
increases classification accuracy by eliminating noise approaches are: odds ratio, mutual information, cross
features. A noise feature is one that, when included in the entropy, information gain, weight of evidence, χ 2 test,
document representation, increases the classification error correlation coefficient [6], conditional mutual information
on new data. Additional features can be mined from the maxmin [8], and conformity/uniformity criteria [7]. In
classifiable text; however nature of such features should be simple terms, most formulas give high scores to words that
highly dependent on the nature of classification to be appear frequently within a category and less frequently
carried out. If web sites need to be separated into spam and outside of a category (conformity) or to the opposite (non-
non-spam websites, then the word frequency distribution or conformity). And additionally higher scores are given to
the ontology is of little use for the classification, because of words that appear in most documents of a particular
widespread tactics by the spammers to copy and paste category (uniformity).
mixture of texts from legitimate web sites in creation of
their spam web sites [2]. 2.6 Classification
2.4 Natural Language Processing With each passing day, automatic classification of
documents in predefined categories is gaining active
Feature extraction and reduction phases of text attention of many researchers. Supervised, unsupervised
classification process are performed with the help of and semi supervised are the methods used to classify
Natural Language Processing techniques. Linguistic documents. The last decade has seen the unprecedented and
features can be extracted from texts and used as part of rapid progress in this area, including the machine learning
their feature vectors [3]. For example parts of the text that approaches such as Bayesian classifier, Decision Tree, K-
are written in direct speech, use of different types of nearest neighbor(KNN), Support Vector Machines(SVMs),
declinations, length of sentences, proportions of different Neural Networks, Rocchio’s.
parts of speech in sentences (such as noun phrases,
preposition phrases or verb phrases) can all be detected and
used as a feature vector or in addition to word frequency III. CLASSIFIERS
feature vector [4].
3.1 K-Nearest Neighbour
2.5 Feature Reduction
K nearest neighbors is an elegant supervised machine
Feature reduction a.k.a. Dimensionality reduction is about learning algorithm that stores all available cases and
transforming data of very high dimensionality into data of classifies new cases based on a similarity measure (e.g.,
much lower dimensionality such that each of the lower distance functions).K-NN works on a principle that the
dimensions manifest much more information. The points (documents) which are close in the space belong to
A major demerit of the similarity measure used in k-NN is Experiments witness that this algorithm performs well on
that it uses all features in computing distances which numeric and textual data. Though it is often outperformed
degrades its performance. In myriad document data sets, by other techniques such as boosted trees, random forests,
only smaller number of the total vocabulary may be useful Max Entropy, Support Vector Machines etc., Naive Bayes
in categorizing documents. A probable approach to tackle classifier is quite efficient since it is less computationally
this problem is to learn weights for different features (or intensive (in both CPU and memory) and it necessitates a
words in document data etc.) [11]. Proposed Weight small amount of training data. The assumption of
Adjusted k-Nearest Neighbor (WAKNN) classification conditional independence is breached by real-world data
algorithm is based on the k-NN classification paradigm with highly correlated features thereby degrading its
which can enhance the performance of text classification performance.
[12].
3.4 Neural Networks
3.2 Support Vector Machine
Neural networks can be used to model complex
Initially, Support vector machines (SVM) was developed relationships between inputs and outputs to find patterns in
for building an optimal binary (2-class) classifier but data. By using neural networks as a tool, data warehousing
thereafter the technique was extended to regression and firms are gathering information from datasets in the
clustering problems. The working principle of SVM is to process known as data mining. A neural network classifier
find out a hyper plane (linear/non-linear) which maximizes is a network of units, where the input units usually
the margin. Maximizing the margin is equivalent to: represent terms, the output unit(s) represents the category.
For classifying a text document, its term weights are
assigned to the input units; the activation of these units is
1 T
w w C (i 1 i ) propagated forward through the network, and the value that
N
minimize
w, b,ζ i 2 the output unit(s) takes up as a consequence determines the
categorization decision.
subject to y i ( wT xi b) i 1 0, 1 i N
i 0, 1 i N
SVM is a partial case of kernel-based methods. It binds
feature vectors into a higher-dimensional space using a
kernel function and builds an optimal linear discriminating
function in this space or an optimal hyper-plane that is
congruent with the training data. The kernel is not
explicitly defined in case of SVM. Instead, a distance
between any 2 points in the hyper-space needs to be
defined.