0% found this document useful (0 votes)
22 views5 pages

Ijcst V3i2p17

Research paper

Uploaded by

renuga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views5 pages

Ijcst V3i2p17

Research paper

Uploaded by

renuga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Science Trends and Technology (IJCST) – Volume 3 Issue 2, Mar-Apr 2015

RESEARCH ARTICLE OPEN ACCESS

Survey Paper on Document Classification and Classifiers


Upendra Singh [1], Saqib Hasan [2]
UG Students [1] & [2]
Department of Computer Science and Engineering
Madan Mohan Malaviya University of Technology
Gorakhpur - 273010
UP - India

ABSTRACT
The rapid growth of World Wide Web has rendered the document classification by humans infeasible which has given
impetus to the techniques like Data mining, NLP and Machine Learning for automatic classification of textual documents.
With the high availability of information from diverse sources, classification tasks have attained paramount importance.
Automated text classification has been considered as a vital method to manage and process a vast amount of documents in
digital forms. This paper provides an insight into text classification process, its phases and various classifiers. It also aims at
comparing and contrasting various available classifiers on the basis of few criteria like time complexity and performance.
Keywords:- Data Mining, Natural Language Processing, Classifier, Text classification, Machine Learning.

I. INTRODUCTION
II. CLASSIFICATION PROCESS
With the increasing availability of digital documents from
diverse sources, text classification is gaining popularity day
From the perspective of automatic text classification
in and day out. There is a mushroom growth of digital data
systems, classification task can be sequenced
made available in the last few years, data discovery and
data mining have worked together to extract meaningful
data into useful information and knowledge [10]. Text
mining refers to the process of deriving high quality
information from text. It is conducive in utilizing
information contained in textual documents in various
ways including discovery of patterns, association among
entities etc. and this is done with the amalgamation of
NLP(Natural Language Processing), Data Mining and
Machine learning techniques.

Infeasibility of human beings to go through all the


available documents to find the document of interest
precipitated the rise of document classification.
Automatically categorizing documents could provide
people a significant ease in this realm. Text classification
assigns documents one or more predefined categories. The
notion of classification is very general and has many
applications within and beyond information retrieval (IR).
For instance, text classification finds its application in
automatic spam detection, sentiment analysis, automatic
detection of obscenity, personal email sorting and Topic
specific or Vertical Searches. An example of classification
Fig 2.1 Steps of Text Classification
would be automatically labeling news stories with subjects
like “business”, “entertainment”, “sports” etc.

2.1 Document Collection

ISSN: 2347-8578 www.ijcstjournal.org Page 83


International Journal of Computer Science Trends and Technology (IJCST) – Volume 3 Issue 2, Mar-Apr 2015
Text classification starts with this step of collecting various computational complexity of any operations with such
types of documents including different formats like html, feature vectors will be proportional to the size of the
.pdf, .doc, web content etc. feature vector (Yang & Pedersen, 1997), so any methods
that reduce the size of the feature vector while not
2.2 Tokenization significantly impacting the classification performance are
very welcome in any practical application. Additionally, it
Tokenization, when applied to documents, is the process of has been shown that some specific words in specific
substituting a sensitive data element with a non-sensitive languages only add noise to the data and removing them
equivalent, referred as token that has no extrinsic or from the feature vector actually improves classification
exploitable meaning or value. A document is considered as performance.
a string, and then partitioned into a list of tokens. Stop
words such as “the”, “a”, “and”, etc. are frequently The set of feature reduction operations involves a
occurring; therefore the insignificant words need to be combination of three general approaches [5]:
removed. 1. Stop words;
2. Stemming;
2.3 Feature Extraction 3. Statistical filtering.

Feature extraction is the process of selecting a subset of the Stop words like: “a”, “the”, “but” are required by the
terms occurring in the training set and using only this grammar structure of any language but inculcate no
subset as features in text classification. Feature extraction meaning. Likewise, stemming converts different word form
serves two main purposes. First, it makes training and into similar canonical form. Statistical filtering practices
applying a classifier more efficient by decreasing the size are used to glean those words that have higher statistical
of the effective vocabulary. Second, feature selection often significance. Most represented statistical filtering
increases classification accuracy by eliminating noise approaches are: odds ratio, mutual information, cross
features. A noise feature is one that, when included in the entropy, information gain, weight of evidence, χ 2 test,
document representation, increases the classification error correlation coefficient [6], conditional mutual information
on new data. Additional features can be mined from the maxmin [8], and conformity/uniformity criteria [7]. In
classifiable text; however nature of such features should be simple terms, most formulas give high scores to words that
highly dependent on the nature of classification to be appear frequently within a category and less frequently
carried out. If web sites need to be separated into spam and outside of a category (conformity) or to the opposite (non-
non-spam websites, then the word frequency distribution or conformity). And additionally higher scores are given to
the ontology is of little use for the classification, because of words that appear in most documents of a particular
widespread tactics by the spammers to copy and paste category (uniformity).
mixture of texts from legitimate web sites in creation of
their spam web sites [2]. 2.6 Classification
2.4 Natural Language Processing With each passing day, automatic classification of
documents in predefined categories is gaining active
Feature extraction and reduction phases of text attention of many researchers. Supervised, unsupervised
classification process are performed with the help of and semi supervised are the methods used to classify
Natural Language Processing techniques. Linguistic documents. The last decade has seen the unprecedented and
features can be extracted from texts and used as part of rapid progress in this area, including the machine learning
their feature vectors [3]. For example parts of the text that approaches such as Bayesian classifier, Decision Tree, K-
are written in direct speech, use of different types of nearest neighbor(KNN), Support Vector Machines(SVMs),
declinations, length of sentences, proportions of different Neural Networks, Rocchio’s.
parts of speech in sentences (such as noun phrases,
preposition phrases or verb phrases) can all be detected and
used as a feature vector or in addition to word frequency III. CLASSIFIERS
feature vector [4].
3.1 K-Nearest Neighbour
2.5 Feature Reduction
K nearest neighbors is an elegant supervised machine
Feature reduction a.k.a. Dimensionality reduction is about learning algorithm that stores all available cases and
transforming data of very high dimensionality into data of classifies new cases based on a similarity measure (e.g.,
much lower dimensionality such that each of the lower distance functions).K-NN works on a principle that the
dimensions manifest much more information. The points (documents) which are close in the space belong to

ISSN: 2347-8578 www.ijcstjournal.org Page 84


International Journal of Computer Science Trends and Technology (IJCST) – Volume 3 Issue 2, Mar-Apr 2015
the same class. The algorithm assimilates all training parameter [sigma] - and the value of [epsilon] in the
samples and predicts the response for a new sample by [epsilon]-insensitive loss function.
analyzing a certain number (K) of the nearest neighbors of
the sample by using some similarity measure such as 3.3 Naïve Bayes
Euclidean distance measure etc., the distance between two
neighbors using Euclidean distance can be found using the The Naive Bayes classifier is a probabilistic classifier
given formula. based on Bayes theorem with strong and naïve
independence assumptions. It is supposed to be one of the
most basic text classification techniques with various
applications in email spam detection, personal email
sorting, document categorization, sexually explicit content
detection, language detection and sentiment detection.

A major demerit of the similarity measure used in k-NN is Experiments witness that this algorithm performs well on
that it uses all features in computing distances which numeric and textual data. Though it is often outperformed
degrades its performance. In myriad document data sets, by other techniques such as boosted trees, random forests,
only smaller number of the total vocabulary may be useful Max Entropy, Support Vector Machines etc., Naive Bayes
in categorizing documents. A probable approach to tackle classifier is quite efficient since it is less computationally
this problem is to learn weights for different features (or intensive (in both CPU and memory) and it necessitates a
words in document data etc.) [11]. Proposed Weight small amount of training data. The assumption of
Adjusted k-Nearest Neighbor (WAKNN) classification conditional independence is breached by real-world data
algorithm is based on the k-NN classification paradigm with highly correlated features thereby degrading its
which can enhance the performance of text classification performance.
[12].
3.4 Neural Networks
3.2 Support Vector Machine
Neural networks can be used to model complex
Initially, Support vector machines (SVM) was developed relationships between inputs and outputs to find patterns in
for building an optimal binary (2-class) classifier but data. By using neural networks as a tool, data warehousing
thereafter the technique was extended to regression and firms are gathering information from datasets in the
clustering problems. The working principle of SVM is to process known as data mining. A neural network classifier
find out a hyper plane (linear/non-linear) which maximizes is a network of units, where the input units usually
the margin. Maximizing the margin is equivalent to: represent terms, the output unit(s) represents the category.
For classifying a text document, its term weights are
assigned to the input units; the activation of these units is
1 T
w w  C (i 1  i ) propagated forward through the network, and the value that
N
minimize
w, b,ζ i 2 the output unit(s) takes up as a consequence determines the
categorization decision.
subject to y i ( wT xi  b)   i  1  0, 1  i  N
 i  0, 1 i  N
SVM is a partial case of kernel-based methods. It binds
feature vectors into a higher-dimensional space using a
kernel function and builds an optimal linear discriminating
function in this space or an optimal hyper-plane that is
congruent with the training data. The kernel is not
explicitly defined in case of SVM. Instead, a distance
between any 2 points in the hyper-space needs to be
defined.

The key features of SVMs are the use of kernels, the


absence of local minima, the sparseness of the solution and
the capacity control obtained by optimizing the margin.
Besides the advantages of SVMs - from a practical point
of view they have some drawbacks. An important practical Fig 3.4 Simple Neural Network Demonstration
question that is not entirely solved, is the selection of the
kernel function parameters - for Gaussian kernels the width

ISSN: 2347-8578 www.ijcstjournal.org Page 85


International Journal of Computer Science Trends and Technology (IJCST) – Volume 3 Issue 2, Mar-Apr 2015
V. CONCLUSION
Suitability for both discrete and continuous data makes
neural network a popular choice for text classification Text classification is a widespread domain of research
purpose. encompassing Data mining, NLP and Machine Learning. It
has witnessed much heed owing to the high growth rate of
3.5 Rocchio’s Algorithm internet and relevance of internet search engines. This
review paper circumscribes existing literature and explores
The Rocchio’s algorithm is based on a method of relevance the document representation and analysis of feature
feedback found in information retrieval systems which extraction methods and broaches to different available
stemmed from the SMART Information Retrieval System classifiers. Various methods of classification and feature
around the year 1970. In this algorithm, a prototype vector extraction have been compared and contrasted with all
is built for each class. A prototype vector is average vector coeval methods based on different parameters like time
over all training document vectors that belong to class ci. complexities and performance. It is deemed that no single
representation scheme and classifier can be mentioned as a
general model for any application. Performance of different
algorithms varies according to the data collection.
However, SVM with term weighted VSM representation
scheme has shown some potential results in the tasks of
text classification up to some extent but still universal
Similarity between text document and each of prototype acceptance of this algorithm remains implausible.
vectors is determined and text document is assigned to the
class having maximum similarity. The algorithm is based
on the assumption that most users have a general REFERENCES
conception of which documents should be denoted
as relevant or non-relevant. [1] F. Sebastiani, “Text categorization”, Alessandro Zanasi
(ed.) Text Mining and its Applications, WIT Press,
Southampton, UK, pp. 109-129, 2005.
This algorithm is deemed as very fast learner and easy to
implement. Although easy to implement, this algorithm
[2] Fetterly, D., Manasse, M. & Najork, M. (2005).
suffers from poor classification accuracy. The selection of
Detecting phrase-level duplication on the world wide
values for the constants alpha and beta plays a vital role in
web. In Proceedings of the 28th annual international
its performance.
ACM SIGIR conference on Research and development
in information retrieval (pp. pp. 170-177). : ACM
IV. PROPOSED METHODOLOGY Press, Salvador, Brazil
When confronted with a need to build a text classifier, the [3] Hunnisett, D. S. & Teahan, W.J. (2004). Context-based
first question to ask is how much training data is there methods for text categorisation. In Proceedings of the
currently available? None? Very little? Quite a lot? Or a 27th annual international ACM SIGIR conference on
huge amount, growing every day? For many problems and Research and development in information retrieval
algorithms, hundreds or thousands of examples from each (pp. pp. 578-579). : ACM Press, Sheffield, United
class are required to produce a high performance classifier Kingdom
and many real world contexts involve large sets of
categories. [4] Stamatatos, E., Kokkinakis, G. & Fakotakis, N. (2000).
Automatic text categorization in terms of genre and
Training a supervised classifier with little data may not turn author. Computational Linguistics, 26, pp. 471-495.
out beneficial. So it is advisable to cling to a semi-
supervised classifier. In case of availability of huge amount [5] Liu, H. & Motoda, H. (1998). Feature Selection for
of data, it may be best to choose a classifier based on the Knowledge Discovery and Data Mining. : Kluwer
scalability of training or even runtime efficiency. The Academic Publisher.
general rule of thumb is that each doubling of the training
data size produces a linear increase in classifier [6] Ng, H. T., Goh, W. B. & Low, K.L. (1997). Feature
performance, but with very large amounts of data, the selection, perception learning, and a usability case
improvement becomes sub-linear. study for text categorization. In Proceedings of the
20th annual international ACM SIGIR conference on
Research and development in information retrieval
(pp. pp. 67-73).

ISSN: 2347-8578 www.ijcstjournal.org Page 86


International Journal of Computer Science Trends and Technology (IJCST) – Volume 3 Issue 2, Mar-Apr 2015
[7] Chen, C., Lee, H. & Hwang, C. (2005). A Hierarchical [19] J.J. Rocchio. Relevance feedback in information
Neural Network Document Classifier with Linguistic retrieval. In The SMART Retrieval System—
Feature Selection. Applied Intelligence, 23, pp. 277- Experiments in Automatic Document Processing,
294. pages 313–323, Englewood Cliffs, NJ, 1971. Prentice
Hall, Inc
[8] Wang, G. & Lochovsky, F.H. (2004). Feature selection [20] Kjersti Aas and Line Eikvil “Text Categorization: A
with conditional mutual information maximin in text Survey” Report No. 941. ISBN 82-539-0425-8. , June,
categorization. In Proceedings of the thirteenth ACM 1999.
international conference on Information and
knowledge management (pp. pp. 342-349).

[9] Yang, Y. & Pedersen, J.O. (1997). A Comparative


Study on Feature Selection in Text Categorization. In
Proceedings of the Fourteenth International
Conference on Machine Learning (pp. pp. 412-420). :
Morgan Kaufmann Publishers Inc, San Francisco, CA,
USA

[10] Rupali Bhaisare,T. RajuRao 2013 “Review On Text


Mining With Pattern Discovery”.

[11] Muhammed Miah, “Improved k-NN Algorithm for


Text Classification”, Department of Computer Science
and Engineering University of Texas at Arlington, TX,
USA.

[12] Fang Lu Qingyuan Bai, “A Refined Weighted K-


Nearest Neighbours Algorithm for Text
Categorization”, IEEE 2010.

[13] Kwangcheol Shin, Ajith Abraham, and Sang Yong


Han, “Improving kNN Text Categorization by
Removing Outliers from Training Set”, Springer-
Verlag Berlin Heidelberg 2006.

[14] Robert Burbidge, Bernard Buxton2000’s An


Introduction to Support Vector Machines for
DataMining.

[15] Vidhya. K.A G.Aghila, “A Survey of Naïve Bayes


Machine Learning approach in Text Document
Classification”, (IJCSIS) International Journal of
Computer Science and Information Security, Vol. 7,
2010.

[16] S. M. Kamruzzaman, Chowdhury Mofizur Rahman:


“Text Categorization using Association Rule and
Naive Bayes Classifier” CoRR, 2010.

[17] MIgual E .Ruiz, Padmini Srinivasn, “Automatic Text


Categorization Using Neural networks”, Advaces in
Classification Research, Volume VIII.

[18] J.J. Rocchio. Document Retrieval Systems–


Optimization and Evaluation. PhD thesis, Harvard
Computational Laboratory, Cambridge, MA, 1966.

ISSN: 2347-8578 www.ijcstjournal.org Page 87

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy