Abstract
Spam filtering is a text categorization task that has attracted significant attention due to the increasingly huge amounts of junk email on the Internet. While current best-practice systems use Naive Bayes filtering and other probabilistic methods, we propose using a statistical, but non-probabilistic classifier based on the Winnow algorithm. The feature space considered by most current methods is either limited in expressivity or imposes a large computational cost. We introduce orthogonal sparse bigrams (OSB) as a feature combination technique that overcomes both these weaknesses. By combining Winnow and OSB with refined preprocessing and tokenization techniques we are able to reach an accuracy of 99.68% on a difficult test corpus, compared to 98.88% previously reported by the CRM114 classifier on the same test corpus.
Chapter PDF
Similar content being viewed by others
Keywords
References
Carlson, J., Cumby, C.M., Rizzolo, N.D., Rosen, J.L., Roth, D.: SNoW user manual. Version: January 2004. Technical report, UIUC (2004)
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 17(2), 141–173 (1999)
CRM114: The controllable regex mutilator, http://crm114.sourceforge.net/
Dagan, I., Karov, Y., Roth, D.: Mistake-driven learning in text categorization. In: EMNLP 1997 (1997)
Gómez Hidalgo, J.M., Puertas Sanz, E., Maña López, M.J.: Evaluating costsensitive unsolicited bulk email categorization. In: JADT 2002, Madrid, ES (2002)
Graham, P.: Better Bayesian filtering. In: MIT Spam Conference (2003)
Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning 2, 285–318 (1988)
Munoz, M., Punyakanok, V., Roth, D., Zimak, D.: A learning approach to shallow parsing. Technical Report UIUCDCS-R-99-2087, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois (1999)
normalizemime v2004-02-04, http://hyvatti.iki.fi/jaakko/spam/
Siefkes, C.: A toolkit for caching and prefetching in the context of Web application platforms. Diplomarbeit, TU Berlin (2002)
SpamAssassin, http://www.spamassassin.org/
SpamBayes, http://spambayes.sourceforge.net/
Trainable Incremental Extraction System, http://www.inf.fu-berlin.de/inst/agdb/software/ties/
Yerazunis, W.S.: Sparse binary polynomial hashing and the CRM114 discriminator. In: 2003 Spam Conference, MIT, Cambridge (2003)
Yerazunis, W.S.: The spam-filtering accuracy plateau at 99.9% accuracy and how to get past it. In: 2004 Spam Conference, MIT, Cambridge (2004)
Zhang, L., Yao, T.: Filtering junk mail with a maximum entropy model. In: 20th International Conference on Computer Processing of Oriental Languages (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Siefkes, C., Assis, F., Chhabra, S., Yerazunis, W.S. (2004). Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Knowledge Discovery in Databases: PKDD 2004. PKDD 2004. Lecture Notes in Computer Science(), vol 3202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30116-5_38
Download citation
DOI: https://doi.org/10.1007/978-3-540-30116-5_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23108-0
Online ISBN: 978-3-540-30116-5
eBook Packages: Springer Book Archive