Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering

Siefkes, Christian; Assis, Fidelis; Chhabra, Shalendra; Yerazunis, William S.

doi:10.1007/978-3-540-30116-5_38

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3202))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2458 Accesses
6 Altmetric

Abstract

Spam filtering is a text categorization task that has attracted significant attention due to the increasingly huge amounts of junk email on the Internet. While current best-practice systems use Naive Bayes filtering and other probabilistic methods, we propose using a statistical, but non-probabilistic classifier based on the Winnow algorithm. The feature space considered by most current methods is either limited in expressivity or imposes a large computational cost. We introduce orthogonal sparse bigrams (OSB) as a feature combination technique that overcomes both these weaknesses. By combining Winnow and OSB with refined preprocessing and tokenization techniques we are able to reach an accuracy of 99.68% on a difficult test corpus, compared to 98.88% previously reported by the CRM114 classifier on the same test corpus.

Download to read the full chapter text

Chapter PDF

Improved Spam Email Filtering Architecture Using Several Feature Extraction Techniques

An Optimized Approach for Detection and Classification of Spam Email’s Using Ensemble Methods

Article Open access 13 November 2024

Spam Detection by Machine Learning-Based Content Analysis

Keywords

References

Carlson, J., Cumby, C.M., Rizzolo, N.D., Rosen, J.L., Roth, D.: SNoW user manual. Version: January 2004. Technical report, UIUC (2004)
Google Scholar
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 17(2), 141–173 (1999)
Article Google Scholar
CRM114: The controllable regex mutilator, http://crm114.sourceforge.net/
Dagan, I., Karov, Y., Roth, D.: Mistake-driven learning in text categorization. In: EMNLP 1997 (1997)
Google Scholar
Gómez Hidalgo, J.M., Puertas Sanz, E., Maña López, M.J.: Evaluating costsensitive unsolicited bulk email categorization. In: JADT 2002, Madrid, ES (2002)
Google Scholar
Graham, P.: Better Bayesian filtering. In: MIT Spam Conference (2003)
Google Scholar
Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning 2, 285–318 (1988)
Google Scholar
Munoz, M., Punyakanok, V., Roth, D., Zimak, D.: A learning approach to shallow parsing. Technical Report UIUCDCS-R-99-2087, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois (1999)
Google Scholar
normalizemime v2004-02-04, http://hyvatti.iki.fi/jaakko/spam/
Siefkes, C.: A toolkit for caching and prefetching in the context of Web application platforms. Diplomarbeit, TU Berlin (2002)
Google Scholar
SpamAssassin, http://www.spamassassin.org/
SpamBayes, http://spambayes.sourceforge.net/
Trainable Incremental Extraction System, http://www.inf.fu-berlin.de/inst/agdb/software/ties/
Yerazunis, W.S.: Sparse binary polynomial hashing and the CRM114 discriminator. In: 2003 Spam Conference, MIT, Cambridge (2003)
Google Scholar
Yerazunis, W.S.: The spam-filtering accuracy plateau at 99.9% accuracy and how to get past it. In: 2004 Spam Conference, MIT, Cambridge (2004)
Google Scholar
Zhang, L., Yao, T.: Filtering junk mail with a maximum entropy model. In: 20th International Conference on Computer Processing of Oriental Languages (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Berlin-Brandenburg Graduate School in Distributed Information Systems, Database and Information Systems Group, Freie Universität Berlin, Berlin, Germany
Christian Siefkes
Empresa Brasileira de Telecomunicações, Embratel, Rio de Janeiro, RJ, Brazil
Fidelis Assis
Computer Science and Engineering, University of California, Riverside, California, USA
Shalendra Chhabra
Mitsubishi Electric Research Laboratories, Cambridge, MA, USA
William S. Yerazunis

Authors

Christian Siefkes
View author publications
You can also search for this author in PubMed Google Scholar
Fidelis Assis
View author publications
You can also search for this author in PubMed Google Scholar
Shalendra Chhabra
View author publications
You can also search for this author in PubMed Google Scholar
William S. Yerazunis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INSA-Lyon, LIRIS CNRS UMR5205, F-69621, Villeurbanne, France
Jean-François Boulicaut
Dipartimento di Informatica, Università degli Studi di Bari,
Floriana Esposito
Pisa KDD Laboratory, ISTI - CNR, Area della Ricerca di Pisa, Via Giuseppe Moruzzi 1, Pisa, Italy
Fosca Giannotti
Dipartimento di Informatica, Via F. Buonarroti 2, 56127, Pisa, Italy
Dino Pedreschi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Siefkes, C., Assis, F., Chhabra, S., Yerazunis, W.S. (2004). Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Knowledge Discovery in Databases: PKDD 2004. PKDD 2004. Lecture Notes in Computer Science(), vol 3202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30116-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-540-30116-5_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23108-0
Online ISBN: 978-3-540-30116-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering

Abstract

Chapter PDF

Similar content being viewed by others

Improved Spam Email Filtering Architecture Using Several Feature Extraction Techniques

An Optimized Approach for Detection and Classification of Spam Email’s Using Ensemble Methods

Spam Detection by Machine Learning-Based Content Analysis

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering

Abstract

Chapter PDF

Similar content being viewed by others

Improved Spam Email Filtering Architecture Using Several Feature Extraction Techniques

An Optimized Approach for Detection and Classification of Spam Email’s Using Ensemble Methods

Spam Detection by Machine Learning-Based Content Analysis

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.