-
Website
http://carlo.zottmann.org/ -
Original page
http://carlo.zottmann.org/2003/06/27/jun-27th-2003-0954-gmt/ -
Subscribe
All Comments -
Community
-
Top Commenters
-
Vicki
1 comment · 1 points
-
Hendrik Mans
3 comments · 2 points
-
Daniel Ha
2 comments · 405 points
-
kentbrew
4 comments · 2 points
-
Mark Douglass
3 comments · 2 points
-
-
Popular Threads
I have found, however, that more and more often the SPAM I'm receiving has all the relevant information put onto an IMAGE contained within the e-mail. There isn't any real content to compare using filtering such as Bayesian.
ugh.
Zhaneel
spambayescope with it? After all, HTML email *is* text - it should be *easy* to train it to filter out stuff with comments.Slurp the message into memory, ignoring binary-encoded attachmentsRemove all HTML tags from the message. In essence, perform the following: s///g; (spambayes is Python, but it uses PCRE)Tokenize the message. Much voodoo here, and this is where most new spambayes development is taking place. Using the token database, take the 15 top-weighted tokens (weight: distance from 50%) and classify the message
The presence of HTML comments gets ignored since HTML gets stripped. You don't get tokens for, say, number of comments, but a string like VIA<!-- aren't I clever? -->GRA will get transformed to "VIAGRA" before the tokenizer gets to it. It's a partial solution to the problem, but in my experience, it works very well. I haven't had a false positive in months. In fact, the only false positive I've gotten this year was when a friend of mine forwarded me an (unintentionally) amusing porn spam that he got.