DISQUS

carlo.comments: carlo.log → Jun 27th 2003, 09:54 GMT

  • entipy · 6 years ago
    Sounds good.

    I have found, however, that more and more often the SPAM I'm receiving has all the relevant information put onto an IMAGE contained within the e-mail. There isn't any real content to compare using filtering such as Bayesian.

    ugh.
  • Zhaneel · 6 years ago
    There is a neat article on Fury about how it also can be thawrted.

    Zhaneel
  • Spam · 6 years ago
    That's a good argument for getting rid of all HTML email. If you have to send text, then Bayesian filtering can work on it. If your mail client is set to never display HTML messages, you'll never see such spams. They'll just show up as a blank message with a couple attachments, and that's clearly spam.
  • Carlo Zottmann · 6 years ago
    #1: what Spam (#3) said. No text, just HTML? Most likely spam. If you get two or more of these and train the filter, it'll learn. It's working. I know because I got a couple of those in the past myself. ;)
  • terpsichoros · 6 years ago
    #2 - more specifically, the article is at http://fury.com/article/1789.php. In it, he complains that the Bayesian filtering can't cope with HTML comments. Why not? Spam - can spambayes cope with it? After all, HTML email *is* text - it should be *easy* to train it to filter out stuff with comments.
  • Spam · 6 years ago
    @5: The way spambayes handles things now is:
    Slurp the message into memory, ignoring binary-encoded attachmentsRemove all HTML tags from the message. In essence, perform the following: s///g; (spambayes is Python, but it uses PCRE)Tokenize the message. Much voodoo here, and this is where most new spambayes development is taking place. Using the token database, take the 15 top-weighted tokens (weight: distance from 50%) and classify the message

    The presence of HTML comments gets ignored since HTML gets stripped. You don't get tokens for, say, number of comments, but a string like VIA<!-- aren't I clever? -->GRA will get transformed to "VIAGRA" before the tokenizer gets to it. It's a partial solution to the problem, but in my experience, it works very well. I haven't had a false positive in months. In fact, the only false positive I've gotten this year was when a friend of mine forwarded me an (unintentionally) amusing porn spam that he got.