spamassassin-users April 2012 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: Re: Regex help (targetting very long HTML co

Re: Regex help (targetting very long HTML comments)

From: Kris Deugau <kdeugau_at_nospam>
Date: Mon Apr 09 2012 - 16:10:31 GMT
To: spamassassin-users <users@spamassassin.apache.org>

Adam Katz wrote:
> % grep html_text_match..comment 20_html_tests.cf

I hadn't known about that function until I saw Henrik's replies last
week, so it would have been hard to search for it.

> Any more that 512 chars isn't going to be helpful but will end up being
> computationally expensive (I've played with this idea). Also, I'd say
> this is more of a ham indicator than a spam indicator.

*shrug* I happen to be getting a wave of ~400K spams that consist of
about 1K of real HTML tags, loading the spam content via image from a
remote server, with the remainder of that 400K message consisting of
maybe four *very* long HTML comments (50K+) with nothing but gibberish
(groups of ~4-8 words, separated by /, ;, # and occasionally some other
symbol).

I've also seen gobs of mail with ~5K of CSS in an HTML comment - mostly
from Outlook. *eyeroll*

These are most of what's still getting through to *my* inbox, but with
~50K users I'd assume they're hitting other people as well.
Unfortunately, as an ISP sysadmin, my ability to get useful, timely
feedback from a high proportion of the userbase is... limited.

-kgd