spamassassin-users April 2011 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: Darxus's LOCAL_8X_TAGS

Darxus's LOCAL_8X_TAGS

From: Adam Katz <antispam_at_nospam>
Date: Thu Apr 21 2011 - 23:35:26 GMT
To: SpamAssassin Users List <users@spamassassin.apache.org>

Broken apart from previous thread to prevent confusion.

On 04/21/2011 04:18 PM, darxus@chaosreigns.com wrote:
> On 04/21, Adam Katz wrote:
>> rawbody LOCAL_5X_BR_TAGS /(?:<br\/?>[\s\r\n]{0,4}){5}/mi
>
> I wonder if it would be useful to generalize this as:
>
> rawbody LOCAL_8X_TAGS /(?:<[^>]*>[\s\r\n]{0,4}){8}/mi
>
> Just a mess of tags in a row without any content.

I'm not sure about email clients specifically, but it is (or rather,
used to be -- I'm way out of date here) a common WYSIWYG foible to
create empty tags when the user plays with various formatting buttons
(like bold and italics) as they decide how something is presented.
Therefore, it is not uncommon to have strings like this:

<b></b><b>1.</b> <b><i>Example bullet</i></b><b>
</b>

I kept thinking that there was a good psychology study in there
somewhere since good knowledge with the inner workings of a specific
WYSIWYG editor would reveal lots of information about how the document
was composed (order, revisions, etc).

HTML generators' sloppiness is so abundant that many of them actually
run their final code through a cleanser application (e.g. Wikipedia uses
HTML Tidy).