| Main Archive Page > Month Archives > spamassassin-users archives |
Broken apart from previous thread to prevent confusion.
On 04/21/2011 04:18 PM, darxus@chaosreigns.com wrote:
> On 04/21, Adam Katz wrote:
>> rawbody LOCAL_5X_BR_TAGS /(?:<br\/?>[\s\r\n]{0,4}){5}/mi
>
> I wonder if it would be useful to generalize this as:
>
> rawbody LOCAL_8X_TAGS /(?:<[^>]*>[\s\r\n]{0,4}){8}/mi
>
> Just a mess of tags in a row without any content.
I'm not sure about email clients specifically, but it is (or rather,
used to be -- I'm way out of date here) a common WYSIWYG foible to
create empty tags when the user plays with various formatting buttons
(like bold and italics) as they decide how something is presented.
Therefore, it is not uncommon to have strings like this:
<b></b><b>1.</b> <b><i>Example bullet</i></b><b>
</b>
I kept thinking that there was a good psychology study in there
somewhere since good knowledge with the inner workings of a specific
WYSIWYG editor would reveal lots of information about how the document
was composed (order, revisions, etc).
HTML generators' sloppiness is so abundant that many of them actually
run their final code through a cleanser application (e.g. Wikipedia uses
HTML Tidy).