spamassassin-users April 2012 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: Re: Regex help (targetting very long HTML co

Re: Regex help (targetting very long HTML comments)

From: Bowie Bailey <Bowie_Bailey_at_nospam>
Date: Mon Apr 02 2012 - 17:04:42 GMT
To: users@spamassassin.apache.org

On 4/2/2012 12:58 PM, Stephane Chazelas wrote:
> 2012-04-02 12:40:27 -0400, Kris Deugau:
>> Can anyone point out what bit of stupidity I'm committing in trying
>> to use this:
>>
>> rawbody OVERSIZE_COMMENT m|<!--(?!-->).{32000,}|s
>>
>> to match messages that are mostly very very long HTML comment(s)?
>>
>> Testing the same regex against the whole raw message outside of SA
>> seems to fire just fine.
> [...]
>
> Don't know about the spamassassin issue, but that regexp
> matches <!-- followed by a sequence of 32000 of more characters
> provided that sequence doesn't start with "-->".
>
> ITYM
>
> m|<!--(?:(?!-->).){32000,}|s
>
> That is you need to look ahead at each character of the sequence
> to look for the closing comment tag, otherwise you'll match on
> <!-- short comment --> <31982 or more characters>

And you may or may not want to match on a closing comment at the end.

m|<!--(?:(?!-->).){32000,}-->|s

Also, because of all of the lookaheads, this may be an expensive
regexp. If you try it, keep a close eye on your SA. If it slows down
to a crawl, this is probably the culprit.

-- Bowie