spamassassin-users September 2011 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: charset in rules

charset in rules

From: Matus UHLAR - fantomas <uhlar_at_nospam>
Date: Mon Sep 26 2011 - 12:57:59 GMT
To: users@spamassassin.apache.org

Hello,

I was trying to write a rule that would lower the effect of FRT_PENIS1
rule, since this one often matches text in czech/slovak language
(e.g. peníze == money)

I didn't want to zero score of FRT_PENIS1, because that still may catch
some spam.

I have expected that putting UTF-8 text into the body rule like

body __PENI_NOPENIS /pen[íě]\s?z/

(e.g. iacute, ecaron)

could help me, however this rule does not match on 3 mails I've
checked.

I wondered when I changed the used character set to iso-8859-2, it
matched (even very badly formatted HTML mail with HTML encoding).

body __PENI_NOPENIS /pen[\xED\xEC]\s?z/

Is this expected behaviour?

my version of SA is 3.3.1 with perl 5.12.3 and LC_CTYPE is set to
sk_SK.utf-8
-- Matus UHLAR - fantomas, uhlar_at_fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Spam is for losers who can't get business any other way.