spamassassin-dev September 2011 archive
Main Archive Page > Month Archives  > spamassassin-dev archives
spamassassin-dev: Re: [SA-dev] bernie-it_batt ham 61% DKIM_ADSP_

Re: [SA-dev] bernie-it_batt ham 61% DKIM_ADSP_ALL and other fun in the corpora

From: Adam Katz <antispam_at_nospam>
Date: Fri Sep 30 2011 - 04:22:11 GMT
To: dev@spamassassin.apache.org

On 09/28/2011 01:29 PM, darxus@chaosreigns.com wrote:
> I wrote a script to read in scores of all the rules from
> /var/lib/spamassassin/3.003002/updates_spamassassin_org/*.cf, then
> read in the corpora from the last mass-check. It adds up the score
> of each of the emails, and outputs the hits for emails that scored on
> the wrong side of a threshold of 5.

% ... |grep -Eo '[A-Z_]\w{2,}' |sort |uniq |sort -n |sed '/ 1 /d'

      2 HTML_MESSAGE
      2 HTTP_ESCAPED_HOST
      2 MILLION_USD
      2 MIME_HTML_ONLY
      2 NUMERIC_HTTP_ADDR
      2 RCVD_IN_BRBL_LASTEXT
      2 URIBL_RHS_DOB
      2 URIBL_SBL
      3 LOTS_OF_MONEY
      3 URI_OBFU_WWW
      4 FRT_APPROV
      4 SPF_PASS
      4 SPOOF_COM2OTH
      4 URIBL_SC_SURBL
      4 URI_NOVOWEL
      5 DRUGS_ERECTILE
      5 DRUGS_ERECTILE_OBFU
      5 FH_HELO_EQ_D_D_D_D
      5 HELO_DYNAMIC_IPADDR2
      5 RDNS_DYNAMIC
      7 URI_HEX
      9 RP_MATCHES_RCVD
      9 URIBL_DBL_SPAM
     10 URIBL_AB_SURBL
     11 DOS_RCVD_IP_TWICE_C
     11 URIBL_JP_SURBL
     11 URIBL_WS_SURBL
     12 NORMAL_HTTP_TO_IP
     13 RCVD_IN_DNSWL_MED
     14 RDNS_NONE
     14 URIBL_BLACK
     16 DOS_RCVD_IP_TWICE_B
     16 FORGED_RELAY_MUA_TO_MX
     16 RCVD_IN_PBL
     16 DKIM_ADSP_ALL

16x isn't screamingly problematic (out of 208473 hams, it's .0077%,
though I suspect your subset of the ham corpus is smaller), though FP
reduction is always a Good Thing.

I've been sitting on a fix to HELO_DYNAMIC_IPADDR2 for a bit. Checking
that in now. It changes a match in last-external HELO

from
\d+[^\d\s]\d+[^\d\s]\d+[^\d\s]\d+[^\d\s][^\.]*\.\S+\.\S+

to
\d{1,3}(?:[\Wx_]\d{1,3}){3}[^\d\s][^\s.]*\.\S+\.\S+

I also added some examples of what this hits. I can't find too many
exotics at the moment though.

One of the FPs I saw in my ham corpus included a space in the text
matching [^\.]* which you can see I have corrected. Since I'm
picking on this front, note that [\Wx_] does afford a space, but it must
be followed by a digit, so since no attribute of the SA-generated
X-Spam-Relays-External pseudoheader begins with a digit, there is no
risk of it matching a space.

Also avoided ccTLDs breaking the exclusion in e.g.
foo.com.au.s3.amazonaws.com in SPOOF_COM2OTH and friends.