spamassassin-dev September 2011 archive
Main Archive Page > Month Archives  > spamassassin-dev archives
spamassassin-dev: Re: High ham rate in darxus corpora for URIBL_

Re: High ham rate in darxus corpora for URIBL_WS_SURBL Re: ham scores

From: Axb <axb.lists_at_nospam>
Date: Tue Sep 20 2011 - 06:05:44 GMT
To: dev@spamassassin.apache.org

On 2011-09-19 23:37, darxus@chaosreigns.com wrote:
> On 09/19, Mark wrote:
>> darxus@chaosreigns.com wrote:
>>> The other 38 were notifications from livejournal.com, nothing spam
>>> related, from 2011-08-02 to 2011-08-11. It looks like you just had
>>> livejournal.com listed as a spammer for those 10 days. Those emails
>>> are not hitting this rule now.
>>
>> livejournal.com has been whitelisted for years, so it's certainly not expected
>> behaviour.
>
> Any SA dev folks have opinions on this? I'm up for assuming there was
> somehow a problem on my end and removing these from my corpora if that's
> what you devs think I should do.
>
> Mark, I encourage you to include dev@spamassassin.apache.org in your
> replies.
>
>> Perhaps you were using a DNS server that returned bad results. Some
>> governments (e.g. China) intercept DNS requests and return their own IP. Some
>> ISP's think they can do that too for NXDOMAIN results.
>
> It seems unlikely. I'm using a local bind server with two forwarders to my
> hosting provider, linode.com, which is very open-source oriented, and
> seems unlikely to pull something like that. Although I'm happy to ask
> them via a support request if there was a related incident during this
> time period.
>
> The relevant rule is:
> urirhssub URIBL_WS_SURBL multi.surbl.org. A 4
>
> Does that mean it could've matched anything ending in .4, or only
> 127.0.0.4?
>
> Man page is Mail::SpamAssassin::Plugin::URIDNSBL
>
>> That should be preventable to a large extent by checking if the return code is
>> within the 127/8 IP range.
>
> Devs, if urirhssub with a value of "4" does not constrain to 127/8,
> we should change the rules to match only, for example, 127.0.0.4.
>
>> We don't control external DNS servers of course, so if one of them decides to
>> return a 127/8 code due to whatever cause (e.g. cache poisoning), it will
>> cause a false detection signal.
>
> Indeed.
>
>> Another possibility is DNS client error. That is known to occur with
>> multithreaded and asynchronous dns clients. Typical is a race condition while
>> accessing memory, causing a mix up of query returns.
>
> Seems unlikely, mostly because of the time frame.
>
>> Did the livejournal.com hits have specific subdomains?
>
> I just looked for notifications from livejournal that didn't hit this rule
> in the same time frame - there were none. Everything I got from
> livejournal.com from August 2nd to August 11th hit URIBL_WS_SURBL. And all
> included these urls:
> http://news.livejournal.com/
> http://www.livejournal.com/manage/subscriptions/
> Other URLs were generally of a subdomain<user>.livejournal.com.
>
>> Also, I would expect that there would not be any query to SURBL for a domain
>> that is on SA's internal frequently queried whitelist. livejournal.com should
>> be on that list. Can you see if there were any changes/updates to SA that
>> could have caused this?
>
> The rules currently include:
>
> 25_uribl.cf:uridnsbl_skip_domain juno.com kernel.org livejournal.com lycos.com
>
> Certainly looks to me like that shouldn't allow livejournal.com to be
> looked up against SURBL.
>
> Closest backup of those config files I have is 2011-08-23, and that file
> has an md5 checksum identical to my current 25_uribl.cf. Same as the
> backup from 2011-07-01:
>
> # md5sum panic-2011-07-01/var/lib/spamassassin/3.004000/updates_spamassassin_org/25_uribl.cf
> 64a27859c0a7cdafbd856dce3461c2f3 panic-2011-07-01/var/lib/spamassassin/3.004000/updates_spamassassin_org/25_uribl.cf
>
> $ md5sum /var/lib/spamassassin/3.004000/updates_spamassassin_org/25_uribl.cf
> 64a27859c0a7cdafbd856dce3461c2f3 /var/lib/spamassassin/3.004000/updates_spamassassin_org/25_uribl.cf
>
>
> So it shouldn't be possible for spamassassin.com to hit URIBL_WS_SURBL.
> I've removed the examples from my corpora. I'd still like to know how it
> happened. Here's the simplest example I can find:
> http://www.chaosreigns.com/sa/ws_surbl.txt
> Only URLs that could hit URIBL_WS_SURBL are www.livejournal.com and
> news.livejournal.com, right? Yep.
>
> spamassassin -D 2>&1 | grep multi.surbl | grep starting | less
>
> Sep 19 17:22:39.564 [9037] dbg: async: starting: URI-DNSBL, DNSBL:multi.surbl.org.:news.livejournal.com (timeout 15.0s, min 3.0s)
> Sep 19 17:22:39.569 [9037] dbg: async: starting: URI-DNSBL, DNSBL:multi.surbl.org.:www.livejournal.com (timeout 15.0s, min 3.0s)
>
> That's current trunk output, so there's a bug causing uridnsbl_skip_domain
> to not work? Opened bug:
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6662
>
>
> Even without uridnsbl_skip_domain I still can't explain why this rule hit,
> and that still bothers me.

from what I'm seeing:

livejournal.com is in 20_aux_tlds.cf

util_rb_2tld livejournal.com

the uridnsbl_skip_domain rule applies to parent domain, not to subdomains.

You are trusting a third party DNS (as your forwarder) which *could* be
manipulating your queries.
If you have a local resolver, why do the extra query hop?

or am I missing something?