spamassassin-users October 2010 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: Re: Collecting IP reputation data from many

Re: Collecting IP reputation data from many people

From: Royce Williams <royce.williams_at_nospam>
Date: Fri Oct 22 2010 - 14:39:38 GMT
To: users@spamassassin.apache.org

On Fri, Oct 22, 2010 at 5:19 AM, Michael Scheidell
<michael.scheidell@secnap.com> wrote:
> On 10/21/10 8:50 PM, Darxus@ChaosReigns.com wrote:
>>
>> I'd like to try collecting reputation data for every IP address from
>> everyone willing to submit it.

> re-inventing the wheel.

If what's being suggested is a non-commercial alternative to a
commercial product, then I think that the pejorative connotations of
"re-inventing the wheel" don't apply. :-) This is a wheel that needs
re-inventing, and begs for an RFC.

OK, a bit of brainstorming here, indulge me a bit ...

Imagine an open standard that commercial and non-commercial vendors
could implement. It could scale by letting each site choose which
peers to share reputation with. One could assign relative "trust"
levels to different peers, depending on how much a site believes that
their spam judgments overlap with another's. ISPs might more closely
peer with each other, as could small businesses, etc.

If it were left up to the individual admins to decide how to figure
out if something is spam or not, then some sites would be subjectively
"better" at spam rating than others - fodder for a meta-reputation
system. If my system and Bob's system have 10,000 IPs in common, and
there is good overlap in our true/false positives/negatives, then I
could tell my system to put more faith in his ratings of IPs that I
haven't seen yet, and vice versa. (This would work sort of like
Netflix ratings, where someone who has tastes very similar to mine
really liked this movie that I haven't seen, so I'm likely to like it,
too.)

I've read multiple places that if 500 people all guess how many
marbles are in a jar, then while there may be wide variation in the
guesses, the average is remarkably close to the real count. While
there's no hard "real" spamminess value (because it's relative, as
others have said in this thread), I'd bet that the aggregate would be
very useful. Seeing the statistical spread (like Amazon does where
if, out of 500 ratings, 480 people gave it 5 stars, 15 gave it 4,
etc., then consensus is pretty clear that it's a cool item) and being
able to programmatically act on that would be sweet.

Decentralized distribution would be the tricky part, of course. :-)

Royce