Collecting IP reputation data from many people

From: <Darxus_at_nospam>
Date: Fri Oct 22 2010 - 00:50:43 GMT

I'd like to try collecting reputation data for every IP address from
everyone willing to submit it. Via something like spamassassin's --report
(spam) and --revoke (non-spam) functions. And then providing a list with
a percentage of non-spam (vs. spam) seen from every IP. And hopefully
that would be useful in filtering spam, similar to the hostkarma white /
black / yellow list.

Because the existing lists only cover a small percentage of non-spam. For
example, I believe covers by far the most, and all of its
ranks together (high, medium, low, and none) only cover 30% of non-spam
according to the latest SA network mass check.

And DNSWL is ranked highest among rules matching non-spam.
(I've worked with on and off for years.)

I realize the primary problem with this is that if it becomes an effective
tool for filtering spam, I can expect that the vast majority of the data
I would get would be maliciously bad data from spammers via zombies.

But I think it would be little enough work for me to set up to be worth

1) Write a little server to accept user ID, IP, and spam/non-spam over an
   encrypted TCP connection, and write the aggregate out to a file to be
   downloaded via http or rsync or something.
2) Write a SA module to report that data to the above server via --report
   and --revoke.
3) Write another SA module to test sender IPs (last untrusted relay)
   against this list.

I'm not interested in ranking every IP listed on the major blacklists.
I'll continue to use a couple blacklists at my MTA, myself only report spam
that makes it into my inbox. If others wish be more thorough in reporting
spam, that's fine with me.

So do you think this is worth my effort?

Would you be willing to occasionally submit data?

How could this be improved?

The fun part comes when hundreds of thousands of IPs which have been
mimicking legitimate reporters suddenly falsely report some IP(s)
as sending (only/mostly) non-spam. And flagging them to be ignored.
And then catching the next wave.

Or falsely reporting legit mail servers as sending large quantities of spam
in another attempt to cripple the usefulness of such a system.

I'm also interested in more ideas on how spammers could game this system,
and what could be done about it.

I was originally thinking it would be most informative to provide the
number of spams and non-spams from each IP over some time period. But that
would also give spammers too much information on how much they're managing
to affect the list. So maybe percentage of email from each IP which is not
spam, and a number proportional to the total number of emails sent from
each IP (100 for the highest traffic IP, 0 for the lowest), in integers?
I can certainly see why hostkarma just went with with black / white /
yellow / brown.

-- "Life is either a daring adventure or it is nothing at all." - Helen Keller