spamassassin-dev September 2011 archive
Main Archive Page > Month Archives  > spamassassin-dev archives
spamassassin-dev: Non-English accuracy Re: Rescore Masscheck for

Non-English accuracy Re: Rescore Masscheck for 3.4.x?

From: <darxus_at_nospam>
Date: Thu Sep 22 2011 - 17:59:50 GMT
To: dev@spamassassin.apache.org

On 09/22, Warren Togami Jr. wrote:
> On a separate note, I have a volunteer at school willing to help us build
> a Mandarin language ham corpus a few months from now. That will be
> interesting to see how that throws off our statistics. =)

I've been wondering about SA's accuracy on other languages. It looks like
the only corpus we have is your wt-jp1? What's the accuracy like on that?
Is the accuracy available somewhere on ruleqa? I'm actually more curious
about accuracy of *spam* in non-English, because I'd say a very
significant portion of my missed spam is in a non-Latin alphabet.
And I don't want to just tell SA to classify non-English as spam because
it would be nice if SA was actually usable for people who speak these
languages.

75 out of the 113 spams SA has missed so far this month have subjects in a
non-Latin alphabet. 66.4%. That doesn't even include a bunch of the
non-English stuff.

(I'm also not using bayes.)

-- "Some people will tell you that slow is good - and it may be, on some days - but I am here to tell you that fast is better.... That is why God made fast motorcycles...." - Hunter S. Thompson http://www.ChaosReigns.com