spamassassin-dev September 2011 archive
Main Archive Page > Month Archives  > spamassassin-dev archives
spamassassin-dev: Re: Non-English accuracy Re: Rescore Masscheck

Re: Non-English accuracy Re: Rescore Masscheck for 3.4.x?

From: Jari Fredriksson <jarif_at_nospam>
Date: Fri Sep 23 2011 - 07:30:18 GMT

22.9.2011 20:59, kirjoitti:
> On 09/22, Warren Togami Jr. wrote:
>> On a separate note, I have a volunteer at school willing to help us build
>> a Mandarin language ham corpus a few months from now. That will be
>> interesting to see how that throws off our statistics. =)
> I've been wondering about SA's accuracy on other languages. It looks like
> the only corpus we have is your wt-jp1? What's the accuracy like on that?
> Is the accuracy available somewhere on ruleqa? I'm actually more curious
> about accuracy of *spam* in non-English, because I'd say a very
> significant portion of my missed spam is in a non-Latin alphabet.
> And I don't want to just tell SA to classify non-English as spam because
> it would be nice if SA was actually usable for people who speak these
> languages.
> 75 out of the 113 spams SA has missed so far this month have subjects in a
> non-Latin alphabet. 66.4%. That doesn't even include a bunch of the
> non-English stuff.
> (I'm also not using bayes.)

My smallish corpus (mostly ham) is Finnish language, but also English in
it. Spam is of course English and other languages, there is no Finnish
spam available ;)

-- "I wonder", he said to himself, "what's in a book while it's closed. Oh, I know it's full of letters printed on paper, but all the same, something must be happening, because as soon as I open it, there's a whole story with people I don't know yet and all kinds of adventures and battles." -- Bastian B. Bux