spamassassin-dev September 2011 archive
Main Archive Page > Month Archives  > spamassassin-dev archives
spamassassin-dev: Re: Non-English accuracy Re: Rescore Masscheck

Re: Non-English accuracy Re: Rescore Masscheck for 3.4.x?

From: Jari Fredriksson <jarif_at_nospam>
Date: Fri Sep 23 2011 - 07:30:18 GMT
To: dev@spamassassin.apache.org

22.9.2011 20:59, darxus@chaosreigns.com kirjoitti:
> On 09/22, Warren Togami Jr. wrote:
>> On a separate note, I have a volunteer at school willing to help us build
>> a Mandarin language ham corpus a few months from now. That will be
>> interesting to see how that throws off our statistics. =)
>
> I've been wondering about SA's accuracy on other languages. It looks like
> the only corpus we have is your wt-jp1? What's the accuracy like on that?
> Is the accuracy available somewhere on ruleqa? I'm actually more curious
> about accuracy of *spam* in non-English, because I'd say a very
> significant portion of my missed spam is in a non-Latin alphabet.
> And I don't want to just tell SA to classify non-English as spam because
> it would be nice if SA was actually usable for people who speak these
> languages.
>
> 75 out of the 113 spams SA has missed so far this month have subjects in a
> non-Latin alphabet. 66.4%. That doesn't even include a bunch of the
> non-English stuff.
>
> (I'm also not using bayes.)
>

My smallish corpus (mostly ham) is Finnish language, but also English in
it. Spam is of course English and other languages, there is no Finnish
spam available ;)

-- "I wonder", he said to himself, "what's in a book while it's closed. Oh, I know it's full of letters printed on paper, but all the same, something must be happening, because as soon as I open it, there's a whole story with people I don't know yet and all kinds of adventures and battles." -- Bastian B. Bux