spamassassin-dev September 2011 archive
Main Archive Page > Month Archives  > spamassassin-dev archives
spamassassin-dev: Re: Non-English accuracy Re: Rescore Masscheck

Re: Non-English accuracy Re: Rescore Masscheck for 3.4.x?

From: Henrik Krohns <hege_at_nospam>
Date: Fri Sep 23 2011 - 07:37:12 GMT
To: dev@spamassassin.apache.org

On Fri, Sep 23, 2011 at 10:30:18AM +0300, Jari Fredriksson wrote:
> 22.9.2011 20:59, darxus@chaosreigns.com kirjoitti:
> > On 09/22, Warren Togami Jr. wrote:
> >> On a separate note, I have a volunteer at school willing to help us build
> >> a Mandarin language ham corpus a few months from now. That will be
> >> interesting to see how that throws off our statistics. =)
> >
> > I've been wondering about SA's accuracy on other languages. It looks like
> > the only corpus we have is your wt-jp1? What's the accuracy like on that?
> > Is the accuracy available somewhere on ruleqa? I'm actually more curious
> > about accuracy of *spam* in non-English, because I'd say a very
> > significant portion of my missed spam is in a non-Latin alphabet.
> > And I don't want to just tell SA to classify non-English as spam because
> > it would be nice if SA was actually usable for people who speak these
> > languages.
> >
> > 75 out of the 113 spams SA has missed so far this month have subjects in a
> > non-Latin alphabet. 66.4%. That doesn't even include a bunch of the
> > non-English stuff.
> >
> > (I'm also not using bayes.)
> >
>
> My smallish corpus (mostly ham) is Finnish language, but also English in
> it. Spam is of course English and other languages, there is no Finnish
> spam available ;)

There isn't any Finnish spam per se, but there are loads of that "badly
autotranslated" Finnish langauge spam/phishing coming in daily.