|Main Archive Page > Month Archives > clamav-users archives|
"David F. Skoll" <email@example.com> wrote:
> The philosophical one: Do heuristics like PhishingScanURLs belong in a
> virus scanner? I realize that once the engine is in place, it's
>tempting to add features, but I'm not convinced such things belong in
> a virus scanner. I think they are more in the domain of anti-spam
> software, especially since it's good for security to keep your
> virus-scanner small, fast and secure and do more complex text analysis
> in a language other than C. I guess I would vote for PhishingScanURLs
> to be "no" by default rather than "yes".
A/V and spam filters do the same task: scan a stream for multiple strings. I've long said they should be implemented in the same package otherwise you double the overhead of feeding your files through the scanners. The external overhead of running two programs is *far* worse than just the computational overhead of searching for twice as many strings. If you *are* running both, then I agree that the better fit is in the spam software. Unless there's almost no overhead from adding it to the AV software.
In fact with a decent string search algorithm (using a trie of strings) there should be very little extra overhead in adding more strings to be searched in parallel. (Example code is at http://www.gtoal.com/spam/src/newcode/multidawg.c.html) Straight string matches (possibly with a char by char lookup table at the point of comparison, which can handle case equivalence and a few other useful equivalences) are much faster than regexps, and I've found that I haven't missed having regexps.
It's been some time since I looked at ClamAV but I would expect (or hope) that the byte string search in clamav is done something like this, in a way that scales well with the addition of more byte sequences to match.
You're right in your assessment above. It should be simple and lightweight. That doesn't rule out scanning for URLs in the body text, it just means you have to do so efficiently, and IMHO using regexps is not efficient and seldom justified.