|Main Archive Page > Month Archives > full-disclosure-uk archives|
On 5/21/07, Arian J. Evans <email@example.com> wrote: <snip>
> I can theorize why some of the crazy things in the wild exist, but in the
> end they may be simple control-c/v artifacts.
> (As Napoleon said: "Never ascribe to malice what one can ascribe to
No doubt. =)
What surprises me is that not all codepage conversion libraries are doing the same thing with this data. I've tested a few, and some of them are canonicalizing full-width unicode to ASCII equivalents, and others are not. Where we run into trouble is where one component doing input validation uses one technique for canonicalization, and another component trying to do the actual work is using a different technique. Figuring out exactly what different application platforms are doing would help to figure out how much of a problem this poses in the real world.
Somebody ought to put together a test suite for this, just to see what different vendors have done.
(At first I was of the opinion that doing such conversions was a dangerous misfeature, but it actually has some fairly important applications. For example, doing full text indexing of character data from different sources requires that you canonicalize first...)