Sunday, February 6, 2011

Hash conflicts - lets keep it complicated

Hashing(or cryptology in general) is somewhere the keep-it-simple-stupid principle is often .... well quite stupid. I have run into hash collisions in weird and wonderful ways in the last year.

1) Firefox Bug - Hash collision due to Google Maps tile url's being too similar. The Firefox URL hash is based on simple circular shifts. Google ended up adding some permutations of the word Galileo to make the URL's more unique and less susceptible to hash collision.
2) Stuxnet Exploits - Windows stored CRC32 keys for scheduled tasks and these can collide leading to malicious code being run with super user privilege. The patch ended up changing the hashing to something less prone to collisions (SHA).
3) ESTA Visitor Registration site - My re-login ID collided with somebody from Perth and they called me to warn me. Even though a fair few organisations have my details, I am seriously disappointed in the US governments inability to protect my data.
4) Git and Mercurial + other DVCS - I moved over to the DVCS temple due to weakness of classic version control systems - linear version numbers and poor merging. The encoding of code change hunks with SHA makes patches much more unique and version numbers unambiguous, if you find  sha hashes hard to remember just use tags. Well collisions will become a problem with code revisions reaching a billion or so - at that point your software project will go to the cryptograhy hall-of-fame.

The proliferation of information in todays world is such that there is a high likelihood some overlap occuring. Consider the data pyramid generation for fast access case, this trick is performed everywhere in the image processing world. If you are lucky enough to have the data in a format that supports in-built overviews like Tiff you can add them to the file itself. Otherwise you are stuck inventing your own pyramid storage scheme .rrd files for Erdas Imagine, .ovr files for Ossim etc. Now you have to make sure you store the pyramids somewhere with write access and in case the data is on a CD you again create a tempfile which is reusable, but tied to the original file in an unambiguous manner - i.e. hashing the sourcefile is needed.

It would be much nicer to use a file format which allows internal arbitrary overviewing with compression, yes I am talking about the obvious - JPEG2000. At some point there was talk about JPEG2000 based compression for NetCDF, but it turned out to be mostly talk. NetCDF libraries decode GRIB data using JPEG2000, it would be nice to have the multi-dimensional data compressed with arbitrary overviews using the same scheme. There seems to be ongoing work on NetCDF streaming - ncstream, perhaps compression could feature there.

2 comments:

Luke said...
This comment has been removed by a blog administrator.
Anonymous said...

I don't think you've just highlighted the problem with md5 (maybe) but also bad programming!