I had a fun case today.
There is set of cache tables which cache certain content in MyISAM tables and queries for these tables such as:
1 | select data from cache0003 where `key`=2342526263 and real_key='cp_140797_6460aad5d2e50d3e859e8649007686ac'; |
The “key” is CRC32 of the real key which is used to keep index size as small as possible so if we have a cache miss we can in most case learn it without going to the disk.
So far so good.
The problem I discovered however is some of these queries would take enormous amount of time while CRC32 conflicts are really rare.
Looking deep into the problem I found out PHP and MySQL are both to blame. PHP is to blame because in 32bit PHP version result of crc32() function was returned as signed integer, in 64bit build of same PHP version it became signed.
The system worked on 32bit platform initially so “key” column was defined as “int”
As it was migrated to 64bit platform we got unsigned 32bit values which did not fit in this column any more so MySQL was silently converting them to 2^32-1, in just about 50% of the cases. This one is kind of expected.
What was unexpected however is how MySQL executed select queries if key value would be out of signed int range.
Instead of simply telling “impossible where noticed” as we have value outside of rage of values which can possibly be in the database we have MySQL truncating this value to 2^32-1, then performing index ref lookup (traversing about half of the rows in pages as cardinality for this constant is low) and discarding all of them before no values matched supplied key value.
So beware, data truncation can backfire in a ways you might not ever expect 🙂
This sounds like a great argument against ever truncating data. Having a DBMS store different data that what I provided it with is just wrong. Instead it should throw an error.
Hmmmm, didn’t know crc32 returned an ‘unsigned 32 bit’ value on 64bit. It probably is a regular signed int, but since signed 64bit can store a much large value, it won’t go negative.
Thanks for mentioning, am now recalculating all crc32 values in my mixed environment setup 🙂
Thomas,
I might be mistaken and it could be it is just int which wraps around in 32bit but not at 64bit, but the fact is if you echo/convert it to string you’ll get different value in 32bit and 64bit versions.
Use MD5 or SHA1 instead of CRC32.
Bolk,
This is not the point. Things should be working with CRC32 as well. For this task MD5 or SHA1 are not good fit as they are too long and what I need it short keys even if it means a bit higher collision rate.
Joseph,
You’re right this is the problem and this is why MySQL 5.0 gets strict mode.
If you however look at general pictures in many cases people select simple thing compared to reliable thing. For example so widely popular PHP does not require variables to be declared neither has exact types which cause numerous security and reliability problems.
Same applies to MySQL most of MySQL user base loading some dirty data to MySQL would rather like it to be truncated int the case when unexpected data is found rather than ball out and require investigations on how to convert it right.
For other group reliability is more important, and they usually only started using MySQL in the recent years – this is where we get Strict mode and similar changes from.
@ peter
As far as I checked, the problem is not the conversion to string, but the fact 64bit php returns a 64 bit signed integer which wont wrap at the same position as a 32bit signed integer.
I’ve solved my problems by using sprintf(‘%u’,crcvalue) when crcvalue is negative and then store all crc32 values as unsigned integers in mysql.
Thomas,
Yes I can be wrong regarding what really happens internally. I just know if you just echo it out it will be signed on 32bit platforms and always positive on 64bit platforms. It well may be wrapping around thing.
I know about sprintf thing but in this case it was existing application with existing data so it was not changed.
But you’re right if I would be dealing with same problem today I would use sprintf for compatibility reasons and have field unsigned.