Quite common task during schema review is to find the optimal data type for the column value – for example column is defined as INT but is it really needed or may be SMALLINT or even TINYINT will do instead. Does it contain any NULLs or it can be defined NOT NULL which reduces space needed and speeds up processing in most cases ?
These and similar tasks are often done by bunch of SQL queries while really MySQL has a native feature to perform this task – PROCEDURE ANALYSE
Here is sample run of PROCEDURE ANALYSE on drupal schema of Percona web site. The output contains row for each column in the table but I’ve omitted everything by few rows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | mysql> select * from node procedure analyse() \G *************************** 1. row *************************** Field_name: percona_drupal.node.nid Min_value: 1 Max_value: 194 Min_length: 1 Max_length: 3 Empties_or_zeros: 0 Nulls: 0 Avg_value_or_avg_length: 115.8345 Std: 45.5192 Optimal_fieldtype: TINYINT(3) UNSIGNED NOT NULL *************************** 2. row *************************** Field_name: percona_drupal.node.vid Min_value: 1 Max_value: 194 Min_length: 1 Max_length: 3 Empties_or_zeros: 0 Nulls: 0 Avg_value_or_avg_length: 115.8345 Std: 45.5192 Optimal_fieldtype: TINYINT(3) UNSIGNED NOT NULL *************************** 3. row *************************** Field_name: percona_drupal.node.type Min_value: blog Max_value: webform Min_length: 4 Max_length: 13 Empties_or_zeros: 0 Nulls: 0 Avg_value_or_avg_length: 8.7482 Std: NULL Optimal_fieldtype: ENUM('blog','contact','customer','image','lab_project','news','page','percona_faq','presentation','press_release','project','random_phrase','service','team_member','webform') NOT NULL |
Here is how this table was originally defined:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | mysql> desc node; +----------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------+------------------+------+-----+---------+----------------+ | nid | int(10) unsigned | NO | PRI | NULL | auto_increment | | vid | int(10) unsigned | NO | PRI | 0 | | | type | varchar(32) | NO | MUL | | | | title | varchar(128) | NO | MUL | | | | uid | int(11) | NO | MUL | 0 | | | status | int(11) | NO | MUL | 1 | | | created | int(11) | NO | MUL | 0 | | | changed | int(11) | NO | MUL | 0 | | | comment | int(11) | NO | | 0 | | | promote | int(11) | NO | MUL | 0 | | | moderate | int(11) | NO | MUL | 0 | | | sticky | int(11) | NO | | 0 | | +----------+------------------+------+-----+---------+----------------+ 12 rows in set (0.00 sec) |
This example shows very well the output of PROCEDURE ANALYSE or any similar tool should not be taken for an instant action but rather used as a base to make your own decision.
For example we can see nid and vid which are defined as INT in original schema can fit to TINYINT. Should we change them ? Probably not – the site will continue to growth and we expect to have more than 255 items this type would allow.
We probably could have changed it to SMALLINT or MEDIUMINT but on the small database like this it does not really yields any significant benefits so it is better to stick to default.
The type field is suggested for conversion to ENUM though as we know new types can appear by inserts we can’t do it. In other drupal installation though when types never change and savings could be significant the change may make sense.
Another important piece of information for type column is maximum length which is really helpful in more cases than ENUM data type – maximum length of data actually stored often can tell us if we defined varchar field too wide. Comparing maximum field length to average length we can see if CHAR or VARCHAR type fits better – for example if column always have same width like hashes CHAR can be more efficient storage.
In general even though I would not just use this tool suggested data types to do an ALTER TABLE it provides very helpful information for performing schema audit.
One thing that is really missing from the output is the *current* data type, and how far I am from exceeding it.
I’m developing a utility which will roughly act like PROCEDURE ANALYSE(), but will rather let you know you capacity percentage.
So far, it only supports AUTO_INCREMENT columns – and let’s you know how far you are from running out of space. You can ask it to only show you columns which are more than, say, 60% used. For example, a TINYINT column with a value of 200 will show in the output.
I’ll be adding other issues, such as used char-length etc. It does not come to replace PROCEDURE ANALYZE (it does not recommend data types) but rather give you an overall look on “what’s my status right now: am I running out of numbers? Am I running out of text length?”
Find it here: oak-show-limits. Again, it’s quite limited now, but work in ongoing.
Nice, real nice.
Both of you. 🙂
Shlomi,
Right. It is just something build in which make it nice. Thank for the link.
Peter.
Indeed. PROCEDURE ANALYSE() is a helpful utility. Sorry if I appear to do too much “marketing” for my toolkit.
Shlimi,
No this is quite fine 🙂
The Max_length field in the analyse() output is also very useful, as Shlomi indicates, for detecting if a varchar column is too narrow. I had created a column in a table to store names provided by a customer, and I thought I had made it sufficiently wide to handle the data they said they would be loading. While doing some maintenance I decided to run analyse() and discovered they had recently starting adding some entities with names that were only one character shorter than the maximum size. This allowed me to alter the table in off hours rather than during the middle of the day when they expanded the names further and INSERTs suddenly started failing.
Of course, this issue jumped out at me only because I knew all the column widths. A tool that combines both bits of info is indeed useful.
@Robert
I think it’s a good opportunity to remind that sql_mode had best been set to ‘TRADITIONAL’ or otherwise strict mode, so that when attempting to INSERT texts *longer* than column width – an error is thrown (instead of silent truncation)
This topic is useful, I learned from it.
PROCEDURE ANALYSE is good in addition to EXPLAIN but also consider the future growth of your database before doing any data type alter.