July 25, 2014

Converting Character Sets

The web is going the way of utf8. Drizzle has chosen it as the default character set, most back-ends to websites use it to store text data, and those who are still using latin1 have begun to migrate their databases to utf8. Googling for “mysql convert charset to utf8″ results in a plethora of sites, each with a slightly different approach, and each broken in some respect. I’ll outline those approaches here and show why they don’t work, and then present a script that can generically be used to convert a database (or set of tables) to a target character set and collation.

Approach #1:

Take the following table as an example why this approach will not work:

Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.

Approach #2 (outlined here):

This approach avoids the issue of implicit conversions by changing each data type to it’s binary counterpart before conversion. Due to implementation limitations, however, it also converts any pre-existing binary columns to their text counterpart. Additionally, this approach will fail because a binary column cannot be part of a FULLTEXT index. Even if these limitations are overcome, this process is inherently unsuitable for large databases because it requires multiple alter statements to be run on each table:

1) Drop FULLTEXT indexes
2) Convert target columns to their binary counterparts
3) Convert the table to the target character set
4) Convert target columns to their original data types
5) Add FULLTEXT indexes back

For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.

Approach #3:

Dumping the entire database and re-importing it with the appropriate server & client character sets.

This is a three-step process, where one must first dump only the schema and then edit it by hand to have the appropriate character sets and the dump the data separately. After which, the schema must be re-created and data imported. If you’re using replication, this usually isn’t even an option because you’ll have a ridiculous amount of binary logs and force a reload of data on every server in the replication chain (very time/bandwidth/disk space consuming).

Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:

This approach will both change the default character set for the table and target column, while leaving in place any FULLTEXT indexes. It also requires only a single ALTER statement for a given table. A perl script has been put together to parallel-ize the ALTER statements and is available at:

It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:

- Proper handling of string foreign keys (currently fails, but you probably shouldn’t be using strings as foreign keys anyway …)
- Allow throttling of the number of threads created (currently creates one per table)

About Ryan Lowe

Ryan is a Principal Consultant and team manager at Percona. He has experience with many database technologies in industries such as health care, telecommunications, and social networking.

Comments

  1. Shameless plug:

    Allow me to suggest using oak-modify-charset, part of the openark kit, which modifies a single column of text.
    At current the utility supports a single column change, but as it matures, it will also support streaming of column, aggregated into one ALTER statement.
    You may also just –print-only, and get the ALTER command without executing it, so you can still tailor it using your favorite PERL/awk/python script.

    Converting an entire table is undesired, in my opinion, since, in addition to said issue, it converts all columns in that table. What if I have a CHAR(32) column for some md5 shcecksum? I wouldn’t that to grow to 96 UTF8 bytes.

  2. Mrten says:

    You should store things like md5 and sha1 as binary data in the first place, as it is wasteful (two times to be precise) to store it in hex, with no tangible benefits. Likewise for IP-addresses (the ones you’re not searching for, at least).

  3. There are two kinds of character set conversion:

    1. If the columns are declared as latin1 and the data are really stored in latin1. Then a simple MODIFY can convert everything.

    2. If the columns are declared as latin1 but the data are stored in utf8. This is quite common for applications created with MySQL < 4.1 or for applications which does not use SET CHARACTER SET or SET NAMES. A simple MODIFY would mangle the data and conversion to binary and then to utf8 is necessary.

  4. @Mrten,

    I partially agree. As exceptions, take a look at mysql.user: the hashed passwords are textual, not binary. Also, I believe indexing a textual column is easier to handle – but that may be a minor issue.

    With regard to IP addresses – I completely agree. I even wrote a post about it a few months ago.

    thanks

  5. @Mrten

    At any case other examples can be email addresses, local file names (hopefully ASCII), mount points, airports abbreviations, etc. There are quite a few ASCII-only texts, which was my point.

    Regards

  6. Marki says:

    @Jakub
    I’ve just migrated around 300 databases from 4.0 to 5.0. It was not easy, because every database could use different charset…
    1) Know the charset of database
    2) Export database from mysql 4.0 server
    3) Create new empty database on mysql 5.0 with correct default charset
    4) Import data into mysql 5.0 server, but prepend data with SET NAMES:
    (echo “SET NAMES $chset;”; cat $path/$db.sql) | mysql newdb -hnewserver

  7. Pavel says:

    What happens if a column has mixed latin1 and utf8 at the same time (left from MySQL 4)?
    Does MySQL detect that right?

  8. Ryan Lowe says:

    @shlomi Had I seen that script sooner, I may have used it! convert_charset has –columns parameter, so it is possible to do one-and-and-only-one column at a time.

    @jakub & @pavel This script assumes latin1 data is stored in latin1 columns:)

  9. @Ryan,

    Cheers. There’s room for more than one utility per task!

    Shlomi

  10. Gil says:

    In a multi-master replication setup, if I broke replication and ran this script on the passive master, then started replication from active->passive, would the queries fail or do other unwanted things? Are the binary logs charset-sensitive during replication?

  11. @Gil
    If you have a utf8 column on a master, and a latin1 column on a slave, and try to replicate, you may get corrupted data if you insert true utf8 text into the master. The slave will not be able to store utf8 characters and will truncate them.

  12. Gil says:

    Thanks @Schlomi

    My question was more about what happens if you have latin1 on the master and utf8 on the slave. Same problem?

  13. @Gil,

    good question. Will try to set up the configuration and test.

  14. nadavkav says:

    Thank you ! for this valuable post.
    But it was a little to late for me, after trying one of those suggestions on the internet
    similar to the #2 method category for converting text to binary and back to text again.
    We use a Mahara (ePortfolio framework) that was installed with the wrong charset
    and got filled with data very quickly, before i had that chance to notice this issue.
    eventually (after #2 method did not work properly) i had to manually change the relevant
    DB fields :-(
    A good ending to this story is that everything works fine, now :-)

  15. bfarber says:

    While the general information in the article is useful, there’s 2 issues I’m spotting (since I recently undertook the same task).

    1) Many databases store any charset text into the latin1 column, part of legacy MySQL 4 using latin1 as the default. While the text stores and can be displayed just fine from a front end application querying it, if you just change the character set I would assume MySQL is going to literally try to convert from latin1 to utf8, and in this case since the text isn’t latin1 it gets mangled.

    2) I’ve seen many applications that store serialized data in the database (from PHP serialize() function). This method would break that serialized string, making it impossible to deserialize it later.

    Just some things to point out.

  16. Hi Ryan,
    My blog is on a WordPress platform. Do you know if I’ll have to make the upgrade to utf8? Thanks.

  17. Pradeep Jindal says:

    Thanks for this great post. Here’s my attempt on handling FK: http://blog.pradeepjindal.com/blog:4

  18. Tom says:

    I get a whole bunch of these when I run the script on windows:

    DBD::mysql::db quote failed: handle 2 is owned by thread cb0f8 not current thread 4fc00b8 (handles can’t be shared between threads and your driver may need a CLONE method added) at convert_charset.pl line 281, line 1.

  19. Mike says:

    Forgive me if I’m wrong, but isn’t mediumtext larger than text? If that is the case, that should not result in any truncation.

  20. sun says:

    Ryan,

    Your convert_charset Perl script still works excellently (and the code also looks beautiful from a pure code standpoint ;)), and is especially required for big data, as well as replication scenarios – as you outlined already. Unlike the openark (oak) script mentioned above, it is really handy to have all table column conversions automated.

    To my surprise, there’s only 1 unmodified copy of your script on github, and also nowhere else on the net. It hasn’t been added to the Percona Tools yet.

    Time to give it a proper home? :)

    Thanks!
    sun

  21. Ryan says:

    Sun,

    Glad to hear the script worked for you! It wasn’t deemed generally useful enough to put into the Percona Tools set (or Percona-Toolkit) :-D

    I’ve given it a “proper home” at https://github.com/rlowe/mysql_convert_charset for the time being … Pull Requests are welcome :)

    – Ryan Lowe

  22. George Lund says:

    Approach number 1 is correct.

    There was no truncation: MEDIUMTEXT is bigger than TEXT.

    MySQL does this to make sure there’s *no* truncation, as explained in the manual – http://dev.mysql.com/doc/refman/5.1/en/alter-table.html

    “For a column that has a data type of VARCHAR or one of the TEXT types, CONVERT TO CHARACTER SET will change the data type as necessary to ensure that the new column is long enough to store as many characters as the original column.” [followed by a complete explanation of why TEXT coverts to MEDIUMTEXT]

    Maybe something changed since this article was originally written, but right now it seems downright misleading.

  23. David says:

    Correct me if i’m wrong, but this script doesn’t convert the data within the table? For instance, I have a table using latin1 swedish that I want to conver to UTF8. The table gets converted, but the data still remains as is.

Speak Your Mind

*