August 27, 2014

utf8 data on latin1 tables: converting to utf8 without downtime or double encoding

Here’s a problem some or most of us have encountered. You have a latin1 table defined like below, and your application is storing utf8 data to the column on a latin1 connection. Obviously, double encoding occurs. Now your development team decided to use utf8 everywhere, but during the process you can only have as little to no downtime while keeping your stored data valid.

One approach here is as described to the manual is to convert the TEXT column into BLOB, then convert the table character set to utf8 and the c column back to TEXT, like this:

All good so far, but, if the tables are too big or big enough to disrupt your application significantly without downtime, this becomes a problem. The old little trick of using slaves now comes into play. In a nutshell, you can convert the TEXT column first on a slave into BLOB, then switch your application to use this slave as its PRIMARY. Any utf8 data written via replication or from the application should be stored and retrieved without issues either via latin1 connection character set or otherwise. This is because the BINARY data type does not really have character sets. Let me show you:

As you can see, while the column is still in BLOB, I have no problems reading or storing utf8 data into it. Now, after your application has been configured to use this slave and use utf8 connection, you can now convert the column and the table back to TEXT and utf8 character set.

Some caveats though, you cannot replicate from BLOB or utf8 back to the latin1 column, so you will have to discard the data from the original master. Doing so will just result in double encoding. Second, while the column is in BLOB or any other BINARY type and this column is indexed, you may experience different results when the index is used. This is because BINARY data is indexed based on their numeric values per bytes not per character strings. Here is an example:

See how the results are now ordered differently?

What’s your utf8 horror? Share with us on the comments below :-)

UPDATE: This was how the process looks like without downtime or extended table being blocked, but there are other ways. One of them is creating a copy of the original table converted to utf8 and doing an  INSERT INTO .. SELECT  using the  CAST  or  CONVERT  functions like below.

Another method is to copy the FRM file of the same table structure but in  utf8  and replace your original table’s FRM file. Since the data is already stored as utf8, you should be able to read them on utf8 connection. However, you will have to rebuild you indexes based on affected columns as they are sorted as latin1 originally. In my tests though, there was no difference before and after rebuilding the index, so, YMMV. To demonstrate, still the same 2 previous tables – on the filesystem, I replaced  t.frm  with a copy of  x.frm  then did a FLUSH TABLES , afterwards, t looked like this:

Now, attempting to read the data on latin1 connection causes truncation:

But on utf8, I am now able to read it fine:

Rebuilding the secondary key on  c  column has no difference on the results too.

UPDATE: Apparently, the last method will not work for InnoDB tables because the character collation is stored in the data dictionary too as my colleague Alexander Rubin pointed out. But not all how is lost, you can still rebuild the table with pt-online-schema-change without blocking it.

About Jervin Real

Jervin is a member of Percona's Rapid Response Consulting team. When you come to Percona for consulting, chances are he'll be greeting you first. His primary role is to make sure customer issues are handled efficiently and professionally. Jervin joined Percona in April 2010.

Comments

  1. Nate says:

    We had a vaguely similar problem — we had UTF-8 columns, but were somehow passing our UTF-8 data into the database but converting it to Latin-1 while we did so. It was kind of crazy.

    I wrote up my understanding of the problem and how we solved it here: http://www.ridesidecar.com/2013/07/30/of-databases-and-character-encodings/

  2. Jackson says:

    I’ve run into this multiple times. So frustration at times. Not so much a problem anymore

  3. Parisa says:

    How about Varchar and Char field ?? I have to change it to blob too ?

  4. Jervin Real says:

    @Parisa, in the example above yes, it can be BINARY or VARBINARY too.

Speak Your Mind

*