July 30, 2014

Fixing column encoding mess in MySQL

Just had an interesting issue with an encoding mess on a column containing non-ASCII (Russian) text. The solution was not immediately obvious so I decided it’s worth sharing.

The column (actually the whole table) was created with DEFAULT CHARSET cp1251. Most of the data was in proper cp1251 national encoding indeed. However, because of web application failure to properly set the encoding, some of the rows were actually in UTF-8. That needed to be fixed.

Simply using CONVERT(column USING xxx) did not work because MySQL treated the source data as if it was in cp1251. One obvious solution would be to write a throwaway PHP script which would SET NAMES cp1251, pull the offending rows (they’d come out in UTF-8), iconv() them to proper cp1251, and UPDATE them with new values.

However it’s possible to fix the issue within MySQL. The trick is to tell it to treat the string coming from the table as binary, and then do charset conversion:

This can be further simplified. After 2nd conversion the result is in UTF-8, and this time MySQL knows that it’s UTF-8 as well. So it will perform conversion to per-table charset automatically, and the 3rd explicit CONVERT can be omitted.

The same trick could be applied to fix notorious KOI8-R vs CP1251 encoding issues in Russian, and other national SBCS encoding vs UTF-8 issues, I suppose. But, of course, ideally you’d always properly set the encoding in the Web application and avoid these issues at all.

About Andrew Aksyonoff

Andrew is the creator of the Sphinx full-text search engine, which powers Craigslist, Slashdot, Dealnews, and many other large-scale websites.

Comments

  1. jrabbit says:

    I was just about to tackle the exact opposite problem when I read this: I have a table that is defined as latin1_swedish_ci, but all the strings fields actually contain valid UTF-8. Is there a way to convert the table definition to utf-8 without it trying to transcode the strings, which are already in the correct encoding?

    Converting the utf-8 to latin1_swedish_ci first is not an option as the strings contain characters not resprentable in the 1-byte character sets – I need to maintain the full utf-8 character set during the change.

    Is there a away to apply this technique in this situation?

  2. shodan says:

    Yes. There are two routes that I’d try.

    First you could create a temporary table (CREATE TABLE tmp LIKE table), fix the encoding there while it’s empty, and populate it with the values from original table corrected using the trick:

    INSERT INTO tmp SELECT … CONVERT(CONVERT(column USING binary) USING utf8) … FROM table

    Second you could ALTER the existing table and set UTF-8 on a column. The data would then get UTF-8 encoded twice. So you’d need to run an update which would decode it once. The corresponding CONVERT()s sequence is somewhat more tricky but I believe that could be done too.

    Update: it would be much better to convert the column to binary character set and then to UTF-8 again; see below.

  3. Teak says:

    Why not just convert it to binary (nothing is really converted, the column is just marked as binary), and then from binary to the real encoding of the data (again nothing is really converted, all the data are left untouched)? I remember I did something like this many times in the past.

  4. shodan says:

    Teak, that would most probably work too, I just wanted to test it before posting. The CONVERT() trick is tested and works OK.

  5. Scott says:

    The manual shows doing the same thing, except altering the column type to binary and back. The CONVERT method is a little smoother, since it’s all one statement.

    http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html

  6. Teak says:

    Altering the talbe type should be much quicker on big tables.

  7. peter says:

    Teak,

    I think you miss Andrews point a bit. It is not about the case when ALL data in the table is in wrong encoding but only single row (note id=123 in statements) Or may be several rows.

  8. shodan says:

    Right. I’ve just tested changing the column charset to binary and UTF-8 and can confirm it works fine. So the rule of thumb is:

    1) if the encoding itself is OK for the whole table, but MySQL thinks there’s some other encoding, use CHANGE COLUMN to binary and back to proper encoding. (This is the case that jrabbit has.)

    2) if the encoding itself is not OK, or if you only need to fix several rows, use CONVERT() tricks. (This is the case which I had.)

  9. Sven Neuhaus says:

    Thanks for the helpful article and comments.
    So if the encoding is wrong (for example latin1 data in a column marked utf8) I would CHANGE COLUMN to binary (no conversion), then CHANGE COLUMN to latin1 (no conversion) and then CHANGE COLUMN to utf8 (conversion). I haven’t tried this yet but it sounds as if it’d work.

  10. I need some suggestion from you. I am working on web application which has partial English and partial Chinese simplified as text.
    English is mainly used in the whole system. Chinese is used only to store additional information like address and Chinese specific name of employee. I had all my tables as default latin1 with collation as latin1_swedish_ci.. But as I had to add new field to the system to show Chinese info to employees who cannot handle English as their first language. I have changed all my fields to UTF-8, so that I can simply store anything anywhere Chinese.
    however I have only about 10 columns in whole database which use Chinese. So is this the right thing I am doing ?
    options i felt were right where in anybody can point me if I am wrong.
    1.> keep everything as UTF-8, collation UTF9_general_ci
    2.> keep all English columns as latin1 with collation as latin1_swedish_ci and chinese colums as as UTF-8, collation UTF9_general_ci
    Database is stored on local server (intra net) not on internet So storage is not an issue
    Any suggestions please.. they are welcome..
    Thanks in advance..

    Mohnkhan
    http://www.mohitech.com

  11. shodan says:

    Mohnkhan,
    English is encoded exactly the same in both Latin-1 and UTF-8 so I’d simply go with everything in UTF-8 for unification.

  12. SCC says:

    I wish there was a way to change an entire table at once instead of going column by column. I need to convert from latin1_swedish_ci to UTF8 for dozens of columns.

  13. Jerk says:

    @Sven Neuhaus: That worked perfectly for Latin1 encoded chars in a UTF8 col. THANX!. That saved our lives.

  14. Jeroen Sen says:

    I’ve used your conversion a bit for converting a database characterset latin_1 to an utf_8 characterset. (Just like SCC above.) But I have a problem with some special characters like ‘ë’ and ‘ä’ for example. When these characters are converted to their binary counterpart they are present in the database. But when I convert the database into utf_8 all contents of the fields containing these symbols are droppen starting form the symbol. Does anyone have a suggestion?

    Thnx in advance!

  15. shodan says:

    Jeroen,

    this means that the data you’re marking as UTF-8 is *not* really in UTF-8 encoding, and then UTF-8 decoder chokes.

    Eg. if the binary data is actually is in Latin-1 encoding but you’re tricking MySQL to think it’s in UTF-8, the decoder will choke at the very first non-ASCII7 symbol.

  16. andreea says:

    salut

  17. tamcy says:

    Perhaps a more complicated issue:

    The table is created with charset utf8 – correct.
    A data is exported with charset utf8 – also correct.
    But thing goes wrong when trying to import the data into mysql with “\.” syntax. The user forgot to run “SET NAMES utf8″, and MySQL sees all data as latin1.

    Now, the script can use “set names latin1″ to get back utf8 data.
    However, “CONVERT(CONVERT(offending_col USING binary) USING utf8)” doesn’t work – the same wrong garbage character is shown.

  18. tamcy says:

    Sorry – answer to myself, but just realized the way to fix it and would like to share with the others.

    As I said, the column expects utf8 data, and the data is correct when “set names latin1″ is issued.
    This means I can get the correct (utf8) data when MySQL performs a “utf8 -> latin1″ conversion.
    In order to fix the encoding I need to treat this “latin1″ data as binary, then convert it to utf8.
    And so here is the statement I need:

    UPDATE table SET col = convert(CONVERT(CONVERT(col USING latin1) USING binary) using utf8);

  19. brigada says:

    phpmyadmin makes this very easy to change – Open your database in phpmyadmin, select a table on the left, select the field(s) you need to change, click the With All: CHANGE button and change the collation. Depending on your amount of tables (i had 40) this takes about 10 min. max, and its error-free.

    You can test it out on any old database you have if you wish before doing it ”live” :)

  20. Thanks Shodan. Very very useful..

  21. sbf says:

    Thank you very much, Shodan! Just what I’d been looking for…

  22. Ken says:

    I’ve tried to follow this guide, but have no idea what “123″ is. Am I supposed to replace certain words in the following statement with something else?

    > UPDATE table SET column=CONVERT(CONVERT(column USING binary) USING utf8) WHERE id=123;

    Also, I’m just not able to do this thing on my own. Anyone can help me convert my data for a fee? Please contact me at sanbat[@t]gmail.com

  23. Ken says:

    Btw, the situation is, I have a db where most of the fields are latin1_swedish_ci and filled with Chinese data. The site displays fine on the frontend. I need to convert all the data to utf8.

  24. EON says:

    Well, my data is so munged that I don’t even know what to do with it. The table is InnoDB, utf8, the column is utf8, the original data we imported may have been utf-8, the tables were MyISAM – it came out of Sql Server into MySQL, and may have been converted to utf-8 during that first import.

    I tried (with a backup, of course) changing the column to binary, then latin-1, then back to utf-8 and the data looks exactly the same: ¡§de-mystify¡¨ (That’s a word with what I believe to be MS Word curly-quotes around it, but they’re coming out as upside-down apostrophe followed by, respectively, the “section” glyph and an umlaut. In the db, I see ¡§ for the first one, after the conversion (addition of Capital A with Circumflex). On a web page (utf-8 in header and content-type meta tag), it is as above.

    Searching everywhere, I have found nothing that tells me what character set encoding that is, and I’ve tried many. When it was sensed (by PHP) as ISO-8859-1, it prints as “��” (two black-diamond question marks, whether the web page showing it is in utf-8 or ISO-8859-1), though if I display it converted to utf-8, it’s back to upside-down exclamation marks and “section” glyphs, etc. And some characters look like „« (double-comma, double-left-angle-brackets), or this (in my Navicat view): „X .

    I’m sadly guessing that there is no solution for this within MySQL. Does anybody have information to share?

  25. Hi EON, sometimes binary strings look like a strange language like that. Have you tried to retrieve this information using php binary functions as well? Just by looking at them will be impossible to know what it is…

  26. EON says:

    Well, I took a look in PHP functions, and didn’t find any binary functions that seem likely to help. the have bin2hex… what would I do with hex data? and some comparison functions and that’s all. What would I compare them against? But thanks for giving me an idea, at least. If you could explain a bit further, I’d be most thankful.

    Tried a REPLACE on test data in MySQL: SET field = REPLACE(field, ‘•’, ‘•’);
    MySQL said it happily replaced the string in lots of fields (and searching that string returns no rows), but when viewing in a browser, they’re still there.

  27. Daniel says:

    Thanks, this works perfectly and saved my life!

  28. Tadas S says:

    Arggh. I have similar problem and nothing works. MyISAM tables with latin1 encoding.

    Tried:
    ALTER TABLE CONVERT TO CHARACTER SET utf8;

    Does nothing except changes table/column encodings, but no actual content changes.

    So.. After this mysql thinks my table is utf8, I try the nr2 way mentioned in this blog:
    UPDATE table SET column=CONVERT(CONVERT(column USING binary) USING utf8) WHERE id=123;

    And it.. Does nothing. £ sign stays the same, nothing changes.

    Server version: 5.1.61-0ubuntu0.11.10.1

  29. Tadas S says:

    Before banging head to the table after “ALTER TABLE CONVERT TO CHARACTER SET utf8″ try looking at your columns via HEX(column_name).. It did convert everything very well. My problem was not setting connection encoding… i.e. “SET NAMES utf8″ ..

  30. Vulpes M says:

    Excellent! This is the solution I was looking for for the same problem.

  31. When someone writes an article he/she maintains the plan of a user in his/her mind that how a user can
    know it. So that’s why this article is amazing. Thanks!

  32. Said Bakr says:

    There is a magical question: How could I determine the real encoding of the data? For example, I use phpMyAdmin to view table’s data, the data should be Arabic text, it rendered strangely. I have no any idea about its real encode to perform CONVERT. I just need it to be utf8_general_ci. However, this data is copied (copied meaning, inserted using SQL insert) from table with latin1_swedish_ci and to be rendered correctly on the application pages it should be set meta tag charset to windows-1256.

Speak Your Mind

*