Converting Character Sets

The web is going the way of utf8. Drizzle has chosen it as the default character set, most back-ends to websites use it to store text data, and those who are still using latin1 have begun to migrate their databases to utf8. Googling for “mysql convert charset to utf8” results in a plethora of sites, each with a slightly different approach, and each broken in some respect. I’ll outline those approaches here and show why they don’t work, and then present a script that can generically be used to convert a database (or set of tables) to a target character set and collation.

Approach #1:

ALTER TABLE `t1` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

1	ALTER TABLE `t1` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

Take the following table as an example why this approach will not work:

mysql> CREATE TABLE `t1` (
-> `c1` text NOT NULL
-> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.02 sec)

mysql> ALTER TABLE `t1` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> SHOW CREATE TABLE `t1`\G
*************************** 1. row ***************************
Table: t1
Create Table: CREATE TABLE `t1` (
`c1` mediumtext NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.01 sec)

mysql> CREATE TABLE `t1` (

-> `c1` text NOT NULL

-> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Query OK, 0 rows affected (0.02 sec)

mysql> ALTER TABLE `t1` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

Query OK, 0 rows affected (0.02 sec)

Records: 0 Duplicates: 0 Warnings: 0

mysql> SHOW CREATE TABLE `t1`\G

*************************** 1. row ***************************

Table: t1

Create Table: CREATE TABLE `t1` (

`c1` mediumtext NOT NULL

) ENGINE=MyISAM DEFAULT CHARSET=utf8

1 row in set (0.01 sec)

Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.

Approach #2 (outlined here):

This approach avoids the issue of implicit conversions by changing each data type to it’s binary counterpart before conversion. Due to implementation limitations, however, it also converts any pre-existing binary columns to their text counterpart. Additionally, this approach will fail because a binary column cannot be part of a FULLTEXT index. Even if these limitations are overcome, this process is inherently unsuitable for large databases because it requires multiple alter statements to be run on each table:

1) Drop FULLTEXT indexes
2) Convert target columns to their binary counterparts
3) Convert the table to the target character set
4) Convert target columns to their original data types
5) Add FULLTEXT indexes back

For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.

Approach #3:

Dumping the entire database and re-importing it with the appropriate server & client character sets.

This is a three-step process, where one must first dump only the schema and then edit it by hand to have the appropriate character sets and the dump the data separately. After which, the schema must be re-created and data imported. If you’re using replication, this usually isn’t even an option because you’ll have a ridiculous amount of binary logs and force a reload of data on every server in the replication chain (very time/bandwidth/disk space consuming).

Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:

ALTER TABLE `t1`
DEFAULT CHARSET=utf8,
MODIFY COLUMN `c1` text CHARACTER SET utf8;

ALTER TABLE `t1`

DEFAULT CHARSET=utf8,

MODIFY COLUMN `c1` text CHARACTER SET utf8;

This approach will both change the default character set for the table and target column, while leaving in place any FULLTEXT indexes. It also requires only a single ALTER statement for a given table. A perl script has been put together to parallel-ize the ALTER statements and is available at:

%> wget http://www.pablowe.net/convert_charset

1	%> wget http://www.pablowe.net/convert_charset

It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:

– Proper handling of string foreign keys (currently fails, but you probably shouldn’t be using strings as foreign keys anyway …)
– Allow throttling of the number of threads created (currently creates one per table)

Usage:
convert_charset --database=database [options]

Options:
--askpass        Prompt for a MySQL password
--charset        The target character set to convert to
--collate        The target collation to convert to
--database|d     The target database
--help|?         Display this help and exit
--host|h         The target host
--ignore-columns Columns to ignore, useful if you want to
keep the existing charset for a target column
Comma-separated.  NO SPACES.
table.column
--ignore-tables  A comma-separated list of tables to ignore
--password|p     The MySQL password to use
--port           The target port
--tables         A comma-separated list of tables to convert.
All non-named tables will be ignored
--test           Print the ALTER statements that would be executed
without executing them.
--user|u         The MySQL user
--version|V      Display version information and exit

defaults are:

ATTRIBUTE                  VALUE
-------------------------- ------------------
askpass                    FALSE
charset                    utf8
collate                    No Default Value
database                   No Default Value
help                       FALSE
host                       localhost
ignore-columns             No Default Value
ignore-tables              No Default Value
password                   No Default Value
port                       3306
tables                     No Default Value
test                       FALSE
user                       Current User
version                    FALSE

Usage:

convert_charset --database=database [options]

Options:

--askpass Prompt for a MySQL password

--charset The target character set to convert to

--collate The target collation to convert to

--database|d The target database

--help|? Display this help and exit

--host|h The target host

--ignore-columns Columns to ignore, useful if you want to

keep the existing charset for a target column

Comma-separated. NO SPACES.

table.column

--ignore-tables A comma-separated list of tables to ignore

--password|p The MySQL password to use

--port The target port

--tables A comma-separated list of tables to convert.

All non-named tables will be ignored

--test Print the ALTER statements that would be executed

without executing them.

--user|u The MySQL user

--version|V Display version information and exit

defaults are:

ATTRIBUTE VALUE

-------------------------- ------------------

askpass FALSE

charset utf8

collate No Default Value

database No Default Value

help FALSE

host localhost

ignore-columns No Default Value

ignore-tables No Default Value

password No Default Value

port 3306

tables No Default Value

test FALSE

user Current User

version FALSE

25 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Shlomi Noach

15 years ago

Shameless plug:

Allow me to suggest using oak-modify-charset, part of the openark kit, which modifies a single column of text.
At current the utility supports a single column change, but as it matures, it will also support streaming of column, aggregated into one ALTER statement.
You may also just –print-only, and get the ALTER command without executing it, so you can still tailor it using your favorite PERL/awk/python script.

Converting an entire table is undesired, in my opinion, since, in addition to said issue, it converts all columns in that table. What if I have a CHAR(32) column for some md5 shcecksum? I wouldn’t that to grow to 96 UTF8 bytes.

Mrten

15 years ago

You should store things like md5 and sha1 as binary data in the first place, as it is wasteful (two times to be precise) to store it in hex, with no tangible benefits. Likewise for IP-addresses (the ones you’re not searching for, at least).

Jakub VrÃ¡na

15 years ago

There are two kinds of character set conversion:

1. If the columns are declared as latin1 and the data are really stored in latin1. Then a simple MODIFY can convert everything.

2. If the columns are declared as latin1 but the data are stored in utf8. This is quite common for applications created with MySQL < 4.1 or for applications which does not use SET CHARACTER SET or SET NAMES. A simple MODIFY would mangle the data and conversion to binary and then to utf8 is necessary.

Shlomi Noach

15 years ago

@Mrten,

I partially agree. As exceptions, take a look at mysql.user: the hashed passwords are textual, not binary. Also, I believe indexing a textual column is easier to handle – but that may be a minor issue.

With regard to IP addresses – I completely agree. I even wrote a post about it a few months ago.

thanks

Shlomi Noach

15 years ago

@Mrten

At any case other examples can be email addresses, local file names (hopefully ASCII), mount points, airports abbreviations, etc. There are quite a few ASCII-only texts, which was my point.

Regards

Marki

15 years ago

@Jakub
I’ve just migrated around 300 databases from 4.0 to 5.0. It was not easy, because every database could use different charset…
1) Know the charset of database
2) Export database from mysql 4.0 server
3) Create new empty database on mysql 5.0 with correct default charset
4) Import data into mysql 5.0 server, but prepend data with SET NAMES:
(echo “SET NAMES $chset;”; cat $path/$db.sql) | mysql newdb -hnewserver

Pavel

15 years ago

What happens if a column has mixed latin1 and utf8 at the same time (left from MySQL 4)?
Does MySQL detect that right?

Ryan Lowe

Author

15 years ago

@shlomi Had I seen that script sooner, I may have used it! convert_charset has –columns parameter, so it is possible to do one-and-and-only-one column at a time.

@jakub & @pavel This script assumes latin1 data is stored in latin1 columns:)

Shlomi Noach

15 years ago

@Ryan,

Cheers. There’s room for more than one utility per task!

Shlomi

Gil

15 years ago

In a multi-master replication setup, if I broke replication and ran this script on the passive master, then started replication from active->passive, would the queries fail or do other unwanted things? Are the binary logs charset-sensitive during replication?

Shlomi Noach

15 years ago

@Gil
If you have a utf8 column on a master, and a latin1 column on a slave, and try to replicate, you may get corrupted data if you insert true utf8 text into the master. The slave will not be able to store utf8 characters and will truncate them.

Gil

15 years ago

Thanks @Schlomi

My question was more about what happens if you have latin1 on the master and utf8 on the slave. Same problem?

Shlomi Noach

15 years ago

@Gil,

good question. Will try to set up the configuration and test.

nadavkav

15 years ago

Thank you ! for this valuable post.
But it was a little to late for me, after trying one of those suggestions on the internet
similar to the #2 method category for converting text to binary and back to text again.
We use a Mahara (ePortfolio framework) that was installed with the wrong charset
and got filled with data very quickly, before i had that chance to notice this issue.
eventually (after #2 method did not work properly) i had to manually change the relevant
DB fields 🙁
A good ending to this story is that everything works fine, now 🙂

bfarber

14 years ago

While the general information in the article is useful, there’s 2 issues I’m spotting (since I recently undertook the same task).

1) Many databases store any charset text into the latin1 column, part of legacy MySQL 4 using latin1 as the default. While the text stores and can be displayed just fine from a front end application querying it, if you just change the character set I would assume MySQL is going to literally try to convert from latin1 to utf8, and in this case since the text isn’t latin1 it gets mangled.

2) I’ve seen many applications that store serialized data in the database (from PHP serialize() function). This method would break that serialized string, making it impossible to deserialize it later.

Just some things to point out.

Chris Huntley

13 years ago

Hi Ryan,
My blog is on a WordPress platform. Do you know if I’ll have to make the upgrade to utf8? Thanks.

Pradeep Jindal

11 years ago

Thanks for this great post. Here’s my attempt on handling FK: http://blog.pradeepjindal.com/blog:4

Tom

11 years ago

I get a whole bunch of these when I run the script on windows:

DBD::mysql::db quote failed: handle 2 is owned by thread cb0f8 not current thread 4fc00b8 (handles can’t be shared between threads and your driver may need a CLONE method added) at convert_charset.pl line 281, line 1.

Mike

11 years ago

Forgive me if I’m wrong, but isn’t mediumtext larger than text? If that is the case, that should not result in any truncation.

sun

10 years ago

Ryan,

Your convert_charset Perl script still works excellently (and the code also looks beautiful from a pure code standpoint ;)), and is especially required for big data, as well as replication scenarios – as you outlined already. Unlike the openark (oak) script mentioned above, it is really handy to have all table column conversions automated.

To my surprise, there’s only 1 unmodified copy of your script on github, and also nowhere else on the net. It hasn’t been added to the Percona Tools yet.

Time to give it a proper home? 🙂

Thanks!
sun

Ryan

10 years ago

Sun,

Glad to hear the script worked for you! It wasn’t deemed generally useful enough to put into the Percona Tools set (or Percona-Toolkit) 😀

I’ve given it a “proper home” at https://github.com/rlowe/mysql_convert_charset for the time being … Pull Requests are welcome 🙂

— Ryan Lowe

George Lund

10 years ago

Approach number 1 is correct.

There was no truncation: MEDIUMTEXT is bigger than TEXT.

MySQL does this to make sure there’s *no* truncation, as explained in the manual – http://dev.mysql.com/doc/refman/5.1/en/alter-table.html

“For a column that has a data type of VARCHAR or one of the TEXT types, CONVERT TO CHARACTER SET will change the data type as necessary to ensure that the new column is long enough to store as many characters as the original column.” [followed by a complete explanation of why TEXT coverts to MEDIUMTEXT]

Maybe something changed since this article was originally written, but right now it seems downright misleading.

David

10 years ago

Correct me if i’m wrong, but this script doesn’t convert the data within the table? For instance, I have a table using latin1 swedish that I want to conver to UTF8. The table gets converted, but the data still remains as is.

raul

9 years ago

hi friends

i have database in latin1_swedish_ci y need migrate to utf8 utf8_general_ci

i use
./convert_charset --host=127.0.0.1 --user=root --password=mypass --database=database --charset=utf8 --collate=utf8_general_ci>/code>

but this did not work

not display anything on the web, even in mysql, if the change shows utf8_general_ci

and everything looks fine

anyone can help me

Stijn de Witt

7 years ago

Thanks for this informative post.

It is worth mentioning however that MySQL’s UTF-8 is not actually UTF-8… Think of it as ‘MySQL’s 3-byte version of a subset of UTF-8’. What the rest of the world calls UTF-8, and which takes up to a maximum of 4 bytes per character to encode, MySQL actually calls ‘utf8mb4’.

SEE:
https://stijndewitt.com/2014/08/09/max-bytes-in-a-utf-8-char/
https://stijndewitt.com/2015/06/15/use-mysql-utf8mb4-if-you-want-full-unicode-support/

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Converting Character Sets

Related

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis Replication and Auto-Failover With Sentinel Service

Valkey/Redis: Sets and Sorted Sets

Hello World… Hello Valkey! Let’s Get Started!

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Converting Character Sets

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis Replication and Auto-Failover With Sentinel Service

Valkey/Redis: Sets and Sorted Sets

Hello World… Hello Valkey! Let’s Get Started!

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation