Why is converting my database to UTF-8 truncating entries?

The issue was one of mixed encoding. Some fields contained data encoded properly as UTF-8; others contained data encoded as something else, probably ISO-8859-1. When imported to a new UTF-8 database, this was causing truncation.

My steps to solve this:

  1. Copy the original database, wordpress, to a new database, wordpress2. Make sure the collation of wordpress2 is set to UTF-8.
  2. Follow any one of the steps above to convert the tables and columns of wordpress2 to UTF-8. This will cause truncation.
  3. For each row on wordpress containing entries where the data when converted to UTF-8 was not the same as the data when converted to ASCII, update the corresponding row in wordpress2 with the wordpress data, converted to UTF-8. A sample script is below.

The script:

UPDATE wordpress3.wp_options wp3
INNER JOIN wordpress.wp_options wp ON (wp.option_id = wp3.option_id 
    AND convert(wp.option_value using utf8) != convert(wp.option_value using ascii))
SET wp3.option_value = convert(wp.option_value using utf8);

A more knowledgeable friend wrote a series of scripts for me that queried information_schema to locate all the columns containing entries where convert(value using utf8) != convert(value using ascii), and to then generate versions of the script above for them.

Results:

It worked! On my new database, I can save Japanese without it turning into question marks (because the character set has successfully been set to UTF-8), and all the mis-encoded fields that were causing data truncation have been fixed.

There are some posts that contain misencoded characters, but since I can find almost all of those strings by searching for â, Ã, Â, or æ, I can just go in and manually replace them.