How to diagnose and keep in check encoding issues?

So after about a year (on and off!) I had managed to hopefully get a fix on encoding issue.

Why it breaks

What my experience boiled down to, is that encoding issue like this are mostly caused by miscommunication when moving data around.

  • in best case this is read mismatch, when correct data is wrongly interpreted
  • in worst case that is write mismatch, when data is incorrectly saved, causing waterfall of issues and various degrees of corruption down the line

Preemptive measures

The earliest you can screw up database encoding in WP is when creating database. So even before you even went to download that WP archive to install.

Do not rely on defaults and make sure that components talk in same encoding (like UTF8) internally, as well as to each other and visitors. This goes well beyond WP and involves MySQL configuration, possibly with some kicks for Apache and PHP on top.

See WordPress Database Charset and Collation Configuration

Fixing

When the things are thoroughly broken you are up for a ton of pain figuring out what is wrong and how to get it back to normal.

I found mb_detect_encoding() highly useful. It’s not a magic wand, but (in a strict mode) false return from it is good signal that things are not normal.

On WP-specific front $wpdb has encoding-related properties.

When you have a reason/guess/idea of what is wrong – drag data to safe place and try to convert data to be meaningfully normalized, see: