Please explain how WordPress works with MySQL character set and collation at a low level

There are two defines in wp-config.php of WordPress website:

define('DB_CHARSET', 'utf8');
define('DB_COLLATE', '');

There are several things which are most commonly misunderstood. Names
of constants in those defines, might suggest that they are related to
the database itself. They are not. They are related to tables within
the database.

Database creation is totally independent from table creation.
WordPress does not create a database and does not care about
database’s default character set and collation, as long as it can
connect to the database.

The value ‘utf8’ in the first define means, the least restricted
character set from ‘utf8’ family, which is either ‘utf8’ or ‘utf8mb4’.

If you leave defines above unchanged, before an attempt to install your website, it is like telling WordPress to make its own choices, regarding database’s tables character set and collation, which are supported by MySQL ( depending of MySQL version ) and are least limiting.

The following are the things, WordPress analyses to determine its choices, during installation:

  • MySQL’s version
  • database’s collation ( in wp-config.php )

Based on MySQL’s version, WordPress decides, which group of utf8 family to use. There are two, distinguished by their names: utf8 and utf8mb4. Character sets from utf8 group, allow storing of a maximum 3-bytes long characters. Character sets from utf8mb4 group, allow storing of a maximum 4-bytes long characters.

Now, WordPress checks the value of DB_COLLATE define. If empty, it will use the least limiting collation from chosen utf8 family, otherwise, will use the value specified.

Examples

define('DB_CHARSET', 'utf8');
define('DB_COLLATE', '');

If MySQL does not support utf8mb4 ( older versions ) then tables
character set will be utf8 and collation will be
utf8_general_ci. Otherwise, we can expect utf8mb4 and utf8mb4_unicode_520_ci, or utf8mb4_unicode_ci ( MySQL version dependent ), respectively.

define('DB_CHARSET', 'utf8');
define('DB_COLLATE', 'utf8_polish_ci');

Older MySQL version – utf8 and utf8_polish_ci. Newer MySQL
version – utf8mb4 and utf8mb4_polish_ci ( _polish_ci
suffix is honoured )

define('DB_CHARSET', 'cp1250');
define('DB_COLLATE', 'cp1250_polish_ci');

Any MySQL version – cp1250 and cp1250_polish_ci.

define('DB_CHARSET', 'cp1250');
define('DB_COLLATE', 'utf8_general_ci');

Any MySQL version – error ( mismatch of character set and collation )

Summary

In most cases, leaving values of defines, explained above, unchanged, is a good choice. But, if you want tables collation to match the language of your website, you can modify the value of DB_COLLATE define, appropriately ( for example – utf8mb4_polish_ci ).

Note: that explains, why the character 𝌆 was stored and retrieved properly. Simply, your tables character set belonged to utf8mb4 group, not utf8.

Leave a Comment

tech