Why don’t WordPress post slugs allow accents?

The sanitize_title() function uses remove_accents() right away. Both of these functions date way back to <= v1.2.1. The remove_accents() function is a hard coded list of accent characters to explicitly replace a handful of characters in a few specific languages. The inline comment and function reference simply say:

Converts all accent characters to ASCII characters.

According to RFC 3986, valid URLs are only ASCII.

So (although I can’t find any evidence of this) I assume what’s going on here is that the accent characters are replaced to make a ascii-valid URL for languages that have just a few almost-ascii characters. And that this originates from way back.

Invalid URL /à-b-c/ becomes valid /a-b-c/ (instead of valid encoded /%wtv-b-c/ title).


As to why the Chinese characters (which aren’t ASCII) are not replaced, striped out, or encoded/escaped by WordPress seems to be intentional. Again, I can’t find any documentation on this, but the characters aren’t even close to ascii-valid like aforementioned accent characters are, so there’s nothing to replace them with. Escaping the URL would be ridiculous, nearly unusable. And this entire thread, and notably this post, shed some light on these characters in URLs:

[addresses with non-ASCII characters] are not URIs (and therefore not URLs, since URLs are a type of URIs). If we consider ourselves beholden to the terminology of existing IETF standards, then we should properly call them IRIs (Internationalized Resource Identifiers), as defined in RFC 3987, which are technically not URIs but can be converted to URIs simply by percent-encoding all non-ASCII characters in the IRI.

So I would assume with WordPress not having handlers for these characters, it’s leaving it up to the user and the browser in these cases, but it’s been cleaning up accents since the early days.

(I realize this answer doesn’t satisfy the question, but hopefully it provides a bit more info to get you closer to your answer).

Leave a Comment