Why WP encodes UNICODE (UTF8) containing urls? Any drawbacks of UNICODE url?

Question

There’s no such thing as a real unicode URL minus encoding of some kind. If you try to write a unicode character into a URL, the browser encodes it. The appearance of the unicode character in the address bar is purely UI assistance.

Unicode in URLs

For the bits that come after the domain e.g. /page we need to url encode according with another spec, e.g. the PHP function url_encode

So ヒキワリ.ナットウ becomes %E3%83%92%E3%82%AD%E3%83%AF%E3%83%AA.%E3%83%8A%E3%83%83%E3%83%88%E3%82%A6.

You can see this with latin characters if you try to insert a space into a URL and it gets turned into %20

See https://www.w3.org/International/O-URL-and-ident.html for information on how UTF-8 characters unsupported by the URL RFC are transformed into hexadecimal encoded values with %

Internationalized Resource Identifiers (IRIs) are a new protocol element, a complement to URIs [RFC2396]. An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO10646). There is a mapping from IRIs to URIs, which means that IRIs can be used instead of URIs where appropriate to identify resources.

…

Internationalization of URIs is important because URIs may contain all kinds of information from all kinds of protocols or formats that use characters beyond ASCII. The URI syntax defined in RFC 2396 currently only allows as subset of ASCII, about 60 characters. It also defines a way to encode arbitrary bytes into URI characters: a % followed by two hexadecimal digits (%HH-escaping). However, for historical reasons, it does not define how arbitrary characters are encoded into bytes before using %HH-escaping.

Among various solutions discussed a few years ago, the use of UTF-8 as the preferred character encoding for URIs was judged best. This is in line with the IRI-to-URI conversion, which uses encoding as UTF-8 and then escaping with %hh:

Also https://www.w3.org/International/articles/idn-and-iri/

Unicode in domains

This is a little different and doesn’t use percentage based encoding.

Browsers do a lot of heavy lifting to hide this encoding from you. E.g. http://JP納豆.例.jp, you can copy paste the URL, visit it, etc but if you try to view the source, you’ll see the URL is actually view-source:http://xn--jp-cd2fp15c.xn--fsq.jp/

Read this article for more information:

https://www.w3.org/International/articles/idn-and-iri/

The TLDR

URLs are only allowed to have a very limited subset of characters due to their history, which wasn’t great when people outside of CERN and the US started to use it. Standards and specs were agreed upon to fit characters that didn’t fit into the subset of ascii.

So WordPress is converting the pretty URLs into the real URLs. Otherwise, you’d have all sorts of problems with matching and searching the database.

To MySQL, ヒキワリ.ナットウ isn’t the same as %E3%83%92%E3%82%AD%E3%83%AF%E3%83%AA.%E3%83%8A%E3%83%83%E3%83%88%E3%82%A6, so WordPress uses the latter, as that’s the actual URL. ヒキワリ.ナットウ is just the human friendly helpful visual

To be more specific, WP converts it on input, not output. If you disable that, then 2 things happen, first your html no longer validates, and second the browser converts it to the encoded value when you try to use the URL anyway. Some software may not do this conversion though, and if you put the unicode characters directly in the HTTP request, that may break things

Unicode in URLs

Unicode in domains

The TLDR

Related Posts:

Leave a Comment Cancel reply