Japanese ASCII Code

ASCII stands for American Standard Code for Information Interchange, only includes 128 characters (not all of them even printable), and is based on the needs of American use circa 1960. It includes nothing related to any Japanese characters. I believe you want the Unicode code points for so

JSON and escaping characters

This is not a bug in either implementation. There is no requirement to escape U+00B0. To quote the RFC: 2.5. Strings The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks … Read more

List of all unicode’s open/close brackets?

There is a plain-text database of information about every Unicode character available from the Unicode Consortium; the format is described in Unicode Annex #44. The primary information is contained in UnicodeData.txt. Open and close punctuation characters are denoted with Ps (punctuation start) and Pe (punctuation end) in the General_Category field (the third field, delimited by … Read more

What does _T stands for in a CString

_T stands for “text”. It will turn your literal into a Unicode wide character literal if and only if you are compiling your sources with Unicode support. See http://msdn.microsoft.com/en-us/library/c426s321.aspx.

Byte and char conversion in Java

A character in Java is a Unicode code-unit which is treated as an unsigned number. So if you perform c = (char)b the value you get is 2^16 – 56 or 65536 – 56. Or more precisely, the byte is first converted to a signed integer with the value 0xFFFFFFC8 using sign extension in a widening conversion. This in turn is … Read more

What’s the difference between UTF-8 and UTF-8 without BOM?

The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8. Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary. According to the Unicode standard, the BOM … Read more