What is the difference between UTF-8 and Unicode?

To expand on the answers others have given: We’ve got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point. Computers deal with such numbers as bytes… skipping a bit of history here and ignoring memory addressing issues, 8-bit computers would treat an … Read more

python encoding utf-8

You don’t need to encode data that is already encoded. When you try to do that, Python will first try to decode it to unicode before it can encode it back to UTF-8. That is what is failing here: Just write your data directly to the file, there is no need to encode already-encoded data. If you instead build up unicode values instead, you would … Read more

Python – Reading and writing csv files with utf-8 encoding

You report three separate problems. This is a bit of a guess into the blue, because there’s not enough information to be sure, but you should try the following: input encoding: As suggested in comments, try “utf-8-sig”. This will remove the Byte Order Mark (BOM) from your input. double quotes: Among the csv parameters, you specify quoting=csv.QUOTE_NONE. This tells the csv library … Read more

u’\ufeff’ in Python string

The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples: Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only … Read more

error UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0). Since you did … Read more