Differences between utf8 and latin1

UTF-8 is prepared for world domination, Latin1 isn’t. If you’re trying to store non-Latin characters like Chinese, Japanese, Hebrew, Russian, etc using Latin1 encoding, then they will end up as mojibake. You may find the introductory text of this article useful (and even more if you know a bit Java). Note that full 4-byte UTF-8 support was only introduced in MySQL … Read more

What’s the difference between UTF-8 and UTF-8 without BOM?

The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8. Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary. According to the Unicode standard, the BOM … Read more

What is a unicode string?

Update: Python 3 In Python 3, Unicode strings are the default. The type str is a collection of Unicode code points, and the type bytes is used for representing collections of 8-bit integers (often interpreted as ASCII characters). Here is the code from the question, updated for Python 3: Working with files: Historical answer: Python 2 In Python 2, … Read more

UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x80 in position 3131: invalid start byte

It doesn’t help that you have sys.setdefaultencoding(‘utf-8′), which is confusing things further – It’s a nasty hack and you need to remove it from your code. See https://stackoverflow.com/a/34378962/1554386 for more information The error is happening because line is a string and you’re calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using … Read more

Using unicode character u201c

The reason is that in 3.x Python You can’t just mix unicode strings with byte strings. Probably, You’ve read the manuals dealing with Python 2.x where such things are possible as long as bytestring contains convertable chars. works fine for me, so the only reason is that you’re using wrong encoding for source file or … Read more