Understanding a typical 8bit character problem (such as w/ European language accented chars)

July 23, 2012

Understanding a typical 8bit character problem (such as w/ European language accented chars)

If a single accented European character is incorrectly displayed as two seemingly random characters, then the issue is that at some point utf-8 bytes were incorrectly interpreted as ANSI bytes.

For example, consider the character “é”.

In the utf-8 encoding, this character is represented in two bytes: 0xC3 0xA9
In the typical ANSI encoding (such as Windows-1252 or iso-8859-1) it is a single byte: 0xE9

For example, if the word “appliquée” is represented in utf-8 bytes, but interpreted as if the bytes contained ANSI chars, you would see this: “appliquÃ©e”.

The reason for “Ã©” is that each of the 0xC3 and 0xA9 chars are being interpreted as a separate ANSI char. If the iso-8859-1 code chart at http://en.wikipedia.org/wiki/ISO/IEC_8859-1 is examined, you’ll find that:

The solution is to determine how/why the utf-8 chars were mistakenly being interpreted as ANSI.

One common issue with the FTP2 component is that it’s not possible to always automatically know the character encoding for directory listings returned by the FTP server. The Ftp2.DirListingCharset property provides a way to tell the FTP2 component how to interpret the bytes returned in a directory listing. The default is ANSI. However, if the directory listing actually returns utf-8 bytes, then this misinterpretation will occur. The solution is to set the DirListingCharset property = “utf-8”.

admin

Understanding a typical 8bit character problem (such as w/ European language accented chars)

Blogroll

Tags