Posts Tagged ‘charset’

Chilkat provides API’s that are identical across a variety of different programming languages. One difficulty in doing this is in handling strings. Different programming languages pass strings in different ways. In some programming languages, such as Python or C/C++, a “string” is simply a sequence of bytes terminated by a null. (I’m referring to “multibyte” strings, not Unicode (utf-16) strings. The term “multibyte” means any charset such that each letter or symbol is represented by one or more bytes without using nulls.) A Python or C/C++ application must indicate how the bytes are going to be interpreted. There are two choices: ANSI or utf-8. Each Chilkat class has a “Utf8″ property that controls whether the bytes are interpreted as ANSI or utf-8. Note: The Utf8 property only exists in programming languages where strings are passed as a sequence of bytes. For example, in .NET strings are objects and are always passed as objects (and returned as objects). If the ActiveX is used, then strings are always passed as utf-16. However, in the case of Python or C/C++, strings are simply sequences of bytes and some additional mechanism must be used to indicate how the bytes are to be interpreted.

To encrypt a string, we must precisely specify the exact byte representation of the string we want to be encrypted. This is achieved via the Charset property. For example, maybe it is the ANSI byte representation that is to be encrypted. Or maybe it is the utf-16 byte representation. Or maybe utf-8, or anything else. The mechanism to specify the byte representation of the string to be encrypted must be entirely separate from the mechanism used to unambiguously pass the string to the Chilkat method. These are two separate things. Therefore, string encryption/decryption happens in these steps:

Encrypting a String (EncryptStringENC)

1) Unambiguously pass the string to the EncryptStringENC method.
2) (Internal to the Chilkat method) Convert the string to the byte representation specified by the Charset property.
3) Encrypt
4) Encode the binary encrypted bytes according to the EncodingMode property (which can be base64, hex, etc.) and return this string.

Decrypting a String (DecryptStringENC)

1) Pass the encoded string to DecryptStringENC method. Note that all possible encodings (base64, hex, etc.) use only us-ascii chars. In all multibyte charsets, it is only the non-us-ascii chars that are different. us-ascii chars are always represented by a single byte that is less than 0×80. Therefore, the Utf8 property can be either true or false because us-ascii chars have the same byte representation in both utf-8 and ANSI.
2) (Internal to the Chilkat method) Decode the base64/hex/etc. to get the binary encrypted bytes.
3) Decrypt to get the string in the byte representation as was indicated by the Charset property when encrypting. (The Charset property must be set to this same value when decrypting.)
4) Unambiguously return the string. For a languages such as Python or C/C++, this means examining the Utf8 property setting, and performing whatever conversion is necessary (if any) to convert from the charset indicated by the Charset property, to return the string in the ANSI or utf-8 encoding. (For languages such as C#, Chilkat will convert as appropriate to return as string object to the .NET language.)

Question:

In C++, is it somehow possible to specify a desired charset (like ISO-8859-15) when getting mail headers with POP3?

Answer:

Instead of calling the method that returns a “const char *” — which can return either utf-8 or ANSI (see this Chilkat blog post about the Utf8 property common to all Chilkat C++ classes), call the alternate method that returns the string in a CkString object.  You can then get the iso-8859-15 string from the CkString object.

Each Chilkat C++ method that returns a string has two versions — an upper-case version that returns the string in a CkString (always the last argument), and a lower-case version that returns a “const char *”.

For example, in the CkEmail class:

bool GetHeaderField(const char *fieldName, CkString &outFieldValue);
const char *getHeaderField(const char *fieldName);

The lower-case method returning a “const char *” returns a pointer to memory that may be overwritten in subsequent calls.  Therefore, make sure to copy the string to a safe place immediately before making additional calls on the same Chilkat object instance.  (Only methods that also return “const char *” would overwrite the memory from a previous call.)

The upper-case version of the method returns the string in a CkString object.  It is an output-only argument, meaning that the CkString contents are replaced, not appended.  To get the iso-8859-15 string from the CkString, call the getEnc method.  For example:

const char  *str_iso_8859_15 = outFieldValue.getEnc("iso-8859-15");

This returns a NULL-terminated string where each character is represented as a single byte using the iso-8859-15 encoding.

Question:
I have a Base64 decode error, as follows:

CkString str;
str.setString("16q");
str.base64Decode("gb2312");
const char *strResult = str.getString();

convert result is { cb f2 }
But the correct result should be { d7 aa}

What’s wrong?

The platform is WinCE 6.0, use Chilkat_PPC_M5.lib

Answer:

The following code shows how to do it correctly:

    CkString str;
    str.setString("16q");

    // The following line of code tells the CkString object to 
    // decode the base64 to raw bytes, then interpret those bytes as
    // GB2312 encoded characters and store them within the string.
    // Internally, the string is stored as utf-8.
    str.base64Decode("gb2312");

    // This is an implicit conversion to ANSI, because
    // getString returns either ANSI or utf-8,
    // depending on the setting of get_Utf8/put_Utf8
    const char *strAnsi = str.getString();

    // Instead, fetch the string as GB2312 bytes:
    const char *strGb2312 = str.getEnc("gb2312");

    const unsigned char *c = (const unsigned char *) strGb2312;
    while (*c != '\0') {  printf("%02x ",*c); c++; }
    printf("\n");

    // The output is "d7 aa "


    // Another way to decode using CkByteData...
    CkByteData data;
    data.appendEncoded("16q","base64");
    c = data.getData();
    unsigned long i;
    unsigned long sz = data.getSize();
    for (i=0; i<sz; i++) { printf("%02x ",*c); c++; }
    printf("\n");

    // The output is "d7 aa "

Charset Name Charset Value
(Hex)
Charset Value
(Decimal)
Code-Page ID
ANSI_CHARSET 0×00 0 1252
DEFAULT_CHARSET 0×01 1
SYMBOL_CHARSET 0×02 2
SHIFTJIS_CHARSET 0×80 128 932
HANGUL_CHARSET 0×81 129 949
GB2312_CHARSET 0×86 134 936
CHINESEBIG5_CHARSET 0×88 136 950
GREEK_CHARSET 0xA1 161 1253
TURKISH_CHARSET 0xA2 162 1254
HEBREW_CHARSET 0xB1 177 1255
ARABIC_CHARSET 0xB2 178 1256
BALTIC_CHARSET 0xBA 186 1257
RUSSIAN_CHARSET 0xCC 204 1251
THAI_CHARSET 0xDE 222 874
EE_CHARSET 0xEE 238 1250
OEM_CHARSET 0xFF 255