Unicode and C

Tue, Jul 15, 2014 | tags: C programming unicode

Have you ever wondered how Unicode and its encodings are handled when programming in C? If you investigated this question before, you will probably have encountered the wchar.h header which according to the man pages contains functions and types for wide-character handling.

You might have asked yourself after taking a look at these man pages what those “wide-characters” are and why there exist functions like

int mbtowc(wchar_t *pwc, const char *s, size_t n);

which “extracts the next complete multibyte character, converts it to a wide character and stores it at *pwc.“. If a multibyte character already can contain a Unicode code point, why do we need a “wide character” as well?

It took some head-scratching until I figured out why there are two types involved in the handling of Unicode characters. The man page for the function contains a subtle hint about what is going on. The page says

NOTES
  The behavior of mbtowc() depends on the LC_CTYPE category of the
  current locale.

where the LC_CTYPE is explained in the locale(7) man page.

LC_CTYPE
  This category determines the interpretation of byte sequences as
  characters (e.g., single versus multibyte characters), character
  classifications (e.g., alphabetic or digit), and the behavior of
  character  classes.   It  changes  the behavior of the character
  handling and classification functions, such  as  isupper(3)  and
  toupper(3),  and  the  multibyte  character  functions  such  as
  mblen(3) or wctomb(3).

On my system when I run the locale command I get the following output.

$ locale
LANG=en_US.utf8
LC_CTYPE=en_US.utf8
LC_NUMERIC=en_US.utf8
LC_TIME=en_US.utf8
LC_COLLATE=en_US.utf8
LC_MONETARY=en_US.utf8
LC_MESSAGES=en_US.utf8
LC_PAPER=en_US.utf8
LC_NAME=en_US.utf8
LC_ADDRESS=en_US.utf8
LC_TELEPHONE=en_US.utf8
LC_MEASUREMENT=en_US.utf8
LC_IDENTIFICATION=en_US.utf8
LC_ALL=

It is easy to see that my LC_CTYPE environment variable is set to en_US.utf8 which means that it is an American-English locale using the UTF-8 encoding. Since mbtowc depends on the value of

makes use of this information somehow.

Exactly how the ```mbtowc``` does this can be shown by using it in a
simple example program. In order to set up the locale, we have to use the
```setlocale``` function in the program, whose signature can be found
on the ```setlocale(3)``` man page. Calling this function with an empty
string as an argument makes it use the values defined in the environment
variables (so the ones above in my case). I will use the following
[simple](/data/simple.c) program for demonstration
purposes  (compile it using ```gcc -Wall -o simple filename```).

#include #include #include #include #include

int main(int argc, char *argv[]) { size_t len, wclen; int converted; // The non-ascii characters in hexadecimal notation: “Hello, // \xe4\xb8\x96\xe7\x95\x8c!\n” const char *helloworld = “Hello, 世界!\n”; const char *ch; wchar_t helloworldwc[12]; wchar_t *wcchar;

// This call is needed to set up the environment.
setlocale(LC_ALL, "");

// Getting the string length.
len = strlen(helloworld);
wprintf(L"strlen: %lu\n", len);

converted = 0;
wcchar = helloworldwc;
for (ch = helloworld; *ch; ch += converted) {
        // Use the mbtowc function to convert the
        // multibyte-string into the wide-char string.
        converted = mbtowc(wcchar, ch, 4);
        wprintf(L"Bytes converted: %i\n", converted);
        wcchar++;

} *wcchar = ‘\0’;

wclen = wcslen(helloworldwc);
wprintf(L"wclen: %lu\n", wclen);
wprintf(helloworldwc);

return 0;

}


You can see that I included the ```wchar.h``` header for all the
functions starting with "w" and the ```wchar_t``` type. If I run the
program it prints the following (provided that you do not comment out
the ```setlocale``` call. Try commenting it out and re-compile the
program. The output will have most likely changed...).

$ ./simple strlen: 15 Bytes converted: 1 Bytes converted: 1 Bytes converted: 1 Bytes converted: 1 Bytes converted: 1 Bytes converted: 1 Bytes converted: 1 Bytes converted: 3 Bytes converted: 3 Bytes converted: 1 Bytes converted: 1 wclen: 11 Hello, 世界!


The program prints the length of the char string, converts the characters
in the string to ```wchar_t``` one at a time and then prints the length
of the ```wchar_t``` array obtained by the ```wcslen``` function.

The first thing to note is that the C string length ```strlen``` differs
from the length of the wide-character string (```wclen```). It should
be fairly obvious that ```strlen``` counts the bytes in the C string of
which there are 15 in my case. This count is due to my editor being set
up to save my files using the UTF-8 encoding (the same as my locale as we
have seen). That means that the (Sino-)Japanese characters in my C string will
be UTF-8-encoded which results in the byte values that I have written
down in the comment.

So how do we get the actual length of the string in characters shown
and not in bytes used for the encoding? Simple, we convert the original
```helloworld``` C string into a wide-character string using ```mbtowc```
and then run the ```wcslen``` function over it. This function returns
the length of a wide-character string. Considering that this function
returns the right value this apparently is what we wanted.

Why did the conversion from UTF-8 using the ```mbtowc``` function make it
possible to determine the actual number of characters in the string? The
wide-character type is usually implemented as a data type big enough to
contain all the Unicode code points in the Unicode standard. This is the
case on Unix-like operating systems like Linux and OSX where ```wchar_t```
is typedef'ed to a  4-byte-```int```. Not so on Windows where ```wchar_t``` is only
16-bits long (same size as a ```short```). To be fair, at the time Windows
implemented the ```wchar_t``` the Unicode standard did not contain more
than the 65'536 characters representable by a 16-bit type. Through the
```mbtowc``` conversion function, the UTF-8-encoded Unicode point in
the C string is converted to a ```wchar_t``` that is set to a value
corresponding to the Unicode code point itself (which is identical to
its UTF-32 encoding). After the conversion each Unicode character will
have the same length (the length of the ```wchar_t```) and ```wcslen```
can easily figure out the length of the string in Unicode characters.

The conversion to a wide-char type allows the standard library to abstract
away from the different encodings used to encode non-ASCII-characters. A
program making use of the functions defined in the ```wchar.h``` header
and the ```wchar_t``` type does not have to take into account multiple text
encodings for input and output. It can rely on the multibyte-to-wide-char
functions to convert the encoded multibyte values into the corresponding
code points. Those code points can then be used in all the functions that
take the ```wchar_t``` type and the wide-char-to-multibyte functions
will then take care of converting them to the local encoding again
(as long as the locale is set up correctly, that is).

That the representation of the Unicode point changed from a UTF-8-encoded
value to a ```wchar_t``` containing the Unicode code point value itself
can be shown by replacing the ```wprintf(helloworldwc);``` line with
the following (or downloading a changed version of the simple program
[here](/data/simple.hextended.c)).

const char *end;
for (wcchar = helloworldwc; *wcchar; wcchar++) {
        wprintf(L"dec: %i / hex: ", *wcchar);
        ch = (char *)wcchar;
        for (end = (ch + 3); ch <= end; ch++) {
                wprintf(L"%2X", *ch);
        }
        wprintf(L"\n");
}

Compiling the source code of the changed program should print the
decimal and hexadecimal value of each ```wchar_t``` in the wide-char
string which looks something like this.

dec: 72 / hex: 48 0 0 0 dec: 101 / hex: 65 0 0 0 dec: 108 / hex: 6C 0 0 0 dec: 108 / hex: 6C 0 0 0 dec: 111 / hex: 6F 0 0 0 dec: 44 / hex: 2C 0 0 0 dec: 32 / hex: 20 0 0 0 dec: 19990 / hex: 164E 0 0 dec: 30028 / hex: 4C75 0 0 dec: 33 / hex: 21 0 0 0 dec: 10 / hex: A 0 0 0


The Japanese characters in the string have the highest values. You
can easily confirm that the printed values are the actual Unicode
code points by searching for them online in one of the many
Unicode search sites. Remember though that since most of us use
Little-endian-processors the hexadecimal values are in the wrong
order. To search for the two Japanese characters in the string you have
to swap the byte values and use the official [code point specification
format](http://en.wikipedia.org/wiki/Unicode#Architecture_and_terminology)
like this.

U+4E16 U+754C ```

How to print the hex values of the wide-characters in the correct order is left as an exercise for the reader :P

References:

Back