Unicode and C

Tue, Jul 15, 2014 | tags: C programming unicode

Have you ever wondered how Unicode and its encodings are handled when programming in C? If you investigated this question before, you will probably have encountered the wchar.h header which according to the man pages contains functions and types for wide-character handling.

You might have asked yourself after taking a look at these man pages what those “wide-characters” are and why there exist functions like

int mbtowc(wchar_t *pwc, const char *s, size_t n);

which “extracts the next complete multibyte character, converts it to a wide character and stores it at *pwc.“. If a multibyte character already can contain a Unicode code point, why do we need a “wide character” as well?

It took some head-scratching until I figured out why there are two types involved in the handling of Unicode characters. The man page for the function contains a subtle hint about what is going on. The page says

NOTES
  The behavior of mbtowc() depends on the LC_CTYPE category of the
  current locale.

where the LC_CTYPE is explained in the locale(7) man page.

LC_CTYPE
  This category determines the interpretation of byte sequences as
  characters (e.g., single versus multibyte characters), character
  classifications (e.g., alphabetic or digit), and the behavior of
  character  classes.   It  changes  the behavior of the character
  handling and classification functions, such  as  isupper(3)  and
  toupper(3),  and  the  multibyte  character  functions  such  as
  mblen(3) or wctomb(3).

On my system when I run the locale command I get the following output.

$ locale
LANG=en_US.utf8
LC_CTYPE=en_US.utf8
LC_NUMERIC=en_US.utf8
LC_TIME=en_US.utf8
LC_COLLATE=en_US.utf8
LC_MONETARY=en_US.utf8
LC_MESSAGES=en_US.utf8
LC_PAPER=en_US.utf8
LC_NAME=en_US.utf8
LC_ADDRESS=en_US.utf8
LC_TELEPHONE=en_US.utf8
LC_MEASUREMENT=en_US.utf8
LC_IDENTIFICATION=en_US.utf8
LC_ALL=

It is easy to see that my LC_CTYPE environment variable is set to en_US.utf8 which means that it is an American-English locale using the UTF-8 encoding. Since mbtowc depends on the value of LC_CTYPE according to the man page, we can assume that the function makes use of this information somehow.

Exactly how the mbtowc does this can be shown by using it in a simple example program. In order to set up the locale, we have to use the setlocale function in the program, whose signature can be found on the setlocale(3) man page. Calling this function with an empty string as an argument makes it use the values defined in the environment variables (so the ones above in my case). I will use the following simple program for demonstration purposes (compile it using gcc -Wall -o simple filename).

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>

int main(int argc, char *argv[]) {
	size_t len, wclen;
	int converted;
	// The non-ascii characters in hexadecimal notation: "Hello,
	//  \xe4\xb8\x96\xe7\x95\x8c!\n"
	const char *helloworld = "Hello, 世界!\n";
	const char *ch;
	wchar_t helloworldwc[12];
	wchar_t *wcchar;

	// This call is needed to set up the environment.
	setlocale(LC_ALL, "");

	// Getting the string length.
	len = strlen(helloworld);
	wprintf(L"strlen: %lu\n", len);

	converted = 0;
	wcchar = helloworldwc;
	for (ch = helloworld; *ch; ch += converted) {
			// Use the mbtowc function to convert the
			// multibyte-string into the wide-char string.
			converted = mbtowc(wcchar, ch, 4);
			wprintf(L"Bytes converted: %i\n", converted);
			wcchar++;
  }
	*wcchar = '\0';

	wclen = wcslen(helloworldwc);
	wprintf(L"wclen: %lu\n", wclen);
	wprintf(helloworldwc);

	return 0;
}

You can see that I included the wchar.h header for all the functions starting with “w” and the wchar_t type. If I run the program it prints the following (provided that you do not comment out the setlocale call. Try commenting it out and re-compile the program. The output will have most likely changed…).

$ ./simple
strlen: 15
Bytes converted: 1
Bytes converted: 1
Bytes converted: 1
Bytes converted: 1
Bytes converted: 1
Bytes converted: 1
Bytes converted: 1
Bytes converted: 3
Bytes converted: 3
Bytes converted: 1
Bytes converted: 1
wclen: 11
Hello, 世界!

The program prints the length of the char string, converts the characters in the string to wchar_t one at a time and then prints the length of the wchar_t array obtained by the wcslen function.

The first thing to note is that the C string length strlen differs from the length of the wide-character string (wclen). It should be fairly obvious that strlen counts the bytes in the C string of which there are 15 in my case. This count is due to my editor being set up to save my files using the UTF-8 encoding (the same as my locale as we have seen). That means that the (Sino-)Japanese characters in my C string will be UTF-8-encoded which results in the byte values that I have written down in the comment.

So how do we get the actual length of the string in characters shown and not in bytes used for the encoding? Simple, we convert the original helloworld C string into a wide-character string using mbtowc and then run the wcslen function over it. This function returns the length of a wide-character string. Considering that this function returns the right value this apparently is what we wanted.

Why did the conversion from UTF-8 using the mbtowc function make it possible to determine the actual number of characters in the string? The wide-character type is usually implemented as a data type big enough to contain all the Unicode code points in the Unicode standard. This is the case on Unix-like operating systems like Linux and OSX where wchar_t is typedef’ed to a 4-byte-int. Not so on Windows where wchar_t is only 16-bits long (same size as a short). To be fair, at the time Windows implemented the wchar_t the Unicode standard did not contain more than the 65’536 characters representable by a 16-bit type. Through the mbtowc conversion function, the UTF-8-encoded Unicode point in the C string is converted to a wchar_t that is set to a value corresponding to the Unicode code point itself (which is identical to its UTF-32 encoding). After the conversion each Unicode character will have the same length (the length of the wchar_t) and wcslen can easily figure out the length of the string in Unicode characters.

The conversion to a wide-char type allows the standard library to abstract away from the different encodings used to encode non-ASCII-characters. A program making use of the functions defined in the wchar.h header and the wchar_t type does not have to take into account multiple text encodings for input and output. It can rely on the multibyte-to-wide-char functions to convert the encoded multibyte values into the corresponding code points. Those code points can then be used in all the functions that take the wchar_t type and the wide-char-to-multibyte functions will then take care of converting them to the local encoding again (as long as the locale is set up correctly, that is).

That the representation of the Unicode point changed from a UTF-8-encoded value to a wchar_t containing the Unicode code point value itself can be shown by replacing the wprintf(helloworldwc); line with the following (or downloading a changed version of the simple program here).

	const char *end;
	for (wcchar = helloworldwc; *wcchar; wcchar++) {
			wprintf(L"dec: %i / hex: ", *wcchar);
			ch = (char *)wcchar;
			for (end = (ch + 3); ch <= end; ch++) {
					wprintf(L"%2X", *ch);
			}
			wprintf(L"\n");
	}

Compiling the source code of the changed program should print the decimal and hexadecimal value of each wchar_t in the wide-char string which looks something like this.

dec: 72 / hex: 48 0 0 0
dec: 101 / hex: 65 0 0 0
dec: 108 / hex: 6C 0 0 0
dec: 108 / hex: 6C 0 0 0
dec: 111 / hex: 6F 0 0 0
dec: 44 / hex: 2C 0 0 0
dec: 32 / hex: 20 0 0 0
dec: 19990 / hex: 164E 0 0
dec: 30028 / hex: 4C75 0 0
dec: 33 / hex: 21 0 0 0
dec: 10 / hex:  A 0 0 0

The Japanese characters in the string have the highest values. You can easily confirm that the printed values are the actual Unicode code points by searching for them online in one of the many Unicode search sites. Remember though that since most of us use Little-endian-processors the hexadecimal values are in the wrong order. To search for the two Japanese characters in the string you have to swap the byte values and use the official code point specification format like this.

U+4E16
U+754C

How to print the hex values of the wide-characters in the correct order is left as an exercise for the reader :P

References:

Back