Unicode and C
Have you ever wondered how Unicode and its encodings are handled when
programming in C? If you investigated this question before, you will probably
have encountered the wchar.h
header which according to the man
pages contains functions and types for wide-character handling.
You might have asked yourself after taking a look at these man pages what those “wide-characters” are and why there exist functions like
int mbtowc(wchar_t *pwc, const char *s, size_t n);
which “extracts the next complete multibyte character, converts it to a wide character and stores it at *pwc.” If a multibyte character already can contain a Unicode code point, why do we need a “wide character” as well?
It took some head-scratching until I figured out why there are two types involved in the handling of Unicode characters. The man page for the function contains a subtle hint about what is going on. The page says
NOTES
The behavior of mbtowc() depends on the LC_CTYPE category of the
current locale.
where the LC_CTYPE
is explained in the locale(7) man page.
LC_CTYPE
This category determines the interpretation of byte sequences as
characters (e.g., single versus multibyte characters), character
classifications (e.g., alphabetic or digit), and the behavior of
character classes. It changes the behavior of the character
handling and classification functions, such as isupper(3) and
toupper(3), and the multibyte character functions such as
mblen(3) or wctomb(3).
On my system when I run the locale
command I get the following output.
$ locale
LANG=en_US.utf8
LC_CTYPE=en_US.utf8
LC_NUMERIC=en_US.utf8
LC_TIME=en_US.utf8
LC_COLLATE=en_US.utf8
LC_MONETARY=en_US.utf8
LC_MESSAGES=en_US.utf8
LC_PAPER=en_US.utf8
LC_NAME=en_US.utf8
LC_ADDRESS=en_US.utf8
LC_TELEPHONE=en_US.utf8
LC_MEASUREMENT=en_US.utf8
LC_IDENTIFICATION=en_US.utf8
LC_ALL=
It is easy to see that my LC_CTYPE
environment variable is set
to en_US.utf8
which means that it is an American-English locale
using the UTF-8 encoding. Since mbtowc
depends on the value of
LC_CTYPE
according to the man page, we can assume that the function
makes use of this information somehow.
Exactly how the mbtowc
does this can be shown by using it in a
simple example program. In order to set up the locale, we have to use the
setlocale
function in the program, whose signature can be found
on the setlocale(3)
man page. Calling this function with an empty
string as an argument makes it use the values defined in the environment
variables (so the ones above in my case). I will use the following
simple program for demonstration
purposes (compile it using gcc -Wall -o simple filename
).
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
int main(int argc, char *argv[]) {
size_t len, wclen;
int converted;
// The non-ascii characters in hexadecimal notation: "Hello,
// \xe4\xb8\x96\xe7\x95\x8c!\n"
const char *helloworld = "Hello, 世界!\n";
const char *ch;
wchar_t helloworldwc[12];
wchar_t *wcchar;
// This call is needed to set up the environment.
setlocale(LC_ALL, "");
// Getting the string length.
len = strlen(helloworld);
wprintf(L"strlen: %lu\n", len);
converted = 0;
wcchar = helloworldwc;
for (ch = helloworld; *ch; ch += converted) {
// Use the mbtowc function to convert the
// multibyte-string into the wide-char string.
converted = mbtowc(wcchar, ch, 4);
wprintf(L"Bytes converted: %i\n", converted);
wcchar++;
}
*wcchar = '\0';
wclen = wcslen(helloworldwc);
wprintf(L"wclen: %lu\n", wclen);
wprintf(helloworldwc);
return 0;
}
You can see that I included the wchar.h
header for all the
functions starting with “w” and the wchar_t
type. If I run the
program it prints the following (provided that you do not comment out
the setlocale
call. Try commenting it out and re-compile the
program. The output will have most likely changed …).
$ ./simple
strlen: 15
Bytes converted: 1
Bytes converted: 1
Bytes converted: 1
Bytes converted: 1
Bytes converted: 1
Bytes converted: 1
Bytes converted: 1
Bytes converted: 3
Bytes converted: 3
Bytes converted: 1
Bytes converted: 1
wclen: 11
Hello, 世界!
The program prints the length of the char string, converts the characters
in the string to wchar_t
one at a time and then prints the length
of the wchar_t
array obtained by the wcslen
function.
The first thing to note is that the C string length strlen
differs
from the length of the wide-character string (wclen
). It should
be fairly obvious that strlen
counts the bytes in the C string of
which there are 15 in my case. This count is due to my editor being set
up to save my files using the UTF-8 encoding (the same as my locale as we
have seen). That means that the (Sino-)Japanese characters in my C string will
be UTF-8-encoded which results in the byte values that I have written
down in the comment.
So how do we get the actual length of the string in characters shown
and not in bytes used for the encoding? Simple, we convert the original
helloworld
C string into a wide-character string using mbtowc
and then run the wcslen
function over it. This function returns
the length of a wide-character string. Considering that this function
returns the right value this apparently is what we wanted.
Why did the conversion from UTF-8 using the mbtowc
function make it
possible to determine the actual number of characters in the string? The
wide-character type is usually implemented as a data type big enough to
contain all the Unicode code points in the Unicode standard. This is the
case on Unix-like operating systems like Linux and OSX where wchar_t
is typedef’ed to a 4-byte-int
. Not so on Windows where wchar_t
is only
16-bits long (same size as a short
). To be fair, at the time Windows
implemented the wchar_t
the Unicode standard did not contain more
than the 65,536 characters representable by a 16-bit type. Through the
mbtowc
conversion function, the UTF-8-encoded Unicode point in
the C string is converted to a wchar_t
that is set to a value
corresponding to the Unicode code point itself (which is identical to
its UTF-32 encoding). After the conversion each Unicode character will
have the same length (the length of the wchar_t
) and wcslen
can easily figure out the length of the string in Unicode characters.
The conversion to a wide-char type allows the standard library to abstract
away from the different encodings used to encode non-ASCII-characters. A
program making use of the functions defined in the wchar.h
header
and the wchar_t
type does not have to take into account multiple text
encodings for input and output. It can rely on the multibyte-to-wide-char
functions to convert the encoded multibyte values into the corresponding
code points. Those code points can then be used in all the functions that
take the wchar_t
type and the wide-char-to-multibyte functions
will then take care of converting them to the local encoding again
(as long as the locale is set up correctly, that is).
That the representation of the Unicode point changed from a UTF-8-encoded
value to a wchar_t
containing the Unicode code point value itself
can be shown by replacing the wprintf(helloworldwc);
-line with
the following (or downloading a changed version of the simple program
here).
const char *end;
for (wcchar = helloworldwc; *wcchar; wcchar++) {
wprintf(L"dec: %i / hex: ", *wcchar);
ch = (char *)wcchar;
for (end = (ch + 3); ch <= end; ch++) {
wprintf(L"%2X", *ch);
}
wprintf(L"\n");
}
Compiling the source code of the changed program should print the
decimal and hexadecimal value of each wchar_t
in the wide-char
string which looks something like this.
dec: 72 / hex: 48 0 0 0
dec: 101 / hex: 65 0 0 0
dec: 108 / hex: 6C 0 0 0
dec: 108 / hex: 6C 0 0 0
dec: 111 / hex: 6F 0 0 0
dec: 44 / hex: 2C 0 0 0
dec: 32 / hex: 20 0 0 0
dec: 19990 / hex: 164E 0 0
dec: 30028 / hex: 4C75 0 0
dec: 33 / hex: 21 0 0 0
dec: 10 / hex: A 0 0 0
The Japanese characters in the string have the highest values. You can easily confirm that the printed values are the actual Unicode code points by searching for them online in one of the many Unicode search sites. Remember though that since most of us use Little-endian-processors the hexadecimal values are in the wrong order. To search for the two Japanese characters in the string you have to swap the byte values and use the official code point specification format like this.
U+4E16
U+754C
How to print the hex values of the wide-characters in the correct order is left as an exercise for the reader :P
References: