Jim Cobban<jcobban@magma.ca> to GCC-help26 Aug 2008 18:08:07 GMT
The following is my understanding of how the industry got to the
situation where this is an issue.
Back in the 1980s Xerox was the first company to seriously examine
multi-lingual text handling, at their famous Palo Alto Research Centre,
in the implementation of the Star platform. Xerox PARC quickly focussed
in on the use of a 16 bit representation for each character, which they
believed would be capable of representing all of the symbols used in
living languages. Xerox called this representation Unicode. Microsoft
was one of the earliest adopters of this representation, since it
naturally wanted to sell licenses for M$ Word to every human being on
the planet. Since the primary purpose of Visual C++ was to facilitate
the implementation of Windows and M$ Office it incorporated support,
using the wchar_t type, for strings of Unicode characters. Later, after
M$ had frozen their implementation, the ISO standards committee decided
that in order to support processing of strings representing non-living
languages (Akkadian cuneiform, Egyptian hieroglyphics, Mayan
hieroglyphics, Klingon, Elvish, archaic Chinese, etc.) more than 16 bits
were needed, so the adopted ISO 10646 standard requires a 32 bit word to
hold every conceivable character.
The definition of a wchar_t string or std::wstring, even if a wchar_t is
16 bits in size, is not the same thing as UTF-16. A wchar_t string or
std::wstring, as defined by by the C, C++, and POSIX standards, contains
ONE wchar_t value for each displayed glyph. Alternatively the value of
strlen() for a wchar_t string is the same as the number of glyphs in the
displayed representation of the string.
In these standards the size of a wchar_t is not explicitly defined
except that it must be large enough to represent every text
"character". It is critical to understand that a wchar_t string, as
defined by these standards, is not the same thing as a UTF-16 string,
even if a wchar_t is 16 bits in size. UTF-16 may use up to THREE 16-bit
words to represent a single glyph, although I believe that almost all
symbols actually used by living languages can be represented in a single
word in UTF-16. I have not worked with Visual C++ recently precisely
because it accepts a non-portable language. The last time I used it the
M$ library was standards compliant, with the understanding that its
definition of wchar_t as a 16-bit word meant the library could not
support some languages. If the implementation of the wchar_t strings in
the Visual C++ library has been changed to implement UTF-16 internally,
then in my opinion it is not compliant with the POSIX, C, and C++ standards.
Furthermore UTF-8 and UTF-16 should have nothing to do with the
internals of the representation of strings inside a C++ program. It is
obviously convenient that a wchar_t * or std::wstring should contain
one "word" for each external glyph, which is not true for either UTF-8
or UTF-16. UTF-8 and UTF-16 are standards for the external
representation of text for transmission between applications, and in
particular for writing files used to carry international text. For
example UTF-8 is clearly a desirable format for the representation of
C/C++ programs themselves, because so many of the characters used in the
language are limited to the ASCII code set, which requires only 8 bits
to represent in UTF-8. However once such a file is read into an
application its contents should be represented internally using wchar_t
* or std::wstring with fixed length words. Full compliance with ISO
10646 requires that internal representation to use at least 32 bit words
although a practical implementation can get away with16 bit words.