печально быть антисоциальным - atradu interesantu stāstu par unicodi

Sep. 29th, 2008

08:51 pm - atradu interesantu stāstu par unicodi

Previous Entry Add to Memories Tell A Friend Next Entry


Jim Cobban<jcobban@magma.ca> to GCC-help26 Aug 2008 18:08:07 GMT

The following is my understanding of how the industry got to the
situation where this is an issue.
Back in the 1980s Xerox was the first company to seriously examine
multi-lingual text handling, at their famous Palo Alto Research Centre,
in the implementation of the Star platform. Xerox PARC quickly focussed
in on the use of a 16 bit representation for each character, which they
believed would be capable of representing all of the symbols used in
living languages. Xerox called this representation Unicode. Microsoft
was one of the earliest adopters of this representation, since it
naturally wanted to sell licenses for M$ Word to every human being on
the planet. Since the primary purpose of Visual C++ was to facilitate
the implementation of Windows and M$ Office it incorporated support,
using the wchar_t type, for strings of Unicode characters. Later, after
M$ had frozen their implementation, the ISO standards committee decided
that in order to support processing of strings representing non-living
languages (Akkadian cuneiform, Egyptian hieroglyphics, Mayan
hieroglyphics, Klingon, Elvish, archaic Chinese, etc.) more than 16 bits
were needed, so the adopted ISO 10646 standard requires a 32 bit word to
hold every conceivable character.
The definition of a wchar_t string or std::wstring, even if a wchar_t is
16 bits in size, is not the same thing as UTF-16. A wchar_t string or
std::wstring, as defined by by the C, C++, and POSIX standards, contains
ONE wchar_t value for each displayed glyph. Alternatively the value of
strlen() for a wchar_t string is the same as the number of glyphs in the
displayed representation of the string.
In these standards the size of a wchar_t is not explicitly defined
except that it must be large enough to represent every text
"character". It is critical to understand that a wchar_t string, as
defined by these standards, is not the same thing as a UTF-16 string,
even if a wchar_t is 16 bits in size. UTF-16 may use up to THREE 16-bit
words to represent a single glyph, although I believe that almost all
symbols actually used by living languages can be represented in a single
word in UTF-16. I have not worked with Visual C++ recently precisely
because it accepts a non-portable language. The last time I used it the
M$ library was standards compliant, with the understanding that its
definition of wchar_t as a 16-bit word meant the library could not
support some languages. If the implementation of the wchar_t strings in
the Visual C++ library has been changed to implement UTF-16 internally,
then in my opinion it is not compliant with the POSIX, C, and C++ standards.
Furthermore UTF-8 and UTF-16 should have nothing to do with the
internals of the representation of strings inside a C++ program. It is
obviously convenient that a wchar_t * or std::wstring should contain
one "word" for each external glyph, which is not true for either UTF-8
or UTF-16. UTF-8 and UTF-16 are standards for the external
representation of text for transmission between applications, and in
particular for writing files used to carry international text. For
example UTF-8 is clearly a desirable format for the representation of
C/C++ programs themselves, because so many of the characters used in the
language are limited to the ASCII code set, which requires only 8 bits
to represent in UTF-8. However once such a file is read into an
application its contents should be represented internally using wchar_t
* or std::wstring with fixed length words. Full compliance with ISO
10646 requires that internal representation to use at least 32 bit words
although a practical implementation can get away with16 bit words.

(5 comments | Leave a comment)

Comments:

[User Picture]
From:[info]dooora
Date:September 30th, 2008 - 07:13 am
(Link)
aha, šis ir inčīgs gabaliņš, pārlasīšu uzmanīgāk, kad pamodīšos.
(Reply to this)
From:[info]bubu
Date:September 30th, 2008 - 12:11 pm
(Link)
Vēl jau arī fiška, ka wchar_t ne vienmēr ir 16-bit. GCC zem linukšiem vismaz wchar_t ir 32-biti.
Es katrā ziņā neesmu sapratis, kāpēc man būtu jāgrib lietot wchar_t (vai std::wstring). Parasti lietoju std::string ar utf-8 kodējumu, un papildus kādu mazu bibliotēku, kas ļauj darboties ar šo stringu (substr, iteratori) saprotot utf-8 čarus: http://utfcpp.sourceforge.net/ Čakars var vienīgi sanākt, ja gribas lietot Unikodes variantu WinAPI funkcijām. Bet nu utf-8 uz ucs2 var arī relatīvi viegli pārkonvertēt.
(Reply to this) (Thread)
From:[info]elfz
Date:September 30th, 2008 - 03:55 pm
(Link)
Man izskatās, ka viņa galvenais arguments ir pūrisms, tradīcijas un diezgan vārgais "It is obviously convenient that a wchar_t * or std::wstring should contain one "word" for each external glyph", bet es arī esmu no tās paaudzes, kas UTF-8 uzskata par the best thing ever, un turot visus tekstus utfos un operējot ar tiem kā ar utfiem, tas atrisinās visas svarīgās dzīves problēmas.
(Reply to this) (Parent) (Thread)
[User Picture]
From:[info]smejmoon
Date:September 30th, 2008 - 09:24 pm
(Link)
Čakars sanāk, kad jālieto Windowsu W apīšu un nedod dies A apīši un OSXu utf-8 apīši, (fopen piemēram ņem pretī char* kodētu iekš utf-8 pēc nokluss), un OSXu; "Unicode" apīši. Vienlaikus.

I repeat, kad ir jāatbalsta vairākas OS, kas sākas ar vārdu Windows un vairākas OS, kas sākas ar vārdu Mac, tad ir chakars. :) It sevišķi, ka c++ kompilatoru + libu izpratne arī mainās laiku gaitā un NT, XP, Vista tā ir dažāda un 10.4/10.5 procesā arī kautkāda huiņa notika. Nerunājot nemaz par to, ka programmētāji ir dažādās attīstības stadijās. Es zinu, es esmu.
(Reply to this) (Parent)
[User Picture]
From:[info]smejmoon
Date:September 30th, 2008 - 09:25 pm
(Link)
Es jau nesaku, ka tēvainītis ir lielo patiesību pateicis, tik ka interesanti noformulējis domu.
(Reply to this) (Parent)