The standard ISO/CEI 10646 entitled Technologies information - universal Play of natures coded on several bytes tries to define a system of universal Codage for all the Langue S.
It is directly connected to the Unicode.
The characters, symbols, glyphes, and letters, numbers, ideograms, logogrammes are all drawn since different Langues, scripts, traditions of the whole world, and are then listed in the UCS. The remainder of the characters of the other written forms (less known) are also added or updated frequently in the UCS.
Since 1991, the consortium Unicode collaborates with the ISO to develop The Unicode Standard (" Unicode") and normalizes it ISO/CEI 10646 . Repertories, names of characters, and points of code of Version 2.0 of Unicode correspond exactly to those of the standard ISO/CEI 10646-1: 1993 with its first seven amendments published. Each publication of a new version of Unicode gives then place to an update of the standard, i.e. the addition of new natures and the update of those already present. For example, the publication Unicode 3.0 in February 2000 corresponds to the standard ISO/CEI 10646-1: 2000 . See the section Relation with Unicode for more details.
The UCS includes/understands more than 1,1 million points of code, but only the 65.536 first (the Multilingual Plan basic , or PMB) were popularized before 2000. This situation started to change when the popular China (RPC) legislated into 2000 that the computing systems sold on its territory were to support GB 18030, which required that the computing systems put at the sale in the RPC were to use the characters beyond the PMB.
The system leaves much code not assigned points deliberately, even in the PMB. That makes it possible to spare future extensions or to minimize the conflicts with other codings.
ISO/CEI 10646 defines several forms of coding of characters for Universal Character Set. Simplest, UCS-2 , assigns a single value of code , i.e. a sequence of one or more numbers single, between 0 and 65.535 for each character. Thus, exactly two Byte S, is a word of 16 bits, make it possible to represent the value of any point of code. Consequently, UCS-2 allows a binary representation of each point of code of the Multilingual Plan basic, the PMB, in the condition which the point of code is assigned, i.e. it represents indeed a character. It also results from this that UCS-2 does not make it possible to represent the points of code external to the PMB.
The first amendment of the original edition of the UCS defines UTF-16 , an extension of UCS-2, to represent the points of code external to the PMB. There exists in the PMB a beach located in the special zone ( S Zone ) whose points of code are not assigned. UCS-2 prohibits the use of values of code for these points of code, but UTF-16 allows their use per pairs in order to indicate the characters external with the PBM.
This beach is subdivided in a high part and a low part, where each pair consists of a " RC" element; of each one of these zones. A element RC is a sequence of two bytes made of a byte R and a byte C obtained since the sequence of 4 bytes which corresponds to a cell in the space of coding of a play of coding of characters.
Unicode also adopted UTF-16, but in Unicode terminology, the elements of half-high zone are called " half-codets tops of indirection" and those of the low half " low half-codets of indirection". One meets also the term of seizet of indirection.
Another coding, UCS-4 , use only one value between 0x00 and (theoretically) 0x7FFFFFFF for each character (although the UCS stops with 0x10FFFF and ISO/CEI 10646 established that all the future assignements of characters will take seat in this zone). UCS-4 allows the representation of each value by exactly four bytes (words of 32-bits). UCS-4 thus allows a binary representation of each code not of the UCS, including those apart from the BMP. As in UCS-2, each coded character a fixed size in bytes has, which makes that it is simple to handle, but of course, it requires twice more memory than the UCS-2.
Occasionally, the articles on Unicode mix concepts UCS-2 and " UCS-16". UCS-16 does not exist; the authors who make this error try to speak about UCS-2 or UTF-16.
The characters of this first ISO 10646 standard can be coded in three manners:
In 1990, two initiatives for a universal character set existed: Unicode, with 16 bits for each character (65.536 characters possible), and ISO 10646. The companies of the software refused to accept the complexity and the requirements for size of the standard ISO and could convince a number of the ISO National Bodies to vote against. The standardisers of the ISO realized that they could not continue to support the standard in the state and negotiated the unification of their standard with Unicode. Two changes took place: lifting of the limitations on the characters (prohibition of the values of control characters), allowing characters such as 0x0000101F, and the synchronization of the repertory of the BASIC Multilingual Planes with that of Unicode.
However, time passing, the situation changed in the Unicode standard itself: 65.536 characters quickly became insufficient and, since version 2.0 and following, the standard supports the encoding of 1.112.064 characters by the mechanisms of surrogate of UTF-16. For this reason, ISO 10646 was limited to contain as many characters as the UTF-16 could contain, C. - with-D. hardly more one of characters million instead of more than 2.000 million. Coding UCS-4 of the ISO 10646 was built-in the Unicode standard with the limitation of the UTF-16 under the name of UTF-32. As for UTF-1, nobody used it, because of its bad design (no manner of distinguishing the solitary bytes, the bytes of beginning of sequences and the other bytes, a problem similar to that of coding Shift-JIS for Japanese) and of weak performances (many division operations). Rob Pike and Ken Thompson, the developers of the Plane operating system 9, conceived new, fast and well-finished variable coding of size, which ends up being called UTF-8.
Certain applications support the characters ISO 10646 but do not support Unicode completely. One of these is Xterm, which can correctly post all the characters ISO 10646 which have a chart of bijection character-to-glyph and only one direction of writing. It can manage certain marks of combination by methods of simple overstriking, but cannot post Hebrew (bidirectional), the Devanagari (a character for several glyphes) or Arabic (two functionalities). The majority of the applications GUI use functionalities to trace standard libraries which manage such scripts, although the applications themselves still do not treat them exactly correctly. For example, to select text in certain scripts under Mozilla Firefox makes that the text jumps around.
See §D.1 off '' The Unicode Standard '' for more details.
the official synopsis of the ISO/CEI 10646 (previsualisation on www.iec.ch)
| Random links: | Gilles Genest | Léonard Gianadda | Park of Montgeroult - Courcelles | Stalemate Medrado | Nokia 6030 |