The standard ISO/CEI 10646 entitled Technologies information - universal Play of natures coded on several bytes tries to define a system of universal Codage for all the Langue S.

It is directly connected to the Unicode.

Description

The international standard ISO/CEI 10646 defines l' Universal Character Set (UCS) like a abstract character set . It contains nearly a hundred thousands of abstract characters, each one identified by an not-ambiguous name (one in English and one in French) associated with a whole number positive called its not of code (or position of code ).

The characters, symbols, glyphes, and letters, numbers, ideograms, logogrammes are all drawn since different Langues, scripts, traditions of the whole world, and are then listed in the UCS. The remainder of the characters of the other written forms (less known) are also added or updated frequently in the UCS.

Since 1991, the consortium Unicode collaborates with the ISO to develop The Unicode Standard (" Unicode") and normalizes it ISO/CEI 10646 . Repertories, names of characters, and points of code of Version 2.0 of Unicode correspond exactly to those of the standard ISO/CEI 10646-1: 1993 with its first seven amendments published. Each publication of a new version of Unicode gives then place to an update of the standard, i.e. the addition of new natures and the update of those already present. For example, the publication Unicode 3.0 in February 2000 corresponds to the standard ISO/CEI 10646-1: 2000 . See the section Relation with Unicode for more details.

The UCS includes/understands more than 1,1 million points of code, but only the 65.536 first (the Multilingual Plan basic , or PMB) were popularized before 2000. This situation started to change when the popular China (RPC) legislated into 2000 that the computing systems sold on its territory were to support GB 18030, which required that the computing systems put at the sale in the RPC were to use the characters beyond the PMB.

The system leaves much code not assigned points deliberately, even in the PMB. That makes it possible to spare future extensions or to minimize the conflicts with other codings.

Forms of coding of Universal Character Set

ISO/CEI 10646 defines several forms of coding of characters for Universal Character Set. Simplest, UCS-2 , assigns a single value of code , i.e. a sequence of one or more numbers single, between 0 and 65.535 for each character. Thus, exactly two Byte S, is a word of 16 bits, make it possible to represent the value of any point of code. Consequently, UCS-2 allows a binary representation of each point of code of the Multilingual Plan basic, the PMB, in the condition which the point of code is assigned, i.e. it represents indeed a character. It also results from this that UCS-2 does not make it possible to represent the points of code external to the PMB.

The first amendment of the original edition of the UCS defines UTF-16 , an extension of UCS-2, to represent the points of code external to the PMB. There exists in the PMB a beach located in the special zone ( S Zone ) whose points of code are not assigned. UCS-2 prohibits the use of values of code for these points of code, but UTF-16 allows their use per pairs in order to indicate the characters external with the PBM.

This beach is subdivided in a high part and a low part, where each pair consists of a " RC" element; of each one of these zones. A element RC is a sequence of two bytes made of a byte R and a byte C obtained since the sequence of 4 bytes which corresponds to a cell in the space of coding of a play of coding of characters.

Unicode also adopted UTF-16, but in Unicode terminology, the elements of half-high zone are called " half-codets tops of indirection" and those of the low half " low half-codets of indirection". One meets also the term of seizet of indirection.

Another coding, UCS-4 , use only one value between 0x00 and (theoretically) 0x7FFFFFFF for each character (although the UCS stops with 0x10FFFF and ISO/CEI 10646 established that all the future assignements of characters will take seat in this zone). UCS-4 allows the representation of each value by exactly four bytes (words of 32-bits). UCS-4 thus allows a binary representation of each code not of the UCS, including those apart from the BMP. As in UCS-2, each coded character a fixed size in bytes has, which makes that it is simple to handle, but of course, it requires twice more memory than the UCS-2.

Occasionally, the articles on Unicode mix concepts UCS-2 and " UCS-16". UCS-16 does not exist; the authors who make this error try to speak about UCS-2 or UTF-16.

History of the ISO 10646

The International organization of standardization (ISO) started to compose the Universal Character Set in 1989 and published a draft of the ISO 10646 in 1990. Hugh McGregor Ross was one of these principal architects. This standard differed clearly from the current standard. It defined 128 groups of 256 plans of 256 lines of 256 cells, for an apparent total of 2.147.483.648 characters, but currently the standard makes it possible to code only 679.477.248 characters, because the police force prohibited the values of bytes of the control characters (0x00 with 0x1F and 0x80 with 0x9F, in notation Hexadécimal E) everywhere. The capital Latin letter has, for example, is located in the 0x20 group, 0x20 plan, arranged 0x20, cell 0x41.

The characters of this first ISO 10646 standard can be coded in three manners:

  1. UCS-4, four bytes for all the characters, allowing the simple encoding of all the characters;
  2. UCS-2, two bytes for all the characters, allowing the encoding of the foreground, 0x20, the plan BASIC Multilingual , containing first 36.864 code points, and the other plans and groups by exchanging them with ISO 2022 the escape sequences;
  3. UTF-1, which encode all characters in sequences of octects variable length (1 to 5 bytes, each one of them not containing any control character).

In 1990, two initiatives for a universal character set existed: Unicode, with 16 bits for each character (65.536 characters possible), and ISO 10646. The companies of the software refused to accept the complexity and the requirements for size of the standard ISO and could convince a number of the ISO National Bodies to vote against. The standardisers of the ISO realized that they could not continue to support the standard in the state and negotiated the unification of their standard with Unicode. Two changes took place: lifting of the limitations on the characters (prohibition of the values of control characters), allowing characters such as 0x0000101F, and the synchronization of the repertory of the BASIC Multilingual Planes with that of Unicode.

However, time passing, the situation changed in the Unicode standard itself: 65.536 characters quickly became insufficient and, since version 2.0 and following, the standard supports the encoding of 1.112.064 characters by the mechanisms of surrogate of UTF-16. For this reason, ISO 10646 was limited to contain as many characters as the UTF-16 could contain, C. - with-D. hardly more one of characters million instead of more than 2.000 million. Coding UCS-4 of the ISO 10646 was built-in the Unicode standard with the limitation of the UTF-16 under the name of UTF-32. As for UTF-1, nobody used it, because of its bad design (no manner of distinguishing the solitary bytes, the bytes of beginning of sequences and the other bytes, a problem similar to that of coding Shift-JIS for Japanese) and of weak performances (many division operations). Rob Pike and Ken Thompson, the developers of the Plane operating system 9, conceived new, fast and well-finished variable coding of size, which ends up being called UTF-8.

Differences between ISO 10646 and Unicode

ISO 10646 and Unicode have an identical repertory and number the same characters with the same numbers. The difference between the two is that Unicode adds rules and specifications to the standard ISO 10646. ISO 10646 is a simple table of characters, an extension of the preceding standards like ISO 8859. In contrast, Unicode adds rules of collation, standardization of forms, and algorithm bidirectional for scripts like the Hebrew and the Arab . For interworking between the platforms, especially when scripts bidirectional are used, it is not enough to support the ISO 10646; Unicode must be implemented.

Certain applications support the characters ISO 10646 but do not support Unicode completely. One of these is Xterm, which can correctly post all the characters ISO 10646 which have a chart of bijection character-to-glyph and only one direction of writing. It can manage certain marks of combination by methods of simple overstriking, but cannot post Hebrew (bidirectional), the Devanagari (a character for several glyphes) or Arabic (two functionalities). The majority of the applications GUI use functionalities to trace standard libraries which manage such scripts, although the applications themselves still do not treat them exactly correctly. For example, to select text in certain scripts under Mozilla Firefox makes that the text jumps around.

To speak about Universal Character Set

ISO 10646 , a general, abstract reference for the family of standards ISO/CEI 10646, is acceptable in the majority of the cases. And even if it is about a separated standard, the term Unicode is used also frequently, informellement, in connection with the UCS. However, a normative reference to the UCS just like a publication should refer to a part and a particular version of the one of these two standards in the form ISO/CEI 10646- {share}: {year} ; for example: ISO/CEI 10646-1: 1993 .

Relation with Unicode

  • ISO/CEI 10646-1:1993 ≈ Unicode 1.1
  • ISO/CEI 10646-1:2000 ≈ Unicode 3.0
  • ISO/CEI 10646-2:2001 ≈ Unicode 3.2
  • ISO/CEI 10646:2003 ≈ Unicode 4.0
  • ISO/CEI 10646:2003 Amendment 1 ≈ Unicode 4.1

See §D.1 off '' The Unicode Standard '' for more details.

See too

External bonds

  • ISO/CEI JTC1/SC2/WG2, the Work group charged with the ISO 10646
  • UTF-8 and Unicode FAQ
  • SIL' S freeware font, editors and documentation
  • the official synopsis of the ISO/CEI 10646 (previsualisation on www.iec.ch)

Random links:Gilles Genest | Léonard Gianadda | Park of Montgeroult - Courcelles | Stalemate Medrado | Nokia 6030