UNICODE

12/16/2000

Unicode: A 31 bit character set that is generally used in 8 or 16 bit subsets (UTF-8 and BMP-0 respectively). Unicode implements a specific subset of ISO 10646-1 which includes characters representing all common languages plus common typographical and mathematical symbols. A few specialized character sets -- e.g. Egyptian hieroglyphics -- are relegated to a larger 21 bit character set.

Unicode is a specific implementation of the something called the Universal Character Set (UCS).

Characters (00)00-7F correspond to ASCII; 00-FF correspond to Latin-1 (Code Page 850). Characters E000-F8FF are set aside for private character sets.

Although characters often found in accented form have their own characters, there are provisions to add accents and similar markings to any character. Presumably these can also be used to build Chinese characters. The Universal Character Set can be implemented without allowing combining characters. Unicode, however, requires that the UCS be implemented with support for combining characters.

refer to: http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucs, UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn

Return To Index Copyright 1994-2002 by Donald Kenney.