The encoding is somewhat outlandish. It follows the following rules.
1. Any byte starting with a 0 bit is a character containing an ASCII value.
2. The first byte of any multibyte character starts with one fewer one bits than there are characters in the code followed by a zero bit. For example, the first byte of a three character code will start with 110.
3. Second and later bytes of multibyte characters start with 10.
Thus, single byte characters are in the range 0-7F and represent the corresponding ASCII characters. Double byte characters occupy the range 8080-BFFF and correspond to the Unicode characters that can fully be expressed in 12 data bits 000-FFF.
4. UTF-8 characters may be up to 6 bytes = 31 data bits.
5. In principle Capital A could be represented as 65, 8065, C08065, etc. However, in the interest of maintaining programmer sanity when doing comparisons, etc, only the shortest form is actually permitted.
Return To Index Copyright 1994-2008 by Donald Kenney.