UTF-8

12/23/2000

UTF-8: A variant of Unicode that encodes Unicode characters in variable length entities. The rationale for UTF-8 is that it allows ASCII characters to be used unaltered as seven bit values in an eight bit byte with a leading zero bit.

The encoding is somewhat outlandish. It follows the following rules.

1. Any byte starting with a 0 bit is a character containing an ASCII value.

2. The first byte of any multibyte character starts with one fewer one bits than there are characters in the code followed by a zero bit. For example, the first byte of a three character code will start with 110.

3. Second and later bytes of multibyte characters start with 10.

Thus, single byte characters are in the range 0-7F and represent the corresponding ASCII characters. Double byte characters occupy the range 8080-BFFF and correspond to the Unicode characters that can fully be expressed in 12 data bits 000-FFF.

4. UTF-8 characters may be up to 6 bytes = 31 data bits.

5. In principle Capital A could be represented as 65, 8065, C08065, etc. However, in the interest of maintaining programmer sanity when doing comparisons, etc, only the shortest form is actually permitted.