UTF-8

12/23/2000

UTF-8: A variant of Unicode that encodes Unicode characters in variable length entities. The rationale for UTF-8 is that it allows ASCII characters to be used unaltered as seven bit values in an eight bit byte with a leading zero bit.

The encoding is somewhat outlandish. It follows the following rules.

1. Any byte starting with a 0 bit is a character containing an ASCII value.

2. The first byte of any multibyte character starts with one fewer one bits than there are characters in the code followed by a zero bit. For example, the first byte of a three character code will start with 110.

3. Second and later bytes of multibyte characters start with 10.

Thus, single byte characters are in the range 0-7F and represent the corresponding ASCII characters. Double byte characters occupy the range 8080-BFFF and correspond to the Unicode characters that can fully be expressed in 12 data bits 000-FFF.

4. UTF-8 characters may be up to 6 bytes = 31 data bits.

5. In principle Capital A could be represented as 65, 8065, C08065, etc. However, in the interest of maintaining programmer sanity when doing comparisons, etc, only the shortest form is actually permitted.

Return To Index Copyright 1994-2002 by Donald Kenney.