User:Hillgentleman1/p1
粵語係乜? | |||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Description[編輯]There are several current, slightly different definitions of UTF-8 in various standards documents:
They supersede the definitions given in the following obsolete works:
They are all the same in their general mechanics with the main differences being on issues such as allowed range of code point values and safe handling of invalid input. The bits of a Unicode character are divided into several groups which are then divided among the lower bit positions inside the UTF-8 bytes. A character whose code point is below U+0080 is encoded with a single byte that contains its code point: these correspond exactly to the 128 characters of 7-bit ASCII. In other cases, up to four bytes are required. The most significant bit of these bytes is 1, to prevent confusion with 7-bit ASCII characters and therefore keep standard byte-oriented string processing safe.
For example, the character aleph (א), which is Unicode U+05D0, is encoded into UTF-8 in this way:
So the first 128 characters (US-ASCII) need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the BMP characters use three bytes, and additional characters are encoded in four bytes. By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the universal character set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003. With these restrictions, the following byte values never appear in a legal UTF-8 sequence:
While the last two categories were technically allowed by earlier UTF-8 specifications, no characters were ever assigned to the code points they represent so they should never have appeared in actual text. The design of the algorithm has some similarities with Huffman coding. |
utf--->urlencode
- 0-79 => 0-79
- 80-BF =>C280 - C2BF
- C0-CF=>C380 - C38F
- D0-DF=>C390-C39F
- F0-FF=>C3B0-C3BF
- 100- =>C480
- 140=>C580
- 180=>C680
- 1C0=>C780
- 200=>C880
- 300=>CC80
- 400 => D080
- 500 => D480
- 7FF=> DFBF
- 800=> E0A080
- 900=>E0A480
- C00=>E0B080
- 1000=>E18080
- 1100=>E18480
- 1400=>E19080
- 1800=>E1A080
- 1C00=>E1B080
- 2000=>E28080
- 3000=>E38080
- 4000=>E48080
- 5000=>E58080
- 8000=>E88080
- C000=>EC8080
- F000=>EF8080
- FFFF=>EFBFBF
utf 1000 --> e18080 --> {{urldecode:%e1%80%80}} utf 1001 --> e18081 --> {{urldecode:%e1%80%81}}
utf 1010 --> e18090 --> {{urldecode:%e1%80%90}}
utf 1100-->
0001 ---> 1
0079-->79
0080--->C280
0100-->C480
0400---D080
0799--DE99
0800---E0A080
0C00 --E0B080 ... %E6%9B%B8