User:Hillgentleman1/p1

粵語係乜？

Description[編輯]

There are several current, slightly different definitions of UTF-8 in various standards documents:

RFC 3629 / STD 63 (2003), which establishes UTF-8 as a standard Internet protocol element
The Unicode Standard, Version 4.0, §3.9–§3.10 (2003)
ISO/IEC 10646-1:2000 Annex D (2000)

They supersede the definitions given in the following obsolete works:

ISO/IEC 10646-1:1993 Amendment 2 / Annex R (1996)
The Unicode Standard, Version 2.0, Appendix A (1996)
RFC 2044 (1996)
RFC 2279 (1998)
The Unicode Standard, Version 3.0, §2.3 (2000) plus Corrigendum #1: UTF-8 Shortest Form (2000)
Unicode Standard Annex #27: Unicode 3.1 (2001)

They are all the same in their general mechanics with the main differences being on issues such as allowed range of code point values and safe handling of invalid input.

The bits of a Unicode character are divided into several groups which are then divided among the lower bit positions inside the UTF-8 bytes. A character whose code point is below U+0080 is encoded with a single byte that contains its code point: these correspond exactly to the 128 characters of 7-bit ASCII. In other cases, up to four bytes are required. The most significant bit of these bytes is 1, to prevent confusion with 7-bit ASCII characters and therefore keep standard byte-oriented string processing safe.

Code range hexadecimal	Scalar value binary	UTF-8 binary / hexadecimal	Notes
000000–00007F 128 codes	00000000 00000000 0zzzzzzz	0zzzzzzz(00-7F)	ASCII equivalence range; byte begins with zero
000000–00007F 128 codes	seven z	seven z	ASCII equivalence range; byte begins with zero
000080–0007FF 1920 codes	00000000 00000yyy yyzzzzzz	110yyyyy(C2-DF) 10zzzzzz(80-BF)	first byte begins with 110, the following byte begins with 10.
000080–0007FF 1920 codes	three y; two y, six z	five y; six z
000800–00FFFF 63488 codes	00000000 xxxxyyyy yyzzzzzz	1110xxxx(E0-EF) 10yyyyyy 10zzzzzz	first byte begins with 1110, the following bytes begin with 10.
000800–00FFFF 63488 codes	four x,four y; two y,six z	four x; six y; six z
010000–10FFFF 1048576 codes	000wwwxx xxxxyyyy yyzzzzzz	11110www(F0-F4) 10xxxxxx 10yyyyyy 10zzzzzz	First byte begins with 11110, the following bytes begin with 10
010000–10FFFF 1048576 codes	three w, two x; four x, four y; two y, six z	three w; six x; six y; six z

For example, the character aleph (א), which is Unicode U+05D0, is encoded into UTF-8 in this way:

It falls into the range of U+0080 to U+07FF. The table shows it will be encoded using two bytes, 110yyyyy 10zzzzzz.
Hexadecimal 0x05D0 is equivalent to binary 101-1101-0000.
The eleven bits are put in their order into the positions marked by "y"-s and "z"-s: 11010111 10010000.
The final result is the two bytes, more conveniently expressed as the two hexadecimal bytes 0xD7 0x90. That is the encoding of the character aleph (א) in UTF-8.

So the first 128 characters (US-ASCII) need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the BMP characters use three bytes, and additional characters are encoded in four bytes.

By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the universal character set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003. With these restrictions, the following byte values never appear in a legal UTF-8 sequence:

Codes (binary)	Codes (hexadecimal)	Notes
1100000x	C0, C1	Overlong encoding: lead-byte of a 2 byte sequence, but code point <= 127
1111111x	FE, FF	Invalid: lead-byte of a 7 or 8 byte sequence
111110xx 1111110x	F8, F9, FA, FB, FC, FD	Restricted by RFC 3629: lead-byte of a 5 or 6 byte sequence
11110101 1111011x	F5, F6, F7	Restricted by RFC 3629: lead byte of codepoint above 10FFFF

While the last two categories were technically allowed by earlier UTF-8 specifications, no characters were ever assigned to the code points they represent so they should never have appeared in actual text.

The design of the algorithm has some similarities with Huffman coding.

utf--->urlencode

0-79 => 0-79
80-BF =>C280 - C2BF
C0-CF=>C380 - C38F
D0-DF=>C390-C39F
F0-FF=>C3B0-C3BF
100- =>C480
140=>C580
180=>C680
1C0=>C780
200=>C880
300=>CC80
400 => D080
500 => D480
7FF=> DFBF
800=> E0A080
900=>E0A480
C00=>E0B080
1000=>E18080
1100=>E18480
1400=>E19080
1800=>E1A080
1C00=>E1B080
2000=>E28080
3000=>E38080
4000=>E48080
5000=>E58080
8000=>E88080
C000=>EC8080
F000=>EF8080
FFFF=>EFBFBF

utf 1000 --> e18080 --> {{urldecode:%e1%80%80}} utf 1001 --> e18081 --> {{urldecode:%e1%80%81}}

utf 1010 --> e18090 --> {{urldecode:%e1%80%90}}

utf 1100-->

0001 ---> 1

0079-->79

0080--->C280

0100-->C480

0400---D080

0799--DE99

0800---E0A080

0C00 --E0B080 ... %E6%9B%B8