Friday, June 18, 2010

Understanding unicode ... a 5 minute deep dive

While Unicode provides a number for each character in our universe, there is also a "Unicode transformation format" that is an encoding for each of those characters. And it is this encoded character format that is actually used on the wild, wild web pages you are looking at now.

So basically, for each character, there is a number, called a Unicode number, and then there is an encoding of that character called the Unicode transformation format for the character. The most common encoding on the web at this time is called utf-8.

Not many people including myself can remember all of the details of unicodes, so I thought that this brief explanation could help lots of folks. Web developers might want to understand how a browser actually grabs a character on a web page of the wild, wild web and transforms it from the binary bits that it starts with into a Unicode character, and then figures out which of the fonts at hand have a character matching that Unicode value, or vice versa.

So, the following is an example showing the basics of how to convert from the Unicode number of a character to the Unicode Transformation Format using the 8-bit blocks that are used on the web. Stated another way, below it is shown how to convert from a Unicode number to utf-8.

From a Unicode number to UTF-8

Let's start with a non-English character ệ and its unicode number which is U+1EC7, and give you the answer right away, which is "E1BB87" in utf-8 hex. Now, let me show you how to get from U+1EC7 to "E1BB87".

Using the following "table A" from the request for comments, in this case, RFC 3629:

Char. number range  |        UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Step 1.

Determine the number of octets required from the character number and the first column of the table above. So our unicode value of 1EC7 for the character ệ places us on the third row between 0800 and FFFF:
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx

And here are some math details describing 1EC7:
1 E C 7 in hex
1 14 12 7 in decimal
1 16 14 7 in octal
0001 1110 1100 0111 in binary

Step 2.

Prepare the high-order bits in accordance with the second column of table A.

Since 1EC7 is between 0800 and FFFF, the given binary is:
1110xxxx 10xxxxxx 10xxxxxx which shows the little x(s) for 4 bits and then more x(s) for an additional 6 bits and then more x(s) for an additional 6 bits for a total of 16 bits or 16 x(s). And since 16 bits is equivalent to 2 bytes, it is understandable why some people refer to this as double byte encoding, even though the final number of bytes in utf-8 for this character will be three.
So the high order bits will be:
1110xxxx 10xxxxxx 10xxxxxx

And the low order bits will be:
0001 1110 1100 0111

As follows:
1110xxxx 10xxxxxx 10xxxxxx
+ 0001 111011 000111
--------------------------
11100001 10111011 10000111

and then the math details to get us to hex bytes:

1110 0001 1011 1011 1000 0111 in binary
16 1 10 13 10 7 in octal
14 1 8 11 8 7 in decimal
E 1 B B 8 7 in hex

Finally, the unicode number U+1EC7 encodes to the three hex bytes E1 BB 87 of utf-8. And these utf-8 bytes are what you will actually find on a web page of the wild, wild web.

A Handy cross check

If you like perl, you can use "perldoc -f pack" and you will find:
B A bit string (descending bit order inside each byte).
H A hex string (high nybble first).
So here is a line of perl code that verifies the hex and binary answer found above for the character ệ:
$ echo "ệ" | perl -lpe 'print unpack "H*"; $_ = unpack "B*";'
e1bb87
111000011011101110000111

Conclusion

You can easily see how this algorithm works both backwards and forwards. Well, a browser can use these same steps in reverse in order to figure out what character in its font matrix to use for rendering the bytes "e1 bb 87" that it pulled off of a web page in the wild.

Hope that helps you understand the web a little better. Enjoy.

References

For more details you could start with the following documents:
RFC 3629
Look up characters given a unicode number
Look up unicode numbers given a character

1 comment:

Anonymous said...

I would like to exchange links with your site vsingleton.blogspot.com
Is this possible?