Monday, August 16, 2010

Unicode Normalization Forms: Interesting issue with Google Translation services and APIs

Drawing on my limited knowledge of the ocean of information which is the mashup of the basic multilingual plane (BMP) and Google AJAX Language API, it looks like the Gtranslate callbacks return inconsistent portions of the BMP for Vietnamese.

For example "Hello girl" vs. "Hello grandma" translated into Vietnamese ...
0000000 042 103 150 141 315 200 157 040 143 303 264 042 012
" C h a 315 200 o c 303 264 " \n

0000000 042 130 151 156 040 143 150 303 240 157 040 142 303 240 042 012
" X i n c h 303 240 o b 303 240 " \n
At first glance, it looks to me like the first one comes back as NFKD and the second one comes back as NFC. It is great that the translator is more polite with Grandma, hence the extra word of deference "Xin", but that is not the difference I am noting here. But rather it is the "à" in the word "chào" that comes back two different ways.

Maybe there is some option somewhere in Google's APIs that allows you to specify the range of unicode characters you want to work from when translating. If anyone knows this, please comment.

Note: Here are two command lines I used to test this:
$ wget -O - "" 2>>/dev/null | cut -d: -f2 | cut -d"}" -f1 | od -bc

$ wget -O - "" 2>>/dev/null | cut -d: -f2 | cut -d"}" -f1 | od -bc
It looks like this issue is happening in IE7, Firefox, and Android browsers as of this post, but it is not happening on the BlackBerry browser here on my Bold 9700. Looks like the BlackBerry OS is still backward compatible with decomposed forms as well as NFC now. I wonder how it works on the iPhone, but I will not lose any sleep over not knowing.

No comments: