Thursday, October 30, 2008

Vietnamese unicode and the BlackBerry

NOTE: The issue described below was occurring while using BlackBerry OS version 4.0 and 4.1. The newer 5.0 version of the BlackBerry OS does not exhibit these issues.

Apparently the BlackBerry does a good job of rendering decomposed unicode characters into readable characters. But the BlackBerry does not appear to be able to render precomposed characters (which is what most pages on the internet use).

So to make a long story short, if you copy, let's say, some Vietnamese from a web page, and paste it into an email ... that email will not render very well on a BlackBerry.

Well, here is a hack that may help someone out. Decompose the characters, and send those decomposed characters in an email. If you do that, the email will likely render "fine" in a normal email client as well as a BlackBerry email client. I call this a hack because the W3C generally recommends to exchange texts in NFC ... well most BlackBerry email clients will not render all precomposed characters (NFC).

For example, if you are NOT using a windows box, try cutting and pasting the following into an email:
Subject: NFC (like most of the web)

Chúa yêu em lòng em vui thay
Kia Kinh Thánh đã tỏ cho hay
Các con thơ thuộc Jê-sus đây
Chúng yếu nhưng Ngài khỏe mạnh hoài

Jê-sus yêu em lắm
Phải em được Chúa yêu
Jê-sus yêu em lắm
Chính trong lời Chúa dạy nhiều
You'll find that emails sent with these characters render fine in a normal desktop email client, or even a web email client like Gmail, but they do not render correctly in the BlackBerry email client.

Now, try the same with this "decomposed" text:
Subject: NFKD (a decomposed form)

Chúa yêu em lòng em vui thay
Kia Kinh Thánh đã tỏ cho hay
Các con thơ thuộc Jê-sus đây
Chúng yếu nhưng Ngài khỏe mạnh hoài

Jê-sus yêu em lắm
Phải em được Chúa yêu
Jê-sus yêu em lắm
Chính trong lời Chúa dạy nhiều
While the two texts may look similar on this web page, they are different, trust me. And you'll find that these characters render well in Gmail, Outlook, Evolution, but also render well on the BlackBerry.

I'm not sure why I feel like including a small java program I wrote to help folks create emails that render better for the BlackBerry, but here it is:
import java.util.Scanner;
import java.text.Normalizer;
import java.text.Normalizer.Form;

public class d {
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
sc.useDelimiter("Yes, my Java is terrible ...");
String foo = sc.next();
CharSequence c = foo.subSequence(0,foo.length());
Normalizer.Form nf = Normalizer.Form.valueOf("NFKD");
System.out.println(nf + " Compatability Decomposed:\n" + Normalizer.normalize(c,nf));
}
}
This program would be used as follows from a command line:
$ cat myFileWithPrecomposedCharacters | java d
So you could paste NFC characters from, say, a Vietnamese web page, into a file, and then run the file through the program to generate NFKD which you can then paste into an email you're sending, and that email should render in a readable way using a desktop email client, a web email client, or a BlackBerry.

I've only tested this methodology with Vietnamese, and because the incident at the tower of Babel was so confusing, all bets are off with other languages.

By the way, it looks to me like neither NFC nor NFKD render correctly in AndroidMail as of today's build, so we should end up seeing complaints about Vietnamese not rendering well on the G1, unless the developers get it fixed soon. Maybe we will have a follow up post with more on that subject.

UPDATE: A great write up on unicode and the BlackBerry is here on the Logicmail website. LogicMail is a J2ME E-Mail client supporting IMAP and POP, and designed to run on RIM BlackBerry handheld devices.

NOTE:

1) For more help with the definitions of the normalization forms mentioned above try here. It is a good document to be familiar with if you are planning to do i18n or l10n.
- i18n stands for internationalization
- l10n stands for localization

2) for those of you who just want a quick overview ...
- These terms are roughly equivalent for this discussion:
   compatibility decomposition (NFKD)
   canonically decomposed characters
   composite unicode
   composite characters
   the "separated" diacritical marks and letters used in Vietnamese without combining

- These terms are also roughly equivalent, and should not be confused with those just above:
   compatibility composition (NFKC)
   NFC - normalization form canonical composition
   precompound unicode
   unicode dựng sẵn
   composed characters
   recomposed characters (by canonical equivalence)
   precomposed characters
   decomposable characters
   pre-composite characters
   ligatures
   the set of completed characters (including all markings)

If you think there is a problem with the terms or equivalencies drawn above, please let's discuss it via email. If there are things that need to be corrected, I am open to that, just let me know.

5 comments:

Anonymous said...

Great. I am a VietNamese at Saigon, my cellfone also a BlackBerry, i have plan about this problem too ! Yar blog helped me alot.
Cảm ơn bạn rất nhiều.
(Writed from BB)

merrymenvn said...

I'm a Vietnamese. 99.99% Vietnamese sites use pre-compound unicode. http://vi.wikipedia.org/wiki/Trang_Ch%C3%ADnh is one of them. I'm sure.

edburns said...

Hi Vern,

I read the post but haven't yet figured out how best to respond. I'll have to read it again.

Ed

P.S. Happy Easter.

Vernon Singleton said...

NOTE: The issue described below was occurring while using BlackBerry OS version 4.0 and 4.1. The newer 5.0 version of the BlackBerry OS does not exhibit these issues.

Unknown said...

Hi Vernon, I am a student and now I am working on project in which I want to display Vietnamese characters in the BrowserField.
I tried so much but I still can not do it.

Could you help me this problem. If you have answer, please send it to my mailbox (luongphanbinh@gmail.com)

Thank you very much.