The emergence of the Internet and other distributed networks puts increasing importance on the ability to create global software - that is, software that can be developed independently of the countries or languages of intended users and then translated for multiple countries or regions. JavaŠ 1.1 takes a major step forward in creating global applications and applets. It provides frameworks based on the UnicodeŠ 2.0 character set - the standard for international text - and an architecture for developing global applications, which can present messages, numbers, dates and currency in any country's conventional formats.
In Part I, we discussed how to convert your application into a global application. Now, in Part II, we will look at the current limitations in the JDK 1.1 and what the future may hold in store for us.
The bulk of the international support in JDK 1.1 was licensed from Taligent, with data supplied by IBM's Native Language Technical Center (NLTC) in Toronto. Taligent developed an integrated set of object-oriented frameworks to support the creation of international software, providing a standard API to make handling the requirements of different countries and languages transparent to developers. Using experience developed building C++ frameworks, Taligent redesigned its frameworks in Java, which allows for a much simpler API and implementation. This was a cooperative effort with JavaSoft, which participated in reviewing and adapting the APIs we supplied.
Before diving into the text, let's clarify some of the terms we will be using:
The JDK 1.1 implements the Unicode 2.0 character set with the Unicode character database version 2.0.14; for brevity, we will refer to this as Unicode 2.0.14. Also, our use of the term application should be understood to include applets, unless otherwise noted.
- display string: A string that may be shown to the user. These strings will need to be translated for different countries. Non-display strings, such as URLs, are used programmatically and are not translated.
- locale: A name for conventions shared among a large set of users for language, dates, times, numbers, etc. Typically, a single country is represented by a single locale (and loosely speaking, you may use "country" and "locale" interchangeably). However, some countries, such as Switzerland, have more than one official language and thereby multiple locales.
- global application: An application that can be completely translated for use in different locales. All text shown to the user is in the native language, and user expectations are met for dates, times and other locale conventions. Also known as localizable application. To globalize is to convert your program into a localizable one. Of course, with Java you can also have global applets.
Limitations of JDK 1.1
The JDK 1.1 release provides a lot of support for European languages and the countries that use them (we will refer to these locales as Western), but due to time constraints minimal support for the Far East. There is no adequate support for the Middle East and Southeast Asia (in which we include the Indian subcontinent). In addition, the font support is very weak, even just for English!
On the plus side, JDK 1.1 fonts have the capability to draw any Unicode characters, assuming that the host platform/browser supports drawing those characters. This requires that the appropriate fonts be installed on the host system. Java does have a mechanism for letting you combine many different native fonts together into a single logical font to cover a larger range of Unicode characters. It is currently done by means of editing one or more text font.properties files into a special format. JavaSoft has very good documentation for this, and its current limitations, on their Web site at http://www.javasoft.com:80/products/jdk/1.1/docs/guide/intl/index.html
On the minus side, Java support for fonts is still very weak; for one thing, there is no way to access the full set of fonts on a system; you are limited to a small set of logical fonts: Serif, SansSerif, Monospaced, etc. (By the way, to get the list of fonts, it is futile to search for that method in Font; use Toolkit.getFontList). For most applets, this is not so bad; the few available logical fonts supported on each implementation are usually sufficient. However, for Java applications this is a real problem; you can't build a Java application that can list and use the available fonts on a system, something that the simplest of native applications can do.
Note: The logical font names will map to different fonts on different platforms; never make assumptions about metrics or coverage of these fonts.
The following are general deficiencies in the current international support. For background information on all of the following topics, see the Unicode Standard.
- Default Locales in Applets: Unfortunately, there is no per-thread data in JDK 1.1. That, plus security concerns, prevents applets from being able to call Locale.setDefault. If you want to do this, the only current work-around is to store your own default in a well-known place and pass it around explicitly. This will not work for inaccessible code that uses just the standard default locale, such as exception formatting.
- Full Locale Coverage: While a large number of locales are in JDK 1.1, more need to be added. This is especially true for South America, where the locale data tends to differ only in the name of the country and the currency and number formats. These countries are in pretty good shape otherwise.
- Editing: The current TextArea and TextField use host peers to actually do the editing. That means that if the host does not support Unicode natively, there is a conversion to some character set the host does handle, typically a character set that only handles the host's default locale. In such circumstances, the rich set of symbols and punctuation in Unicode (let alone letters in other languages) is simply discarded. In addition, some current implementations do signed conversion back to Unicode; so when you put 00E5 (a) into the TextArea, you get back FFE5 (full width Y)!
- Character Code Conversions: As we mentioned earlier, you really need to be able to iterate through all of the installed character code converters, and have a richer API - among other reasons, for better performance.
- Keyboards: Most non-Western locales use mixtures of different scripts; for example, you will find English product names mixed in with Japanese or with Arabic. To handle this, the user needs to be able to change between different keyboard mappings. Usually, the operating system will provide support for this, but for word processing you need to be able to find out what is the current keyboard, iterate through the installed keyboards and reset the current keyboard. For example, you can then provide a convenience for the user whereby if he clicks down into Japanese text, you automatically switch the keyboard to Japanese.
- Calendars: Most non-Western locales have alternative calendars and need to allow a choice between the standard (Gregorian) calendar and at least one alternative. Japanese, for example, needs an additional calendar, which is based on the year of the various Emperors' reigns. If you need to do this in JDK 1.1, you will need to subclass Calendar, which is fairly straightforward. Although you can do this, it may be of limited use until some of the other features are supported in Java.
- Styled Text: Using the current Java API, you can perform your own text layout to support drawing, hit-testing, highlighting and line-break. For example, you would make a series of draw calls, with font changes between each one. Even with Western scripts, this does not support higher-level features such as justification efficiently. With non-Western scripts such as Hebrew, having a mixture of right-to-left and left-to-right characters, this method breaks down very quickly.
Chinese, Japanese and Korean (CJK) have very large alphabets which require fonts that can handle large character sets and special support for inputting characters. For high-end systems, vertical text and ruby (textual annotations) are also required.
The main issue for correct localization for CJK is input. Due to the complexity of the character sets, there is a conversion facility that transforms input from a small set of phonetic or component characters that the user types into the actual CJK characters stored in the document. This facility is often called an input method engine (IME) or sometimes a front-end processor (FEP). An IME is generally quite complex. It often does sophisticated grammatical analysis of the text and commonly:
There are three main types of input support that each offer different levels of capability and require different degrees of application changes:
- uses the input context to disambiguate characters.
- marks special states of text with distinctive highlighting.
- allows the user to choose and control alternative transformations.
- allows the user to add new expressions to user dictionaries.
Name: Off-the-spot (aka bottom-line)
User Value: Minimal
Application Changes: None
When the user types a character, a window appears (usually at the bottom of the screen). Within that window, the user interacts with the IME. When the user is done, a series of keyboard events is fed to the unsuspecting application, one at a time. (This would speed up if there were a Java keyboard event that contained an entire string.)
User Value: Partial
Application Changes: Minimal
When the user types a character, a window appears right over where the user was typing. The text is often in the same font and size, and feels more like the user is typing directly into the document. Otherwise, this is the same as off-the-spot.
Name: On-the-spot (aka inline)
User Value: Full
Application Changes: Major (for word processors)
When the user types a character, it goes directly into the document. The special highlighting happens within the text and changes are immediately reflected, including word-wrap. These require fairly complex interactions for word processors; programs that use the built-in Java editing (TextField, TextArea) are not affected.
Currently, you are completely dependent on the quality of the Java implementation on the host platform/browser.
You will get on-the-spot support in TextField and TextArea but only if the implementation supports it. Otherwise, you will get off-the-spot support but again only if the implementation supports it.
It is fairly easy for a host platform/browser to support both these features, at least on the major platforms that have CJK support, but unfortunately there are no guarantees that this is done.
Moreover, there is no way for a Java program to support on-the-spot outside of TextArea/TextField, such as for word processors doing real rich-text editing with mixed styles and fonts. You can do over-the-spot support yourself by opening up your own small window that contains a TextArea, putting the window in the right position and setting the font yourself. However, you can't get a list of the available IMEs or choose which gets invoked.
Large character fonts are handled in 1.1. As discussed above, there is a limited selection. Neither Ruby nor Vertical text are in 1.1. Both require special handling in text layout, but are fairly high-end features and not a problem for most programs.
- Ruby: Because people often don't know the pronunciation of a particular ideograph (Kanji), small phonetic symbols are often placed over one or more ideographs (see Figure 1).
- Vertical Text: CJK characters can also be written vertically, with lines that go from right to left (usually). There are a couple of complications: some characters will rotate or change shape in a vertical context; and intermixed Latin text may rotate 90 degrees clockwise (and Arabic characters 90 degrees counter-clockwise!).
Arabic and Hebrew are written from right-to-left, but also allow mixing in left to right text, such as numbers or English text (see Figure 2). This feature is called BIDI (short for bidirectional). Moreover, Arabic characters may change shape radically, depending on their context. Both of these features require very special handling in text layout to support drawing, hit-testing, highlighting and line-break and are not optional for these locales.
Moreover, the general flow of objects will generally also be from right to left. For example, this includes the flow of components with a FlowLayout, or tab stops in text, or which side the box appears in a Checkbox. Text is also generally right-flush instead of left-flush. The localizer and developer need to have control of this flow direction on a component-by-component basis.
In addition, legacy data in other character sets may be stored in either visual or logical order, while Unicode uses logical order. So, special character converters must be written that can convert back and forth.
Indic languages require special handling since they require special ligatures (called conjuncts) and also rearrange certain vowels. Thai does not have these issues, but does require precise placement of multiple accents, stacking upon one another. This requires very special handling in text layout to support drawing, hit-testing, highlighting and line-break and is not optional (see Figure 3).
These languages may also employ simple input methods to alert the user to illegitimate combinations of letters.
JDK 1.2 and Beyond
JavaSoft is working with several companies to provide input method support in 1.2. This should address the major shortcoming of Java in the Far East. The new work on the JFC will also include 100 % Java text editing, handling the concerns mentioned above.
In addition,Taligent is under contract to JavaSoft to assist in further improving the international system in general. In the 1.2 release, this will involve supporting BIDI, providing a much more general text layout and font mechanism, as well as bug fixing and incorporating our further performance improvements from our C/C++ libraries. After 1.2, this will include support for Southeast Asian countries as well.
Much of this support, such as new calendars, will be transparent to the developer who is already using 1.1 features. However, some, such as the input method support, may require changes to applications, especially those doing their own text processing.
Some of the other possible future enhancements include the following:
If you have any feedback on improvements that you would like to see made, or any bugs that you have encountered, please send e-mail to JavaSoft and us ([email protected] and [email protected]).
- Thread-locale data, so applets can call Locale.setDefault().
- getWebLocale(), so applets can find out their Web page's locale
- Full synchronization of host and Java default locales and time zones
- Searching API on Collators
- Programmatic switching of the priority of uppercase and lowercase letters in Collators
- Host-matching formats, collators, etc.
- Formatted TextFields (e.g., that format numbers and check that input text matches the format).
- Historical time zones, for better compatibility with UNIX
- Sublinear searching for fast international searching.
- Transliteration for rule-based text conversions for smart-quotes and kana.
- International Regular Expressions for language-sensitive matching
- Spell-check framework for connecting spell-check engines
- Hyphenation framework for connecting hyphenation engines
With the Unicode support already in Java 1.1, the amount of work that you have to do to globalize your application is much less than on other platforms. You can get started right now to localize your programs, which will get your application a long way towards world coverage, including Europe, the Americas and (minimally) the Far East. As Java continues to evolve, you soon will be able to localize to all world markets, building on the same code base you have now.
My thanks to Brian Beck, Ken Whistler, Laura Werner, Kathleen Wilson, Baldev Soor, Debbie Coutant, Tom McFarland, Lee Collins, Andy Clark, David Goldsmith and Dave Fritz for their review or assistance with this paper.
Pulling the JDK 1.1 international classes together on a very short schedule demanded a lot of hard work by people at Taligent, including Kathleen Wilson, Helena Shih, Chen-Lieh Huang and John Fitzpatrick. This was assisted by people at the IBM NLTC, most especially Baldev Soor, but also Siraj Berhan and Stan Winitsky. Without the support and excellent feedback from people at JavaSoft it also would not have been possible, especially from Brian Beck, but also from Asmus Freytag, David Bowen, Bill Shannon, Mark Reinhold, Guy Steele, Larry Cable, Graham Hamilton and Naoyuki Ishimura.
For more detailed information about each of the topics, you should definitely consult the Java 1.1 International documentation: http://www.javasoft.com:80/products/jdk/1.1/docs/guide/intl/index.html
To see the Java international classes in action, look at Taligent's Java Demos (JavaSoft a has copy of these on their site and in JDK 1.1, although sometimes it may be a somewhat older source): http://www.taligent.com/Products/javaintl/Demos/About.html
To see how to write robust Java classes, consult Java Cookbook: Well-Mannered Objects; http://www.taligent.com/Technology/WhitePapers/PortingPaper/WellMannered.html
If you are a beginner at Java, but are acquainted with C++ or C, look at Java Cookbook: Porting C++ to Java. http://www.taligent.com/Technology/WhitePapers/PortingPaper/index.html
We also supply C/C++ versions of these classes, in case you are interested in licensing them for other applications besides Java. We also provide on-line updates to this paper and a discussion forum. You can this and other information at Taligent's home page: http://www.taligent.com
I also strongly recommend buying a copy of The Unicode Standard, Version 2.0 (and I don't even get any of the royalties!). For purchasing information and general information about the Unicode Consortium, look at the Unicode web site: http://unicode.org
About the Author
Dr. Mark Davis is the director of the Core Technologies department at Taligent, Inc, a wholly owned subsidiary of IBM. Mark co-founded the Unicode effort, and is the president of the Unicode Consortium. He is a principal co-author and editor of the Unicode Standard, Version 1.0 and the new Version 2.0. Mark can be reached at [email protected]