240 likes | 770 Views
Unicode Support in ICU for Java. Doug Felt dougfelt@us.ibm.com. Globalization Center of Competency, San Jose, CA. Overview. What is ICU4J? ICU and the JDK, a brief history Benefits and tradeoffs of ICU4J Features of ICU4J Performance of ICU4J Using ICU4J Conclusion and References.
E N D
Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA
Overview • What is ICU4J? • ICU and the JDK, a brief history • Benefits and tradeoffs of ICU4J • Features of ICU4J • Performance of ICU4J • Using ICU4J • Conclusion and References
What is ICU4J? • Internationalization Library • Sister project of ICU (C/C++) • Open-source, non-viral license • Sponsored by IBM • Unicode Standard compliant, up-to-date • 100% Pure Java • Enhances and extends JDK functionality • Over five years of continuous development
ICU and Java, a History • Started with Java 1.1 internationalization • Much code contributed by IBM/Taligent • IBM provided support, bug fixes, enhancements • Became open-source project in 2000 • ICU4C code started with port from Java • Continued contributions to Java since then • TextLayout, OpenType layout, Normalization
Collaboration with Java Teams • We continue to work with Java internationalization, graphics2D teams • We participate in Java expert groups (e.g. JSR 204, Supplementary Support) • Differences • perspectives (conformance, features versus size) • processes (open source versus corporate/JSR) • timetable (twice a year versus every two years)
Benefits • Fully implements current standards • Unicode collation, normalization, break iteration • Updated more frequently than Java • Full CLDR data • Improved performance • Open source, open license, customizable • Compatible with ICU C/C++ libraries and data • Runs on JDK 1.4 • Get supplementary support without moving to 1.5
Tradeoffs • Not built-in, unlike Java i18n support • Some API differences • But generally a superset of the Java API • Some differences unavoidable due to class restrictions • Rule syntax differs to varying degrees • Data differences • ICU4J uses its own CLDR data, not the JVM’s data • Size • Can trim ICU4J, but it will always be larger than 0K
Features of ICU4J • Collation • Normalization • Break Iteration • UnicodeSet and Transforms • Character Properties • Locale data • Other • Calendars, Formatters, IDNA, StringPrep, IMEs
Collation • Full UCA (Unicode Collation Algorithm) • Java does not implement UCA collation • Locale data • Over 60 tailorings for locale-specific collation • Variants: Pinyin, stroke, traditional, etc. • Performance • sorting: 2 to 20 times faster • sort key generation: 1.5 to 4 times faster • sort key length: 2/3 to 1/4 the length of Java sort keys
Normalization • Java does not provide normalization APIs • Java uses ICU’s implementation internally • Useful for searching, string equivalence, simplifying processing of text • Full implementation of Unicode standard • NFC, NFD, NFKC, NFKD • Also provides FCD ‘quick check’ for optimization
Break Iteration • Fully conforms to Unicode specifications • supplementary characters, Hangul • Tags • e.g., “what kind of word was this” • Title case iteration • Rule-based, dictionary-based for Thai
Unicode Set and Transforms • UnicodeSet • collections of characters based on properties • logical set operations, flexible • “[[:mark:]&[\u0600-\u067f]]” • Transliterator • general transformations, with chaining and editing • converts between scripts, e.g. Greek/Latin, Devanagari/Gujarati • rule-based, rules for common conversions supplied\ • UScriptRun
Character Properties • All Unicode character properties • over 80, Java provides access to about 10 • All defined code points • Current with latest Unicode release • ICU4J 3.0 uses Unicode 4.0.1 data • Fast access to character data
Locale Data • Standard data, included with ICU4J • CLDR (Common Locale Data Repository) • Ensures same data is available everywhere • Can share resource data with ICU4C applications • More locales, more kinds of data • ~230 locales, compared to ~130 for Java • Can modularize to include only the data you need • RFC3066bis support (language_script_region) • e.g., zh_Hans, zh_Hant • keywords (orthogonal variants)
Performance of ICU4J • Instantiation times are comparable • Common instantiate and reuse model • ICU4J and Java both use caches to limit impact • Collation performance faster • faster sorting, smaller sort keys • Performance is difficult to measure • JVM makes a difference • ICU4J performs well in spot tests • Use a scenario that matters to you to test
Property Data Timings 1.13MHz PIII, Win2K Nanoseconds/operation for character property access (getType, toLowerCase, getDirectionality) on three JVMs.
Sizes of ICU4J • Full jar file: 2,700K • Modular builds for common subsets • normalizer: 420K • collator: 1,400K • calendar: 1,300K • break iterator: 1,300K • basic properties: 500K • full properties: 1,200K • formatting: 2,200K • transforms: 1,500K
Using ICU4J • Jar file, just add to class path • Or roll into your distribution, it’s Open Source! • Modular builds help you to trim ICU4J’s code • Data can be trimmed to further reduce size • Parallel APIs • APIs on parallel classes are generally a superset • Change import (one line change) or change class name • Some differences unavoidable (our supplementary support for Java 1.4 can’t add API to String)
Code Examples (1) import com.ibm.icu.text.BreakIterator; BreakIterator b = BreakIterator.getWordInstance(); b.setText(text); for (int pos = b.first(); pos != BreakIterator.DONE; pos = b.next()) { doSomething(pos); }
Code Examples (2) import com.ibm.icu.lang.UCharacter; int cp, pos = 0; while (pos < text.length()) { cp = UCharacter.codePointAt(text, pos); if (UCharacter.getType(cp) == UCharacter.SURROGATE) return true; pos += UCharacter.charCount(cp); }
Code Examples (3) import com.ibm.icu.util.ULocale; import com.ibm.icu.text.Collator; import java.util.Arrays; ULocale ulocale = new ULocale(“es_ES@collation=traditional”); Collator col = Collator.getInstance(ulocale); String[] list = ... Arrays.sort(list, col);
Conclusion • ICU4J is not for you if • you have tight size constraints • you require the Java runtime behavior • ICU4J is for you if • you need full compliance with current standards • you need current or additional locale and property data • you need customizability • you need features missing from Java (normalization) • you need additional performance
References • ICU4J • http://oss.software.ibm.com/icu4j/ • Java • http://java.sun.com/ • http://www.ibm.com/java/ • Unicode, CLDR • http://www.unicode.org/ • http://www.unicode.org/cldr/