350 likes | 917 Views
Unicode Transforms in ICU. Mark Davis Chief SW Globalization Architect IBM. What is ICU?. Internationalization libraries for C, C++, Java* Open source – non-viral Sponsored by IBM Sun’s Java licenses an earlier ICU version; ICU4J updates it. Unicode standard compliant
E N D
Unicode Transforms in ICU Mark DavisChief SW Globalization Architect IBM
What is ICU? • Internationalization libraries for C, C++, Java* • Open source – non-viral • Sponsored by IBM • Sun’s Java licenses an earlier ICU version; ICU4J updates it. • Unicode standard compliant • full supplementary support • Cross-platform; extensible and customizable • High performance and thread-safe • Multiple locales in same thread – simultaneously • http://oss.software.ibm.com/icu/ 22st International Unicode Conference
Unicode text handling Character set conversions (700+) Collation & Searching Locales (170+) Resource Bundles Calendar & Time zones Complex-text layout engine Breaks: character, word, line, & sentence Formatting Date & time Messages Numbers & currencies Transforms Normalization Casing Transliterations ICU Features 22st International Unicode Conference
ICU Transforms • Powerful, flexible mechanism • Uppercase, Lowercase, Titlecase, Full/Halfwidth • Normalization • Hex, Character Names • Script to Script conversion… • Supports Styled Text, not just Plain Text • Chaining, Filters, Buffering • Customizable 22st International Unicode Conference
Transform Examples • “Any-Uppercase” a → A • “Any-Hex/Java” a → \u0061 • “Greek-Latin” a → α 22st International Unicode Conference
Filters • “[aeiou] Latin - Greek” • “Latin” is the source • “[aeiou]” is a filter, restricts the application to only English vowels. Uses UnicodeSet. • “Greek” is the target • “[^\u0000-\u007E] Any - Hex” • “A δ is…” → “A \u03B4 is\u2026” 22st International Unicode Conference
UnicodeSet • Ranges [ABC a-z] • Union [[:Lu:] [:P:]] • Intersection [[:Lu:] & [\u0000-\u01FF]] • Set Difference [[:Lu:] - [\u0000-\u01FF]] • Complement [^aeiou] • Properties • Uppercase letters[:Lu:] • Punctuation[:P:] • Script[:Greek:] ICU 2.2: all enumerated Unicode 3.2 properties 22st International Unicode Conference
UnicodeSet Property Syntax • Either POSIX or Perl Style • \p{letter} • [:letter:] • Short or long form (UCD Property Aliases) • \p{general_category = uppercase_letter} • \p{gc=Lu} • Case-, Space-, Underbar-Insensitive 22st International Unicode Conference
Example Filter • “[:Lu:] Latin-Katakana; Latin-Hiragana” • Converts all uppercase Latin characters to Katakana, • Then converts all other Latin characters to Hiragana. 22st International Unicode Conference
Chaining Transforms • “Kana-Latin; Any-Title” • たけだ, まさゆき • takeda, masayuki • Takeda, Masayuki • Any number 22st International Unicode Conference
Filtering plus Chaining • “NFD; [:Mark:] Remove; NFC” • Decompose • Remove accents (Marks) • Recompose 22st International Unicode Conference
Built-in Transforms • Normalization • Å → Å • Casing • a → A • Full ↔ Halfwidth • カ → カ • Character Names • a → {LATIN SMALL LETTER A} • Hex: XML, Java, C++, Perl, … styles • a → \u0061, U+0061, … 22st International Unicode Conference
Script ↔ Script Conversions • General conversions, e.g.: Greek-Latin • Source-Target Reversible: φ → ph → φ • Not Target-Source Reversible: f → φ → ph • Variants • By Language: Greek-German • By Standard: Greek-Latin/UNGEGN • Can build your own 22st International Unicode Conference
김, 국삼 김, 명희 정, 병호 たけだ, まさゆき ますだ, よしひこ やまもと, のぼる Ρούτση, Άννα Καλούδης, Χρήστος Θεοδωράτου, Ελένη Gim, Gugsam Gim, Myeonghyi Jeong, Byeongho Takeda, Masayuki Masuda, Yoshihiko Yamamoto, Noboru Roútsē, Ánna Kaloúdēs, Chrḗstos Theodōrátou, Elénē “Any-Latin” Example 22st International Unicode Conference
Styled Text • Preserves individual styles on letters, where possible απα → apa 22st International Unicode Conference
p? ph? ps? When Buffering • Conversions are not performed if they may extend over boundaries Key Result a α p αp a απα p απαp h απαφ 22st International Unicode Conference
Custom Rules • Similar to Regular Expressions • Variables • Property matches • Contextual matches • Rearrangement • $1, $2… • Quantifiers: • *, +, ? 22st International Unicode Conference
Differences from Reg. Exp.’s • More Powerful… • Buffered/Keyboard • Styled Text • Ordered Rules • Cursor Backup • Less Powerful… • Only greedy quantifiers • No backup: so no (X | Y) • No “input-side back references” 22st International Unicode Conference
Example of Custom Rules • “UnixQuotes-RealQuotes” \`\` > “; # two graves → right-quote \'\' > ” ; # two generics → left-quote • Example (SJ Mercury News online) ``expertise''→“expertise” 22st International Unicode Conference
Rule Ordering • Find first rule that matches at start • If no match, or (isBuffered & clipped-Match) • advance start by 1 • Else if match, • Substitute text • Move start as specified • Continue until start reaches limit 22st International Unicode Conference
Rule Ordering Example Translit. Reg Exp. xy > c ; s/xy/c/g yx > d ; s/yx/d/g xyx-yxy-xyx cx-dy-cx cx-yc-cx 22st International Unicode Conference
Context • Rules: • {γ } [ Γ Κ Χ Ξ γ κ χ ξ ] > n; • γ > g; • Meaning: • Convert gamma into n IF followed byΓ, Κ, Χ, Ξ, γ, κ, χ, or ξ • Otherwise into g 22st International Unicode Conference
Cursor Backup |BYO • Allows text to be revisited • Reduces rule-count • Example Rules • BY > ビ | ~Y ; • ~YO > ョ; 1 ビ|~YO 2 ビョ| 22st International Unicode Conference
Demonstration • Public Demo • http://oss.software.ibm.com/icu/demo • (local copy, samples) 22st International Unicode Conference
More Information http://oss.software.ibm.com/… User Guide /icu/userguide/ C /icu/apiref/utrans_h.html C++ /icu/apiref/ Java API /icu4j/doc/com/ibm/text/ • Latest Version of these slides • http://www.macchiato.com 22st International Unicode Conference
ICU Transforms • Powerful, flexible mechanism • Uppercase, Lowercase, Titlecase, Full/Halfwidth • Normalization • Hex, Character Names • Script to Script conversion… • Supports Styled Text, not just Plain Text • Chaining, Filters, Buffering • Customizable 22st International Unicode Conference
Q & A 22st International Unicode Conference
Backup Slides • Not used in the presentation, except in response to questions 22st International Unicode Conference
Buffered Usage • No conversion for clipped match …t…t • Fill buffer • Transliterate • May have left-overs x …τ…t th… • Copy left-overs to start • Fill rest of buffer • Transliterate θ… 22st International Unicode Conference
Styled Text Handling • Transforms operate on Replaceable, an interface/abstract class defined by ICU • In ICU4c, UnicodeString is a Replaceable subclass (with no out-of-band data -- no styles) • ICU4j defines ReplaceableString, a Replaceable subclass, also with no styles • Clients must define their own Replaceable subclass that implements their styled text. 22st International Unicode Conference
Transliteration Sources • Søren Binks • http://homepage.mac.com/sirbinks/translit.html • UNGEGN • http://www.eki.ee/wgrs/ • … 22st International Unicode Conference
API: Information • Like other ICU APIs, can get each of the available Transform IDs: • count =Transliterator:: countAvailableIDs(); • myID = Transliterator::getAvailableID(n); • And get a localizable name for each: • Transliterator::getDisplayName(myID, france, nameForUser); Note: these are C++ APIs; C and Java are also available. 22st International Unicode Conference
API: Creation • Use an ID to create: • myTrans = Transliterator::createInstance("Latin-Greek"); 22st International Unicode Conference
API: Simple usage • Convert entire string • myTrans.transliterate(myString); 22st International Unicode Conference
More Control • Specify Context • Use with Styled Text abcdefghijklmnopqrstuvwxyz contextStart contextLimit start limit 22st International Unicode Conference