360 likes | 913 Views
Unicode (and Java). Brice Giesbrecht. Objective of Presentation. The need for Unicode How it works Differentiate between encodings How to get your browser to work… See how Java consumes and produces data. Overview of Presentation. Character Sets Unicode Encodings
E N D
Unicode (and Java) Brice Giesbrecht
Objective of Presentation • The need for Unicode • How it works • Differentiate between encodings • How to get your browser to work… • See how Java consumes and produces data
Overview of Presentation • Character Sets • Unicode • Encodings • Unicode Support in Java • Unicode Support in Databases (?) • Demonstration (web app) • Resources • Door Prizes (for those still awake…)
Character Sets • What is a character set? • Code Page: a mapping in which a sequence of bits, usually a single octet representing integer values 0 through 255, is associated with a specific character (wikipedia) • Most character sets are a direct mapping of a value to a number (7 bit / 8 bit) • Character sets are NOT fonts! • Encoding is usually a lookup in a table • Most IBM and Microsoft code pages use ASCII as their base set of characters • The English bias (compare to Indic languages)
Character Sets • Issues Within a single Language • Selectors to overcome 8 bit limitations (especially for CJK sets) • Historical importance of platforms and hardware • Compatibility (or more likely, lack thereof) • ISCII as an example • Issues outside a single Language • How do you produce content using multiple languages? (Or the characters from those languages?) • http://en.wikipedia.org/wiki/Code_page_437
Character Sets • Enter the standards • ISO-646 (ASCII, still 7 bit) • 12 whole code points to play with! • C0 Control Set (0x00 – 0x1F) • ISO-8859-n • 0x00 – 0x7F ISO-646 IRV • 0x80 – 0xFF Different for each set (or part) • ISO 8859-1 (Latin1) • C1 Control Set (0x80 – 0X9F) • ISO-2022 • Designed for transmission • Non Latin bases & multi byte sets
Character Sets • Enter Microsoft! • Windows code pages • http://www.microsoft.com/globaldev/reference/wincp.mspx • Cp1252 • Based on ISO 8859-1 • C1 code points used for printable characters • Often mislabeled as ISO-8859-1 due to their similarities
Unicode What is Unicode? Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
Unicode • ISO 10646 1990 • Merged with the Unicode Consortium Ties a character, name, and a code point together • BMP – Basic Multilingual Plane (the first 65,536 code points) • ISO and UC Character repertoire are synchronized • UCS (Universal Character Set)
Unicode • Q: So are they the same thing?A: No. Although the character codes and encoding forms are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an extensive set of functional characterspecifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646.(http://unicode.org/faq/unicode_iso.html)
Unicode • The Unicode Standard includes a set of characters, names, and coded representations that are identical with those in ISO/IEC 10646:2003. It additionally provides details of characterproperties, processing algorithms, and definitions that are useful to implementers. [It] strengthens Unicode support for worldwide communication, software availability, and publishing. (http://www.iso.org)
Unicode • UCS Code space: (0x – 0x7FFFFFFF) 128 x 256 x 256 x 256 (GPRC) 2,147,483,648 possible code points • The Unicode Character Database • http://unicode.org/Public/UNIDATA/UCD.html • Main Definition (UnicodeData.txt) • Available on line • http://www.unicode.org/Public/UNIDATA/ • Unicode Code Space (0x – 0x10FFFF) 17 x 256 x 256 1,114,112 code points
Unicode • As of Unicode 5.0.0, 101,063 (9.1%) of these codepoints are assigned, with another 137,468 (12.3%) reserved for private use, leaving 875,441 (78.6%) unassigned. The number of assigned code points is made up as follows: 98,884 graphemes 140 formatting characters 65 control characters 2,048 surrogate characters
Unicode • Plane 0 (0000-FFFF) • Basic Multilingual Plane (BMP) • Used for most of the alphabets • Not all code points are used • Allocated in areas/blocks
Unicode • Plane 1 (10000-1FFFF): • Supplementary Multilingual Plane (SMP) • Historic scripts such as Linear B, but is also used for musical and mathematical symbols.
Unicode • Plane 2 (20000-2FFFF) • Supplementary Ideographic Plane (SIP) • Used for about 40,000 rare Chinese characters that are mostly historic
Unicode • Planes 3 to 13 (30000-DFFFF) • Unassigned
Unicode • Plane 14 (E0000-EFFFF) • Supplementary Special-purpose Plane (SSP) • glyph (font) selection • code point + variation selector = variation sequence • http://www.unicode.org/reports/tr37/tr37-3.html (Ideographic Variation Database)
Unicode • Plane 15 (F0000-FFFFF) • Plane 16 (100000-10FFFF) • Plane 0 (E000-F8FF) • Private Use Area (PUA) • The use of the PUA was a concept inherited from certain Asian encoding systems. These systems had private use areas to encode Japanese Gaiji (rare personal name characters) in application-specific ways)
Unicode ConScript Unicode Registry • The purpose of the ConScript Unicode Registry (CSUR) is to coordinate the assignment of blocks out of the Unicode Private Use Area (E000-F8FF and 000F0000-0010FFFF) to constructed/artificial scripts, including scripts for constructed/artificial languages. • Cirth, Klingon, Tengwar, etc.
Encodings Purpose of the following encodings is to get the Unicode value to you.Depending on the storage or transmission protocols, differentencodings will need to be used. These are not different character sets, they are ways of representing the characters in Unicode.
Encodings • Endianness • 0x1234 • LE 34 12 • BE 12 34 • Byte Order Mark - 0xFEFF • Helps Determine Endianness • Unicode 3.2 (0x2060) • 0xFFFE reserved • 0XFEFF set aside for BOM • Also used to declare encoding (UTF-8)
Encodings UTF-8 • Variable-length character encoding • Can address all characters in the UCS but was limited by RFC 3629 to just address the Unicode code space. • BOM – EF BB BF • Format 000000-00007F 0zzzzzzz 000080-0007FF 110yyyyy 10zzzzzz 000800-00FFFF 1110xxxx 10yyyyyy 10zzzzzz 010000-10FFFF 11110www 10xxxxxx 10yyyyyy 10zzzzzz
Encodings UTF-32/UCS-4 • Fixed-length character encoding • Uses 31 bits • UCS-4 capable of addressing entire UCS, but was restricted to only cover the Unicode code space • UTF-32 only covers the Unicode code space • 4E8C, 10302 = 00004E8C, 00010302 • BE BOM – 00 00 FE FF • LE BOM – FF FE 00 00
Encodings UCS-2 • Fixed-length encoding • Two-octet • It is NOT UTF-16! • Only addresses BMP • UCS-2BE, UCS-2LE • Obsoleted by UTF-16
Encodings UTF-16 • Variable-length encoding • UTF-16BE, UTF-16LE • BE BOM – FEFF • LE BOM – FFFE • Surrogates are used to address code points outside the BMP. (We will cover this later)
Encodings UTF-16 Surrogate Pairs • Needed for code points > 0xFFFF • High Byte 0xD800 – 0xDBFF first surrogate • Low Byte 0xDC00 – 0xDFFF second surrogate • Algorithm: • ((cp - 0x10000) high 10 bits) | 0xD800 • ((cp - 0x10000) low 10 bits) | 0xDC00
Encodings Which Encoding should you use? • If dealing with CJK or Hindi (>0x0800), UTF-8 requires 3 bytes whereas UTF-16 needs only 2 • UTF-8 is great for ASCII whereas UTF-16 needs 2 bytes for it • Java uses UTF-16 • Windows uses UTF-16LE internally • UTF-32 not really used that much • UTF-8 and UTF-16 are the most common
Java • J2SE 1.5 version 4.0 • J2SE 1.4 version 3.0 • J2SE 1.3 version 2.1 • Supplementary characters were part of Unicode 3.1 • Addressed in JSR 204 (http://jcp.org/en/jsr/detail?id=204)
Java • Unicode characters are specified using \u such as \u0039 • Unicode can be used in source files • file.encoding=Cp1252 on my machine • You can change this, but beware… • Java reads and writes using this encoding by default • You can specify the character set to use for reading or writing
Databases (Maybe) • SQL 92 NATIONAL CHARACTER • The <key word>s NATIONAL CHARACTER are used to specify a character string data type with a particular implementation-defined character repertoire. Special syntax (N'string') is provided for representing literals in that character repertoire. • Collation • Database Support • MySQL • Oracle • Sql Server • Postgres
Demonstration • Read/Write/Examine UTF-8/UTF-16/UTF-16LE encoded text (with Hex editor) • Show encoding settings in Eclipse and Java • Show how windows (and eclipse console) can/can't display some characters • web browser settings • Chinese article on cracking of SHA-1 • Martin Fowler article on dependency Injection
Resources • The big ones: • http://www.unicode.org/Public/UNIDATA/ • http://en.wikipedia.org/wiki/Unicode • http://www.evertype.com/standards/csur • The rest: • http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp • http://en.wikibooks.org/wiki/Unicode/Character_reference • http://www.joelonsoftware.com/articles/Unicode.html • http://www.cl.cam.ac.uk/~mgk25/unicode.html • http://czyborra.com/charsets/iso646.html • http://www.fileformat.info/ (GREAT resource) • For fun: • http://www.omniglot.com/ • http://en.wikipedia.org/wiki/Constructed_language • http://talideon.com/concultures/wiki/