Unicode

The basic idea behind Unicode is to be able to encode all the characters in all the world's language with a single encoding. Sounds great, right? Unfortunately, getting all this work and making everybody happy is hideously complicated. In fact, the Unicode standard is so complex that (to my knowledge) no single product implements all of it. For a slightly longer but very accessible introduction to Unicode see Wikipedia. Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is a very nice introduction from a software engineering perspective. More technical information on Unicode including code charts, normalization, collation and bidirectional (Hebrew and Arabic support) issues can be found at the Unicode Home Page.

Code Points

Every character in Unicode is assigned a code point, which is written as U+XXXX where XXXX is a group of four to six hexadecimal digits. For example, LATIN CAPITAL LETTER A (the familiar A) has the code point U+0041. the EURO SIGN () has code point U+20AC. Characters from U+0000 to U+FFFF are said to be in Plane 0 or the Basic Multilingual Plane (BMP). All software supporting Unicode should support manipulation of characters in the BMP, though most current software and fonts will not support the correct display of all possible characters and combinations of characters in the BMP. Processing and display of characters outside the BMP, those with code points greater than U+FFFF, is not yet widely supported.

Unicode Transformation Formats

Code points are just an abstract assignment are not directly used in data. Instead a Unicode Transformation or encoding is applied to the code points.

UTF-8

The most common encoding of Unicode is UTF-8. It is a variable-width encoding ranging from 1 to 4 bytes. It can encode any Unicode character from U+0000 to U+10FFFF, the largest allowed code point in Unicode.

UTF-8 has many advantages. One major one is that plain 7-bit ASCII is forward-compatible with UTF-8 - a plain ASCII document is already valid UTF-8. Also, UTF-8 is byte-order independent - that is, represented the same on little-endian machines such as x86 and on big-endian machines like Sparc. This also means you can sort correctly (or at least in order by Unicode code point; the Unicode Collation Algorithm is a table-driven nightmare) using byte-oriented sorting routines.

The disadvantage of UTF-8 is complexity resulting from it being variable-width. In practice this isn't too bad since there is no ambiguity - if we attempt to decode a character starting in the middle of a multibyte sequence, it will simply be invalid as opposed to giving an incorrect character like in some other variable-width encodings like Shift-JIS.

Perl uses UTF-8 as its internal encoding. UTF-8 is also the default encoding for XML. The use of UTF-8 only is highly recommended for data interchange.

Wikipedia explains how to encode Unicode code points in UTF-8

UCS-2 and UTF-16

UCS-2 is a fixed-width two byte encoding that simply encodes each code point from U+0000 to U+FFFF as itself. It has the advantage of being simple and fast but has the disadvantage of being byte-order dependent and of being incompatible with 7-bit ASCII.

UTF-16 is exactly like UCS-2 but allows the encoding of characters above U+FFFF with pairs of characters from the range U+D800 to U+DFFF. Surrogate pairs complicate the processing of UTF-16 considerably, since Unicode characters may now be either 2 or 4 bytes. So UTF-16 has all the disadvantages of UCS-2 in terms of byte order and all the disadvantages of UTF-8 in terms of variable width.

Microsoft Windows uses UTF-16 as its internal encoding, but support for surrogate pairs is disabled by default. Java also uses UTF-16 as its internal encoding; Java 1.4 and before did not support surrogate pairs, but Java 1.5 does. As an internal encoding, the byte order issue is less of a problem, but the surrogate pairs issue remains problematic.

Java by default uses UCS-2 internally and is thus incapable of handling surrogate pairs, but it can be configured to use UCS-4 which encodes each Unicode character in 4 bytes. It's a very simple solution but is wasteful of space - but if you're using Java you probably won't mind.

Because of the byte order issues, UTF-16 is actually slightly more complicated than this - Wikipedia explains byte markers as well how to encode characters using surrogate pairs.

Some Problems of Unicode

One goal of Unicode that complicates things considerably to enable round-trip compatibility with legacy encodings. Round-trip compatibility means the ability convert from one character set to another and back and get the original input.

One major problem is glyph variants - more than one way to encode what is conceptually the same grapheme. For example, GB2312 includes , , and (1., (1) and 1 in a circle) all as separate characters, so Unicode does too.

Another issue with Unicode is precomposed characters vs. combining character sequences. Precomposed characters are characters like é where a single code point (in this case U+00E9 LATIN SMALL LETTER E WITH ACUTE) encodes something that we could build from two components, in this case U+0065 LATIN SMALL LETTER E and U+0301 COMBINING ACUTE ACCENT. Unfortunately, support for displaying and manipulating combining character sequences is still not very good. If your browser supports them, should display same as é. If not, it might show an e with an acute accent floating off to one side. Besides diacritics, there are also many ligatures with precomposed versions available. Ligatures are particular sequences of characters that are displayed differently than their individual component characters. For example in Latin script ffi, fi, fl and ffl are usually typeset with a ligature, and some legacy encodings include a separate character for some of these, so they have separate code points in Unicode.

Sometimes it's not clear whether something is a ligature or a separate grapheme - for example, ß in German or ij in Dutch. Usually, the handling of these issues depends on the language or dialect the text is in! Ligatures aren't the only language-specific issue. For example, in Swedish ä is viewed as a separate letter of the alphabet, completely independent from a; in this case we might want to retain the composed form U+00E4, Latin Small Letter A with Diaeresis. In German it's viewed as a form of the letter a; there we might prefer to use the combining character sequence U+0061 U+0308, Latin Small Letter A followed by Combining Diaeresis.

Just as important as display is representation - how to search for something if there's more than one way to represent it? How do we even know what one character is? Is it a single code point or a single combining character sequence? How should glyph variants be treated? Different applications answer this different ways, and usually the answer is that a single code point is a single character, because that's the easiest solution. Java 1.5 is the only platform I'm aware of that supports combining character sequences as a single logical unit.

Some things might be simpler (or at least work better) if there were no precomposed character sequences and everybody had to support combining character sequences. But Unicode tries to be everything to everybody (everybody who speaks languages with enough money, anyway) so there are large numbers of precomposed character sequences in Unicode to support the vaunted goal of Round-Trip Compatibility With Legacy Encodings. As it turns out, there's still ambiguity - even having all those precomposed character sequences available doesn't solve the round-trip problem. Since all the "important" precomposed sequences are available, not much software fully supports combining character sequences so minority languages still tend to have poor support. As it stands, normalization is still a big problem, and even with all the precomposed characters and glyph variants it's not always possible to ensure round-trip compatibility with most legacy character sets.

Normalization Formats

"The wonderful thing about standards is that there are so many of them to choose from." - attributed to Rear Admiral Grace Hopper

Yes, Unicode has not just one but four different normalization formats. It's not always clear which one you should use. ICU is one of the few Unicode libraries that fully supports normalization, so if you have a fixed data set you can always use the command-line uconv utility to normalize before further processing.

All of the normalization algorithms are concerned with the handling of combining character sequences. They differ in which characters they decompose and which sequences they compose.

Normalization consists of two phases. In the first phase characters are decomposed. We have to choose whether to decompose characters with canonical decompositions or also compatibility decompositions. Characters with compatibility decompositions are called compatibility characters According to the Unicode standard (section 3.6):

Compatibility characters are included in the Unicode Standard only to represent distinctions in other base standards and would not otherwise have been encoded. However, replacing a compatibility character by its decomposition may lose round-trip convertibility with a base standard.

Compatibility characters include such things as , , and (1., (1) and 1 in a circle) from GB2312 as well as more common things like ² and ½ from ISO-8859-1. Many represent existing graphemes with only formatting differences; others, like the Arabic Presentation Forms A and B, exist primarily for internal use by applications.

In the second phase we have to choose whether or not to recombine the characters we decomposed in the first phase.

So there are two choices to make which results in four possible formats:

Note that NKFC does Canonical Composition, not the (nonexistant) Compatibility Composition - you would almost certainly not want to replace occurrences of (1) in the source text with , for example.

A rather technical but important note: For combining character sequences that have more than one combining character, normalization puts them in a canonical order. Each combining character belongs to a particular combining character class. Combining characters within a class interact typographically; characters in different classes don't. Thus, the order of combining characters in the same class is significant - those that appear first are drawn closest to the base character. The order of combining characters in different classes is not significant. All the normalization forms sort combining characters in ascending order by class, retaining the original order for combining characters in the same class. The meaning of various classes is given in UCD.html from the Unicode Character Database; the class for particular characters is found in UnicodeData.txt in the fourth field.

Which Normalization Form Should I Use?

For Web data, the W3C recommends using NFC. For natural language processing, NFKC or NFKD are probably the best choices - in general the goal is to reduce sparseness as much as possible, and the existence of compatibility characters can hamper this. Of course, if the compatibility characters do represent important distinctions in the base text that could be lost (such as with IPA) then NFKC or NFKD would be inappropriate. The choice of normalization form to use can be as much philosophical as technical; ultimately, the most important thing is to use some form of normalization and use it consistently.

Performing Normalization

The best way to do normalization is to use uconv from ICU. To do normalization, assuming the input data is already UTF-8:

uconv -f utf-8 -t utf-8 -x Any-FORM < input.txt > output.txt
where FORM is one of NFC, NFD, NFKC, NFKD.