Encodings in Java

Working with Unicode in Java is a complex topic, but reading and writing files in different encodings is pretty easy. Java has had Unicode support since 1.0; it underwent a significant overhaul to support the Unicode 4.0 standard. For more information, some good places to look are the internationalization resources, internationalization (i18n) guide, java.lang.Character and its inner classes and the java.nio.charset package. For information on supplementary character support (characters above U+FFFF), see Supplementary Characters in the Java Platform and JSR 204. IBM's ICU4J (Internationalization Components for Unicode, for Java) is another useful resource.

Documentation on how Unicode actually works in Java is somewhat spotty and widely scattered - this page tries to tell you everything you need to know. This information is primarily based on Java 1.5, but much should also be applicable to 1.4.2 and earlier.

The Java documentation frequently refers to encodings as "character sets" but this is not technically correct as we are interested in particular encodings of character sets, not the abstract character repetoire. For example, Unicode is a character set; UTF-8, UTF-16, etc are encodings.

String values - Sequences of UTF-16 code units

Most Java programmers know that String values have something to do with Unicode, but many don't know how things actually work. Strings and char[] arrays store sequences of UTF-16 code points - it takes one char to hold a Unicode character for code points U+0000 - U+FFFF and two chars (or one int) to hold a code point from U+10000 to U+10FFFF. Thus (unlike perl or python) a String is not a sequence of abstract Unicode code points. Java 1.4.2 and earlier doesn't have any API support for characters with code points above U+FFFF (supplementary characters), although the UTF-8 support will convert these characters into surrogate pairs.

Java 1.5 adds support for supplementary characters. If you do processing that cares about individual characters, you need to use Character.codePointAt(char[] a, int index) or String.codePointBefore(int index) to extract the code point to pass to methods like Character.isLetter(int codePoint) that now are overloaded to work with ints as well as chars. Things like java.util.regex, java.text, string casing and font rendering all deal correctly with strings containing supplementary characters in Java 1.5.0. Note that although there is now a String constructor that accepts an array of ints representing code points (see here) there is no simple way to convert a String to an int array representing abstract code points or to iterate over a String as a sequence of code points.

byte[] and String values - Encoding and Decoding

byte[] arrays are decoded into sequences of UTF-16 code units - Strings. This can be done with the various String constructors that accept byte[]s, the CharsetDecoder, or automatically through a InputStreamReader (which converts a stream of bytes into a stream of UTF-16 code units) wrapped around an InputStream. Likewise, Strings must be encoded into byte[]s using OutputStreamWriter, String.getBytes() or a CharsetEncoder

Java uses the "platform default charset" which can be retrieved via Charset.defaultCharset() as the default value for various encoding-related methods. On UNIX, the default charset can be set via the LC_CTYPE environment variable - for example to use UTF-8 by default, set LC_CTYPE = "en_US.UTF-8". This is especially important as System.out and System.err use the default encoding. Many applications using System.in wrapped in a InputStreamReader that uses the system default encoding, so it's important to make sure to set LC_CTYPE correctly.

There are various ways convert using something other than the default encoding. Here are the main ones:

Streams and Readers

Although the multi-layer implementation with Streams, Readers and Writers seems overly complicated, it does mean that it's somewhat easier to keep track of what's going on with your files than in perl or python. If you've got a Reader or Writer then the strings you read or write are automatically decoded or encoded according to some encoding, which unless you say otherwise is the platform default encoding. And if you have an InputStream or an OutputStream, then you're reading and writing bytes. One interesting exception is PrintStream, which inherits from OutputStream yet has methods for writing strings; it always uses the platform default encoding. System.out and System.err are PrintStreams.

To change the encoding for System.out or System.err, wrap it in a OutputStreamWriter. You provide the encoding name, a Charset or a CharsetEncoder. For example:

CharsetEncoder enc = Charset.forName("Big5_HKSCS").newEncoder();
enc.onMalformedInput(CodingErrorAction.REPORT);
enc.onUnmappableCharacter(CodingErrorAction.REPORT);
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out, enc));

System.in is an InputStream, so to read Strings from standard input you'll always need to wrap it in something. Here's a typical use that allows you to read a line at a time from standard input:

BufferedReader in = new BufferedReader(new InputStreamReader(System.in,"UTF-8"))

while(true) {
   String myString = in.readLine();
   if(myString == null) { break; }
   // do some stuff with string..
}

If no encoding name, Charset or CharsetDecoder is passed to the InputStreamReader constructor, then the platform default charset is used.

Specifying the encoding for any other InputStream or OutputStream works the same way; just wrap it in an InputStreamReader or OutputStreamWriter. For example, to open a file for output and specify the use of CP1256:

PrintWriter out = new PrintWriter(new OutputStreamWriter(new FileOutputStream("outfile.txt"), "Cp1256"));

If you don't specify an encoding or use the FileReader or FileWriter convenience classes, the platform default charset is used.

Converting Between Encodings with ICU

Using Java to natively read files in different encodings isn't always the best solution. Sometimes, Java may not support a particular encoding you want to use or you may not want to worry about the encoding in Java. It's often best to just convert your data to Unicode ahead of time.

IBM's ICU toolkit provides the most flexible and portable method of converting between various encodings.

To list the all the encodings that uconv supports, run uconv -l. The online ICU Converter Explorer provides a better organized view. Despite the wide range of encodings available, even ICU does not support all known encodings.

The simplest and most common way of using ICU is to convert data from one character set to another. To do that, just run

uconv -f from-encoding -t to-encoding < input > output

For example, to convert file.gb.txt from gb2312 to UTF-8, you'd do

uconv -f gb2312 -t utf-8 < file.gb.txt > file.utf8.txt
If uconv finds any invalid characters, by default it will report the location and stop.

Repairing Invalid Data with ICU

Frequently you run across data that is mostly encoded according to some particular encoding but has a few invalid byte sequences. Often this happens as the result of editing a file with a non-encoding-aware editor or by copying and pasting starting in the middle of a byte sequence. Sometimes this can happen because a file uses vendor extensions to a standard character set. ICU supports many of these vendor encodings and you should try to figure out which one is in use. Even if you aren't sure exactly what went wrong, uconv provides a very easy way to fix the problem.

Suppose file.gb.txt has a invalid byte sequence. For example, suppose a line of French encoded in ISO-8859-1 somehow got pasted into the file. If we have a lot of data it can be very tedious to go through by hand and find all the problems. Instead, we can just use uconv to ignore or replace invalid byte sequences. The most useful modes are substitute and skip. Subsitute replaces any invalid byte sequences with a replacement character. For UTF-8 this replacement character is U+FFFD, which your browser shows as "�". Skip just ignores the invalid byte sequence entirely. Substitute has the advantage that after conversion you can grep for the replacement character and see in what contexts invalid byte sequences occurred. To use a callback to handle invalid byte sequences in the input:

uconv --from-callback callback -f from-encoding -t to-encoding < input > output

For example, to replace the invalid byte sequences in file.gb.txt with U+FFFD, do

uconv --from-callback substitute -f gb2312 -t utf-8 < file.gb.txt > file.utf8.txt

Other callbacks include stop, which is the default and simply exits when an error occurs, and a series of escape callbacks which are typically more useful for --to-callback - they write Unicode characters that can't be represented in the target encoding with various kinds of escape sequences such as &#xFFFD; for HTML and XML or \uFFFD for C or Java.

More serious corruption often occurs as the result of running the data through a program not aware of the encoding. Sometimes this wreaks wholesale havoc on your data and there's no choice but to fix the program if the program isn't Unicode-aware or write a one-off tool to reverse the damage if the source to the program isn't available and the program must be run. Both of these can be quite challenging.