Working with Unicode in Java is a complex topic, but reading and
writing files in different encodings is pretty easy. Java has had
Unicode support since 1.0; it underwent a significant overhaul to
support the Unicode 4.0 standard. For more information, some good
places to look are the internationalization
resources, internationalization
(i18n) guide, java.lang.Character
and its inner classes and the java.nio.charset
package. For information on supplementary character support
(characters above U+FFFF), see Supplementary
Characters in the Java Platform and JSR 204. IBM's ICU4J (Internationalization
Components for Unicode, for Java) is another useful resource.
Documentation on how Unicode actually works in Java is somewhat spotty and widely scattered - this page tries to tell you everything you need to know. This information is primarily based on Java 1.5, but much should also be applicable to 1.4.2 and earlier.
The Java documentation frequently refers to encodings as "character sets" but this is not technically correct as we are interested in particular encodings of character sets, not the abstract character repetoire. For example, Unicode is a character set; UTF-8, UTF-16, etc are encodings.
Most Java programmers know that String values have something to do
with Unicode, but many don't know how things actually work. Strings
and char[] arrays store sequences of UTF-16 code points - it takes one
char to hold a Unicode character for code points U+0000 -
U+FFFF and two chars (or one int) to hold a code point
from U+10000 to U+10FFFF. Thus (unlike perl
or python) a String is not a sequence of abstract Unicode code
points. Java 1.4.2 and earlier doesn't have any API support for
characters with code points above U+FFFF (supplementary
characters), although the UTF-8 support will convert these characters
into surrogate pairs.
Java 1.5 adds support for supplementary characters. If you do
processing that cares about individual characters, you need to use
Character.codePointAt(char[] a, int index) or
String.codePointBefore(int index) to extract the code
point to pass to methods like Character.isLetter(int
codePoint) that now are overloaded to work with
ints as well as chars. Things like
java.util.regex, java.text, string casing
and font rendering all deal correctly with strings containing
supplementary characters in Java 1.5.0. Note that although there is
now a String constructor that accepts an array of ints
representing code points (see here)
there is no simple way to convert a String to an int array
representing abstract code points or to iterate over a String as a
sequence of code points.
byte[] arrays are decoded into sequences of
UTF-16 code units - Strings. This can be done with the
various String
constructors that accept byte[]s, the CharsetDecoder,
or automatically through a InputStreamReader
(which converts a stream of bytes into a stream of UTF-16 code units)
wrapped around an InputStream.
Likewise, Strings must be
encoded into byte[]s using OutputStreamWriter,
String.getBytes()
or a CharsetEncoder
Java uses the "platform default charset" which can be retrieved via
Charset.defaultCharset()
as the default value for various encoding-related methods. On UNIX,
the default charset can be set via the LC_CTYPE
environment variable - for example to use UTF-8 by default, set
LC_CTYPE = "en_US.UTF-8". This is especially important as
System.out and System.err use the default
encoding. Many applications using System.in wrapped in a
InputStreamReader that uses the system default encoding,
so it's important to make sure to set LC_CTYPE correctly.
There are various ways convert using something other than the default encoding. Here are the main ones:
Use the String(byte[] bytes, String
charsetName) constructor to decode an array of bytes according
to the given encoding and getBytes(String charsetName) to
encode. If charsetName is not specified, the
platform default charset is used. If invalid byte sequences for the
given charset are encountered, the behavior is undefined.
Use a CharsetDecoder
to decode bytes to Strings and a CharsetEncoder
to encode Strings to bytes. This gives you much more control over the
conversion process. To get one of these objects, do
CharsetDecoder dec = Charset.forName(charsetName).newDecoder();where
charsetName is a supported
encoding.
You can then set a CodingErrorAction
for what to do on malformed input and what to do when the target
character set does not encode some abstract character. The default is
to ignore the offending byte sequence or character. The other
possibilities are to use a configurable replacement character or to
throw an exception. For example,
dec.onUnmappableCharacter(CodingErrorAction.REPLACE); dec.onMalformedInput(CodingErrorAction.REPORT);
The CharsetEncoder and CharsetDecoder
operate on ByteBuffer
and CharBuffer
objects rather than directly on byte or character arrays.. You can
wrap a byte[] in a ByteBuffer by calling
ByteBuffer.wrap(byte[] array). Likewise you can wrap a
char[], String or StringBuffer
in a CharBuffer.
A complete example of decoding an array of bytes:
byte[] myBytes = getBytesFromSomewhere();
CharsetDecoder dec = Charset.forName("Big5_HKSCS").newDecoder();
dec.onMalformedInput(CodingErrorAction.REPORT);
dec.onUnmappableCharacter(CodingErrorAction.REPORT);
String myString = dec.decode(ByteBuffer.wrap(myBytes)).toString();
And encoding:
CharsetEncoder enc = Charset.forName("GB18030").newEncoder();
enc.onMalformedInput(CodingErrorAction.REPORT);
enc.onUnmappableCharacter(CodingErrorAction.REPLACE);
byte[] gbBytes = enc.encode(CharBuffer.wrap(myString)).array();
Although the multi-layer implementation with Streams, Readers and
Writers seems overly complicated, it does mean that it's somewhat
easier to keep track of what's going on with your files than in perl
or python. If you've got a Reader
or Writer
then the strings you read or write are automatically decoded or
encoded according to some encoding, which unless you say otherwise is
the platform default encoding. And if you have an InputStream
or an OutputStream,
then you're reading and writing bytes. One interesting exception is PrintStream,
which inherits from OutputStream yet has methods for writing strings;
it always uses the platform default encoding. System.out
and System.err are PrintStreams.
To change the encoding for System.out or
System.err, wrap it in a OutputStreamWriter. You
provide the encoding name, a Charset or a
CharsetEncoder. For example:
CharsetEncoder enc = Charset.forName("Big5_HKSCS").newEncoder();
enc.onMalformedInput(CodingErrorAction.REPORT);
enc.onUnmappableCharacter(CodingErrorAction.REPORT);
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out, enc));
System.in is an InputStream,
so to read Strings from standard input you'll always need to wrap it
in something. Here's a typical use that allows you to read a line at a
time from standard input:
BufferedReader in = new BufferedReader(new InputStreamReader(System.in,"UTF-8"))
while(true) {
String myString = in.readLine();
if(myString == null) { break; }
// do some stuff with string..
}
If no encoding name, Charset or
CharsetDecoder is passed to the
InputStreamReader constructor, then the platform default
charset is used.
Specifying the encoding for any other InputStream or OutputStream works the same way; just wrap it in an InputStreamReader or OutputStreamWriter. For example, to open a file for output and specify the use of CP1256:
PrintWriter out = new PrintWriter(new OutputStreamWriter(new FileOutputStream("outfile.txt"), "Cp1256"));
If you don't specify an encoding or use the FileReader or FileWriter convenience classes, the platform default charset is used.
Using Java to natively read files in different encodings isn't always the best solution. Sometimes, Java may not support a particular encoding you want to use or you may not want to worry about the encoding in Java. It's often best to just convert your data to Unicode ahead of time.
IBM's ICU toolkit provides the most flexible and portable method of converting between various encodings.
To list the all the encodings that uconv supports, run uconv
-l. The online ICU
Converter Explorer provides a better organized
view. Despite the wide range of encodings available, even ICU
does not support all known encodings.
The simplest and most common way of using ICU is to convert data from one character set to another. To do that, just run
uconv -f from-encoding -t to-encoding < input > output
For example, to convert file.gb.txt from gb2312 to UTF-8, you'd do
uconv -f gb2312 -t utf-8 < file.gb.txt > file.utf8.txtIf uconv finds any invalid characters, by default it will report the location and stop.
Frequently you run across data that is mostly encoded according to some particular encoding but has a few invalid byte sequences. Often this happens as the result of editing a file with a non-encoding-aware editor or by copying and pasting starting in the middle of a byte sequence. Sometimes this can happen because a file uses vendor extensions to a standard character set. ICU supports many of these vendor encodings and you should try to figure out which one is in use. Even if you aren't sure exactly what went wrong, uconv provides a very easy way to fix the problem.
Suppose file.gb.txt has a invalid byte sequence. For
example, suppose a line of French encoded in ISO-8859-1 somehow got
pasted into the file. If we have a lot of data it can be very tedious
to go through by hand and find all the problems. Instead, we can just
use uconv to ignore or replace invalid byte sequences. The most useful
modes are substitute and
skip. Subsitute replaces any invalid byte
sequences with a replacement character. For UTF-8 this
replacement character is U+FFFD, which your browser shows
as "�". Skip just ignores the invalid byte
sequence entirely. Substitute has the advantage that
after conversion you can grep for the replacement character and see in
what contexts invalid byte sequences occurred. To use a callback to
handle invalid byte sequences in the input:
uconv --from-callback callback -f from-encoding -t to-encoding < input > output
For example, to replace the invalid byte sequences in
file.gb.txt with U+FFFD, do
uconv --from-callback substitute -f gb2312 -t utf-8 < file.gb.txt > file.utf8.txt
Other callbacks include stop, which is the default and
simply exits when an error occurs, and a series of
escape callbacks which are typically more useful for
--to-callback - they write Unicode characters that can't
be represented in the target encoding with various kinds of escape
sequences such as � for HTML and XML or
\uFFFD for C or Java.
More serious corruption often occurs as the result of running the data through a program not aware of the encoding. Sometimes this wreaks wholesale havoc on your data and there's no choice but to fix the program if the program isn't Unicode-aware or write a one-off tool to reverse the damage if the source to the program isn't available and the program must be run. Both of these can be quite challenging.