Package org.marketcetera.util.unicode

Unicode en/decoding with BOMs, including I/O stream support.

Unicode handling in Java can get a bit confusing, esp. in the context of byte order marks (BOMs). Java has flip-flopped on the handling of BOMs in UTF-8 (Java 6 does not provide any), provides BOMs for UTF-16 (see Charset), and does not provide BOMs for UTF-32 (unless the UTF-32LE or UTF-32BE variants are used). Also, as of Java 5, Java moved from UCS-2 to UTF-16 for its internal string representation, enhancing the range of byte streams that Java can en/decode. This package provides an improved framework for handling BOMs; the classes herein are tested with unicode code points beyond those covered by UCS-2 (thus exercising the new UTF-16 representation of Java strings).

Foundation

UnicodeCharset is the basic tie-in to the JDK unicode facilities. Each instance is just a wrapper around a unicode Charset instance, and obeys the standard Java semantics for BOMs (with their inconsistencies).

Signature is a generic signature detector in the header of a byte stream. Enumeration constants are defined for the unicode BOM signatures, but this class makes no attempt to handle a map between signatures and charsets.

SignatureCharset pairs up charsets to signatures that can mark those charsets. All charsets can occur either without a signature (Signature.NONE), or with the BOM defined by the unicode standard specifically for each charset.

Serialization groups a list of SignatureCharset instances. We check a byte source against all signatures in a serialization (in order), in order to determine the charset we should use.

Finally, DecodingStrategy groups a list of Serialization instances. Functionally, it is similar to Serialization, but it occupies a higher level of abstraction, making it easy to define signature-detection strategies using Serialization instances as building blocks.

I/O

In addition to the basic foundation, this package exposes readers and writers that are BOM-aware. This includes:

  • Readers who can recognize a BOM, and thus automatically determine a byte stream's charset.
  • Writers who can place a BOM at the header of a byte stream, thus specifying the stream's charset.

There are basic input stream/output stream converters to a reader (UnicodeInputStreamReader) or writer (UnicodeOutputStreamWriter), as well as a file-based reader (UnicodeFileReader) and writer (UnicodeFileWriter).