Package org.marketcetera.util.unicode

Unicode en/decoding with BOMs, including I/O stream support.

See:
          Description

Interface Summary
Messages The internationalization constants used by this package.
 

Class Summary
UnicodeFileReader A UnicodeInputStreamReader which gets its input from a file.
UnicodeFileWriter A UnicodeOutputStreamWriter which directs its output to a file.
UnicodeInputStreamReader A variation of InputStreamReader that is BOM-aware.
UnicodeOutputStreamWriter A variation of OutputStreamWriter that is BOM-aware.
 

Enum Summary
DecodingStrategy A list of one or more Serialization instances.
Serialization A list of signature/charset pairs.
Signature A byte stream signature.
SignatureCharset A byte stream signature (Signature) coupled with a charset (UnicodeCharset) that may follow the signature (aka a signature/charset pair).
UnicodeCharset A thin wrapper around Charset for the Unicode charsets.
 

Package org.marketcetera.util.unicode Description

Unicode en/decoding with BOMs, including I/O stream support.

Unicode handling in Java can get a bit confusing, esp. in the context of byte order marks (BOMs). Java has flip-flopped on the handling of BOMs in UTF-8 (Java 6 does not provide any), provides BOMs for UTF-16 (see Charset), and does not provide BOMs for UTF-32 (unless the UTF-32LE or UTF-32BE variants are used). Also, as of Java 5, Java moved from UCS-2 to UTF-16 for its internal string representation, enhancing the range of byte streams that Java can en/decode. This package provides an improved framework for handling BOMs; the classes herein are tested with unicode code points beyond those covered by UCS-2 (thus exercising the new UTF-16 representation of Java strings).

Foundation

UnicodeCharset is the basic tie-in to the JDK unicode facilities. Each instance is just a wrapper around a unicode Charset instance, and obeys the standard Java semantics for BOMs (with their inconsistencies).

Signature is a generic signature detector in the header of a byte stream. Enumeration constants are defined for the unicode BOM signatures, but this class makes no attempt to handle a map between signatures and charsets.

SignatureCharset pairs up charsets to signatures that can mark those charsets. All charsets can occur either without a signature (Signature.NONE), or with the BOM defined by the unicode standard specifically for each charset.

Serialization groups a list of SignatureCharset instances. We check a byte source against all signatures in a serialization (in order), in order to determine the charset we should use.

Finally, DecodingStrategy groups a list of Serialization instances. Functionally, it is similar to Serialization, but it occupies a higher level of abstraction, making it easy to define signature-detection strategies using Serialization instances as building blocks.

I/O

In addition to the basic foundation, this package exposes readers and writers that are BOM-aware. This includes:

There are basic input stream/output stream converters to a reader (UnicodeInputStreamReader) or writer (UnicodeOutputStreamWriter), as well as a file-based reader (UnicodeFileReader) and writer (UnicodeFileWriter).



Copyright © 2012. All Rights Reserved.