Package org.marketcetera.util.unicode
Unicode en/decoding with BOMs, including I/O stream support.
Unicode handling in Java can get a bit confusing, esp. in the
context of byte order marks (BOMs). Java
has flip-flopped
on the handling of BOMs in UTF-8 (Java 6 does not provide any),
provides BOMs for UTF-16 (see Charset), and
does not provide BOMs for UTF-32 (unless the UTF-32LE
or UTF-32BE variants are used). Also, as of Java 5, Java
moved from UCS-2 to UTF-16 for its internal string representation,
enhancing the range of byte streams that Java can en/decode. This
package provides an improved framework for handling BOMs; the classes
herein are tested with unicode code points beyond those covered by
UCS-2 (thus exercising the new UTF-16 representation of Java
strings).
Foundation
UnicodeCharset is the basic
tie-in to the JDK unicode facilities. Each instance is just a wrapper
around a unicode Charset instance, and obeys
the standard Java semantics for BOMs (with their inconsistencies).
Signature is a generic
signature detector in the header of a byte stream. Enumeration
constants are defined for the unicode BOM signatures, but this class
makes no attempt to handle a map between signatures and charsets.
SignatureCharset pairs up
charsets to signatures that can mark those charsets. All charsets can
occur either without a signature (Signature.NONE), or with the BOM
defined by the unicode standard specifically for each charset.
Serialization groups a list
of SignatureCharset
instances. We check a byte source against all signatures in a
serialization (in order), in order to determine the charset we should
use.
Finally, DecodingStrategy
groups a list of Serialization
instances. Functionally, it is similar to Serialization, but it occupies a higher
level of abstraction, making it easy to define signature-detection
strategies using Serialization
instances as building blocks.
I/O
In addition to the basic foundation, this package exposes readers and writers that are BOM-aware. This includes:
- Readers who can recognize a BOM, and thus automatically determine a byte stream's charset.
- Writers who can place a BOM at the header of a byte stream, thus specifying the stream's charset.
There are basic input stream/output stream converters to a reader
(UnicodeInputStreamReader) or
writer (UnicodeOutputStreamWriter), as well as
a file-based reader (UnicodeFileReader) and writer (UnicodeFileWriter).
-
Interface Summary Interface Description Messages The internationalization constants used by this package. -
Class Summary Class Description UnicodeFileReader AUnicodeInputStreamReaderwhich gets its input from a file.UnicodeFileWriter AUnicodeOutputStreamWriterwhich directs its output to a file.UnicodeInputStreamReader A variation ofInputStreamReaderthat is BOM-aware.UnicodeOutputStreamWriter A variation ofOutputStreamWriterthat is BOM-aware. -
Enum Summary Enum Description DecodingStrategy A list of one or moreSerializationinstances.Serialization A list of signature/charset pairs.Signature A byte stream signature.SignatureCharset A byte stream signature (Signature) coupled with a charset (UnicodeCharset) that may follow the signature (aka a signature/charset pair).UnicodeCharset A thin wrapper aroundCharsetfor the Unicode charsets.