See: Description
| Interface | Description |
|---|---|
| Messages |
The internationalization constants used by this package.
|
| Class | Description |
|---|---|
| UnicodeFileReader |
A
UnicodeInputStreamReader which gets its input from a
file. |
| UnicodeFileWriter |
A
UnicodeOutputStreamWriter which directs its output to a
file. |
| UnicodeInputStreamReader |
A variation of
InputStreamReader that is BOM-aware. |
| UnicodeOutputStreamWriter |
A variation of
OutputStreamWriter that is BOM-aware. |
| Enum | Description |
|---|---|
| DecodingStrategy |
A list of one or more
Serialization instances. |
| Serialization |
A list of signature/charset pairs.
|
| Signature |
A byte stream signature.
|
| SignatureCharset |
A byte stream signature (
Signature) coupled with a charset
(UnicodeCharset) that may follow the signature (aka a
signature/charset pair). |
| UnicodeCharset |
A thin wrapper around
Charset for the Unicode charsets. |
Unicode en/decoding with BOMs, including I/O stream support.
Unicode handling in Java can get a bit confusing, esp. in the
context of byte order marks (BOMs). Java
has flip-flopped
on the handling of BOMs in UTF-8 (Java 6 does not provide any),
provides BOMs for UTF-16 (see Charset), and
does not provide BOMs for UTF-32 (unless the UTF-32LE
or UTF-32BE variants are used). Also, as of Java 5, Java
moved from UCS-2 to UTF-16 for its internal string representation,
enhancing the range of byte streams that Java can en/decode. This
package provides an improved framework for handling BOMs; the classes
herein are tested with unicode code points beyond those covered by
UCS-2 (thus exercising the new UTF-16 representation of Java
strings).
UnicodeCharset is the basic
tie-in to the JDK unicode facilities. Each instance is just a wrapper
around a unicode Charset instance, and obeys
the standard Java semantics for BOMs (with their inconsistencies).
Signature is a generic
signature detector in the header of a byte stream. Enumeration
constants are defined for the unicode BOM signatures, but this class
makes no attempt to handle a map between signatures and charsets.
SignatureCharset pairs up
charsets to signatures that can mark those charsets. All charsets can
occur either without a signature (Signature.NONE), or with the BOM
defined by the unicode standard specifically for each charset.
Serialization groups a list
of SignatureCharset
instances. We check a byte source against all signatures in a
serialization (in order), in order to determine the charset we should
use.
Finally, DecodingStrategy
groups a list of Serialization
instances. Functionally, it is similar to Serialization, but it occupies a higher
level of abstraction, making it easy to define signature-detection
strategies using Serialization
instances as building blocks.
In addition to the basic foundation, this package exposes readers and writers that are BOM-aware. This includes:
There are basic input stream/output stream converters to a reader
(UnicodeInputStreamReader) or
writer (UnicodeOutputStreamWriter), as well as
a file-based reader (UnicodeFileReader) and writer (UnicodeFileWriter).
Copyright © 2015. All Rights Reserved.