public class HTMLScanner extends Object implements XMLDocumentSource, XMLLocator, HTMLComponent
This component recognizes the following features:
This component recognizes the following properties:
HTMLElements| Modifier and Type | Class and Description |
|---|---|
class |
HTMLScanner.ContentScanner
The primary HTML document scanner.
|
class |
HTMLScanner.PlainTextScanner
Special scanner used for
PLAINTEXT |
static interface |
HTMLScanner.Scanner
Basic scanner interface.
|
class |
HTMLScanner.ScriptScanner
Special scanner used for
PLAINTEXT |
class |
HTMLScanner.SpecialScanner
Special scanner used for elements whose content needs to be scanned as plain
text, ignoring markup such as elements and entity references.
|
| Modifier and Type | Field and Description |
|---|---|
static String |
ALLOW_SELFCLOSING_IFRAME
Allows self closing <iframe/> tag
|
static String |
ALLOW_SELFCLOSING_TAGS
Allows self closing tags e.g.
|
static String |
AUGMENTATIONS
Include infoset augmentations.
|
static String |
CDATA_SECTIONS
Scan CDATA sections.
|
protected static boolean |
DEBUG_CALLBACKS
Set to true to debug callbacks.
|
protected static int |
DEFAULT_BUFFER_SIZE |
static String |
DEFAULT_ENCODING
Default encoding.
|
static String |
DOCTYPE_PUBID
Doctype declaration public identifier.
|
static String |
DOCTYPE_SYSID
Doctype declaration system identifier.
|
static String |
ENCODING_TRANSLATOR
Encoding translator.
|
static String |
ERROR_REPORTER
Error reporter.
|
protected int |
fBeginCharacterOffset
Beginning character offset in the file.
|
protected int |
fBeginColumnNumber
Beginning column number.
|
protected int |
fBeginLineNumber
Beginning line number.
|
protected PlaybackInputStream |
fByteStream
The playback byte stream.
|
protected HTMLScanner.Scanner |
fContentScanner
Content scanner.
|
protected MiniStack<org.htmlunit.cyberneko.HTMLScanner.CurrentEntity> |
fCurrentEntityStack
The current entity stack.
|
protected String |
fDefaultIANAEncoding
Default encoding.
|
protected String |
fDoctypePubid
Doctype declaration public identifier.
|
protected String |
fDoctypeSysid
Doctype declaration system identifier.
|
protected XMLDocumentHandler |
fDocumentHandler
The document handler.
|
protected int |
fElementCount
Element count.
|
protected int |
fElementDepth
Element depth.
|
protected EncodingTranslator |
fEncodingTranslator
Error reporter.
|
protected int |
fEndCharacterOffset
Ending character offset in the file.
|
protected int |
fEndColumnNumber
Ending column number.
|
protected int |
fEndLineNumber
Ending line number.
|
protected HTMLErrorReporter |
fErrorReporter
Error reporter.
|
protected String |
fIANAEncoding
Auto-detected IANA encoding.
|
protected String |
fJavaEncoding
Auto-detected Java encoding.
|
protected short |
fNamesAttrs
Modify HTML attribute names.
|
protected short |
fNamesElems
Modify HTML element names.
|
protected HTMLScanner.Scanner |
fScanner
The current scanner.
|
protected short |
fScannerState
The current scanner state.
|
protected HTMLScanner.ScriptScanner |
fScriptScanner
Special scanner used script tags.
|
protected HTMLScanner.SpecialScanner |
fSpecialScanner
Special scanner used for elements whose content needs to be scanned as plain
text, ignoring markup such as elements and entity references.
|
protected XMLString |
fStringBuffer
String buffer.
|
static String |
HTML_4_01_FRAMESET_PUBID
HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").
|
static String |
HTML_4_01_FRAMESET_SYSID
HTML 4.01 frameset system identifier
("http://www.w3.org/TR/html4/frameset.dtd").
|
static String |
HTML_4_01_STRICT_PUBID
HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").
|
static String |
HTML_4_01_STRICT_SYSID
HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").
|
static String |
HTML_4_01_TRANSITIONAL_PUBID
HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01
Transitional//EN").
|
static String |
HTML_4_01_TRANSITIONAL_SYSID
HTML 4.01 transitional system identifier
("http://www.w3.org/TR/html4/loose.dtd").
|
static String |
IGNORE_SPECIFIED_CHARSET
Ignore specified charset found in the <meta equiv='Content-Type'
content='text/html;charset=…'> tag or in the <?
|
static String |
INSERT_DOCTYPE
Insert document type declaration.
|
static String |
NAMES_ATTRS
Modify HTML attribute names: { "upper", "lower", "default" }.
|
static String |
NAMES_ELEMS
Modify HTML element names: { "upper", "lower", "default" }.
|
protected static short |
NAMES_LOWERCASE
Lowercase HTML names.
|
protected static short |
NAMES_NO_CHANGE
Don't modify HTML names.
|
protected static short |
NAMES_UPPERCASE
Uppercase HTML names.
|
static String |
NORMALIZE_ATTRIBUTES
Normalize attribute values.
|
static String |
OVERRIDE_DOCTYPE
Override doctype declaration public and system identifiers.
|
static String |
PARSE_NOSCRIPT_CONTENT
Parse <noscript>...
|
static String |
PLAIN_ATTRIBUTE_VALUES
Store the plain attribute values also.
|
static String |
REPORT_ERRORS
Report errors.
|
static String |
SCRIPT_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<!
|
static String |
SCRIPT_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!
|
protected static short |
STATE_CONTENT
State: content.
|
protected static short |
STATE_END_DOCUMENT
State: end document.
|
protected static short |
STATE_MARKUP_BRACKET
State: markup bracket.
|
protected static short |
STATE_START_DOCUMENT
State: start document.
|
static String |
STYLE_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<!
|
static String |
STYLE_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!
|
protected static HTMLEventInfo |
SYNTHESIZED_ITEM
Synthesized event info item.
|
| Modifier and Type | Method and Description |
|---|---|
void |
cleanup(boolean closeall)
Cleans up used resources.
|
void |
evaluateInputSource(XMLInputSource inputSource)
Immediately evaluates an input source and add the new content (e.g. the
output written by an embedded script).
|
static String |
expandSystemId(String systemId,
String baseSystemId)
Expands a system id and returns the system id as a URI, if it can be
expanded.
|
protected static String |
fixURI(String str)
Fixes a platform dependent filename to standard URI form.
|
String |
getBaseSystemId()
Returns the base system identifier.
|
int |
getCharacterOffset()
Returns the character offset.
|
int |
getColumnNumber()
Returns the current column number.
|
XMLDocumentHandler |
getDocumentHandler()
Returns the document handler.
|
String |
getEncoding()
Returns the encoding.
|
String |
getExpandedSystemId()
Returns the expanded system identifier.
|
Boolean |
getFeatureDefault(String featureId)
Returns the default state for a feature.
|
int |
getLineNumber()
Returns the current line number.
|
String |
getLiteralSystemId()
Returns the literal system identifier.
|
protected static short |
getNamesValue(String value) |
Object |
getPropertyDefault(String propertyId)
Returns the default state for a property.
|
String |
getPublicId()
Returns the public identifier.
|
String[] |
getRecognizedFeatures()
Returns recognized features.
|
String[] |
getRecognizedProperties()
Returns recognized properties.
|
protected static String |
getValue(XMLAttributes attrs,
String aname) |
String |
getXMLVersion()
Returns the XML version.
|
protected Augmentations |
locationAugs() |
protected static String |
modifyName(String name,
short mode) |
protected String |
nextContent(int len)
Reads the next characters WITHOUT impacting the buffer content up to current
offset.
|
void |
pushInputSource(XMLInputSource inputSource)
Pushes an input source onto the current entity stack.
|
protected int |
readPreservingBufferContent() |
void |
reset(XMLComponentManager manager)
Resets the component.
|
protected void |
scanDoctype() |
boolean |
scanDocument(boolean complete)
Scans a document.
|
protected int |
scanEntityRef(XMLString str,
XMLString plainValue,
boolean content) |
protected String |
scanLiteral() |
protected String |
scanName(boolean strict) |
protected String |
scanTagName() |
void |
setDocumentHandler(XMLDocumentHandler handler)
Sets the document handler.
|
void |
setFeature(String featureId,
boolean state)
Sets a feature.
|
void |
setInputSource(XMLInputSource source)
Sets the input source.
|
void |
setProperty(String propertyId,
Object value)
Sets a property.
|
protected void |
setScanner(HTMLScanner.Scanner scanner) |
protected void |
setScannerState(short state) |
protected boolean |
skip(String s) |
protected boolean |
skipMarkup(boolean balance) |
protected int |
skipNewlines() |
protected boolean |
skipSpaces() |
protected Augmentations |
synthesizedAugs() |
public static final String HTML_4_01_STRICT_PUBID
public static final String HTML_4_01_STRICT_SYSID
public static final String HTML_4_01_TRANSITIONAL_PUBID
public static final String HTML_4_01_TRANSITIONAL_SYSID
public static final String HTML_4_01_FRAMESET_PUBID
public static final String HTML_4_01_FRAMESET_SYSID
public static final String AUGMENTATIONS
public static final String REPORT_ERRORS
public static final String SCRIPT_STRIP_COMMENT_DELIMS
public static final String SCRIPT_STRIP_CDATA_DELIMS
public static final String STYLE_STRIP_COMMENT_DELIMS
public static final String STYLE_STRIP_CDATA_DELIMS
public static final String IGNORE_SPECIFIED_CHARSET
public static final String CDATA_SECTIONS
public static final String OVERRIDE_DOCTYPE
public static final String INSERT_DOCTYPE
public static final String PARSE_NOSCRIPT_CONTENT
public static final String ALLOW_SELFCLOSING_IFRAME
public static final String ALLOW_SELFCLOSING_TAGS
public static final String NORMALIZE_ATTRIBUTES
public static final String PLAIN_ATTRIBUTE_VALUES
public static final String NAMES_ELEMS
public static final String NAMES_ATTRS
public static final String DEFAULT_ENCODING
public static final String ERROR_REPORTER
public static final String ENCODING_TRANSLATOR
public static final String DOCTYPE_PUBID
public static final String DOCTYPE_SYSID
protected static final short STATE_CONTENT
protected static final short STATE_MARKUP_BRACKET
protected static final short STATE_START_DOCUMENT
protected static final short STATE_END_DOCUMENT
protected static final short NAMES_NO_CHANGE
protected static final short NAMES_UPPERCASE
protected static final short NAMES_LOWERCASE
protected static final int DEFAULT_BUFFER_SIZE
protected static final boolean DEBUG_CALLBACKS
protected static final HTMLEventInfo SYNTHESIZED_ITEM
protected short fNamesElems
protected short fNamesAttrs
protected String fDefaultIANAEncoding
protected HTMLErrorReporter fErrorReporter
protected EncodingTranslator fEncodingTranslator
protected String fDoctypePubid
protected String fDoctypeSysid
protected int fBeginLineNumber
protected int fBeginColumnNumber
protected int fBeginCharacterOffset
protected int fEndLineNumber
protected int fEndColumnNumber
protected int fEndCharacterOffset
protected PlaybackInputStream fByteStream
protected final MiniStack<org.htmlunit.cyberneko.HTMLScanner.CurrentEntity> fCurrentEntityStack
protected HTMLScanner.Scanner fScanner
protected short fScannerState
protected XMLDocumentHandler fDocumentHandler
protected String fIANAEncoding
protected String fJavaEncoding
protected int fElementCount
protected int fElementDepth
protected HTMLScanner.Scanner fContentScanner
protected final HTMLScanner.SpecialScanner fSpecialScanner
protected final HTMLScanner.ScriptScanner fScriptScanner
protected final XMLString fStringBuffer
public void pushInputSource(XMLInputSource inputSource)
Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.
inputSource - The new input source to start scanning.evaluateInputSource(XMLInputSource)public void evaluateInputSource(XMLInputSource inputSource)
inputSource - The new input source to start evaluating.pushInputSource(XMLInputSource)public void cleanup(boolean closeall)
closeall - Close all streams, including the original. This is used in
cases when the application has opened the original document
stream and should be responsible for closing it.public String getEncoding()
getEncoding in interface XMLLocatorpublic String getPublicId()
getPublicId in interface XMLLocatorpublic String getBaseSystemId()
getBaseSystemId in interface XMLLocatorpublic String getLiteralSystemId()
getLiteralSystemId in interface XMLLocatorpublic String getExpandedSystemId()
getExpandedSystemId in interface XMLLocatorpublic int getLineNumber()
getLineNumber in interface XMLLocator-1 if no line number is available.public int getColumnNumber()
getColumnNumber in interface XMLLocator-1 if no column number is
available.public String getXMLVersion()
getXMLVersion in interface XMLLocatorpublic int getCharacterOffset()
getCharacterOffset in interface XMLLocator-1 if no character offset is
available.public Boolean getFeatureDefault(String featureId)
getFeatureDefault in interface HTMLComponentgetFeatureDefault in interface XMLComponentfeatureId - The feature identifier.public Object getPropertyDefault(String propertyId)
getPropertyDefault in interface HTMLComponentgetPropertyDefault in interface XMLComponentpropertyId - The property identifier.public String[] getRecognizedFeatures()
getRecognizedFeatures in interface XMLComponentpublic String[] getRecognizedProperties()
getRecognizedProperties in interface XMLComponentpublic void reset(XMLComponentManager manager) throws XMLConfigurationException
reset in interface XMLComponentmanager - The component manager.XMLConfigurationExceptionpublic void setFeature(String featureId, boolean state)
setFeature in interface XMLComponentfeatureId - The feature identifier.state - The state of the feature.public void setProperty(String propertyId, Object value) throws XMLConfigurationException
setProperty in interface XMLComponentpropertyId - The property identifier.value - The value of the property.XMLConfigurationException - Thrown for configuration error. In general,
components should only throw this exception
if it is really a critical
error.public void setInputSource(XMLInputSource source) throws IOException
source - The input source.IOException - Thrown on i/o error.public boolean scanDocument(boolean complete)
throws XNIException,
IOException
complete - True if the scanner should scan the document completely,
pushing all events to the registered document handler. A
value of false indicates that the scanner should only
scan the next portion of the document and return. A scanner
instance is permitted to completely scan a document if it
does not support this "pull" scanning model.IOException - Thrown on i/o error.XNIException - on error.public void setDocumentHandler(XMLDocumentHandler handler)
setDocumentHandler in interface XMLDocumentSourcehandler - the new handlerpublic XMLDocumentHandler getDocumentHandler()
getDocumentHandler in interface XMLDocumentSourceprotected static String getValue(XMLAttributes attrs, String aname)
public static String expandSystemId(String systemId, String baseSystemId)
systemId - The systemId to be expanded.baseSystemId - baseSystemIdprotected static String fixURI(String str)
str - The string to fix.protected static short getNamesValue(String value)
protected void setScanner(HTMLScanner.Scanner scanner)
protected void setScannerState(short state)
protected void scanDoctype()
throws IOException
IOExceptionprotected String scanLiteral() throws IOException
IOExceptionprotected String scanName(boolean strict) throws IOException
IOExceptionprotected String scanTagName() throws IOException
IOExceptionprotected int scanEntityRef(XMLString str, XMLString plainValue, boolean content) throws IOException
IOExceptionprotected boolean skip(String s) throws IOException
IOExceptionprotected boolean skipMarkup(boolean balance)
throws IOException
IOExceptionprotected boolean skipSpaces()
throws IOException
IOExceptionprotected int skipNewlines()
throws IOException
IOExceptionprotected final Augmentations locationAugs()
protected final Augmentations synthesizedAugs()
protected String nextContent(int len) throws IOException
len - the number of characters to readIOException - in case of io problemsprotected int readPreservingBufferContent()
throws IOException
IOExceptionCopyright © 2024 HtmlUnit. All rights reserved.