Version 1.0
-
Graphical user interface is introduced giving the environment for easier configuration
development and testing.
html-to-xml processor, which is based on HtmlCleaner, now exposes attributes
for controlling cleaner's behaviour.
-
Besides
BeanShell scripting engine, two others are added: Groovy and JavaScript.
Now it is possible to choose the favourite scripting engine or even mix them in a single
Web-Harvest configuration. This option is supported by adding new attributes to
config, script and template processors.
-
Access to HTTP client is supported by introducing implicit context varibale
http.
Now it is possible to check important HTTP response values, like
http.mimeType, http.headers, http.statusCode,
or even to obtain instance of org.apache.commons.httpclient.HttpClient class
with http.client and manipulate it in the runtime.
-
New attribute
cookie-policy added to the http processor,
specifying the way HttpClient manage cookies.
-
Command-line use is improved by adding several new parameters.
-
For more comfortable use of Web-Harvest context variables in the script engines'
runtime scopes, several handy methods are added to the class
org.webharvest.runtime.variables.Variable (interface
IVariable in
previous versions of Web-Harvest).
-
Several useful methods added in implicit Web-Harvest context variable
sys,
like sys.xpath(expression, xml), sys.isVariableDefined(varname)
and sys.defineVariable(varName, varValue, [overwrite]).
-
Attribute
overwrite added in the ver-def processor,
giving possibility to specify whether existing variables with specified name
will be overwriten or not.
-
New proccessor
<exit condition=... message=.../> is introduced
in order to support conditional execution break.
-
Encoding selection in
http processor is changed - if no explicitely
specified with charset attribute, one given from HTTP response is used
instead to read downloaded text content.
-
NTLM proxy authentication scheme is supported.
-
Performance improvements and bug fixes.
Version 0.5
-
html-to-xml parser is changed - HtmlCleaner is used instead of TagSoup. The
bad point in this is that some existing Web-Harvest configurations may need
corrections of XPath or XQuery processors. On the other hand, lot of problems
previously existing are now solved.
-
Script processor is introduced. It adds scripting support based on
BeanShell scripting language.
-
template processor is now based also on BeanShell instead of
OGNL, this way giving possibilty to share the same variables and methods
with script processing.
-
Optional attribute
type is now added to xq-param
defaulting to node(). It specifies type of external XQuery
parameter. Up to the Web-Harvest 0.5 this parameter was implicitely declared
at the beginning of XQuery expression and was always of node()*
type. Now on, for each parameter defined
with xq-param the matching explicit declaration inside xq-expression
is required (declare variable $var_name as var_type external;).
-
A couple of new constructors is added to the class
ScraperConfiguration
allowing loading configuration from URL or from arbitrary input stream.
-
file and include processors now support both absolute
and relative paths. File paths are regarded as absolute if they begin with X:,
/, or \, where X is a letter.
-
In order to avoid ambiguity in exchanging values with
script
and template processing, Web-Harvest variables are case-sensitive
from this version.