com.googlecode.fascinator.harvester.filesystem
Class FileSystemHarvester

java.lang.Object
  extended by com.googlecode.fascinator.common.harvester.impl.GenericHarvester
      extended by com.googlecode.fascinator.harvester.filesystem.FileSystemHarvester
All Implemented Interfaces:
Harvester, Plugin

public class FileSystemHarvester
extends GenericHarvester

This plugin harvests files in a specified directory or a specified file on the local file system. it can use a cache to do incremental harvests, which only harvests files that have changed since the last time it was run. system.

Configuration

Sample configuration file for file system harvester: local-files.json

Option Description Required Default
baseDir Path of directory or file to be harvested Yes ${user.home}/Documents/public/
facetDir Used to specify the top level directory for the file_path facet No ${user.home}/Documents/public/
ignoreFilter Pipe-separated ('|') list of filename patterns to ignore No .svn|.ice|.*|~*|Thumbs.db|.DS_Store
recursive Set true to harvest files recursively No true
force Force harvest the specified directory or file again even when it's not modified (ignore cache) No false
link Store the digital object as a link in the storage and point to the original file in the file system No true
caching Caching method to use. Valid entries are 'basic' and 'hashed' No null
cacheId The cache ID to use in the database if caching is in use. Yes (if valid 'caching' value is provided) null
derbyHome Path to use for the file store of the database. Should match other Derby paths provided in the configuration file for the application. Yes (if valid 'caching' value is provided) null

Caching

With regards to the underlying cache you have three options for configuration:
  1. No caching: All files will always be be harvested. Be aware that without caching there is no support for deletion.
  2. Basic caching: The file is considered 'cached' if the last modified date matches the database entry. On some operating systems (like linux) this can provide a minimum of around 2 seconds of granularity. For most purposes this is sufficient, and this cache is the most efficient.
  3. Hashed caching: The entire contents of the file are SHA hashed and the hash is stored in the database. The file is considered cached if the old hash matches the new hash. This approach will only trigger a harvest if the contents of the file really change, but it is quite slow across large data sets and large files.
Deletion support is provided by any configured cache. After the standard harvest is performed any 'stale' cache entries are considered to targets for deletion. This is why the 'cacheId' is particularly important, because you don't want cache entries from a different harvest configuration getting deleted.

Examples

  1. Harvesting ${user.home}/Documents/public/ directory recursively. Ignore files with the filename match the pattern specified in the ignoreFilter. The harvest includes the files in the subdirectory, and do not re-harvest unmodified file if the file exist in the cache database under the 'default' cache.
       "harvester": {
          "type": "file-system",
          "file-system": {
              "targets": [
                  {
                      "baseDir": "${user.home}/Documents/public/",
                      "facetDir": "${user.home}/Documents/public/",
                      "ignoreFilter": ".svn|.ice|.*|~*|Thumbs.db|.DS_Store",
                      "recursive": true,
                      "force": false,
                      "link": true
                  }
              ],
              "caching": "basic",
              "cacheId": "default",
              "derbyHome" : "${fascinator.home}/database"
          }
      }
     

Rule file

Sample rule file for the file system harvester: local-files.py

Wiki Link

None

Author:
Oliver Lucido

Constructor Summary
FileSystemHarvester()
          File System Harvester Constructor
 
Method Summary
 Set<String> getDeletedObjectIdList()
          Delete cached references to files which no longer exist and return the set of IDs to delete from the system.
 Set<String> getObjectIdList()
          Harvest the next set of files, and return their Object IDs
 boolean hasMoreDeletedObjects()
          Check if there are more objects to delete
 boolean hasMoreObjects()
          Check if there are more objects to harvest
 void init()
          Initialisation of File system harvester plugin
 void shutdown()
          Shutdown the plugin
 
Methods inherited from class com.googlecode.fascinator.common.harvester.impl.GenericHarvester
getId, getJsonConfig, getName, getObjectId, getPluginDetails, getStorage, init, init, setStorage
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

FileSystemHarvester

public FileSystemHarvester()
File System Harvester Constructor

Method Detail

init

public void init()
          throws HarvesterException
Initialisation of File system harvester plugin

Specified by:
init in class GenericHarvester
Throws:
HarvesterException - if fails to initialise

shutdown

public void shutdown()
              throws HarvesterException
Shutdown the plugin

Specified by:
shutdown in interface Plugin
Overrides:
shutdown in class GenericHarvester
Throws:
HarvesterException - is there are errors

getObjectIdList

public Set<String> getObjectIdList()
                            throws HarvesterException
Harvest the next set of files, and return their Object IDs

Returns:
Set The set of object IDs just harvested
Throws:
HarvesterException - is there are errors

hasMoreObjects

public boolean hasMoreObjects()
Check if there are more objects to harvest

Returns:
true if there are more, false otherwise

getDeletedObjectIdList

public Set<String> getDeletedObjectIdList()
                                   throws HarvesterException
Delete cached references to files which no longer exist and return the set of IDs to delete from the system.

Specified by:
getDeletedObjectIdList in interface Harvester
Overrides:
getDeletedObjectIdList in class GenericHarvester
Returns:
Set The set of object IDs deleted
Throws:
HarvesterException - is there are errors

hasMoreDeletedObjects

public boolean hasMoreDeletedObjects()
Check if there are more objects to delete

Specified by:
hasMoreDeletedObjects in interface Harvester
Overrides:
hasMoreDeletedObjects in class GenericHarvester
Returns:
true if there are more, false otherwise


Copyright © 2009-2013. All Rights Reserved.