Class CmdLineCrawler
java.lang.Object
org.lockss.laaws.crawler.impl.pluggable.CmdLineCrawler
- All Implemented Interfaces:
PluggableCrawler
- Direct Known Subclasses:
WgetCmdLineCrawler
A Base implementation of a CmdLineCrawler.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic interfacestatic class -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final Stringstatic final Stringstatic final StringControls the number of AUs running cmd line crawlsstatic final Stringstatic final Stringstatic final Stringstatic final Stringstatic final Stringstatic final Stringstatic final Stringprotected CmdLineCrawler.CommandLineBuilderprotected booleanprotected CrawlerConfigThe Configuration for this crawler.protected HashMap<String,CmdLineCrawl> The map of crawls for this crawler.static final Stringstatic final Stringstatic final Stringstatic final Stringstatic final Stringstatic final Stringstatic final Stringstatic final longstatic final Stringprotected StringThe level to use when logging error from a processprotected StringThe http response codes to exclude from warc import.protected StringThe level to use when logging output from a processprotected PluggableCrawlManagerstatic final Stringprotected longstatic final Stringstatic final String -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidStop all crawls and clear the crawl queue managed by this crawlerprotected booleandidCrawlSucceed(int exitCode) voiddisable(boolean abortCrawling) disable this crawler clearing any queued crawls.protected CmdLineCrawler.CommandLineBuilderGet a Crawl for a given crawl id.Return the configuration for this crawlerReturn the unique Id for this crawler.longprotected voidinitCrawlScheduler(String reqSpec) booleanis this crawler enabledbooleanisElgibleForCrawl(String auId) booleanrequestCrawl(org.lockss.plugin.ArchivalUnit au, org.lockss.util.rest.crawler.CrawlJob crawlJob) setCmdLineBuilder(CmdLineCrawler.CommandLineBuilder cmdLineBuilder) setConfig(CrawlerConfig config) setCrawlManager(PluggableCrawlManager pcManager) setNamespace(String namespace) voidsetPluggableCrawlManager(PluggableCrawlManager pluggableCrawlManager) Set the Crawl Manager which created and maintains this crawler.setV2Repo(org.lockss.util.rest.repo.LockssRepository v2Repo) voidshutdown()Shutdown the crawler.protected voidshutdownWithWait(ExecutorService scheduler) Stop a crawl a specific crawlvoidstoreInRepository(String auId, File warcFile, boolean isCompressed) voidupdateAuConfig(org.lockss.plugin.ArchivalUnit au, boolean isRepairCrawl, List<String> reqUrls, List<String> crawlStems) voidupdateCrawlerConfig(CrawlerConfig crawlerConfig) set the configuration parameters for this crawlerboolean
-
Field Details
-
PREFIX
- See Also:
-
ATTR_CRAWL_EXECUTOR_SPEC
Controls the number of AUs running cmd line crawls- See Also:
-
DEFAULT_CMDLINE_CRAWL_EXECUTOR_SPEC
- See Also:
-
ATTR_EXCLUDE_STATUS_PATTERN
- See Also:
-
DEFAULT_EXCLUDE_STATUS_PATTERN
- See Also:
-
ATTR_OUTPUT_LOG_LEVEL
- See Also:
-
DEFAULT_OUTPUT_LOG_LEVEL
- See Also:
-
ATTR_ERROR_LOG_LEVEL
- See Also:
-
DEFAULT_ERROR_LOG_LEVEL
- See Also:
-
ATTR_JOIN_OUTPUT_STREAMS
- See Also:
-
DEFAULT_JOIN_OUTPUT_STREAMS
- See Also:
-
ATTR_PROC_EXIT_WAIT
- See Also:
-
DEFAULT_PROC_EXIT_WAIT
public static final long DEFAULT_PROC_EXIT_WAIT- See Also:
-
ATTR_COMPRESS_WARC
- See Also:
-
DEFAULT_COMPRESS_WARC
- See Also:
-
ATTR_COMPRESSED_WARC_FILE_EXTENSION
- See Also:
-
DEFAULT_COMPRESSED_WARC_FILE_EXTENSION
- See Also:
-
ATTR_UNCOMPRESSED_WARC_FILE_EXTENSION
- See Also:
-
DEFAULT_UNCOMPRESSED_WARC_FILE_EXTENSION
- See Also:
-
ATTR_UNSUPPORTED_PARAMS
- See Also:
-
START_URL_KEY
- See Also:
-
URL_STEMS_KEY
- See Also:
-
config
The Configuration for this crawler. -
outputLogLevel
The level to use when logging output from a process -
errorLogLevel
The level to use when logging error from a process -
excludeStatusPattern
The http response codes to exclude from warc import. -
compressWarc
protected boolean compressWarc -
warcFileFilter
-
procExitWait
protected long procExitWait -
unsupportedParams
-
crawlMap
The map of crawls for this crawler. -
cmdLineBuilder
-
pcManager
-
-
Constructor Details
-
CmdLineCrawler
public CmdLineCrawler()Instantiates a new Cmd line crawler.
-
-
Method Details
-
setCrawlManager
-
setV2Repo
-
setNamespace
-
setConfig
-
setCmdLineBuilder
-
getConfig
-
getCmdLineBuilder
-
getCrawlerId
Description copied from interface:PluggableCrawlerReturn the unique Id for this crawler.- Specified by:
getCrawlerIdin interfacePluggableCrawler- Returns:
- the id of the crawler
-
updateCrawlerConfig
Description copied from interface:PluggableCrawlerset the configuration parameters for this crawler- Specified by:
updateCrawlerConfigin interfacePluggableCrawler- Parameters:
crawlerConfig- the configuration parameters to use
-
getCrawlerConfig
Description copied from interface:PluggableCrawlerReturn the configuration for this crawler- Specified by:
getCrawlerConfigin interfacePluggableCrawler- Returns:
- the configuration parameters in use by this crawler.
-
getProcExitWait
public long getProcExitWait() -
getWarcFileFilter
-
getCompressedWarcExtension
-
getUncompressedWarcExtension
-
getUnsupportedParams
-
useCompressWarc
public boolean useCompressWarc() -
requestCrawl
public PluggableCrawl requestCrawl(org.lockss.plugin.ArchivalUnit au, org.lockss.util.rest.crawler.CrawlJob crawlJob) - Specified by:
requestCrawlin interfacePluggableCrawler
-
isElgibleForCrawl
-
stopCrawl
Description copied from interface:PluggableCrawlerStop a crawl a specific crawl- Specified by:
stopCrawlin interfacePluggableCrawler- Parameters:
crawlId- The crawl id of the crawl to stop- Returns:
- The PluggableCrawl containing the results of this crawl attempt.
-
getCrawl
Description copied from interface:PluggableCrawlerGet a Crawl for a given crawl id.- Specified by:
getCrawlin interfacePluggableCrawler- Parameters:
crawlId- The crawl id of the crawl to stop- Returns:
- The PluggableCrawl that matches a crawl id.
-
deleteAllCrawls
public void deleteAllCrawls()Description copied from interface:PluggableCrawlerStop all crawls and clear the crawl queue managed by this crawler- Specified by:
deleteAllCrawlsin interfacePluggableCrawler
-
isCrawlerEnabled
public boolean isCrawlerEnabled()Description copied from interface:PluggableCrawleris this crawler enabled- Specified by:
isCrawlerEnabledin interfacePluggableCrawler- Returns:
- true if this crawler is set to enabled.
-
shutdown
public void shutdown()Description copied from interface:PluggableCrawlerShutdown the crawler.- Specified by:
shutdownin interfacePluggableCrawler
-
shutdownWithWait
-
disable
public void disable(boolean abortCrawling) Description copied from interface:PluggableCrawlerdisable this crawler clearing any queued crawls. if the crawler was running is now marked as disabled or is missing from the supported crawler ids in the configuration- Specified by:
disablein interfacePluggableCrawler- Parameters:
abortCrawling- abort the currently running crawls.
-
setPluggableCrawlManager
Description copied from interface:PluggableCrawlerSet the Crawl Manager which created and maintains this crawler.- Specified by:
setPluggableCrawlManagerin interfacePluggableCrawler
-
getPluggableCrawlManager
-
storeInRepository
- Throws:
IOException
-
updateAuConfig
public void updateAuConfig(org.lockss.plugin.ArchivalUnit au, boolean isRepairCrawl, List<String> reqUrls, List<String> crawlStems) throws IOException - Throws:
IOException
-
initCrawlScheduler
-
didCrawlSucceed
protected boolean didCrawlSucceed(int exitCode) -
getOutputLogLevel
-
getErrorLogLevel
-
isJoinOutputStreams
public boolean isJoinOutputStreams()
-