The compression codec used for files written to HDFS.
The compression codec used for files written to HDFS. This may be set by setting the "mothra.repacker.compression" property. If that property is not set, CorePacker.DEFAULT_COMPRESSION is used.
The maximum number of open files maintained by the file cache.
The maximum number of open files maintained by the file cache. This is
determined by the mothra.repacker.fileCacheSize Java property, or by
CorePacker.DEFAULT_FILE_CACHE_SIZE when the property is not set. This
value must be no less than CorePacker.MINIMUM_FILE_CACHE_SIZE.
CorePacker.DEFAULT_FILE_CACHE_SIZE for a full description of this value.
The number of hours covered by each file in the repository.
The number of hours covered by each file in the repository. This is determined by the "mothra.repacker.hoursPerFile" property, or CorePacker.DEFAULT_HOURS_PER_FILE when that property is not set.
How often to print log messages regarding the number of tasks, in seconds.
maxScanJobs specifies the maximum number of scanning threads to start.
maxScanJobs specifies the maximum number of scanning threads to start. Since at most one thread can scan a directory, the default is to create 1 scanner per srcDir. Setting this to a value larger than the number of source directories has no effect. This may be modified by setting the mothra.repacker.maxScanJobs property.
maxThreads specifies the maximum number of scanning and reader/repacker threads to start.
maxThreads specifies the maximum number of scanning and reader/repacker threads to start. By default this is
(scanningJobs * (1 + * readersPerScanner))
Setting it to a value larger than that has no effect.
This may be modified by setting the mothra.repacker.readersPerScanner property.
The (approximate) maximum size file to create.
The (approximate) maximum size file to create. Typically a file's size will not exceed this value by more than the maximum size of an IPFIX message, 64k. The default is no maximum. When a file's size exceeds this value, the file is closed and a new file is started.
readersPerScanner specifies the number of file reader/repacker threads that are invoked per scanning thread.
readersPerScanner specifies the number of file reader/repacker threads that are invoked per scanning thread. The default is 1. This may be modified by setting the mothra.repacker.readersPerScanner property.
Object to implement the Reacker application.
Typical Usage in a Spark environment:
spark-submit --class org.cert.netsa.mothra.packer.tools.RepackerMain mothra-tools.jar <partition-conf> <dest-dir> <work-dir> <s1> [<s2> <s3> ...]where:
partition-conf: Partitioning configuration file as Hadoop URI
dest-dir: Root destination directory as Hadoop URI
work-dir: Working directory on the local disk (not
file://)s1..sn: Source directories as Hadoop URIs
Makes a single recursive scan of the source directories <s1>,<s2>,... for IPFIX files. Splits the IPFIX records in the source files into output file(s) in a time-based directory structure based on the partitioning rules in the partitioning configuration file <partition-conf>. The output files are initially created in the working directory <work-dir>, and, once ALL input files have been read, are moved to the destination directory and the initial source files removed. The dest-dir may be a source directory.
Repacker runs as a batch process; not as a daemon.
Example/Intended uses for the Repacker include:
(1)Changing how the records are packed---for example packing by the silkAppLabel instead of the protocolIdentifier.
(2)Combining multiple files for an hour into a single file for that hour, merging hourly files into a file that covers a longer duration, or spliting a longer duration file into smaller files.
(3)Changing the compression algorithm used on the IPFIX files.
Currently the repacker does NOT support modifying the records, it only moves the records into different files.
Repacker uses multiple threads. By default, each source directory specified on the command line gets a dedicated thread to scanning that directory and its subdirectories recursively for IPFIX files, and another thread decidated to reading those files and repacking them. The repacker does not support having multiple threads scan a directory, but it does allow multiple threads to process a single directory's files.
The <work-dir> must NOT be a source directory or a subdirectory of a source directory. To repack the files in an existing working directory, use a different working directory. The repacker ignores any files in the <work-dir> that exist when the repacker is started, and it ignores files placed there by other programs.
The property values that are used by the repacker are:
mothra.repacker.compression-- the compression algorithm used for the new IPFIX files. Values typically supported by Hadoop includebzip2,gzip,lz4,lzo,lzop,snappy, anddefault. The empty string indicates no compression.mothra.repacker.hoursPerFile-- The number of hours covered by each file in the repository. The valid range is 1 (a file for each hour) to 24 (one file per day). The default is 1.mothra.repacker.maxScanJobs-- the maximum number of threads dedicated to scanning the source directories. The default (and maximum) value is the number of source directories.mothra.repacker.readersPerScanner-- the number of reader/repacker threads to create for each source directory. The default is 1.mothra.repacker.maxThreads-- the maximum number of worker (scanner and repacker) threads to create. The default value is computed using the formula: (maxScanJobs * (1 + readersPerScanner)).mothra.repacker.maximumSize-- the (approximate) maximum file size to create. When specified, a work-file that exceeds this size is closed and moved into the repository. NOTES: (1)This value uses the uncompressed file size, and does not consider any compression that may occur when the file is moved from the workDir to the tgtDir. In addition, a file's size tends to grow in large steps because of buffering by the Java stream code. (2)Specifying amaximumSizemay temporarially cause duplicate records to appear in the repository because of some records in the original files and some in the new file. Once Repacker finishes scanning all files, the original files are removed and only the newly packed files are left. This issue of temporary having duplicate records in the repository will be resolved in a future release.mothra.repacker.archiveDirectory-- the root directory into which working files are moved after the repacker has finished running, as a Hadoop URI. If not specified, the working files are deleted.mothra.repacker.fileCacheSize-- The maximum size of the open file cache. This is the maximum number of open files maintained by the file cache for writing to files in the work directory. The repacker does not limit the number of files in the work directory; this only limits the number of open files. Once the cache reaches this number of open files and the packer needs to (re-)open a file, the packer closes the least-recently-used file. This value does not include the file handles required when reading incoming files or when copying files from the work directory to the data directory. The default is 2000; the minimum permitted is 128.