Wrapper to provide a better command-line argument experience over the top of the main packer class.
Wrapper to provide a better command-line argument experience over the top of the main packer class.
Wrapper to provide a better command-line argument experience over the top of the main packer class. Things should be folded together in the future.
Object to implement the FileJoiner application.
Object to implement the FileJoiner application.
Typical Usage in a Spark environment:
spark-submit --class org.cert.netsa.mothra.packer.tools.FileJoinerMain mothra-tools.jar <s1> [<s2> <s3> ...]
where:
s1..sn: Directories to process, as Hadoop URIs
FileJoiner reduces the number of data files in a Mothra repository. It may also be used to modify the files' compression.
FileJoiner runs as a batch process, not as a daemon.
FileJoiner makes a single recursive scan of the source directories <s1>,
<s2>, ... for files whose names match the pattern "YYYYMMDD.HH." or
"YYYYMMDD.HH-PTddH." (It looks for files matching the regular expression
^\d{8}\.\d{2}(?:-PT\d\d?H)?\.) Files whose names match that pattern are
processed by FileJoiner to create a single new file in the same directory
that has the same prefix as the originals, and then the original file(s)
are removed.
By default, files that share the same prefix are only processed when there
are two or more files. To force re-writing when there is a single file,
set the Java property mothra.filejoiner.minCountToJoin to a value less
than 2. The property may also be used to create a new file only when an
"excessive" number of files share the same prefix.
There is always a single thread that recursively scans the directories.
The number of threads that joins the files may be set by specifying the
mothra.filejoiner.maxThreads Java property. If not specified, the
default is 6.
FileJoiner may be run so that either it spawns a thread for every
directory that contains files to be joined or it spawns a thread for each
set of files in a directory that have the same prefix. The behavior is
controlled whether the mothra.filejoiner.spawnThread Java property is
set to by-prefix or by-directory. The default is by-directory.
(For backwards compatibility, by-hour is an alias for by-prefix.)
By default, FileJoiner does not compress the files it writes.
(NOTE: It should support writing the output using the same compression as
the input.) To specify the compression codec that it should use, specify
the mothra.filejoiner.compression Java property. Values typically
supported by Hadoop include bzip2, gzip, lz4, lzo, lzop,
snappy, and default. The empty string indicates no compression.
FileJoiner joins files sharing the same prefix into a single file by
default. The mothra.filejoiner.maximumSize Java property may be used to
limit the maximum file size. The size is for the compressed file if
compression is active. The value is approximate since it is only checked
after the data appears on disk which occurs in large blocks because of
buffering by the Java stream code and the compression algorithm. (By
setting that property and mothra.filejoiner.minCountToJoin to 1, you can
force large files to be split into smaller ones, making the FileJoiner a
file-splitter.)
Wrapper to provide a better command-line argument experience over the top of the main packer class.
Wrapper to provide a better command-line argument experience over the top of the main packer class. Things should be folded together in the future.
Object to implement the FileSanitizer application.
Object to implement the FileSanitizer application.
Typical Usage in a Spark environment:
spark-submit --class org.cert.netsa.mothra.packer.tools.FileSanitizerMain mothra-tools.jar <f1>[,<f2>[,<f3>...]] <s1> [<s2> <s3> ...]
where:
f1..fn: Names of InfoElements to be removed from the files s1..sn: Directories to process, as Hadoop URIs
FileSanitizer removes Information Element fields from the data files in a Mothra repository. In addition, when multiple files share the same name except for the UUID, FileSanitizer combines those files together.
The IE fields to be removed must be specified in a single argument, as a
comma-separated list of names, such as
sourceTransportPort,destinationTransportPort.
Each remaining argument is a single directory to process.
FileSanitizer runs as a batch process, not as a daemon.
FileSanitizer makes a single recursive scan of the source directories
<s1>, <s2>, ... for files whose names match the pattern "YYYYMMDD.HH." or
"YYYYMMDD.HH-PTddH." (It looks for files matching the regular expression
^\d{8}\.\d{2}(?:-PT\d\d?H)?\.) Files whose names match that pattern are
processed by FileSanitizer to remove the named Information Elements. All
files where the regular expression matched the same string are joined into
a single file, similar to the FileJoiner. Finally, the original files are
removed.
There is always a single thread that recursively scans the directories.
The number of threads that sanitizes and joins the files may be set by
specifying the mothra.filesanitizer.maxThreads Java property. If not
specified, the default is 6.
FileSanitizer may be run so that either it spawns a thread for every
directory that contains files to process or it spawns a thread for each
set of files in a directory that have the same prefix. The behavior is
controlled whether the mothra.filesanitizer.spawnThread Java property is
set to by-prefix or by-directory. The default is by-directory.
(For backwards compatibility, by-hour is an alias for by-prefix.)
By default, FileSanitizer does not compress the files it writes.
(NOTE: It should support writing the output using the same compression as
the input.) To specify the compression codec that it should use, specify
the mothra.filesanitizer.compression Java property. Values typically
supported by Hadoop include bzip2, gzip, lz4, lzo, lzop,
snappy, and default. The empty string indicates no compression.
FileSanitizer joins the files sharing the same prefix into a single file
by default. The mothra.filesanitizer.maximumSize Java property may be
used to limit the maximum file size. The size is for the compressed file
if compression is active. The value is approximate since it is only
checked after the data appears on disk which occurs in large blocks
because of buffering by the Java stream code and the compression
algorithm.
Wrapper to provide a better command-line argument experience over the top of the main packer class.
Wrapper to provide a better command-line argument experience over the top of the main packer class. Things should be folded together in the future.
Object to implement the InvariantPacker application.
Object to implement the InvariantPacker application.
Typical Usage in a Spark environment:
spark-submit --class org.cert.netsa.mothra.packer.tools.InvariantPackerMain mothra-tools.jar [--one-shot] <sourceDir> <destinationDir> <partitionerFile>
Processes files created by super_mediator running in invariant mode and writes them into HDFS.
Wrapper to provide a better command-line argument experience over the top of the main packer class.
Wrapper to provide a better command-line argument experience over the top of the main packer class. Things should be folded together in the future.
Object to implement the Packer application
Object to implement the Packer application
Typical usage in a Spark environment:
spark-submit --class org.cert.netsa.mothra.packer.tools.PackerMain mothra-tools.jar [--one-shot] <srcDir> <destDir> <workDir> <partitioner>
where:
srcDir: Source (incoming) directory as Hadoop URI destDir: Destination directory as Hadoop URI workDir: Working directory on the local disk (not file://) partitioner: Partitioning file as Hadoop URIs
Packer scans the source directory (srcDir) for IPFIX files. It splits
the IPFIX records in each file into output file(s) in a time-based
directory structure based on the partitioning rules in the partitioning
file (partitioner). The output files are initially created in the
working directory (workDir), and when they meet size and/or age
thresholds, they are moved to the destination directory (destDir).
If "--one-shot" is included on the command line, the srcDir is only
scanned one time. Once all files in srcDir have been packed (or they
fail to be packed after some number of attempts), the packer exits.
The Java property values that are used by Packer are:
mothra.packer.compression -- The compression to use for files written to
HDFS. Values typically supported by Hadoop include bzip2, gzip,
lz4, lzo, lzop, snappy, and default. The empty string indicates
no compression. The default is no compression.
mothra.packer.maxPackJobs -- The size of the thread pool that determines
the maximum number of input files that may be processed simultaneously. A
larger value provides more throughput. The default is 1.
mothra.packer.hoursPerFile -- The number of hours covered by each file
in the repository. The valid range is 1 (a file for each hour) to 24 (one
file per day). The default is 1.
mothra.packer.pollingInterval -- How long the main thread sleeps (in
seconds) between scans (polls) of the source directory checking for IPFIX
files to process. The default is 30.
mothra.packer.workDir.checkInterval -- The value for how often, in
seconds, to check the sizes and ages of the files in the working
directory. The default is 60. When the checkInterval is reached,
the sizes and ages of the files in the working directory are checked.
Files that meet ONE of the following criteria are closed and moved into
the data repository. The criteria are:
--- Files that were created more than maximumAge seconds ago. Since
files are only checked at this interval, a file could potentially be one
interval older than the maximumAge.
--- Files whose size exceeds maximumSize. Since a file's size is not
continuously monitored, a file could be larger than this size, and the
user should set this value appropriately.
--- Files whose size is at least minimumsSize AND that were created at
least minimumAge seconds ago.
mothra.packer.workDir.maximumAge -- Files in the working directory that
were created over this number of seconds ago are always moved into the
repository, regardless of their size. The default value is 1800 seconds
(30 minutes).
mothra.packer.workDir.maximumSize -- Files in the working directory
whose size, in octets, is greater than this value are always moved into
the repository, regardless of their age. The default value is 104857600
bytes (100MiB).
mothra.packer.workDir.minimumAge -- Files in the working directory are
NOT eligible to be moved into the repository if they are younger this age
(were created less this number of seconds ago) unless their size exceeds
maximumSize. The default is 600 seconds (5 minutes).
mothra.packer.workDir.minimumSize -- Files in the working directory are
NOT eligible to moved moved into the repository if they are smaller than
this size (in octets) unless their age exceeds maximumAge. The default
is 67108864 bytes (64 MiB).
mothra.packer.numMoveThreads -- The size of the thread pool that closes
the work files and moves them to the destination directory. A task is
potentially created every workdirCheckInterval seconds if files are
determined to have met the limits. The default is 4.
mothra.packer.archiveDirectory -- The root directory into which working
files are moved after the packer copies their content to the repository,
as a Hadoop URI. If not specified, the working files are deleted.
mothra.packer.packAttempts -- The number of times the packer attempts to
process a file found in the srcDir. After this number of failed attempts,
the file is ignored by this invocation of the packer. The default is 3.
mothra.packer.fileCacheSize -- The maximum size of the open file cache.
This is the maximum number of open files maintained by the file cache for
writing to files in the work directory. The packer does not limit the
number of files in the work directory; this only limits the number of open
files. Once the cache reaches this number of open files and the packer
needs to (re-)open a file, the packer closes the least-recently-used file.
This value does not include the file handles required when reading
incoming files or when copying files from the work directory to the data
directory. The default is 2000; the minimum permitted is 128.
Wrapper to provide a better command-line argument experience over the top of the main packer class.
Wrapper to provide a better command-line argument experience over the top of the main packer class. Things should be folded together in the future.
Object to implement the Reacker application.
Object to implement the Reacker application.
Typical Usage in a Spark environment:
spark-submit --class org.cert.netsa.mothra.packer.tools.RepackerMain mothra-tools.jar <partition-conf> <dest-dir> <work-dir> <s1> [<s2> <s3> ...]
where:
partition-conf: Partitioning configuration file as Hadoop URI
dest-dir: Root destination directory as Hadoop URI
work-dir: Working directory on the local disk (not file://)
s1..sn: Source directories as Hadoop URIs
Makes a single recursive scan of the source directories <s1>,<s2>,... for IPFIX files. Splits the IPFIX records in the source files into output file(s) in a time-based directory structure based on the partitioning rules in the partitioning configuration file <partition-conf>. The output files are initially created in the working directory <work-dir>, and, once ALL input files have been read, are moved to the destination directory and the initial source files removed. The dest-dir may be a source directory.
Repacker runs as a batch process; not as a daemon.
Example/Intended uses for the Repacker include:
(1)Changing how the records are packed---for example packing by the silkAppLabel instead of the protocolIdentifier.
(2)Combining multiple files for an hour into a single file for that hour, merging hourly files into a file that covers a longer duration, or spliting a longer duration file into smaller files.
(3)Changing the compression algorithm used on the IPFIX files.
Currently the repacker does NOT support modifying the records, it only moves the records into different files.
Repacker uses multiple threads. By default, each source directory specified on the command line gets a dedicated thread to scanning that directory and its subdirectories recursively for IPFIX files, and another thread decidated to reading those files and repacking them. The repacker does not support having multiple threads scan a directory, but it does allow multiple threads to process a single directory's files.
The <work-dir> must NOT be a source directory or a subdirectory of a source directory. To repack the files in an existing working directory, use a different working directory. The repacker ignores any files in the <work-dir> that exist when the repacker is started, and it ignores files placed there by other programs.
The property values that are used by the repacker are:
mothra.repacker.compression -- the compression algorithm used for the
new IPFIX files. Values typically supported by Hadoop include bzip2,
gzip, lz4, lzo, lzop, snappy, and default. The empty string
indicates no compression.
mothra.repacker.hoursPerFile -- The number of hours covered by each file
in the repository. The valid range is 1 (a file for each hour) to 24 (one
file per day). The default is 1.
mothra.repacker.maxScanJobs -- the maximum number of threads dedicated
to scanning the source directories. The default (and maximum) value is
the number of source directories.
mothra.repacker.readersPerScanner -- the number of reader/repacker
threads to create for each source directory. The default is 1.
mothra.repacker.maxThreads -- the maximum number of worker (scanner and
repacker) threads to create. The default value is computed using the
formula: (maxScanJobs * (1 + readersPerScanner)).
mothra.repacker.maximumSize -- the (approximate) maximum file size to
create. When specified, a work-file that exceeds this size is closed and
moved into the repository. NOTES: (1)This value uses the uncompressed
file size, and does not consider any compression that may occur when the
file is moved from the workDir to the tgtDir. In addition, a file's size
tends to grow in large steps because of buffering by the Java stream code.
(2)Specifying a maximumSize may temporarially cause duplicate records to
appear in the repository because of some records in the original files and
some in the new file. Once Repacker finishes scanning all files, the
original files are removed and only the newly packed files are left. This
issue of temporary having duplicate records in the repository will be
resolved in a future release.
mothra.repacker.archiveDirectory -- the root directory into which
working files are moved after the repacker has finished running, as a
Hadoop URI. If not specified, the working files are deleted.
mothra.repacker.fileCacheSize -- The maximum size of the open file
cache. This is the maximum number of open files maintained by the file
cache for writing to files in the work directory. The repacker does not
limit the number of files in the work directory; this only limits the
number of open files. Once the cache reaches this number of open files
and the packer needs to (re-)open a file, the packer closes the
least-recently-used file. This value does not include the file handles
required when reading incoming files or when copying files from the work
directory to the data directory. The default is 2000; the minimum
permitted is 128.
Wrapper to provide a better command-line argument experience over the top of the main packer class.
Wrapper to provide a better command-line argument experience over the top of the main packer class. Things should be folded together in the future.
Object to implement the RollupDay application.
Object to implement the RollupDay application.
Typical Usage in a Spark environment:
spark-submit --class org.cert.netsa.mothra.packer.tools.RollupDayMain mothra-tools.jar <s1> [<s2> <s3> ...]
where:
s1..sn: Directories to process, as Hadoop URIs
RollupDay reduces the number of data files in a Mothra repository. It may also be used to modify the files' compression.
RollupDay runs as a batch process, not as a daemon.
RollupDay makes a single recursive scan of the source directories <s1>,
<s2>, ... for files whose names match the pattern "YYYYMMDD.HH." or
"YYYYMMDD.HH-PTddH." (It looks for files matching the regular expression
^\d{8}\.\d{2}(?:-PT\d\d?H)?\.) Files whose names match that pattern and
reside in the same directory are processed by RollupDay to create a single
new file (see next paragraph) in the same directory containing the records
in all files in that directory.
RollupDay joins the files in a directory into a single file by default.
The mothra.rollupday.maximumSize Java property may be used to limit the
maximum file size. The size is for the compressed file if compression is
active. The value is approximate since it is only checked after the data
appears on disk which occurs in large blocks because of buffering by the
Java stream code and the compression algorithm.
There is always a single thread that recursively scans the directories.
The number of threads that joins the files may be set by specifying the
mothra.rollupday.maxThreads Java property. If not specified, the
default is 6.
By default, RollupDay does not compress the files it writes.
(NOTE: It should support writing the output using the same compression as
the input.) To specify the compression codec that it should use, specify
the mothra.rollupday.compression Java property. Values typically
supported by Hadoop include bzip2, gzip, lz4, lzo, lzop,
snappy, and default. The empty string indicates no compression.
Wrapper to provide a better command-line argument experience over the top of the main packer class.
Wrapper to provide a better command-line argument experience over the top of the main packer class. Things should be folded together in the future.
Wrapper to provide a better command-line argument experience over the top of the main packer class. Things should be folded together in the future.