object RepackerMain extends App with StrictLogging
Object to implement the Reacker application.
Typical Usage in a Spark environment:
spark-submit --class org.cert.netsa.mothra.packer.tools.RepackerMain mothra-tools.jar <partition-conf> <dest-dir> <work-dir> <s1> [<s2> <s3> ...]
where:
partition-conf: Partitioning configuration file as Hadoop URI
dest-dir: Root destination directory as Hadoop URI
work-dir: Working directory on the local disk (not file://)
s1..sn: Source directories as Hadoop URIs
Makes a single recursive scan of the source directories <s1>,<s2>,... for IPFIX files. Splits the IPFIX records in the source files into output file(s) in a time-based directory structure based on the partitioning rules in the partitioning configuration file <partition-conf>. The output files are initially created in the working directory <work-dir>, and, once ALL input files have been read, are moved to the destination directory and the initial source files removed. The dest-dir may be a source directory.
Repacker runs as a batch process; not as a daemon.
Example/Intended uses for the Repacker include:
(1)Changing how the records are packed---for example packing by the silkAppLabel instead of the protocolIdentifier.
(2)Combining multiple files for an hour into a single file for that hour, merging hourly files into a file that covers a longer duration, or spliting a longer duration file into smaller files.
(3)Changing the compression algorithm used on the IPFIX files.
Currently the repacker does NOT support modifying the records, it only moves the records into different files.
Repacker uses multiple threads. By default, each source directory specified on the command line gets a dedicated thread to scanning that directory and its subdirectories recursively for IPFIX files, and another thread decidated to reading those files and repacking them. The repacker does not support having multiple threads scan a directory, but it does allow multiple threads to process a single directory's files.
The <work-dir> must NOT be a source directory or a subdirectory of a source directory. To repack the files in an existing working directory, use a different working directory. The repacker ignores any files in the <work-dir> that exist when the repacker is started, and it ignores files placed there by other programs.
The property values that are used by the repacker are:
mothra.repacker.compression -- the compression algorithm used for the
new IPFIX files. Values typically supported by Hadoop include bzip2,
gzip, lz4, lzo, lzop, snappy, and default. The empty string
indicates no compression.
mothra.repacker.hoursPerFile -- The number of hours covered by each file
in the repository. The valid range is 1 (a file for each hour) to 24 (one
file per day). The default is 1.
mothra.repacker.maxScanJobs -- the maximum number of threads dedicated
to scanning the source directories. The default (and maximum) value is
the number of source directories.
mothra.repacker.readersPerScanner -- the number of reader/repacker
threads to create for each source directory. The default is 1.
mothra.repacker.maxThreads -- the maximum number of worker (scanner and
repacker) threads to create. The default value is computed using the
formula: (maxScanJobs * (1 + readersPerScanner)).
mothra.repacker.maximumSize -- the (approximate) maximum file size to
create. When specified, a work-file that exceeds this size is closed and
moved into the repository. NOTES: (1)This value uses the uncompressed
file size, and does not consider any compression that may occur when the
file is moved from the workDir to the tgtDir. In addition, a file's size
tends to grow in large steps because of buffering by the Java stream code.
(2)Specifying a maximumSize may temporarially cause duplicate records to
appear in the repository because of some records in the original files and
some in the new file. Once Repacker finishes scanning all files, the
original files are removed and only the newly packed files are left. This
issue of temporary having duplicate records in the repository will be
resolved in a future release.
mothra.repacker.archiveDirectory -- the root directory into which
working files are moved after the repacker has finished running, as a
Hadoop URI. If not specified, the working files are deleted.
mothra.repacker.fileCacheSize -- The maximum size of the open file
cache. This is the maximum number of open files maintained by the file
cache for writing to files in the work directory. The repacker does not
limit the number of files in the work directory; this only limits the
number of open files. Once the cache reaches this number of open files
and the packer needs to (re-)open a file, the packer closes the
least-recently-used file. This value does not include the file handles
required when reading incoming files or when copying files from the work
directory to the data directory. The default is 2000; the minimum
permitted is 128.
- Alphabetic
- By Inheritance
- RepackerMain
- StrictLogging
- App
- DelayedInit
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- val archiveDir: Option[Path]
- final def args: Array[String]
- Attributes
- protected
- Definition Classes
- App
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- val compressCodec: Option[CompressionCodec]
The compression codec used for files written to HDFS.
The compression codec used for files written to HDFS. This may be set by setting the "mothra.repacker.compression" property. If that property is not set, CorePacker.DEFAULT_COMPRESSION is used.
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- final val executionStart: Long
- Definition Classes
- App
- val fileCacheSize: Int
The maximum number of open files maintained by the file cache.
The maximum number of open files maintained by the file cache. This is determined by the
mothra.repacker.fileCacheSizeJava property, or byCorePacker.DEFAULT_FILE_CACHE_SIZEwhen the property is not set. This value must be no less thanCorePacker.MINIMUM_FILE_CACHE_SIZE.- See also
CorePacker.DEFAULT_FILE_CACHE_SIZE for a full description of this value.
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- implicit val hadoopConf: Configuration
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- val hoursPerFile: Int
The number of hours covered by each file in the repository.
The number of hours covered by each file in the repository. This is determined by the "mothra.repacker.hoursPerFile" property, or CorePacker.DEFAULT_HOURS_PER_FILE when that property is not set.
- val infoModel: InfoModel
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- val logTaskCountInterval: Int
How often to print log messages regarding the number of tasks, in seconds.
- val logger: Logger
- Attributes
- protected
- Definition Classes
- StrictLogging
- final def main(args: Array[String]): Unit
- Definition Classes
- App
- val maxScanJobs: Int
maxScanJobs specifies the maximum number of scanning threads to start.
maxScanJobs specifies the maximum number of scanning threads to start. Since at most one thread can scan a directory, the default is to create 1 scanner per srcDir. Setting this to a value larger than the number of source directories has no effect. This may be modified by setting the mothra.repacker.maxScanJobs property.
- val maxThreads: Int
maxThreads specifies the maximum number of scanning and reader/repacker threads to start.
maxThreads specifies the maximum number of scanning and reader/repacker threads to start. By default this is
(scanningJobs * (1 + * readersPerScanner))
Setting it to a value larger than that has no effect.
This may be modified by setting the mothra.repacker.readersPerScanner property.
- val maximumSize: Option[Long]
The (approximate) maximum size file to create.
The (approximate) maximum size file to create. Typically a file's size will not exceed this value by more than the maximum size of an IPFIX message, 64k. The default is no maximum. When a file's size exceeds this value, the file is closed and a new file is started.
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- val packConf: PackerConfig
- val packLogic: PackingLogic
- val packer: CorePacker
- val positionalArgs: Array[String]
- var readersPerScanner: Int
readersPerScanner specifies the number of file reader/repacker threads that are invoked per scanning thread.
readersPerScanner specifies the number of file reader/repacker threads that are invoked per scanning thread. The default is 1. This may be modified by setting the mothra.repacker.readersPerScanner property.
- val removeList: ConcurrentLinkedQueue[Path]
- val rootDir: Path
- val runTimePackConf: Path
- var running: Boolean
- val sourceDirs: Array[Path]
- val sourceFileSystem: FileSystem
- val switches: Array[String]
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- AnyRef → Any
- def usage(full: Boolean = false): Unit
- def version(): Unit
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- val workDir: Path