object FileSanitizerMain extends App with StrictLogging
Object to implement the FileSanitizer application.
Typical Usage in a Spark environment:
spark-submit --class org.cert.netsa.mothra.packer.tools.FileSanitizerMain mothra-tools.jar <f1>[,<f2>[,<f3>...]] <s1> [<s2> <s3> ...]
where:
f1..fn: Names of InfoElements to be removed from the files s1..sn: Directories to process, as Hadoop URIs
FileSanitizer removes Information Element fields from the data files in a Mothra repository. In addition, when multiple files share the same name except for the UUID, FileSanitizer combines those files together.
The IE fields to be removed must be specified in a single argument, as a
comma-separated list of names, such as
sourceTransportPort,destinationTransportPort.
Each remaining argument is a single directory to process.
FileSanitizer runs as a batch process, not as a daemon.
FileSanitizer makes a single recursive scan of the source directories
<s1>, <s2>, ... for files whose names match the pattern "YYYYMMDD.HH." or
"YYYYMMDD.HH-PTddH." (It looks for files matching the regular expression
^\d{8}\.\d{2}(?:-PT\d\d?H)?\.) Files whose names match that pattern are
processed by FileSanitizer to remove the named Information Elements. All
files where the regular expression matched the same string are joined into
a single file, similar to the FileJoiner. Finally, the original files are
removed.
There is always a single thread that recursively scans the directories.
The number of threads that sanitizes and joins the files may be set by
specifying the mothra.filesanitizer.maxThreads Java property. If not
specified, the default is 6.
FileSanitizer may be run so that either it spawns a thread for every
directory that contains files to process or it spawns a thread for each
set of files in a directory that have the same prefix. The behavior is
controlled whether the mothra.filesanitizer.spawnThread Java property is
set to by-prefix or by-directory. The default is by-directory.
(For backwards compatibility, by-hour is an alias for by-prefix.)
By default, FileSanitizer does not compress the files it writes.
(NOTE: It should support writing the output using the same compression as
the input.) To specify the compression codec that it should use, specify
the mothra.filesanitizer.compression Java property. Values typically
supported by Hadoop include bzip2, gzip, lz4, lzo, lzop,
snappy, and default. The empty string indicates no compression.
FileSanitizer joins the files sharing the same prefix into a single file
by default. The mothra.filesanitizer.maximumSize Java property may be
used to limit the maximum file size. The size is for the compressed file
if compression is active. The value is approximate since it is only
checked after the data appears on disk which occurs in large blocks
because of buffering by the Java stream code and the compression
algorithm.
- Alphabetic
- By Inheritance
- FileSanitizerMain
- StrictLogging
- App
- DelayedInit
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- val DEFAULT_COMPRESSION: String
The default compression codec to use for files written to HDFS.
The default compression codec to use for files written to HDFS. This may be modified by specifying the following property: mothra.filesanitizer.compression.
Values typically supported by Hadoop include
bzip2,gzip,lz4,lzo,lzop,snappy, anddefault. The empty string indicates no compression. - val DEFAULT_MAX_THREADS: Int
The default number of threads to run for sanitizing files when the
mothra.filesanitizer.maxThreadsJava property is not set.The default number of threads to run for sanitizing files when the
mothra.filesanitizer.maxThreadsJava property is not set. (The scanning task always runs in its own thread.) - val DEFAULT_SPAWN_THREAD: String
The default value for
spawnThreadwhen themothra.filesanitizer.spawnThreadJava property is not specified. - final def args: Array[String]
- Attributes
- protected
- Definition Classes
- App
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- val compressCodec: Option[CompressionCodec]
The compression codec used for files written to HDFS.
The compression codec used for files written to HDFS. This may be set by setting the "mothra.filesanitizer.compression" property. If that property is not set, DEFAULT_COMPRESSION is used.
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- final val executionStart: Long
- Definition Classes
- App
- val fileSystem: FileSystem
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- implicit val hadoopConf: Configuration
The Hadoop configuration
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- implicit val infoModel: InfoModel
The information model
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- val logTaskCountInterval: Int
How often to print log messages regarding the number of tasks, in seconds.
- val logger: Logger
- Attributes
- protected
- Definition Classes
- StrictLogging
- final def main(args: Array[String]): Unit
- Definition Classes
- App
- val maxThreads: Int
The maximum number of filesanitizer threads to start.
The maximum number of filesanitizer threads to start. It defaults to the value
DEFAULT_MAX_THREADS.This run-time behavior may be modified by setting the mothra.filesanitizer.maxThreads property.
- val maximumSize: Option[Long]
The (approximate) maximum size file to create.
The (approximate) maximum size file to create. The default is no maximum. When a file's size exceeds this value, the file is closed and a new file is started. Typically a file's size will not exceed this value by more than the maximum size of an IPFIX message, 64k.
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- val positionalArgs: Array[String]
- val spawnThread: String
The behavior as to whether a file-sanitizing thread is spawned...
The behavior as to whether a file-sanitizing thread is spawned...
by-directory: for every directory that contains files to be sanitized, orby-prefix: for every unqiue basename prefix (that is, the file name without the UUID) (in a single directory) that contains files to be sanitized.by-houris an alias forby-prefix.The default is specified by the
DEFAULT_SPAWN_THREADvariable. The run-time behavior may be modified by setting themothra.filesanitizer.spawnThreadJava property to one of those values. - val spawnThreadMap: Map[String, Boolean]
Mapping from
spawnThreadvalue tothreadPerDirectory. - val switches: Array[String]
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- val toRemove: Set[InfoElement]
- def toString(): String
- Definition Classes
- AnyRef → Any
- def usage(full: Boolean = false): Unit
- def version(): Unit
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()