o

org.cert.netsa.mothra.tools

FileSanitizerMain

object FileSanitizerMain extends App with StrictLogging

Object to implement the FileSanitizer application.

Typical Usage in a Spark environment:

spark-submit --class org.cert.netsa.mothra.packer.tools.FileSanitizerMain mothra-tools.jar <f1>[,<f2>[,<f3>...]] <s1> [<s2> <s3> ...]

where:

f1..fn: Names of InfoElements to be removed from the files s1..sn: Directories to process, as Hadoop URIs

FileSanitizer removes Information Element fields from the data files in a Mothra repository. In addition, when multiple files share the same name except for the UUID, FileSanitizer combines those files together.

The IE fields to be removed must be specified in a single argument, as a comma-separated list of names, such as sourceTransportPort,destinationTransportPort.

Each remaining argument is a single directory to process.

FileSanitizer runs as a batch process, not as a daemon.

FileSanitizer makes a single recursive scan of the source directories <s1>, <s2>, ... for files whose names match the pattern "YYYYMMDD.HH." or "YYYYMMDD.HH-PTddH." (It looks for files matching the regular expression ^\d{8}\.\d{2}(?:-PT\d\d?H)?\.) Files whose names match that pattern are processed by FileSanitizer to remove the named Information Elements. All files where the regular expression matched the same string are joined into a single file, similar to the FileJoiner. Finally, the original files are removed.

There is always a single thread that recursively scans the directories. The number of threads that sanitizes and joins the files may be set by specifying the mothra.filesanitizer.maxThreads Java property. If not specified, the default is 6.

FileSanitizer may be run so that either it spawns a thread for every directory that contains files to process or it spawns a thread for each set of files in a directory that have the same prefix. The behavior is controlled whether the mothra.filesanitizer.spawnThread Java property is set to by-prefix or by-directory. The default is by-directory. (For backwards compatibility, by-hour is an alias for by-prefix.)

By default, FileSanitizer does not compress the files it writes. (NOTE: It should support writing the output using the same compression as the input.) To specify the compression codec that it should use, specify the mothra.filesanitizer.compression Java property. Values typically supported by Hadoop include bzip2, gzip, lz4, lzo, lzop, snappy, and default. The empty string indicates no compression.

FileSanitizer joins the files sharing the same prefix into a single file by default. The mothra.filesanitizer.maximumSize Java property may be used to limit the maximum file size. The size is for the compressed file if compression is active. The value is approximate since it is only checked after the data appears on disk which occurs in large blocks because of buffering by the Java stream code and the compression algorithm.

Linear Supertypes
StrictLogging, App, DelayedInit, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. FileSanitizerMain
  2. StrictLogging
  3. App
  4. DelayedInit
  5. AnyRef
  6. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. val DEFAULT_COMPRESSION: String

    The default compression codec to use for files written to HDFS.

    The default compression codec to use for files written to HDFS. This may be modified by specifying the following property: mothra.filesanitizer.compression.

    Values typically supported by Hadoop include bzip2, gzip, lz4, lzo, lzop, snappy, and default. The empty string indicates no compression.

  5. val DEFAULT_MAX_THREADS: Int

    The default number of threads to run for sanitizing files when the mothra.filesanitizer.maxThreads Java property is not set.

    The default number of threads to run for sanitizing files when the mothra.filesanitizer.maxThreads Java property is not set. (The scanning task always runs in its own thread.)

  6. val DEFAULT_SPAWN_THREAD: String

    The default value for spawnThread when the mothra.filesanitizer.spawnThread Java property is not specified.

  7. final def args: Array[String]
    Attributes
    protected
    Definition Classes
    App
  8. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  9. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @native()
  10. val compressCodec: Option[CompressionCodec]

    The compression codec used for files written to HDFS.

    The compression codec used for files written to HDFS. This may be set by setting the "mothra.filesanitizer.compression" property. If that property is not set, DEFAULT_COMPRESSION is used.

  11. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  12. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  13. final val executionStart: Long
    Definition Classes
    App
  14. val fileSystem: FileSystem
  15. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable])
  16. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  17. implicit val hadoopConf: Configuration

    The Hadoop configuration

  18. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  19. implicit val infoModel: InfoModel

    The information model

  20. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  21. val logTaskCountInterval: Int

    How often to print log messages regarding the number of tasks, in seconds.

  22. val logger: Logger
    Attributes
    protected
    Definition Classes
    StrictLogging
  23. final def main(args: Array[String]): Unit
    Definition Classes
    App
  24. val maxThreads: Int

    The maximum number of filesanitizer threads to start.

    The maximum number of filesanitizer threads to start. It defaults to the value DEFAULT_MAX_THREADS.

    This run-time behavior may be modified by setting the mothra.filesanitizer.maxThreads property.

  25. val maximumSize: Option[Long]

    The (approximate) maximum size file to create.

    The (approximate) maximum size file to create. The default is no maximum. When a file's size exceeds this value, the file is closed and a new file is started. Typically a file's size will not exceed this value by more than the maximum size of an IPFIX message, 64k.

  26. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  27. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  28. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  29. val positionalArgs: Array[String]
  30. val spawnThread: String

    The behavior as to whether a file-sanitizing thread is spawned...

    The behavior as to whether a file-sanitizing thread is spawned...

    by-directory: for every directory that contains files to be sanitized, or

    by-prefix: for every unqiue basename prefix (that is, the file name without the UUID) (in a single directory) that contains files to be sanitized. by-hour is an alias for by-prefix.

    The default is specified by the DEFAULT_SPAWN_THREAD variable. The run-time behavior may be modified by setting the mothra.filesanitizer.spawnThread Java property to one of those values.

  31. val spawnThreadMap: Map[String, Boolean]

    Mapping from spawnThread value to threadPerDirectory.

  32. val switches: Array[String]
  33. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  34. val toRemove: Set[InfoElement]
  35. def toString(): String
    Definition Classes
    AnyRef → Any
  36. def usage(full: Boolean = false): Unit
  37. def version(): Unit
  38. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  39. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  40. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()

Deprecated Value Members

  1. def delayedInit(body: => Unit): Unit
    Definition Classes
    App → DelayedInit
    Annotations
    @deprecated
    Deprecated

    (Since version 2.11.0) the delayedInit mechanism will disappear

Inherited from StrictLogging

Inherited from App

Inherited from DelayedInit

Inherited from AnyRef

Inherited from Any

Ungrouped