object FileJoinerMain extends App with StrictLogging
Object to implement the FileJoiner application.
Typical Usage in a Spark environment:
spark-submit --class org.cert.netsa.mothra.packer.tools.FileJoinerMain mothra-tools.jar <s1> [<s2> <s3> ...]
where:
s1..sn: Directories to process, as Hadoop URIs
FileJoiner reduces the number of data files in a Mothra repository. It may also be used to modify the files' compression.
FileJoiner runs as a batch process, not as a daemon.
FileJoiner makes a single recursive scan of the source directories <s1>,
<s2>, ... for files whose names match the pattern "YYYYMMDD.HH." or
"YYYYMMDD.HH-PTddH." (It looks for files matching the regular expression
^\d{8}\.\d{2}(?:-PT\d\d?H)?\.) Files whose names match that pattern are
processed by FileJoiner to create a single new file in the same directory
that has the same prefix as the originals, and then the original file(s)
are removed.
By default, files that share the same prefix are only processed when there
are two or more files. To force re-writing when there is a single file,
set the Java property mothra.filejoiner.minCountToJoin to a value less
than 2. The property may also be used to create a new file only when an
"excessive" number of files share the same prefix.
There is always a single thread that recursively scans the directories.
The number of threads that joins the files may be set by specifying the
mothra.filejoiner.maxThreads Java property. If not specified, the
default is 6.
FileJoiner may be run so that either it spawns a thread for every
directory that contains files to be joined or it spawns a thread for each
set of files in a directory that have the same prefix. The behavior is
controlled whether the mothra.filejoiner.spawnThread Java property is
set to by-prefix or by-directory. The default is by-directory.
(For backwards compatibility, by-hour is an alias for by-prefix.)
By default, FileJoiner does not compress the files it writes.
(NOTE: It should support writing the output using the same compression as
the input.) To specify the compression codec that it should use, specify
the mothra.filejoiner.compression Java property. Values typically
supported by Hadoop include bzip2, gzip, lz4, lzo, lzop,
snappy, and default. The empty string indicates no compression.
FileJoiner joins files sharing the same prefix into a single file by
default. The mothra.filejoiner.maximumSize Java property may be used to
limit the maximum file size. The size is for the compressed file if
compression is active. The value is approximate since it is only checked
after the data appears on disk which occurs in large blocks because of
buffering by the Java stream code and the compression algorithm. (By
setting that property and mothra.filejoiner.minCountToJoin to 1, you can
force large files to be split into smaller ones, making the FileJoiner a
file-splitter.)
- Alphabetic
- By Inheritance
- FileJoinerMain
- StrictLogging
- App
- DelayedInit
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
val
DEFAULT_COMPRESSION: String
The default compression codec to use for files written to HDFS.
The default compression codec to use for files written to HDFS. This may be modified by specifying the following property: mothra.filejoiner.compression.
Values typically supported by Hadoop include
bzip2,gzip,lz4,lzo,lzop,snappy, anddefault. The empty string indicates no compression. -
val
DEFAULT_MAX_THREADS: Int
The default number of threads to run for joining files when the
mothra.filejoiner.maxThreadsJava property is not set.The default number of threads to run for joining files when the
mothra.filejoiner.maxThreadsJava property is not set. (The scanning task always runs in its own thread.) -
val
DEFAULT_SPAWN_THREAD: String
The default value for
spawnThreadwhen themothra.filejoiner.spawnThreadJava property is not specified. -
def
args: Array[String]
- Attributes
- protected
- Definition Classes
- App
- Annotations
- @deprecatedOverriding( "args should not be overridden" , "2.11.0" )
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
val
compressCodec: Option[CompressionCodec]
The compression codec used for files written to HDFS.
The compression codec used for files written to HDFS. This may be set by setting the "mothra.filejoiner.compression" property. If that property is not set, DEFAULT_COMPRESSION is used.
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
val
executionStart: Long
- Definition Classes
- App
- Annotations
- @deprecatedOverriding( ... , "2.11.0" )
- val fileSystem: FileSystem
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
implicit
val
hadoopConf: Configuration
The Hadoop configuration
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
implicit
val
infoModel: InfoModel
The information model
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
val
logTaskCountInterval: Int
How often to print log messages regarding the number of tasks, in seconds.
-
val
logger: Logger
- Attributes
- protected
- Definition Classes
- StrictLogging
-
def
main(args: Array[String]): Unit
- Definition Classes
- App
- Annotations
- @deprecatedOverriding( "main should not be overridden" , "2.11.0" )
-
val
maxThreads: Int
The maximum number of filejoiner threads to start.
The maximum number of filejoiner threads to start. It defaults to the value
DEFAULT_MAX_THREADS.This run-time behavior may be modified by setting the mothra.filejoiner.maxThreads property.
-
val
maximumSize: Option[Long]
The (approximate) maximum size file to create.
The (approximate) maximum size file to create. The default is no maximum. When a file's size exceeds this value, the file is closed and a new file is started. Typically a file's size will not exceed this value by more than the maximum size of an IPFIX message, 64k.
-
val
minCountToJoin: Int
The number of files that must exist having the same "YYYYMMDD.HH" or "YYYYMMDD.HH-PTddH" prefix (in a single directory) for those files to be joined into a larger file.
The number of files that must exist having the same "YYYYMMDD.HH" or "YYYYMMDD.HH-PTddH" prefix (in a single directory) for those files to be joined into a larger file. The default is 2.
This may be modified by setting the mothra.filejoiner.minCountToJoin property. For example, when changing the compression, you may want to modify all files by setting this to 1, even if multiple files do not need to be joined.
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- val positionalArgs: Array[String]
-
val
spawnThread: String
The behavior as to whether a file-joining thread is spawned...
The behavior as to whether a file-joining thread is spawned...
by-directory: for every directory that contains files to be joined, orby-prefix: for every unqiue basename prefix (that is, the file name without the UUID) (in a single directory) that contains files to be joined.by-houris an alias forby-prefix.The default is specified by the
DEFAULT_SPAWN_THREADvariable. The run-time behavior may be modified by setting themothra.filejoiner.spawnThreadJava property to one of those values. -
val
spawnThreadMap: Map[String, Boolean]
Mapping from
spawnThreadvalue tothreadPerDirectory. - val switches: Array[String]
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
- def usage(full: Boolean = false): Unit
- def version(): Unit
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
Deprecated Value Members
-
def
delayedInit(body: ⇒ Unit): Unit
- Definition Classes
- App → DelayedInit
- Annotations
- @deprecated
- Deprecated
(Since version 2.11.0) the delayedInit mechanism will disappear