Class GuaguaInputFormat


  • public class GuaguaInputFormat
    extends org.apache.hadoop.mapreduce.lib.input.TextInputFormat
    GuaguaInputFormat is used to determine how many mappers in guagua MapReduce job.

    In getSplits(JobContext), we add a GuaguaInputSplit instance as a master, others are workers. These make sure one master and multiple workers are started as mapper tasks.

    If multiple masters are needed, add new GuaguaInputSplit in getSplits(JobContext). But sometimes fail-over on multiple masters is not good as master task restarting by hadoop mapper task fail over. Since in multiple masters case: if one master is down, zookeeper will wait for session timeout setting to find failed master. If session timeout is two large, it may be larger than hadoop restarting a task.

    By default guagua depends on hadoop default splits implementation, while guagua also provide a mechanism to support combining several splits together. Set GuaguaConstants.GUAGUA_SPLIT_COMBINABLE to true and GuaguaConstants.GUAGUA_SPLIT_MAX_COMBINED_SPLIT_SIZE to a number to make splits combine to a given number.

     -Dguagua.split.combinable=true -Dguagua.split.maxCombinedSplitSiz=268435456
     
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,​org.apache.hadoop.io.Text> createRecordReader​(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)  
      static List<List<org.apache.hadoop.mapreduce.InputSplit>> getCombineGuaguaSplits​(List<org.apache.hadoop.mapreduce.InputSplit> oneInputSplits, long maxCombinedSplitSize)  
      protected List<org.apache.hadoop.mapreduce.InputSplit> getFinalCombineGuaguaSplits​(List<org.apache.hadoop.mapreduce.InputSplit> newSplits, long combineSize)
      Copy from pig implementation, need to check this code logic.
      protected List<org.apache.hadoop.mapreduce.InputSplit> getGuaguaSplits​(org.apache.hadoop.mapreduce.JobContext job)
      Generate the list of files and make them into FileSplits.
      List<org.apache.hadoop.mapreduce.InputSplit> getSplits​(org.apache.hadoop.mapreduce.JobContext job)
      Splitter building logic including master setting, also includes combining input feature like Pig.
      protected boolean isPigOrHadoopMetaFile​(org.apache.hadoop.fs.Path path)
      Whether it is not pig or hadoop meta output file.
      protected boolean isSplitable​(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path file)  
      • Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

        addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
    • Constructor Detail

      • GuaguaInputFormat

        public GuaguaInputFormat()
    • Method Detail

      • getSplits

        public List<org.apache.hadoop.mapreduce.InputSplit> getSplits​(org.apache.hadoop.mapreduce.JobContext job)
                                                               throws IOException
        Splitter building logic including master setting, also includes combining input feature like Pig.
        Overrides:
        getSplits in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,​org.apache.hadoop.io.Text>
        Throws:
        IOException
      • getFinalCombineGuaguaSplits

        protected List<org.apache.hadoop.mapreduce.InputSplit> getFinalCombineGuaguaSplits​(List<org.apache.hadoop.mapreduce.InputSplit> newSplits,
                                                                                           long combineSize)
                                                                                    throws IOException
        Copy from pig implementation, need to check this code logic.
        Throws:
        IOException
      • getGuaguaSplits

        protected List<org.apache.hadoop.mapreduce.InputSplit> getGuaguaSplits​(org.apache.hadoop.mapreduce.JobContext job)
                                                                        throws IOException
        Generate the list of files and make them into FileSplits.
        Throws:
        IOException
      • isPigOrHadoopMetaFile

        protected boolean isPigOrHadoopMetaFile​(org.apache.hadoop.fs.Path path)
        Whether it is not pig or hadoop meta output file.
      • isSplitable

        protected boolean isSplitable​(org.apache.hadoop.mapreduce.JobContext context,
                                      org.apache.hadoop.fs.Path file)
        Overrides:
        isSplitable in class org.apache.hadoop.mapreduce.lib.input.TextInputFormat
      • createRecordReader

        public org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,​org.apache.hadoop.io.Text> createRecordReader​(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                                                              org.apache.hadoop.mapreduce.TaskAttemptContext context)
        Overrides:
        createRecordReader in class org.apache.hadoop.mapreduce.lib.input.TextInputFormat