ml.shifu.guagua.mapreduce
类 GuaguaInputFormat

java.lang.Object
  继承者 org.apache.hadoop.mapreduce.InputFormat<K,V>
      继承者 org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text>
          继承者 org.apache.hadoop.mapreduce.lib.input.TextInputFormat
              继承者 ml.shifu.guagua.mapreduce.GuaguaInputFormat

public class GuaguaInputFormat
extends org.apache.hadoop.mapreduce.lib.input.TextInputFormat

GuaguaInputFormat is used to determine how many mappers in guagua MapReduce job.

In getSplits(JobContext), we add a GuaguaInputSplit instance as a master, others are workers. These make sure one master and multiple workers are started as mapper tasks.

If multiple masters are needed, add new GuaguaInputSplit in getSplits(JobContext). But sometimes fail-over on multiple masters is not good as master task restarting by hadoop mapper task fail over. Since in multiple masters case: if one master is down, zookeeper will wait for session timeout setting to find failed master. If session timeout is two large, it may be larger than hadoop restarting a task.

By default guagua depends on hadoop default splits implementation, while guagua also provide a mechanism to support combining several splits together. Set GuaguaConstants.GUAGUA_SPLIT_COMBINABLE to true and GuaguaConstants.GUAGUA_SPLIT_MAX_COMBINED_SPLIT_SIZE to a number to make splits combine to a given number.

 -Dguagua.split.combinable=true -Dguagua.split.maxCombinedSplitSiz=268435456
 


构造方法摘要
GuaguaInputFormat()
           
 
方法摘要
 org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)
           
static List<List<org.apache.hadoop.mapreduce.InputSplit>> getCombineGuaguaSplits(List<org.apache.hadoop.mapreduce.InputSplit> oneInputSplits, long maxCombinedSplitSize)
           
protected  List<org.apache.hadoop.mapreduce.InputSplit> getFinalCombineGuaguaSplits(List<org.apache.hadoop.mapreduce.InputSplit> newSplits, long combineSize)
          Copy from pig implementation, need to check this code logic.
protected  List<org.apache.hadoop.mapreduce.InputSplit> getGuaguaSplits(org.apache.hadoop.mapreduce.JobContext job)
          Generate the list of files and make them into FileSplits.
 List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext job)
          Splitter building logic including master setting, also includes combining input feature like Pig.
protected  boolean isPigOrHadoopMetaFile(org.apache.hadoop.fs.Path path)
          Whether it is not pig or hadoop meta output file.
protected  boolean isSplitable(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path file)
           
 
从类 org.apache.hadoop.mapreduce.lib.input.FileInputFormat 继承的方法
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
 
从类 java.lang.Object 继承的方法
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

构造方法详细信息

GuaguaInputFormat

public GuaguaInputFormat()
方法详细信息

getSplits

public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext job)
                                                       throws IOException
Splitter building logic including master setting, also includes combining input feature like Pig.

覆盖:
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text> 中的 getSplits
抛出:
IOException

getFinalCombineGuaguaSplits

protected List<org.apache.hadoop.mapreduce.InputSplit> getFinalCombineGuaguaSplits(List<org.apache.hadoop.mapreduce.InputSplit> newSplits,
                                                                                   long combineSize)
                                                                            throws IOException
Copy from pig implementation, need to check this code logic.

抛出:
IOException

getGuaguaSplits

protected List<org.apache.hadoop.mapreduce.InputSplit> getGuaguaSplits(org.apache.hadoop.mapreduce.JobContext job)
                                                                throws IOException
Generate the list of files and make them into FileSplits.

抛出:
IOException

getCombineGuaguaSplits

public static List<List<org.apache.hadoop.mapreduce.InputSplit>> getCombineGuaguaSplits(List<org.apache.hadoop.mapreduce.InputSplit> oneInputSplits,
                                                                                        long maxCombinedSplitSize)
                                                                                 throws IOException,
                                                                                        InterruptedException
抛出:
IOException
InterruptedException

isPigOrHadoopMetaFile

protected boolean isPigOrHadoopMetaFile(org.apache.hadoop.fs.Path path)
Whether it is not pig or hadoop meta output file.


isSplitable

protected boolean isSplitable(org.apache.hadoop.mapreduce.JobContext context,
                              org.apache.hadoop.fs.Path file)
覆盖:
org.apache.hadoop.mapreduce.lib.input.TextInputFormat 中的 isSplitable

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                                                org.apache.hadoop.mapreduce.TaskAttemptContext context)
覆盖:
org.apache.hadoop.mapreduce.lib.input.TextInputFormat 中的 createRecordReader


Copyright © 2014. All Rights Reserved.