Class GuaguaInputFormat
- java.lang.Object
-
- org.apache.hadoop.mapreduce.InputFormat<K,V>
-
- org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text>
-
- org.apache.hadoop.mapreduce.lib.input.TextInputFormat
-
- ml.shifu.guagua.mapreduce.GuaguaInputFormat
-
public class GuaguaInputFormat extends org.apache.hadoop.mapreduce.lib.input.TextInputFormatGuaguaInputFormatis used to determine how many mappers in guagua MapReduce job.In
getSplits(JobContext), we add aGuaguaInputSplitinstance as a master, others are workers. These make sure one master and multiple workers are started as mapper tasks.If multiple masters are needed, add new
GuaguaInputSplitingetSplits(JobContext). But sometimes fail-over on multiple masters is not good as master task restarting by hadoop mapper task fail over. Since in multiple masters case: if one master is down, zookeeper will wait for session timeout setting to find failed master. If session timeout is two large, it may be larger than hadoop restarting a task.By default guagua depends on hadoop default splits implementation, while guagua also provide a mechanism to support combining several splits together. Set
GuaguaConstants.GUAGUA_SPLIT_COMBINABLEto true andGuaguaConstants.GUAGUA_SPLIT_MAX_COMBINED_SPLIT_SIZEto a number to make splits combine to a given number.-Dguagua.split.combinable=true -Dguagua.split.maxCombinedSplitSiz=268435456
-
-
Constructor Summary
Constructors Constructor Description GuaguaInputFormat()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text>createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)static List<List<org.apache.hadoop.mapreduce.InputSplit>>getCombineGuaguaSplits(List<org.apache.hadoop.mapreduce.InputSplit> oneInputSplits, long maxCombinedSplitSize)protected List<org.apache.hadoop.mapreduce.InputSplit>getFinalCombineGuaguaSplits(List<org.apache.hadoop.mapreduce.InputSplit> newSplits, long combineSize)Copy from pig implementation, need to check this code logic.protected List<org.apache.hadoop.mapreduce.InputSplit>getGuaguaSplits(org.apache.hadoop.mapreduce.JobContext job)Generate the list of files and make them into FileSplits.List<org.apache.hadoop.mapreduce.InputSplit>getSplits(org.apache.hadoop.mapreduce.JobContext job)Splitter building logic including master setting, also includes combining input feature like Pig.protected booleanisPigOrHadoopMetaFile(org.apache.hadoop.fs.Path path)Whether it is not pig or hadoop meta output file.protected booleanisSplitable(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path file)-
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
-
-
-
-
Method Detail
-
getSplits
public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext job) throws IOException
Splitter building logic including master setting, also includes combining input feature like Pig.- Overrides:
getSplitsin classorg.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text>- Throws:
IOException
-
getFinalCombineGuaguaSplits
protected List<org.apache.hadoop.mapreduce.InputSplit> getFinalCombineGuaguaSplits(List<org.apache.hadoop.mapreduce.InputSplit> newSplits, long combineSize) throws IOException
Copy from pig implementation, need to check this code logic.- Throws:
IOException
-
getGuaguaSplits
protected List<org.apache.hadoop.mapreduce.InputSplit> getGuaguaSplits(org.apache.hadoop.mapreduce.JobContext job) throws IOException
Generate the list of files and make them into FileSplits.- Throws:
IOException
-
getCombineGuaguaSplits
public static List<List<org.apache.hadoop.mapreduce.InputSplit>> getCombineGuaguaSplits(List<org.apache.hadoop.mapreduce.InputSplit> oneInputSplits, long maxCombinedSplitSize) throws IOException, InterruptedException
- Throws:
IOExceptionInterruptedException
-
isPigOrHadoopMetaFile
protected boolean isPigOrHadoopMetaFile(org.apache.hadoop.fs.Path path)
Whether it is not pig or hadoop meta output file.
-
isSplitable
protected boolean isSplitable(org.apache.hadoop.mapreduce.JobContext context, org.apache.hadoop.fs.Path file)- Overrides:
isSplitablein classorg.apache.hadoop.mapreduce.lib.input.TextInputFormat
-
createRecordReader
public org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)- Overrides:
createRecordReaderin classorg.apache.hadoop.mapreduce.lib.input.TextInputFormat
-
-