Package org.apache.avro.mapred
Avro data files do not contain key/value pairs as expected by Hadoop's MapReduce API, but rather just a sequence of values. Thus we provide here a layer on top of Hadoop's MapReduce API.
In all cases, input and output paths are set and jobs are submitted as with standard Hadoop jobs:
- Specify input files with
FileInputFormat.setInputPaths(org.apache.hadoop.mapred.JobConf, java.lang.String)
- Specify an output directory with
FileOutputFormat.setOutputPath(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path)
- Run your job with
JobClient.runJob(org.apache.hadoop.mapred.JobConf)
For jobs whose input and output are Avro data files:
- Call
AvroJob.setInputSchema(org.apache.hadoop.mapred.JobConf, org.apache.avro.Schema)
andAvroJob.setOutputSchema(org.apache.hadoop.mapred.JobConf, org.apache.avro.Schema)
with your job's input and output schemas. - Subclass
AvroMapper
and specify this as your job's mapper withAvroJob.setMapperClass(org.apache.hadoop.mapred.JobConf, java.lang.Class<? extends org.apache.avro.mapred.AvroMapper>)
- Subclass
AvroReducer
and specify this as your job's reducer and perhaps combiner, withAvroJob.setReducerClass(org.apache.hadoop.mapred.JobConf, java.lang.Class<? extends org.apache.avro.mapred.AvroReducer>)
andAvroJob.setCombinerClass(org.apache.hadoop.mapred.JobConf, java.lang.Class<? extends org.apache.avro.mapred.AvroReducer>)
For jobs whose input is an Avro data file and which use an AvroMapper
, but whose reducer is a non-Avro
Reducer
and whose output is a
non-Avro format:
- Call
AvroJob.setInputSchema(org.apache.hadoop.mapred.JobConf, org.apache.avro.Schema)
with your job's input schema. - Subclass
AvroMapper
and specify this as your job's mapper withAvroJob.setMapperClass(org.apache.hadoop.mapred.JobConf, java.lang.Class<? extends org.apache.avro.mapred.AvroMapper>)
- Implement
Reducer
and specify your job's reducer withJobConf.setReducerClass(java.lang.Class<? extends org.apache.hadoop.mapred.Reducer>)
. The input key and value types should beAvroKey
andAvroValue
. - Optionally implement
Reducer
and specify your job's combiner withJobConf.setCombinerClass(java.lang.Class<? extends org.apache.hadoop.mapred.Reducer>)
. You will be unable to re-use the same Reducer class as the Combiner, as the Combiner will need input and output key to beAvroKey
, and input and output value to beAvroValue
. - Specify your job's output key and value types
JobConf.setOutputKeyClass(java.lang.Class<?>)
andJobConf.setOutputValueClass(java.lang.Class<?>)
. - Specify your job's output format
JobConf.setOutputFormat(java.lang.Class<? extends org.apache.hadoop.mapred.OutputFormat>)
.
For jobs whose input is non-Avro data file and which use a
non-Avro Mapper
, but whose reducer
is an AvroReducer
and whose output is
an Avro data file:
- Set your input file format with
JobConf.setInputFormat(java.lang.Class<? extends org.apache.hadoop.mapred.InputFormat>)
. - Implement
Mapper
and specify your job's mapper withJobConf.setMapperClass(java.lang.Class<? extends org.apache.hadoop.mapred.Mapper>)
. The output key and value type should beAvroKey
andAvroValue
. - Subclass
AvroReducer
and specify this as your job's reducer and perhaps combiner, withAvroJob.setReducerClass(org.apache.hadoop.mapred.JobConf, java.lang.Class<? extends org.apache.avro.mapred.AvroReducer>)
andAvroJob.setCombinerClass(org.apache.hadoop.mapred.JobConf, java.lang.Class<? extends org.apache.avro.mapred.AvroReducer>)
- Call
AvroJob.setOutputSchema(org.apache.hadoop.mapred.JobConf, org.apache.avro.Schema)
with your job's output schema.
For jobs whose input is non-Avro data file and which use a
non-Avro Mapper
and no reducer,
i.e., a map-only job:
- Set your input file format with
JobConf.setInputFormat(java.lang.Class<? extends org.apache.hadoop.mapred.InputFormat>)
. - Implement
Mapper
and specify your job's mapper withJobConf.setMapperClass(java.lang.Class<? extends org.apache.hadoop.mapred.Mapper>)
. The output key and value type should beAvroWrapper
andNullWritable
. - Call
JobConf.setNumReduceTasks(int)
with zero. - Call
AvroJob.setOutputSchema(org.apache.hadoop.mapred.JobConf, org.apache.avro.Schema)
with your job's output schema.
-
ClassDescriptionAn
InputFormat
for Avro data files, which converts each datum to string form in the input key.A collector for map and reduce output.AnInputFormat
for Avro data files.Setters to configure jobs for Avro data.AvroKey<T>The wrapper of keys for jobs configured withAvroJob
.TheRawComparator
used by jobs configured withAvroJob
.AvroMapper<IN,OUT> A mapper for Avro data.This class supports Avro-MapReduce jobs that have multiple input paths with a differentSchema
andAvroMapper
for each path.The AvroMultipleOutputs class simplifies writing Avro output data to multiple outputsAnOutputFormat
for Avro data files.AnRecordReader
for Avro data files.AvroReducer<K,V, OUT> A reducer for Avro data.TheSerialization
used by jobs configured withAvroJob
.AvroTextOutputFormat<K,V> The equivalent ofTextOutputFormat
for writing to Avro Data Files with a"bytes"
schema.AnInputFormat
for text files.AvroValue<T>The wrapper of values for jobs configured withAvroJob
.AvroWrapper<T>The wrapper of data for jobs configured withAvroJob
.Adapt anFSDataInputStream
toSeekableInput
.Pair<K,V> A key/value pair.AnInputFormat
for sequence files.SequenceFileReader<K,V> AFileReader
for sequence files.ARecordReader
for sequence files.