org.apache.avro.mapred (Apache Avro Java 1.12.0 API)

package org.apache.avro.mapred

Run Hadoop MapReduce jobs over Avro data, with map and reduce functions written in Java.

Avro data files do not contain key/value pairs as expected by Hadoop's MapReduce API, but rather just a sequence of values. Thus we provide here a layer on top of Hadoop's MapReduce API.

In all cases, input and output paths are set and jobs are submitted as with standard Hadoop jobs:

Specify input files with FileInputFormat.setInputPaths(org.apache.hadoop.mapred.JobConf, java.lang.String)
Specify an output directory with FileOutputFormat.setOutputPath(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path)
Run your job with JobClient.runJob(org.apache.hadoop.mapred.JobConf)

For jobs whose input and output are Avro data files:

Call AvroJob.setInputSchema(org.apache.hadoop.mapred.JobConf, org.apache.avro.Schema) and AvroJob.setOutputSchema(org.apache.hadoop.mapred.JobConf, org.apache.avro.Schema) with your job's input and output schemas.
Subclass AvroMapper and specify this as your job's mapper with AvroJob.setMapperClass(org.apache.hadoop.mapred.JobConf, java.lang.Class<? extends org.apache.avro.mapred.AvroMapper>)
Subclass AvroReducer and specify this as your job's reducer and perhaps combiner, with AvroJob.setReducerClass(org.apache.hadoop.mapred.JobConf, java.lang.Class<? extends org.apache.avro.mapred.AvroReducer>) and AvroJob.setCombinerClass(org.apache.hadoop.mapred.JobConf, java.lang.Class<? extends org.apache.avro.mapred.AvroReducer>)

For jobs whose input is an Avro data file and which use an AvroMapper, but whose reducer is a non-Avro Reducer and whose output is a non-Avro format:

Call AvroJob.setInputSchema(org.apache.hadoop.mapred.JobConf, org.apache.avro.Schema) with your job's input schema.
Subclass AvroMapper and specify this as your job's mapper with AvroJob.setMapperClass(org.apache.hadoop.mapred.JobConf, java.lang.Class<? extends org.apache.avro.mapred.AvroMapper>)
Implement Reducer and specify your job's reducer with JobConf.setReducerClass(java.lang.Class<? extends org.apache.hadoop.mapred.Reducer>). The input key and value types should be AvroKey and AvroValue.
Optionally implement Reducer and specify your job's combiner with JobConf.setCombinerClass(java.lang.Class<? extends org.apache.hadoop.mapred.Reducer>). You will be unable to re-use the same Reducer class as the Combiner, as the Combiner will need input and output key to be AvroKey, and input and output value to be AvroValue.
Specify your job's output key and value types JobConf.setOutputKeyClass(java.lang.Class<?>) and JobConf.setOutputValueClass(java.lang.Class<?>).
Specify your job's output format JobConf.setOutputFormat(java.lang.Class<? extends org.apache.hadoop.mapred.OutputFormat>).

For jobs whose input is non-Avro data file and which use a non-Avro Mapper, but whose reducer is an AvroReducer and whose output is an Avro data file:

Set your input file format with JobConf.setInputFormat(java.lang.Class<? extends org.apache.hadoop.mapred.InputFormat>).
Implement Mapper and specify your job's mapper with JobConf.setMapperClass(java.lang.Class<? extends org.apache.hadoop.mapred.Mapper>). The output key and value type should be AvroKey and AvroValue.
Subclass AvroReducer and specify this as your job's reducer and perhaps combiner, with AvroJob.setReducerClass(org.apache.hadoop.mapred.JobConf, java.lang.Class<? extends org.apache.avro.mapred.AvroReducer>) and AvroJob.setCombinerClass(org.apache.hadoop.mapred.JobConf, java.lang.Class<? extends org.apache.avro.mapred.AvroReducer>)
Call AvroJob.setOutputSchema(org.apache.hadoop.mapred.JobConf, org.apache.avro.Schema) with your job's output schema.

For jobs whose input is non-Avro data file and which use a non-Avro Mapper and no reducer, i.e., a map-only job:

Set your input file format with JobConf.setInputFormat(java.lang.Class<? extends org.apache.hadoop.mapred.InputFormat>).
Implement Mapper and specify your job's mapper with JobConf.setMapperClass(java.lang.Class<? extends org.apache.hadoop.mapred.Mapper>). The output key and value type should be AvroWrapper and NullWritable.
Call JobConf.setNumReduceTasks(int) with zero.
Call AvroJob.setOutputSchema(org.apache.hadoop.mapred.JobConf, org.apache.avro.Schema) with your job's output schema.

Related Packages

Package

Description

org.apache.avro

Avro kernel classes.

org.apache.avro.mapred.tether

Run Hadoop MapReduce jobs over Avro data, with map and reduce functions run in a sub-process.
Classes

Class

Description

AvroAsTextInputFormat

An InputFormat for Avro data files, which converts each datum to string form in the input key.

AvroCollector<T>

A collector for map and reduce output.

AvroInputFormat<T>

An InputFormat for Avro data files.

AvroJob

Setters to configure jobs for Avro data.

AvroKey<T>

The wrapper of keys for jobs configured with AvroJob .

AvroKeyComparator<T>

The RawComparator used by jobs configured with AvroJob.

AvroMapper<IN,OUT>

A mapper for Avro data.

AvroMultipleInputs

This class supports Avro-MapReduce jobs that have multiple input paths with a different Schema and AvroMapper for each path.

AvroMultipleOutputs

The AvroMultipleOutputs class simplifies writing Avro output data to multiple outputs

AvroOutputFormat<T>

An OutputFormat for Avro data files.

AvroRecordReader<T>

An RecordReader for Avro data files.

AvroReducer<K,V,OUT>

A reducer for Avro data.

AvroSerialization<T>

The Serialization used by jobs configured with AvroJob.

AvroTextOutputFormat<K,V>

The equivalent of TextOutputFormat for writing to Avro Data Files with a "bytes" schema.

AvroUtf8InputFormat

An InputFormat for text files.

AvroValue<T>

The wrapper of values for jobs configured with AvroJob .

AvroWrapper<T>

The wrapper of data for jobs configured with AvroJob .

FsInput

Adapt an FSDataInputStream to SeekableInput.

Pair<K,V>

A key/value pair.

SequenceFileInputFormat<K,V>

An InputFormat for sequence files.

SequenceFileReader<K,V>

A FileReader for sequence files.

SequenceFileRecordReader<K,V>

A RecordReader for sequence files.

Package org.apache.avro.mapred