public class AvroMultipleOutputs extends Object
Case one: writing to additional outputs other than the job default output.
Each additional output, or named output, may be configured with its own
Schema
and OutputFormat
.
A named output can be a single file or a multi file. The later is refered as
a multi named output which is an unbound set of files all sharing the same
Schema
.
Case two: to write data to different files provided by user
AvroMultipleOutputs supports counters, by default they are disabled. The
counters group is the AvroMultipleOutputs
class name. The names of the
counters are the same as the output name. These count the number of records
written to each output name. For multi
named outputs the name of the counter is the concatenation of the named
output, and underscore '_' and the multiname.
JobConf job = new JobConf(); FileInputFormat.setInputPath(job, inDir); FileOutputFormat.setOutputPath(job, outDir); job.setMapperClass(MyAvroMapper.class); job.setReducerClass(HadoopReducer.class); job.set("avro.reducer",MyAvroReducer.class); ... Schema schema; ... // Defines additional single output 'avro1' for the job AvroMultipleOutputs.addNamedOutput(job, "avro1", AvroOutputFormat.class, schema); // Defines additional output 'avro2' with different schema for the job AvroMultipleOutputs.addNamedOutput(job, "avro2", AvroOutputFormat.class, null); // if Schema is specified as null then the default output schema is used ... job.waitForCompletion(true); ...
Usage in Reducer:
public class MyAvroReducer extends AvroReducer<K, V, OUT> { private MultipleOutputs amos; public void configure(JobConf conf) { ... amos = new AvroMultipleOutputs(conf); } public void reduce(K, Iterator<V> values, AvroCollector<OUT>, Reporter reporter) throws IOException { ... amos.getCollector("avro1", reporter).collect(datum); amos.getCollector("avro2", "A", reporter).collect(datum); amos.getCollector("avro3", "B", reporter).collect(datum); ... } public void close() throws IOException { amos.close(); ... } }
Constructor and Description |
---|
AvroMultipleOutputs(org.apache.hadoop.mapred.JobConf job)
Creates and initializes multiple named outputs support, it should be
instantiated in the Mapper/Reducer configure method.
|
Modifier and Type | Method and Description |
---|---|
static void |
addMultiNamedOutput(org.apache.hadoop.mapred.JobConf conf,
String namedOutput,
Class<? extends org.apache.hadoop.mapred.OutputFormat> outputFormatClass,
Schema schema)
Adds a multi named output for the job.
|
static void |
addNamedOutput(org.apache.hadoop.mapred.JobConf conf,
String namedOutput,
Class<? extends org.apache.hadoop.mapred.OutputFormat> outputFormatClass,
Schema schema)
Adds a named output for the job.
|
void |
close()
Closes all the opened named outputs.
|
AvroCollector |
getCollector(String namedOutput,
org.apache.hadoop.mapred.Reporter reporter)
Gets the output collector for a named output.
|
AvroCollector |
getCollector(String namedOutput,
String multiName,
org.apache.hadoop.mapred.Reporter reporter)
Gets the output collector for a multi named output.
|
static boolean |
getCountersEnabled(org.apache.hadoop.mapred.JobConf conf)
Returns if the counters for the named outputs are enabled or not.
|
static Class<? extends org.apache.hadoop.mapred.OutputFormat> |
getNamedOutputFormatClass(org.apache.hadoop.mapred.JobConf conf,
String namedOutput)
Returns the named output OutputFormat.
|
Iterator<String> |
getNamedOutputs()
Returns iterator with the defined name outputs.
|
static List<String> |
getNamedOutputsList(org.apache.hadoop.mapred.JobConf conf)
Returns list of channel names.
|
static boolean |
isMultiNamedOutput(org.apache.hadoop.mapred.JobConf conf,
String namedOutput)
Returns if a named output is multiple.
|
static void |
setCountersEnabled(org.apache.hadoop.mapred.JobConf conf,
boolean enabled)
Enables or disables counters for the named outputs.
|
public AvroMultipleOutputs(org.apache.hadoop.mapred.JobConf job)
job
- the job configuration objectpublic static List<String> getNamedOutputsList(org.apache.hadoop.mapred.JobConf conf)
conf
- job confpublic static boolean isMultiNamedOutput(org.apache.hadoop.mapred.JobConf conf, String namedOutput)
conf
- job confnamedOutput
- named outputtrue
if the name output is multi, false
if it is single. If the name output is not defined it returns
false
public static Class<? extends org.apache.hadoop.mapred.OutputFormat> getNamedOutputFormatClass(org.apache.hadoop.mapred.JobConf conf, String namedOutput)
conf
- job confnamedOutput
- named outputpublic static void addNamedOutput(org.apache.hadoop.mapred.JobConf conf, String namedOutput, Class<? extends org.apache.hadoop.mapred.OutputFormat> outputFormatClass, Schema schema)
conf
- job conf to add the named outputnamedOutput
- named output name, it has to be a word, letters
and numbers only, cannot be the word 'part' as
that is reserved for the
default output.outputFormatClass
- OutputFormat class.schema
- Schema to used for this namedOutputpublic static void addMultiNamedOutput(org.apache.hadoop.mapred.JobConf conf, String namedOutput, Class<? extends org.apache.hadoop.mapred.OutputFormat> outputFormatClass, Schema schema)
conf
- job conf to add the named outputnamedOutput
- named output name, it has to be a word, letters
and numbers only, cannot be the word 'part' as
that is reserved for the
default output.outputFormatClass
- OutputFormat class.schema
- Schema to used for this namedOutputpublic static void setCountersEnabled(org.apache.hadoop.mapred.JobConf conf, boolean enabled)
MultipleOutputs
class name.
The names of the counters are the same as the named outputs. For multi
named outputs the name of the counter is the concatenation of the named
output, and underscore '_' and the multiname.conf
- job conf to enableadd the named output.enabled
- indicates if the counters will be enabled or not.public static boolean getCountersEnabled(org.apache.hadoop.mapred.JobConf conf)
MultipleOutputs
class name.
The names of the counters are the same as the named outputs. For multi
named outputs the name of the counter is the concatenation of the named
output, and underscore '_' and the multiname.conf
- job conf to enableadd the named output.public Iterator<String> getNamedOutputs()
public AvroCollector getCollector(String namedOutput, org.apache.hadoop.mapred.Reporter reporter) throws IOException
namedOutput
- the named output namereporter
- the reporterIOException
- thrown if output collector could not be createdpublic AvroCollector getCollector(String namedOutput, String multiName, org.apache.hadoop.mapred.Reporter reporter) throws IOException
namedOutput
- the named output namemultiName
- the multi name partreporter
- the reporterIOException
- thrown if output collector could not be createdpublic void close() throws IOException
super.close()
at the
end of their close()
IOException
- thrown if any of the MultipleOutput files
could not be closed properly.Copyright © 2009-2013 The Apache Software Foundation. All Rights Reserved.