public class AvroMultipleOutputs extends Object
Case one: writing to additional outputs other than the job default output.
Each additional output, or named output, may be configured with its own
Schema
and OutputFormat
.
A named output can be a single file or a multi file. The later is refered as
a multi named output which is an unbound set of files all sharing the same
Schema
.
Case two: to write data to different files provided by user
AvroMultipleOutputs supports counters, by default they are disabled. The
counters group is the AvroMultipleOutputs
class name. The names of the
counters are the same as the output name. These count the number of records
written to each output name. For multi
named outputs the name of the counter is the concatenation of the named
output, and underscore '_' and the multiname.
JobConf job = new JobConf(); FileInputFormat.setInputPath(job, inDir); FileOutputFormat.setOutputPath(job, outDir); job.setMapperClass(MyAvroMapper.class); job.setReducerClass(HadoopReducer.class); job.set("avro.reducer",MyAvroReducer.class); ... Schema schema; ... // Defines additional single output 'avro1' for the job AvroMultipleOutputs.addNamedOutput(job, "avro1", AvroOutputFormat.class, schema); // Defines additional output 'avro2' with different schema for the job AvroMultipleOutputs.addNamedOutput(job, "avro2", AvroOutputFormat.class, null); // if Schema is specified as null then the default output schema is used ... job.waitForCompletion(true); ...
Usage in Reducer:
public class MyAvroReducer extends AvroReducer<K, V, OUT> { private MultipleOutputs amos; public void configure(JobConf conf) { ... amos = new AvroMultipleOutputs(conf); } public void reduce(K, Iterator<V> values, AvroCollector<OUT>, Reporter reporter) throws IOException { ... amos.collect("avro1", reporter,datum); amos.getCollector("avro2", "A", reporter).collect(datum); amos.collect("avro1",reporter,schema,datum,"testavrofile");// this create a file testavrofile and writes data with schema "schema" into it and uses other values from namedoutput "avro1" like outputclass etc. amos.collect("avro1",reporter,schema,datum,"testavrofile1"); ... } public void close() throws IOException { amos.close(); ... } }
Constructor and Description |
---|
AvroMultipleOutputs(JobConf job)
Creates and initializes multiple named outputs support, it should be
instantiated in the Mapper/Reducer configure method.
|
Modifier and Type | Method and Description |
---|---|
static void |
addMultiNamedOutput(JobConf conf,
String namedOutput,
Class<? extends OutputFormat> outputFormatClass,
Schema schema)
Adds a multi named output for the job.
|
static void |
addNamedOutput(JobConf conf,
String namedOutput,
Class<? extends OutputFormat> outputFormatClass,
Schema schema)
Adds a named output for the job.
|
void |
close()
Closes all the opened named outputs.
|
void |
collect(String namedOutput,
Reporter reporter,
Object datum)
Output Collector for the default schema.
|
void |
collect(String namedOutput,
Reporter reporter,
Schema schema,
Object datum)
OutputCollector with custom schema.
|
void |
collect(String namedOutput,
Reporter reporter,
Schema schema,
Object datum,
String baseOutputPath)
OutputCollector with custom schema and file name.
|
AvroCollector |
getCollector(String namedOutput,
Reporter reporter)
Deprecated.
Use
collect(java.lang.String, org.apache.hadoop.mapred.Reporter, java.lang.Object) method for collecting output |
AvroCollector |
getCollector(String namedOutput,
String multiName,
Reporter reporter)
Gets the output collector for a named output.
|
static boolean |
getCountersEnabled(JobConf conf)
Returns if the counters for the named outputs are enabled or not.
|
static Class<? extends OutputFormat> |
getNamedOutputFormatClass(JobConf conf,
String namedOutput)
Returns the named output OutputFormat.
|
Iterator<String> |
getNamedOutputs()
Returns iterator with the defined name outputs.
|
static List<String> |
getNamedOutputsList(JobConf conf)
Returns list of channel names.
|
static boolean |
isMultiNamedOutput(JobConf conf,
String namedOutput)
Returns if a named output is multiple.
|
static void |
setCountersEnabled(JobConf conf,
boolean enabled)
Enables or disables counters for the named outputs.
|
public AvroMultipleOutputs(JobConf job)
job
- the job configuration objectpublic static List<String> getNamedOutputsList(JobConf conf)
conf
- job confpublic static boolean isMultiNamedOutput(JobConf conf, String namedOutput)
conf
- job confnamedOutput
- named outputtrue
if the name output is multi, false
if it is single. If the name output is not defined it returns
false
public static Class<? extends OutputFormat> getNamedOutputFormatClass(JobConf conf, String namedOutput)
conf
- job confnamedOutput
- named outputpublic static void addNamedOutput(JobConf conf, String namedOutput, Class<? extends OutputFormat> outputFormatClass, Schema schema)
conf
- job conf to add the named outputnamedOutput
- named output name, it has to be a word, letters
and numbers only, cannot be the word 'part' as
that is reserved for the
default output.outputFormatClass
- OutputFormat class.schema
- Schema to used for this namedOutputpublic static void addMultiNamedOutput(JobConf conf, String namedOutput, Class<? extends OutputFormat> outputFormatClass, Schema schema)
conf
- job conf to add the named outputnamedOutput
- named output name, it has to be a word, letters
and numbers only, cannot be the word 'part' as
that is reserved for the
default output.outputFormatClass
- OutputFormat class.schema
- Schema to used for this namedOutputpublic static void setCountersEnabled(JobConf conf, boolean enabled)
AvroMultipleOutputs
class name.
The names of the counters are the same as the named outputs. For multi
named outputs the name of the counter is the concatenation of the named
output, and underscore '_' and the multiname.conf
- job conf to enableadd the named output.enabled
- indicates if the counters will be enabled or not.public static boolean getCountersEnabled(JobConf conf)
AvroMultipleOutputs
class name.
The names of the counters are the same as the named outputs. For multi
named outputs the name of the counter is the concatenation of the named
output, and underscore '_' and the multiname.conf
- job conf to enableadd the named output.public Iterator<String> getNamedOutputs()
public void collect(String namedOutput, Reporter reporter, Object datum) throws IOException
namedOutput
- the named output namereporter
- the reporterdatum
- output dataIOException
- thrown if output collector could not be createdpublic void collect(String namedOutput, Reporter reporter, Schema schema, Object datum) throws IOException
namedOutput
- the named output name (this will the output file name)reporter
- the reporterdatum
- output dataschema
- schema to use for this outputIOException
- thrown if output collector could not be createdpublic void collect(String namedOutput, Reporter reporter, Schema schema, Object datum, String baseOutputPath) throws IOException
namedOutput
- the named output namereporter
- the reporterbaseOutputPath
- outputfile name to use.datum
- output dataschema
- schema to use for this outputIOException
- thrown if output collector could not be createdpublic AvroCollector getCollector(String namedOutput, Reporter reporter) throws IOException
collect(java.lang.String, org.apache.hadoop.mapred.Reporter, java.lang.Object)
method for collecting outputnamedOutput
- the named output namereporter
- the reporterIOException
- thrown if output collector could not be createdpublic AvroCollector getCollector(String namedOutput, String multiName, Reporter reporter) throws IOException
namedOutput
- the named output namereporter
- the reportermultiName
- the multinameIOException
- thrown if output collector could not be createdpublic void close() throws IOException
super.close()
at the
end of their close()
IOException
- thrown if any of the MultipleOutput files
could not be closed properly.Copyright © 2009–2017 The Apache Software Foundation. All rights reserved.