Class AvroMultipleOutputs
Case one: writing to additional outputs other than the job default output.
Each additional output, or named output, may be configured with its own
Schema
and OutputFormat
. A named output can be a
single file or a multi file. The later is refered as a multi named output
which is an unbound set of files all sharing the same Schema
.
Case two: to write data to different files provided by user
AvroMultipleOutputs supports counters, by default they are disabled. The
counters group is the AvroMultipleOutputs
class name. The names of
the counters are the same as the output name. These count the number of
records written to each output name. For multi named outputs the name of the
counter is the concatenation of the named output, and underscore '_' and the
multiname.
JobConf job = new JobConf(); FileInputFormat.setInputPath(job, inDir); FileOutputFormat.setOutputPath(job, outDir); job.setMapperClass(MyAvroMapper.class); job.setReducerClass(HadoopReducer.class); job.set("avro.reducer",MyAvroReducer.class); ... Schema schema; ... // Defines additional single output 'avro1' for the job AvroMultipleOutputs.addNamedOutput(job, "avro1", AvroOutputFormat.class, schema); // Defines additional output 'avro2' with different schema for the job AvroMultipleOutputs.addNamedOutput(job, "avro2", AvroOutputFormat.class, null); // if Schema is specified as null then the default output schema is used ... job.waitForCompletion(true); ...
Usage in Reducer:
public class MyAvroReducer extends AvroReducer<K, V, OUT> { private MultipleOutputs amos; public void configure(JobConf conf) { ... amos = new AvroMultipleOutputs(conf); } public void reduce(K, Iterator<V> values, AvroCollector<OUT>, Reporter reporter) throws IOException { ... amos.collect("avro1", reporter,datum); amos.getCollector("avro2", "A", reporter).collect(datum); amos.collect("avro1",reporter,schema,datum,"testavrofile");// this create a file testavrofile and writes data with schema "schema" into it and uses other values from namedoutput "avro1" like outputclass etc. amos.collect("avro1",reporter,schema,datum,"testavrofile1"); ... } public void close() throws IOException { amos.close(); ... } }
-
Constructor Summary
ConstructorDescriptionCreates and initializes multiple named outputs support, it should be instantiated in the Mapper/Reducer configure method. -
Method Summary
Modifier and TypeMethodDescriptionstatic void
addMultiNamedOutput
(JobConf conf, String namedOutput, Class<? extends OutputFormat> outputFormatClass, Schema schema) Adds a multi named output for the job.static void
addNamedOutput
(JobConf conf, String namedOutput, Class<? extends OutputFormat> outputFormatClass, Schema schema) Adds a named output for the job.void
close()
Closes all the opened named outputs.void
Output Collector for the default schema.void
OutputCollector with custom schema.void
OutputCollector with custom schema and file name.getCollector
(String namedOutput, String multiName, Reporter reporter) Gets the output collector for a named output.getCollector
(String namedOutput, Reporter reporter) Deprecated.static boolean
getCountersEnabled
(JobConf conf) Returns if the counters for the named outputs are enabled or not.static Class
<? extends OutputFormat> getNamedOutputFormatClass
(JobConf conf, String namedOutput) Returns the named output OutputFormat.Returns iterator with the defined name outputs.getNamedOutputsList
(JobConf conf) Returns list of channel names.static boolean
isMultiNamedOutput
(JobConf conf, String namedOutput) Returns if a named output is multiple.static void
setCountersEnabled
(JobConf conf, boolean enabled) Enables or disables counters for the named outputs.
-
Constructor Details
-
AvroMultipleOutputs
Creates and initializes multiple named outputs support, it should be instantiated in the Mapper/Reducer configure method.- Parameters:
job
- the job configuration object
-
-
Method Details
-
getNamedOutputsList
Returns list of channel names.- Parameters:
conf
- job conf- Returns:
- List of channel Names
-
isMultiNamedOutput
Returns if a named output is multiple.- Parameters:
conf
- job confnamedOutput
- named output- Returns:
true
if the name output is multi,false
if it is single. If the name output is not defined it returnsfalse
-
getNamedOutputFormatClass
public static Class<? extends OutputFormat> getNamedOutputFormatClass(JobConf conf, String namedOutput) Returns the named output OutputFormat.- Parameters:
conf
- job confnamedOutput
- named output- Returns:
- namedOutput OutputFormat
-
addNamedOutput
public static void addNamedOutput(JobConf conf, String namedOutput, Class<? extends OutputFormat> outputFormatClass, Schema schema) Adds a named output for the job.- Parameters:
conf
- job conf to add the named outputnamedOutput
- named output name, it has to be a word, letters and numbers only, cannot be the word 'part' as that is reserved for the default output.outputFormatClass
- OutputFormat class.schema
- Schema to used for this namedOutput
-
addMultiNamedOutput
public static void addMultiNamedOutput(JobConf conf, String namedOutput, Class<? extends OutputFormat> outputFormatClass, Schema schema) Adds a multi named output for the job.- Parameters:
conf
- job conf to add the named outputnamedOutput
- named output name, it has to be a word, letters and numbers only, cannot be the word 'part' as that is reserved for the default output.outputFormatClass
- OutputFormat class.schema
- Schema to used for this namedOutput
-
setCountersEnabled
Enables or disables counters for the named outputs. By default these counters are disabled. MultipleOutputs supports counters, by default the are disabled. The counters group is theAvroMultipleOutputs
class name. The names of the counters are the same as the named outputs. For multi named outputs the name of the counter is the concatenation of the named output, and underscore '_' and the multiname.- Parameters:
conf
- job conf to enableadd the named output.enabled
- indicates if the counters will be enabled or not.
-
getCountersEnabled
Returns if the counters for the named outputs are enabled or not. By default these counters are disabled. MultipleOutputs supports counters, by default the are disabled. The counters group is theAvroMultipleOutputs
class name. The names of the counters are the same as the named outputs. For multi named outputs the name of the counter is the concatenation of the named output, and underscore '_' and the multiname.- Parameters:
conf
- job conf to enableadd the named output.- Returns:
- TRUE if the counters are enabled, FALSE if they are disabled.
-
getNamedOutputs
Returns iterator with the defined name outputs.- Returns:
- iterator with the defined named outputs
-
collect
Output Collector for the default schema.- Parameters:
namedOutput
- the named output namereporter
- the reporterdatum
- output data- Throws:
IOException
- thrown if output collector could not be created
-
collect
public void collect(String namedOutput, Reporter reporter, Schema schema, Object datum) throws IOException OutputCollector with custom schema.- Parameters:
namedOutput
- the named output name (this will the output file name)reporter
- the reporterschema
- schema to use for this outputdatum
- output data- Throws:
IOException
- thrown if output collector could not be created
-
collect
public void collect(String namedOutput, Reporter reporter, Schema schema, Object datum, String baseOutputPath) throws IOException OutputCollector with custom schema and file name.- Parameters:
namedOutput
- the named output namereporter
- the reporterschema
- schema to use for this outputdatum
- output databaseOutputPath
- outputfile name to use.- Throws:
IOException
- thrown if output collector could not be created
-
getCollector
Deprecated.Usecollect(java.lang.String, org.apache.hadoop.mapred.Reporter, java.lang.Object)
method for collecting outputGets the output collector for a named output.- Parameters:
namedOutput
- the named output namereporter
- the reporter- Returns:
- the output collector for the given named output
- Throws:
IOException
- thrown if output collector could not be created
-
getCollector
public AvroCollector getCollector(String namedOutput, String multiName, Reporter reporter) throws IOException Gets the output collector for a named output.- Parameters:
namedOutput
- the named output namemultiName
- the multinamereporter
- the reporter- Returns:
- the output collector for the given named output
- Throws:
IOException
- thrown if output collector could not be created
-
close
Closes all the opened named outputs. If overriden subclasses must invokesuper.close()
at the end of theirclose()
- Throws:
IOException
- thrown if any of the MultipleOutput files could not be closed properly.
-
collect(java.lang.String, org.apache.hadoop.mapred.Reporter, java.lang.Object)
method for collecting output