Avro C++ Documentation

Introduction to Avro C++

Avro is a data serialization system. See http://avro.apache.org/docs/current/ for background information.

Avro C++ is a C++ library which implementats parts of the Avro Specification. The library includes the following functionality:

Presently there is no support for the following specified in Avro specification.

Note: Prior to Avro release 1.5, some of the functionality mentioned above was avilable through a somewhat different API and set tools. They are partially incompatible to the present ones. They continue to be available but will be deprecated and discontinued sometime in the future. The documentation on that API can be found at http://avro.apache.org/docs/1.4.0/api/cpp/html/index.html

Installing Avro C++

Supported platforms and pre-requisites

One should be able to build Avro C++ on any UNIX flavor including cygwin for Windows. We have tested it on Linux systems and Cygwin.

In order to build Avro C++, one needs the following:

For Ubuntu Linux, for example, you can have these by doing apt-get install for the following packages:

Installing Avro C++

  1. Download the latest Avro distribution. Avro distribution is a compressed tarball. Please see the main documentation if you want to build anything more than Avro C++.
  2. Expand the tarball into a directory.
  3. Change to lang/c++ subdirectory.
  4. Type ./build.sh test. This builds Avro C++ and runs tests on it.
  5. Type ./build.sh install. This installs Avro C++ under /usr/local on your system.

Getting started with Avro C++

Although Avro does not require use of code generation, that is the easiest way to get started with the Avro C++ library. The code generator reads a schema, and generates a C++ header file that defines one or more C++ structs to represent the data for the schema and functions to encode and decode those structs. Even if you wish to write custom code to encode and decode your objects using the core functionality of Avro C++, the generated code can serve as an example of how to use the code functionality.

Let's walk through an example, using a simple schema. Use the schema that represents an complex number:

File: cpx.json

00001 {
00002     "type": "record", 
00003     "name": "cpx",
00004     "fields" : [
00005         {"name": "re", "type": "double"},    
00006         {"name": "im", "type" : "double"}
00007     ]
00008 }

Note: All the example code given here can be found under examples directory of the distribution.

Assume this JSON representation of the schema is stored in a file called cpx.json. To generate the code issue the command:.

avrogencpp -i cpx.json -o cpx.hh -n c

The -i flag specifies the input schema file and -o flag specifies the output header file to generate. The generated C++ code will be in the namespace specifed with -n flag.

The generated file, among other things will have the following:

...
namespace c {
...
struct cpx {
    double re;
    double im;
};
...
}

cpx is a C++ representation of the Avro schema cpx.

Now let's see how we can use the code generated to encode data into avro and decode it back.

File: generated.cc

00001 
00019 #include "cpx.hh"
00020 #include "avro/Encoder.hh"
00021 #include "avro/Decoder.hh"
00022 
00023 
00024 int
00025 main()
00026 {
00027     std::auto_ptr<avro::OutputStream> out = avro::memoryOutputStream();
00028     avro::EncoderPtr e = avro::binaryEncoder();
00029     e->init(*out);
00030     c::cpx c1;
00031     c1.re = 1.0;
00032     c1.im = 2.13;
00033     avro::encode(*e, c1);
00034 
00035     std::auto_ptr<avro::InputStream> in = avro::memoryInputStream(*out);
00036     avro::DecoderPtr d = avro::binaryDecoder();
00037     d->init(*in);
00038 
00039     c::cpx c2;
00040     avro::decode(*d, c2);
00041     std::cout << '(' << c2.re << ", " << c2.im << ')' << std::endl;
00042     return 0;
00043 }
00044 

In line 9, we construct a memory output stream. By this we indicate that we want to send the encoded Avro data into memory. In line 10, we construct a binary encoder, whereby we mean the output should be encoded using the Avro binary standard. In line 11, we attach the output stream to the encoder. At any given time an incoder can write to only one output stream.

In line 14, we write the contents of c1 into the output stream using the encoder. Now the output stream contains the binary representation of the object. The rest of the code verifies that the data is indeed in the stream.

In line 17, we construct a memory input stream from the contents of the output stream. Thus the input stream has the binary representation of the object. In line 18 and 19, we construct a binary decoder and attach the input stream to it. Line 22 decodes the contents of the stream into another object c2. Now c1 and c2 should have identical contents, which one can readily verify from the output of the program, which should be:

(1, 2.13)

Now, if you want to encode the data using Avro JSON encoding, you should use avro::jsonEncoder() instead of avro::binaryEncoder() in line 10 and avro::jsonDecoder() instead of avro::binaryDecoder() in line 18.

On the other hand, if you want to write the contents to a file instead of memory, you should use avro::fileOutputStream() instead of avro::memoryOutputStream() in ine 9 and avro::fileInputStream() instead of avro::memoryInputStream() in line 17.

Reading a JSON schema

The section above demonstrated pretty much all that's needed to know to get started reading and writing objects using the Avro C++ code generator. The following sections will cover some more information.

The library provides some utilities to read a schema that is stored in a JSON file:

File: schemaload.cc

00001 
00019 #include <fstream>
00020 
00021 #include "avro/ValidSchema.hh"
00022 #include "avro/Compiler.hh"
00023 
00024 
00025 int
00026 main()
00027 {
00028     std::ifstream in("cpx.json");
00029 
00030     avro::ValidSchema cpxSchema;
00031     avro::compileJsonSchema(in, cpxSchema);
00032 }

This reads the file, and parses the JSON schema into an in-meory schema object of type avro::ValidSchema. If, for some reason, the schema is not valid, the cpxSchema object will not be set, and an exception will be thrown. If you always use code Avro generator you don't really need the in-memory schema objects. But if you use custom objects and routines to encode or decode avro data, you will need the schema objects. Other uses of schema objects are generic data objects and schema resolution described in the following sections.

Custom encoding and decoding

Suppose you want to encode objects of type std::complex<double> from C++ standard library using the schema defined in cpx.json. Since std::complex<double> was not generated by Avro, it does't know how to encode or decode objects of that type. You have to tell Avro how to do that.

The recommended way to tell Avro how to encode or decode is to specialize Avro's codec_traits template. For std::complex<double>, here is what you'd do:

File: custom.cc

00001 
00019 #include <complex>
00020 
00021 #include "avro/Encoder.hh"
00022 #include "avro/Decoder.hh"
00023 #include "avro/Specific.hh"
00024 
00025 namespace avro {
00026 template<typename T>
00027 struct codec_traits<std::complex<T> > {
00028     static void encode(Encoder& e, const std::complex<T>& c) {
00029         avro::encode(e, std::real(c));
00030         avro::encode(e, std::imag(c));
00031     }
00032 
00033     static void decode(Decoder& d, std::complex<T>& c) {
00034         T re, im;
00035         avro::decode(d, re);
00036         avro::decode(d, im);
00037         c = std::complex<T>(re, im);
00038     }
00039 };
00040 
00041 }
00042 int
00043 main()
00044 {
00045     std::auto_ptr<avro::OutputStream> out = avro::memoryOutputStream();
00046     avro::EncoderPtr e = avro::binaryEncoder();
00047     e->init(*out);
00048     std::complex<double> c1(1.0, 2.0);
00049     avro::encode(*e, c1);
00050 
00051     std::auto_ptr<avro::InputStream> in = avro::memoryInputStream(*out);
00052     avro::DecoderPtr d = avro::binaryDecoder();
00053     d->init(*in);
00054 
00055     std::complex<double> c2;
00056     avro::decode(*d, c2);
00057     std::cout << '(' << std::real(c2) << ", " << std::imag(c2) << ')' << std::endl;
00058     return 0;
00059 }

Please notice that the main function is pretty much similar to that we used for the generated class. Once codec_traits for a specific type is supplied, you do not really need to do anything special for your custom types.

But wait, how does Avro know that complex<double> represents the data for the schema in cpx.json? It doesn't. In fact, if you have used std::complex<float> instead of std::complex<double> program would have worked. But the data in the memory would not have been corresponding to the schema in cpx.json.

In order to ensure that you indeed use the correct type, you can use the validating encoders and decoder. Here is how:

File: validating.cc

00001 
00019 #include <fstream>
00020 #include <complex>
00021 
00022 #include "avro/Compiler.hh"
00023 #include "avro/Encoder.hh"
00024 #include "avro/Decoder.hh"
00025 #include "avro/Specific.hh"
00026 
00027 namespace avro {
00028 template<typename T>
00029 struct codec_traits<std::complex<T> > {
00030     static void encode(Encoder& e, const std::complex<T>& c) {
00031         avro::encode(e, std::real(c));
00032         avro::encode(e, std::imag(c));
00033     }
00034 
00035     static void decode(Decoder& d, std::complex<T>& c) {
00036         T re, im;
00037         avro::decode(d, re);
00038         avro::decode(d, im);
00039         c = std::complex<T>(re, im);
00040     }
00041 };
00042 
00043 }
00044 int
00045 main()
00046 {
00047     std::ifstream ifs("cpx.json");
00048 
00049     avro::ValidSchema cpxSchema;
00050     avro::compileJsonSchema(ifs, cpxSchema);
00051 
00052     std::auto_ptr<avro::OutputStream> out = avro::memoryOutputStream();
00053     avro::EncoderPtr e = avro::validatingEncoder(cpxSchema,
00054         avro::binaryEncoder());
00055     e->init(*out);
00056     std::complex<double> c1(1.0, 2.0);
00057     avro::encode(*e, c1);
00058 
00059     std::auto_ptr<avro::InputStream> in = avro::memoryInputStream(*out);
00060     avro::DecoderPtr d = avro::validatingDecoder(cpxSchema,
00061         avro::binaryDecoder());
00062     d->init(*in);
00063 
00064     std::complex<double> c2;
00065     avro::decode(*d, c2);
00066     std::cout << '(' << std::real(c2) << ", " << std::imag(c2) << ')' << std::endl;
00067     return 0;
00068 }

Here, instead of using the plain binary encoder, you use a validating encoder backed by a binary encoder. Similarly, instead of using the plain binary decoder, you use a validating decoder backed by a binary decoder. Now, if you use std::complex<float> intead of std::complex<double> the validating encoder and decoder will throw exception stating that you are trying to encode or decode float instead of double.

You can use any encoder behind the validating encoder and any decoder behind the validating decoder. But in practice, only the binary encoder and the binary decoder have no knowledge of the underlying schema. All other encoders (JSON encoder) and decoders (JSON decoder, resolving decoder) do know about the schema and they validate internally. So, fronting them with a validating encoder or validating decoder is wasteful.

Generic data objects

A third way to encode and decode data is to use Avro's generic datum. Avro's generic datum allows you to read any arbitray data corresponding to an arbitrary schema into a generic object. One need not know anything about the schema or data at complie time.

Here is an example how one can use the generic datum.

File: generic.cc

00001 
00019 #include <fstream>
00020 #include <complex>
00021 
00022 #include "cpx.hh"
00023 
00024 #include "avro/Compiler.hh"
00025 #include "avro/Encoder.hh"
00026 #include "avro/Decoder.hh"
00027 #include "avro/Specific.hh"
00028 #include "avro/Generic.hh"
00029 
00030 int
00031 main()
00032 {
00033     std::ifstream ifs("cpx.json");
00034 
00035     avro::ValidSchema cpxSchema;
00036     avro::compileJsonSchema(ifs, cpxSchema);
00037 
00038     std::auto_ptr<avro::OutputStream> out = avro::memoryOutputStream();
00039     avro::EncoderPtr e = avro::binaryEncoder();
00040     e->init(*out);
00041     c::cpx c1;
00042     c1.re = 100.23;
00043     c1.im = 105.77;
00044     avro::encode(*e, c1);
00045 
00046     std::auto_ptr<avro::InputStream> in = avro::memoryInputStream(*out);
00047     avro::DecoderPtr d = avro::binaryDecoder();
00048     d->init(*in);
00049 
00050     avro::GenericDatum datum(cpxSchema);
00051     avro::decode(*d, datum);
00052     std::cout << "Type: " << datum.type() << std::endl;
00053     if (datum.type() == avro::AVRO_RECORD) {
00054         const avro::GenericRecord& r = datum.value<avro::GenericRecord>();
00055         std::cout << "Field-count: " << r.fieldCount() << std::endl;
00056         if (r.fieldCount() == 2) {
00057             const avro::GenericDatum& f0 = r.fieldAt(0);
00058             if (f0.type() == avro::AVRO_DOUBLE) {
00059                 std::cout << "Real: " << f0.value<double>() << std::endl;
00060             }
00061             const avro::GenericDatum& f1 = r.fieldAt(1);
00062             if (f1.type() == avro::AVRO_DOUBLE) {
00063                 std::cout << "Imaginary: " << f1.value<double>() << std::endl;
00064             }
00065         }
00066     }
00067     return 0;
00068 }

In this example, we encode the data using generated code and decode it with generic datum. Then we examine the contents of the generic datum and extract them. Please see avro::GenericDatum for more details on how to use it.

Reading data with a schema different from that of the writer

It is possible to read the data written according to one schema using a different schema, provided the reader's schema and the writer's schema are compatible according to the Avro's Schema resolution rules.

For example, you have a reader which is interested only in the imaginary part of a complex number while the writer writes both the real and imaginary parts. It is possible to do automatic schema resolution between the writer's schema and schema as shown below.

File: imaginary.json

00001 {
00002     "type": "record", 
00003     "name": "cpx",
00004     "fields" : [
00005         {"name": "im", "type" : "double"}
00006     ]
00007 }
avrogencpp -i imaginary.json -o imaginary.hh -n i

File: resolving.cc

00001 
00019 #include <fstream>
00020 
00021 #include "cpx.hh"
00022 #include "imaginary.hh"
00023 
00024 #include "avro/Compiler.hh"
00025 #include "avro/Encoder.hh"
00026 #include "avro/Decoder.hh"
00027 #include "avro/Specific.hh"
00028 #include "avro/Generic.hh"
00029 
00030 
00031 
00032 avro::ValidSchema load(const char* filename)
00033 {
00034     std::ifstream ifs(filename);
00035     avro::ValidSchema result;
00036     avro::compileJsonSchema(ifs, result);
00037     return result;
00038 }
00039 
00040 int
00041 main()
00042 {
00043     avro::ValidSchema cpxSchema = load("cpx.json");
00044     avro::ValidSchema imaginarySchema = load("imaginary.json");
00045 
00046     std::auto_ptr<avro::OutputStream> out = avro::memoryOutputStream();
00047     avro::EncoderPtr e = avro::binaryEncoder();
00048     e->init(*out);
00049     c::cpx c1;
00050     c1.re = 100.23;
00051     c1.im = 105.77;
00052     avro::encode(*e, c1);
00053 
00054     std::auto_ptr<avro::InputStream> in = avro::memoryInputStream(*out);
00055     avro::DecoderPtr d = avro::resolvingDecoder(cpxSchema, imaginarySchema,
00056         avro::binaryDecoder());
00057     d->init(*in);
00058 
00059     i::cpx c2;
00060     avro::decode(*d, c2);
00061     std::cout << "Imaginary: " << c2.im << std::endl;
00062 
00063 }

In this example, writer and reader deal with different schemas, both are recornd with the same name cpx. The writer schema has two fields and the reader's has just one. We generated code for writer's schema in a namespace c and the reader's in i.

Please notice how the reading part of the example at line 42 reads as if the stream contains the data corresponding to its schema. The schema resolution is automatically done by the resolving decoder.

In this example, we have used a simple (somewhat artificial) projection (where the set of fields in the reader's schema is a subset of set of fields in the writer's). But more complex resolutions are allowed by Avro specification.

Using Avro data files

Avro specification specifies a format for data files. Avro C++ implements the sepcification. The code below demonstrates how one can use the Avro data file to store and retrieve a collection of objects corresponding to a given schema.

File: datafile.cc

00001 
00019 #include <fstream>
00020 
00021 #include "cpx.hh"
00022 #include "avro/Encoder.hh"
00023 #include "avro/Decoder.hh"
00024 #include "avro/ValidSchema.hh"
00025 #include "avro/Compiler.hh"
00026 #include "avro/DataFile.hh"
00027 
00028 
00029 avro::ValidSchema loadSchema(const char* filename)
00030 {
00031     std::ifstream ifs(filename);
00032     avro::ValidSchema result;
00033     avro::compileJsonSchema(ifs, result);
00034     return result;
00035 }
00036 
00037 int
00038 main()
00039 {
00040     avro::ValidSchema cpxSchema = loadSchema("cpx.json");
00041 
00042     {
00043         avro::DataFileWriter<c::cpx> dfw("test.bin", cpxSchema);
00044         c::cpx c1;
00045         for (int i = 0; i < 100; i++) {
00046             c1.re = i * 100;
00047             c1.im = i + 100;
00048             dfw.write(c1);
00049         }
00050         dfw.close();
00051     }
00052 
00053     {
00054         avro::DataFileReader<c::cpx> dfr("test.bin", cpxSchema);
00055         c::cpx c2;
00056         while (dfr.read(c2)) {
00057             std::cout << '(' << c2.re << ", " << c2.im << ')' << std::endl;
00058         }
00059     }
00060     return 0;
00061 }
00062 

Please see DataFile.hh for more details.