Introduction to Avro C++ ??
Avro is a data serialization system. See https://avro.apache.org/docs/current/ for background information.
Avro C++ is a C++ library which implements parts of the Avro Specification. The library includes the following functionality:
-
Assembling schemas programmatically.
-
A schema parser, which can parse Avro schema (written in JSON) into a Schema object.
-
Encoders and decoders to encode data into Avro format and decode it back using primitive functions. There are multiple implementations of encoders and decoders.
-
A binary encoder, which encodes into binary Avro data.
-
A JSON encoder, which encodes into JSON Avro data.
-
A validating encoder, an encoder proxy, which validates the call sequence to the encoder before sending the calls to another encoder.
-
A binary decoder, which decodes binary Avro data.
-
A JSON decoder, which decodes JSON Avro data.
-
A validating decoder, a decoder proxy, which validates the call sequence to the decoder before sending the calls to another decoder.
-
A resolving decoder, which accepts calls for according to a reader's schema but decodes data corresponding to a different (writer's) schema doing schema resolution according to resolution rules in the Avro specification.
-
Streams for storing and reading data, which Encoders and Decoders use.
-
Support for Avro DataFile.
-
A code generator, which generates C++ classes and functions to encode and decode them. The code generator produces a C++ header file from a given schema file.
Presently there is no support for the following specified in Avro specification.
Note: Prior to Avro release 1.5, some of the functionality mentioned above was available through a somewhat different API and set tools. They are partially incompatible to the present ones. They continue to be available but will be deprecated and discontinued sometime in the future. The documentation on that API can be found at https://avro.apache.org/docs/1.4.0/api/cpp/html/index.html
Installing Avro C++ ??
Supported platforms and pre-requisites
One should be able to build Avro C++ on (1) any UNIX flavor including cygwin for Windows and (2) natively on Windows using Visual Studio. We have tested it on (1) Linux systems (Ubuntu and RHEL) and Cygwin and Visual Studio 2010 Express edition.
In order to build Avro C++, one needs the following:
-
A C++ compiler and runtime libraries.
-
Boost library version 1.38 or later. Apart from the header-only libraries of Boost, Avro C++ requires filesystem, iostreams, system and program_options libraries. Please see https://www.boost.org or your platform's documentation for details on how to set up Boost for your platform.
-
CMake build tool version 2.6 or later. Please see https://www.cmake.org or your platform's documentation for details on how to set up CMake for your system.
-
Python. If not already present, please consult your platform-specific documentation on how to install Python on your system.
For Ubuntu Linux, for example, you can have these by doing apt-get install
for the following packages:
- cmake
- g++
- libboost-dev
- libboost-filesystem-dev
- libboost-iostreams-dev
- libboost-program-options-dev
- libboost-system-dev
For Windows native builds, you need to install the following:
- cmake
- boost distribution from Boost consulting
- Visual studio
Installing Avro C++
-
Download the latest Avro distribution. Avro distribution is a compressed tarball. Please see the main documentation if you want to build anything more than Avro C++.
On Unix systems and on Cygwin
-
Expand the tarball into a directory.
-
Change to
lang/c++
subdirectory.
-
Type
./build.sh test
. This builds Avro C++ and runs tests on it.
-
Type
./build.sh install
. This installs Avro C++ under /usr/local on your system.
On native Windows
-
Ensure that CMake's bin directory and Boost's lib directory are in the path.
-
Expand the tarball into a directory.
-
Change to
lang/c++
subdirectory.
-
Create a subdirectory, say, build.win, and change to that directory.
-
Type
cmake -G "Visual Studio 10"
. It creates, among other things, Avro-cpp.sln file.
-
Open the solution file using Visual Studio and build the projects from within the Visual Studio.
-
To run all unit tests, build the special project named "RUN_TESTS".
-
After building all the projects, you can also execute the unit tests from command line. ctest -C release
or ctest -C debug
.
Getting started with Avro C++ ??
Although Avro does not require use of code generation, that is the easiest way to get started with the Avro C++ library. The code generator reads a schema, and generates a C++ header file that defines one or more C++ struct
s to represent the data for the schema and functions to encode and decode those struct
s. Even if you wish to write custom code to encode and decode your objects using the core functionality of Avro C++, the generated code can serve as an example of how to use the code functionality.
Let's walk through an example, using a simple schema. Use the schema that represents an complex number:
File: cpx.json
5 {"name": "re", "type": "double"},
6 {"name": "im", "type" : "double"}
Note: All the example code given here can be found under examples
directory of the distribution.
Assume this JSON representation of the schema is stored in a file called cpx.json
. To generate the code issue the command:.
avrogencpp -i cpx.json -o cpx.hh -n c
The -i
flag specifies the input schema file and -o
flag specifies the output header file to generate. The generated C++ code will be in the namespace specified with -n
flag.
The generated file, among other things will have the following:
...
namespace c {
...
struct cpx {
double re;
double im;
};
...
}
cpx
is a C++ representation of the Avro schema cpx
.
Now let's see how we can use the code generated to encode data into avro and decode it back.
File: generated.cc
20 #include "avro/Encoder.hh" 21 #include "avro/Decoder.hh" 41 std::cout <<
'(' << c2.re <<
", " << c2.im <<
')' << std::endl;
AVRO_DECL OutputStreamPtr memoryOutputStream(size_t chunkSize=4 *1024)
Returns a new OutputStream, which grows in memory chunks of specified size.
void decode(Decoder &d, T &t)
Generic decoder function that makes use of the codec_traits.
Definition: Specific.hh:339
std::shared_ptr< Encoder > EncoderPtr
Shared pointer to Encoder.
Definition: Encoder.hh:147
std::shared_ptr< Decoder > DecoderPtr
Shared pointer to Decoder.
Definition: Decoder.hh:177
void encode(Encoder &e, const T &t)
Generic encoder function that makes use of the codec_traits.
Definition: Specific.hh:331
AVRO_DECL InputStreamPtr memoryInputStream(const uint8_t *data, size_t len)
Returns a new InputStream, with the data from the given byte array.
AVRO_DECL DecoderPtr binaryDecoder()
Returns an decoder that can decode binary Avro standard.
AVRO_DECL EncoderPtr binaryEncoder()
Returns an encoder that can encode binary Avro standard.
In line 27, we construct a memory output stream. By this we indicate that we want to send the encoded Avro data into memory. In line 28, we construct a binary encoder, whereby we mean the output should be encoded using the Avro binary standard. In line 29, we attach the output stream to the encoder. At any given time an encoder can write to only one output stream.
In line 32, we write the contents of c1 into the output stream using the encoder. Now the output stream contains the binary representation of the object. The rest of the code verifies that the data is indeed in the stream.
In line 35, we construct a memory input stream from the contents of the output stream. Thus the input stream has the binary representation of the object. In line 36 and 37, we construct a binary decoder and attach the input stream to it. Line 40 decodes the contents of the stream into another object c2. Now c1 and c2 should have identical contents, which one can readily verify from the output of the program, which should be:
(1, 2.13)
Now, if you want to encode the data using Avro JSON encoding, you should use avro::jsonEncoder() instead of avro::binaryEncoder() in line 28 and avro::jsonDecoder() instead of avro::binaryDecoder() in line 36.
On the other hand, if you want to write the contents to a file instead of memory, you should use avro::fileOutputStream() instead of avro::memoryOutputStream() in line 27 and avro::fileInputStream() instead of avro::memoryInputStream() in line 35.
Reading a JSON schema ??
The section above demonstrated pretty much all that's needed to know to get started reading and writing objects using the Avro C++ code generator. The following sections will cover some more information.
The library provides some utilities to read a schema that is stored in a JSON file:
File: schemaload.cc
21 #include "avro/ValidSchema.hh" 22 #include "avro/Compiler.hh" 28 std::ifstream in(
"cpx.json");
A ValidSchema is basically a non-mutable Schema that has passed some minimum of sanity checks...
Definition: ValidSchema.hh:40
AVRO_DECL void compileJsonSchema(std::istream &is, ValidSchema &schema)
Given a stream comtaining a JSON schema, compiles the schema to a ValidSchema object.
This reads the file, and parses the JSON schema into an in-memory schema object of type avro::ValidSchema. If, for some reason, the schema is not valid, the cpxSchema
object will not be set, and an exception will be thrown.
If you always use code Avro generator you don't really need the in-memory schema objects. But if you use custom objects and routines to encode or decode avro data, you will need the schema objects. Other uses of schema objects are generic data objects and schema resolution described in the following sections.
Custom encoding and decoding ??
Suppose you want to encode objects of type std::complex<double> from C++ standard library using the schema defined in cpx.json. Since std::complex<double> was not generated by Avro, it doesn't know how to encode or decode objects of that type. You have to tell Avro how to do that.
The recommended way to tell Avro how to encode or decode is to specialize Avro's codec_traits template. For std::complex<double>, here is what you'd do:
File: custom.cc
21 #include "avro/Encoder.hh" 22 #include "avro/Decoder.hh" 23 #include "avro/Specific.hh" 27 struct codec_traits<
std::complex<T> > {
28 static void encode(Encoder& e,
const std::complex<T>& c) {
33 static void decode(Decoder& d, std::complex<T>& c) {
37 c = std::complex<T>(re, im);
48 std::complex<double> c1(1.0, 2.0);
55 std::complex<double> c2;
57 std::cout <<
'(' << std::real(c2) <<
", " << std::imag(c2) <<
')' << std::endl;
AVRO_DECL OutputStreamPtr memoryOutputStream(size_t chunkSize=4 *1024)
Returns a new OutputStream, which grows in memory chunks of specified size.
void decode(Decoder &d, T &t)
Generic decoder function that makes use of the codec_traits.
Definition: Specific.hh:339
A bunch of templates and specializations for encoding and decoding specific types.
Definition: AvroParse.hh:30
std::shared_ptr< Encoder > EncoderPtr
Shared pointer to Encoder.
Definition: Encoder.hh:147
std::shared_ptr< Decoder > DecoderPtr
Shared pointer to Decoder.
Definition: Decoder.hh:177
void encode(Encoder &e, const T &t)
Generic encoder function that makes use of the codec_traits.
Definition: Specific.hh:331
AVRO_DECL InputStreamPtr memoryInputStream(const uint8_t *data, size_t len)
Returns a new InputStream, with the data from the given byte array.
AVRO_DECL DecoderPtr binaryDecoder()
Returns an decoder that can decode binary Avro standard.
AVRO_DECL EncoderPtr binaryEncoder()
Returns an encoder that can encode binary Avro standard.
Please notice that the main function is pretty much similar to that we used for the generated class. Once codec_traits
for a specific type is supplied, you do not really need to do anything special for your custom types.
But wait, how does Avro know that complex<double> represents the data for the schema in cpx.json
? It doesn't. In fact, if you have used std::complex<float>
instead of std::complex<double>
program would have worked. But the data in the memory would not have been corresponding to the schema in cpx.json
.
In order to ensure that you indeed use the correct type, you can use the validating encoders and decoder. Here is how:
File: validating.cc
22 #include "avro/Compiler.hh" 23 #include "avro/Encoder.hh" 24 #include "avro/Decoder.hh" 25 #include "avro/Specific.hh" 29 struct codec_traits<
std::complex<T> > {
30 static void encode(Encoder& e,
const std::complex<T>& c) {
35 static void decode(Decoder& d, std::complex<T>& c) {
39 c = std::complex<T>(re, im);
47 std::ifstream ifs(
"cpx.json");
56 std::complex<double> c1(1.0, 2.0);
64 std::complex<double> c2;
66 std::cout <<
'(' << std::real(c2) <<
", " << std::imag(c2) <<
')' << std::endl;
AVRO_DECL OutputStreamPtr memoryOutputStream(size_t chunkSize=4 *1024)
Returns a new OutputStream, which grows in memory chunks of specified size.
void decode(Decoder &d, T &t)
Generic decoder function that makes use of the codec_traits.
Definition: Specific.hh:339
A bunch of templates and specializations for encoding and decoding specific types.
Definition: AvroParse.hh:30
std::shared_ptr< Encoder > EncoderPtr
Shared pointer to Encoder.
Definition: Encoder.hh:147
AVRO_DECL DecoderPtr validatingDecoder(const ValidSchema &schema, const DecoderPtr &base)
Returns an decoder that validates sequence of calls to an underlying Decoder against the given schema...
std::shared_ptr< Decoder > DecoderPtr
Shared pointer to Decoder.
Definition: Decoder.hh:177
void encode(Encoder &e, const T &t)
Generic encoder function that makes use of the codec_traits.
Definition: Specific.hh:331
A ValidSchema is basically a non-mutable Schema that has passed some minimum of sanity checks...
Definition: ValidSchema.hh:40
AVRO_DECL InputStreamPtr memoryInputStream(const uint8_t *data, size_t len)
Returns a new InputStream, with the data from the given byte array.
AVRO_DECL EncoderPtr validatingEncoder(const ValidSchema &schema, const EncoderPtr &base)
Returns an encoder that validates sequence of calls to an underlying Encoder against the given schema...
AVRO_DECL DecoderPtr binaryDecoder()
Returns an decoder that can decode binary Avro standard.
AVRO_DECL EncoderPtr binaryEncoder()
Returns an encoder that can encode binary Avro standard.
AVRO_DECL void compileJsonSchema(std::istream &is, ValidSchema &schema)
Given a stream comtaining a JSON schema, compiles the schema to a ValidSchema object.
Here, instead of using the plain binary encoder, you use a validating encoder backed by a binary encoder. Similarly, instead of using the plain binary decoder, you use a validating decoder backed by a binary decoder. Now, if you use std::complex<float>
instead of std::complex<double>
the validating encoder and decoder will throw exception stating that you are trying to encode or decode float
instead of double
.
You can use any encoder behind the validating encoder and any decoder behind the validating decoder. But in practice, only the binary encoder and the binary decoder have no knowledge of the underlying schema. All other encoders (JSON encoder) and decoders (JSON decoder, resolving decoder) do know about the schema and they validate internally. So, fronting them with a validating encoder or validating decoder is wasteful.
Generic data objects ??
A third way to encode and decode data is to use Avro's generic datum. Avro's generic datum allows you to read any arbitrary data corresponding to an arbitrary schema into a generic object. One need not know anything about the schema or data at compile time.
Here is an example how one can use the generic datum.
File: generic.cc
24 #include "avro/Compiler.hh" 25 #include "avro/Encoder.hh" 26 #include "avro/Decoder.hh" 27 #include "avro/Specific.hh" 28 #include "avro/Generic.hh" 33 std::ifstream ifs(
"cpx.json");
52 std::cout <<
"Type: " << datum.type() << std::endl;
55 std::cout <<
"Field-count: " << r.
fieldCount() << std::endl;
59 std::cout <<
"Real: " << f0.
value<
double>() << std::endl;
63 std::cout <<
"Imaginary: " << f1.
value<
double>() << std::endl;
AVRO_DECL OutputStreamPtr memoryOutputStream(size_t chunkSize=4 *1024)
Returns a new OutputStream, which grows in memory chunks of specified size.
void decode(Decoder &d, T &t)
Generic decoder function that makes use of the codec_traits.
Definition: Specific.hh:339
const GenericDatum & fieldAt(size_t pos) const
Returns the field at the given position pos.
Definition: GenericDatum.hh:323
const T & value() const
Returns the value held by this datum.
Definition: GenericDatum.hh:548
std::shared_ptr< Encoder > EncoderPtr
Shared pointer to Encoder.
Definition: Encoder.hh:147
The generic container for Avro records.
Definition: GenericDatum.hh:269
std::shared_ptr< Decoder > DecoderPtr
Shared pointer to Decoder.
Definition: Decoder.hh:177
size_t fieldCount() const
Returns the number of fields in the current record.
Definition: GenericDatum.hh:281
void encode(Encoder &e, const T &t)
Generic encoder function that makes use of the codec_traits.
Definition: Specific.hh:331
Type type() const
The avro data type this datum holds.
Definition: GenericDatum.hh:523
A ValidSchema is basically a non-mutable Schema that has passed some minimum of sanity checks...
Definition: ValidSchema.hh:40
AVRO_DECL InputStreamPtr memoryInputStream(const uint8_t *data, size_t len)
Returns a new InputStream, with the data from the given byte array.
AVRO_DECL DecoderPtr binaryDecoder()
Returns an decoder that can decode binary Avro standard.
AVRO_DECL EncoderPtr binaryEncoder()
Returns an encoder that can encode binary Avro standard.
AVRO_DECL void compileJsonSchema(std::istream &is, ValidSchema &schema)
Given a stream comtaining a JSON schema, compiles the schema to a ValidSchema object.
Generic datum which can hold any Avro type.
Definition: GenericDatum.hh:61
In this example, we encode the data using generated code and decode it with generic datum. Then we examine the contents of the generic datum and extract them. Please see avro::GenericDatum for more details on how to use it.
Reading data with a schema different from that of the writer ??
It is possible to read the data written according to one schema using a different schema, provided the reader's schema and the writer's schema are compatible according to the Avro's Schema resolution rules.
For example, you have a reader which is interested only in the imaginary part of a complex number while the writer writes both the real and imaginary parts. It is possible to do automatic schema resolution between the writer's schema and schema as shown below.
File: imaginary.json
5 {"name": "im", "type" : "double"}
avrogencpp -i imaginary.json -o imaginary.hh -n i
File: resolving.cc
22 #include "imaginary.hh" 24 #include "avro/Compiler.hh" 25 #include "avro/Encoder.hh" 26 #include "avro/Decoder.hh" 27 #include "avro/Specific.hh" 28 #include "avro/Generic.hh" 34 std::ifstream ifs(filename);
61 std::cout <<
"Imaginary: " << c2.im << std::endl;
AVRO_DECL OutputStreamPtr memoryOutputStream(size_t chunkSize=4 *1024)
Returns a new OutputStream, which grows in memory chunks of specified size.
void decode(Decoder &d, T &t)
Generic decoder function that makes use of the codec_traits.
Definition: Specific.hh:339
AVRO_DECL ResolvingDecoderPtr resolvingDecoder(const ValidSchema &writer, const ValidSchema &reader, const DecoderPtr &base)
Returns a decoder that decodes avro data from base written according to writerSchema and resolves aga...
std::shared_ptr< Encoder > EncoderPtr
Shared pointer to Encoder.
Definition: Encoder.hh:147
std::shared_ptr< Decoder > DecoderPtr
Shared pointer to Decoder.
Definition: Decoder.hh:177
void encode(Encoder &e, const T &t)
Generic encoder function that makes use of the codec_traits.
Definition: Specific.hh:331
A ValidSchema is basically a non-mutable Schema that has passed some minimum of sanity checks...
Definition: ValidSchema.hh:40
AVRO_DECL InputStreamPtr memoryInputStream(const uint8_t *data, size_t len)
Returns a new InputStream, with the data from the given byte array.
AVRO_DECL DecoderPtr binaryDecoder()
Returns an decoder that can decode binary Avro standard.
AVRO_DECL EncoderPtr binaryEncoder()
Returns an encoder that can encode binary Avro standard.
AVRO_DECL void compileJsonSchema(std::istream &is, ValidSchema &schema)
Given a stream comtaining a JSON schema, compiles the schema to a ValidSchema object.
In this example, writer and reader deal with different schemas, both have a record with the name 'cpx'. The writer schema has two fields and the reader's has just one. We generated code for writer's schema in a namespace c
and the reader's in i
.
Please notice how the reading part of the example at line 60 reads as if the stream contains the data corresponding to its schema. The schema resolution is automatically done by the resolving decoder.
In this example, we have used a simple (somewhat artificial) projection (where the set of fields in the reader's schema is a subset of set of fields in the writer's). But more complex resolutions are allowed by Avro specification.
Using Avro data files ??
Avro specification specifies a format for data files. Avro C++ implements the specification. The code below demonstrates how one can use the Avro data file to store and retrieve a collection of objects corresponding to a given schema.
File: datafile.cc
22 #include "avro/Encoder.hh" 23 #include "avro/Decoder.hh" 24 #include "avro/ValidSchema.hh" 25 #include "avro/Compiler.hh" 26 #include "avro/DataFile.hh" 31 std::ifstream ifs(filename);
45 for (
int i = 0; i < 100; i++) {
56 while (dfr.read(c2)) {
57 std::cout <<
'(' << c2.re <<
", " << c2.im <<
')' << std::endl;
An Avro datafile that can store objects of type T.
Definition: DataFile.hh:147
A ValidSchema is basically a non-mutable Schema that has passed some minimum of sanity checks...
Definition: ValidSchema.hh:40
Reads the contents of data file one after another.
Definition: DataFile.hh:307
AVRO_DECL void compileJsonSchema(std::istream &is, ValidSchema &schema)
Given a stream comtaining a JSON schema, compiles the schema to a ValidSchema object.
Please see DataFile.hh for more details.