Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Avro is a data serialization system. See http://hadoop.apache.org/avro/docs/current/ for background information.
This is the documentation for a C++ implementation of Avro. The library includes:
objects for assembling schemas programmatically
objects for reading and writing data, that may be used to build custom serializers and parsers
an object that validates the data against a schema during serialization (used primarily for debugging)
an object that reads a schema during parsing, and notifies the reader which type (and name or other attributes) to expect next, used for debugging or for building dynamic parsers that don't know a priori which data to expect
a code generation tool that creates C++ objects from a schema, and the code to convert back and forth between the serialized data and the object
a parser that can convert data written in one schema to a C++ object with a different schema
Although Avro does not require use of code generation, the easiest way to get started with the Avro C++ library is to use the code generation tool. The code generator reads a schema, and outputs a C++ object to represent the data for the schema. It also creates the code to serialize this object, and to deserialize it... all the heavy coding is done for you. Even if you wish to write custom serializers or parsers using the core C++ libraries, the generated code can serve as an example of how to use these libraries.
Let's walk through an example, using a simple schema. Use the schema that represents an imaginary number:
{ "type": "record", "name": "complex", "fields" : [ {"name": "real", "type": "double"}, {"name": "imaginary", "type" : "double"} ] }
Assume this JSON representation of the schema is stored in a file called imaginary. To generate the code is a two step process:
precompile < imaginary > imaginary.flat
The precompile step converts the schema into an intermediate format that is used by the code generator. This intermediate file is just a text-based representation of the schema, flattened by a depth-first-traverse of the tree structure of the schema types.
python scripts/gen-cppcode.py --input=example.flat --output=example.hh ???-namespace=Math
This tells the code generator to read your flattened schema as its input, and generate a C++ header file in example.hh. The optional argument namespace will put the objects in that namespace (if you don't specify a namespace, you will still get a default namespace of avrouser).
Here's the start of the generated code:
namespace Math { struct complex { complex () : real(), imaginary() { } double real; double imaginary; };
This is the C++ representation of the schema. It creates a structure for the record, a default constructor, and a member for each field of the record.
There is some other output that we can ignore for now. Let's look at an example of serializing this data:
void serializeMyData() { Math::complex c; c.real = 10.0; c.imaginary = 20.0; // Declare the stream to which to serialize the data to std::ostringstream os; // Ostreamer wraps a stream so that Avro serializer can use it avro::Ostreamer ostreamer(os); // Writer is the object that will do the actual I/O avro::Writer writer(ostreamer); // This will invoke the writer on my object avro::serialize(writer, c); // At this point, the ostringstream ???os??? stores the serialized data! }
Using the generated code, all that is required to serialize the data is to call avro::serialize() on the object. There is some setup required to tell where to write the data. The Ostreamer object is a simple object that understands how to wite to STL ostreams. It is derived from a virtual base class called OutputStreamer. You can derive from OutputStream to create an object that can write to any kind of buffer you wish.
Now let's do the inverse, and read the serialized data into our object:
void parseMyData(const std::string &myData) { Math::complex c; // Assume the serialized data is being passed as the contents of a string // (Note: this may not be the best way since the data is binary) // Declare a stream from which to read the serialized data std::istringstream is(myData); // Istreamer wraps a stream so that Avro parser can use it avro::Istreamer istreamer(is); // Reader is the object that will do the actual I/O avro::Reader reader(istreamer); // This will invoke the reader on my object avro::parse(reader, c); // At this point, c is populated with the deserialized data! }
In case you're wondering how avro::serialize() and avro::parse() handled the custom data type, the answer is in the generated code. It created the following functions:
template <typename Serializer> inline void serialize(Serializer &s, const complex &val, const boost::true_type &) { s.writeRecord(); serialize(s, val.real); serialize(s, val.imaginary); } template <typename Parser> inline void parse(Parser &p, complex &val, const boost::true_type &) { p.readRecord(); parse(p, val.real); parse(p, val.imaginary); }
It also adds the following to the avro namespace:
template <> struct is_serializable<Math::complex> : public boost::true_type{};
This sets up a type trait for the complex structure, telling Avro that this object has serialize and parse functions available.
The above section demonstrated pretty much all that's needed to know to get started reading and writing objects using the Avro C++ code generator. The following sections will cover some more information.
The library provides some utilities to read a schema that is stored in a JSON file or string. Take a look:
void readSchema() { // My schema is stored in a file called ???example??? std::ifstream in(???example???); avro::ValidSchema mySchema; avro::compileJsonSchema(in, mySchema); }
This reads the file, and parses the JSON schema into an object of type avro::ValidSchema. If, for some reason, the schema is not valid, the ValidSchema object will not be set, and an exception will be thrown.
The last section showed how to create a ValidSchema object from a schema stored in JSON. You may wonder, what can I use the ValidSchema for?
One use is to ensure that the writer is actually writing the types that match what the schema expects. Let's revisit the serialize function from above, but this time checking against our schema.
void serializeMyData(const ValidSchema &mySchema) { Math::complex c; c.real = 10.0; c.imaginary = 20.0; std::ostringstream os; avro::Ostreamer ostreamer(os); // ValidatingWriter will make sure our serializer is writing the correct types avro::ValidatingWriter writer(mySchema, ostreamer); try { avro::serialize(writer, c); // At this point, the ostringstream ???os??? stores the serialized data! } catch (avro::Exception &e) { std::cerr << ???ValidatingWriter encountered an error: ??? << e.what(); } }
The difference between this code and the previous version is that the Writer object was replaced with a ValidatingWriter. If the serializer function mistakenly writes a type that does not match the schema, the ValidatingWriter will throw an exception.
The ValidatingWriter will incur more processing overhead while writing your data. For the generated code, it's not necessary to use validation, because (hopefully!) the mechanically generated code will match the schema. Nevertheless it is nice while debugging to have the added safety of validation, especially when writing and testing your own serializing code.
The ValidSchema may also be used when parsing data. In addition to making sure that the parser reads types that match the schema, it provides an interface to query the next type to expect, and the field's name if it is a member of a record.
The following code is not very flexible, but it does demonstrate the API:
void parseMyData(const std::string &myData, const avro::ValidSchema &mySchema) { std::istringstream is(myData); avro::Istreamer istreamer(is); // Manually parse data, the Parser object binds the data to the schema avro::Parser<ValidatingReader> parser(mySchema, istreamer); assert( parser.nextType() == AVRO_READER); // Begin parsing parser.beginRecord(); Math::complex c; assert( parser.currentRecordName() == ???complex???); for(int i=0; i < 2; ++i) { assert( parser.nextType() == AVRO_DOUBLE); if(parser.nextFieldName() == ???real???) { c.real = parser.readDouble(); } else if (parser.nextFieldName() == ???imaginary???) { c.imaginary = parser.readDouble(); } else { std::cout << ???I did not expect that!\n??? } } }
The above code shows that if you don't know the schema at compile time, you can still write code that parses the data, by reading the schema at runtime and querying the ValidatingReader to discover what is in the serialized data.
You can use objects to create schemas in your code. There are schema objects for each primitive and compound type, and they all share a common base class called Schema.
Here's an example, of creating a schema for an array of records of complex data types:
void createMySchema() { // First construct our complex data type: avro::RecordSchema myRecord(???complex???); // Now populate my record with fields (each field is another schema): myRecord.addField(???real???, avro::DoubleSchema()); myRecord.addField(???imaginary???, avro::DoubleSchema()); // The complex record is the same as used above, let's make a schema // for an array of these record avro::ArraySchema complexArray(myRecord);
The above code created our schema, but at this point it is possible that a schema is not valid (a record may not have any fields, or some field names may not be unique, etc.) In order to use the schema, you need to convert it to the ValidSchema object:
// this will throw if the schema is invalid! avro::ValidSchema validComplexArray(complexArray); // now that I have my schema, what does it look like in JSON? // print it to the screen validComplexArray.toJson(std::cout); }
When the above code executes, it prints:
{ "type": "array", "items": { "type": "record", "name": "complex", "fields": [ { "name": "real", "type": "double" }, { "name": "imaginary", "type": "double" } ] } }
The Avro spec provides rules for dealing with schemas that are not exactly the same (for example, the schema may evolve over time, and the data my program now expects may differ than the data stored previously with the older version).
The code generation tool may help again in this case. For each structure it generates, it creates a special indexing structure that may be used to read the data, even if the data was written with a different schema.
In example.hh, this indexing structure looks like:
class complex_Layout : public avro::CompoundOffset { public: complex_Layout(size_t offset) : CompoundOffset(offset) { add(new avro::Offset(offset + offsetof(complex, real))); add(new avro::Offset(offset + offsetof(complex, imaginary))); } };
Let's say my data was previously written with floats instead of doubles. According the schema resolution rules, the schemas are compatible, because floats are promotable to doubles. As long as both the old and the new schemas are available, a dynamic parser may be created that reads the data to the code generated structure.
void dynamicParse(const avro::ValidSchema &writerSchema, const avro::ValidSchema &readerSchema) { // Instantiate the Layout object Math::complex_Layout layout; // Create a schema parser that is aware of my type's layout, and both schemas avro::ResolverSchema resolverSchema(writerSchema, readerSchema, layout); // Setup the reader std::istringstream is(data); avro::IStreamer istreamer(is); avro::ResolvingReader reader(resolverSchema, is); Math::complex c; // Do the parse avro::parse(reader, c); // At this point, c is populated with the deserialized data! }