Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Avro is a data serialization system. See http://hadoop.apache.org/avro/docs/current/ for background information.
This is the documentation for a C++ implementation of Avro. The library includes:
objects for assembling schemas programmatically
objects for reading and writing data, that may be used to build custom serializers and parsers
an object that validates the data against a schema during serialization (used primarily for debugging)
an object that reads a schema during parsing, and notifies the reader which type (and name or other attributes) to expect next, used for debugging or for building dynamic parsers that don't know a priori which data to expect
a code generation tool that creates C++ objects from a schema, and the code to convert back and forth between the serialized data and the object
a parser that can convert data written in one schema to a C++ object with a different schema
Although Avro does not require use of code generation, the easiest way to get started with the Avro C++ library is to use the code generation tool. The code generator reads a schema, and outputs a C++ object to represent the data for the schema. It also creates the code to serialize this object, and to deserialize it... all the heavy coding is done for you. Even if you wish to write custom serializers or parsers using the core C++ libraries, the generated code can serve as an example of how to use these libraries.
Let's walk through an example, using a simple schema. Use the schema that represents an imaginary number:
{ "type": "record", "name": "complex", "fields" : [ {"name": "real", "type": "double"}, {"name": "imaginary", "type" : "double"} ] }
Assume this JSON representation of the schema is stored in a file called imaginary. To generate the code is a two step process:
precompile < imaginary > imaginary.flat
The precompile step converts the schema into an intermediate format that is used by the code generator. This intermediate file is just a text-based representation of the schema, flattened by a depth-first-traverse of the tree structure of the schema types.
python scripts/gen-cppcode.py --input=example.flat --output=example.hh ???-namespace=Math
This tells the code generator to read your flattened schema as its input, and generate a C++ header file in example.hh. The optional argument namespace will put the objects in that namespace (if you don't specify a namespace, you will still get a default namespace of avrouser).
Here's the start of the generated code:
namespace Math { struct complex { complex () : real(), imaginary() { } double real; double imaginary; };
This is the C++ representation of the schema. It creates a structure for the record, a default constructor, and a member for each field of the record.
There is some other output that we can ignore for now. Let's look at an example of serializing this data:
void serializeMyData() { Math::complex c; c.real = 10.0; c.imaginary = 20.0; // Writer is the object that will do the actual I/O and buffer the results avro::Writer writer; // This will invoke the writer on my object avro::serialize(writer, c); // At this point, the writer stores the serialized data in a buffer, // which can be extracted to an immutable buffer InputBuffer buffer = writer.buffer(); }
Using the generated code, all that is required to serialize the data is to call avro::serialize() on the object.
The data may be be accessed by requesting an avro::InputBuffer object. From there, it can be sent to a file, over the network, etc.Now let's do the inverse, and read the serialized data into our object:
void parseMyData(const avro::InputBuffer &myData) { Math::complex c; // Reader is the object that will do the actual I/O avro::Reader reader(myData); // This will invoke the reader on my object avro::parse(reader, c); // At this point, c is populated with the deserialized data! }
In case you're wondering how avro::serialize() and avro::parse() handled the custom data type, the answer is in the generated code. It created the following functions:
template <typename Serializer> inline void serialize(Serializer &s, const complex &val, const boost::true_type &) { s.writeRecord(); serialize(s, val.real); serialize(s, val.imaginary); s.writeRecordEnd(); } template <typename Parser> inline void parse(Parser &p, complex &val, const boost::true_type &) { p.readRecord(); parse(p, val.real); parse(p, val.imaginary); p.readRecordEnd(); }
It also adds the following to the avro namespace:
template <> struct is_serializable<Math::complex> : public boost::true_type{};
This sets up a type trait for the complex structure, telling Avro that this object has serialize and parse functions available.
The above section demonstrated pretty much all that's needed to know to get started reading and writing objects using the Avro C++ code generator. The following sections will cover some more information.
The library provides some utilities to read a schema that is stored in a JSON file or string. Take a look:
void readSchema() { // My schema is stored in a file called ???example??? std::ifstream in(???example???); avro::ValidSchema mySchema; avro::compileJsonSchema(in, mySchema); }
This reads the file, and parses the JSON schema into an object of type avro::ValidSchema. If, for some reason, the schema is not valid, the ValidSchema object will not be set, and an exception will be thrown.
The last section showed how to create a ValidSchema object from a schema stored in JSON. You may wonder, what can I use the ValidSchema for?
One use is to ensure that the writer is actually writing the types that match what the schema expects. Let's revisit the serialize function from above, but this time checking against our schema.
void serializeMyData(const ValidSchema &mySchema) { Math::complex c; c.real = 10.0; c.imaginary = 20.0; // ValidatingWriter will make sure our serializer is writing the correct types avro::ValidatingWriter writer(mySchema); try { avro::serialize(writer, c); // At this point, the ostringstream ???os??? stores the serialized data! } catch (avro::Exception &e) { std::cerr << ???ValidatingWriter encountered an error: ??? << e.what(); } }
The difference between this code and the previous version is that the Writer object was replaced with a ValidatingWriter. If the serializer function mistakenly writes a type that does not match the schema, the ValidatingWriter will throw an exception.
The ValidatingWriter will incur more processing overhead while writing your data. For the generated code, it's not necessary to use validation, because (hopefully!) the mechanically generated code will match the schema. Nevertheless it is nice while debugging to have the added safety of validation, especially when writing and testing your own serializing code.
The ValidSchema may also be used when parsing data. In addition to making sure that the parser reads types that match the schema, it provides an interface to query the next type to expect, and the field's name if it is a member of a record.
The following code is not very flexible, but it does demonstrate the API:
void parseMyData(const avro::InputBuffer &myData, const avro::ValidSchema &mySchema) { // Manually parse data, the Parser object binds the data to the schema avro::Parser<ValidatingReader> parser(mySchema, myData); assert( nextType(parser) == avro::AVRO_RECORD); // Begin parsing parser.readRecord(); Math::complex c; std::string recordName; assert( currentRecordName(parser, recordName) == true); assert( recordName == ???complex???); std::string fieldName; for(int i=0; i < 2; ++i) { assert( nextType(parser) == avro::AVRO_DOUBLE); assert( nextFieldName(parser, fieldName) == true); if(fieldName == ???real???) { c.real = parser.readDouble(); } else if (fieldName == ???imaginary???) { c.imaginary = parser.readDouble(); } else { std::cout << ???I did not expect that!\n???; } } parser.readRecordEnd(); }
The above code shows that if you don't know the schema at compile time, you can still write code that parses the data, by reading the schema at runtime and querying the ValidatingReader to discover what is in the serialized data.
You can use objects to create schemas in your code. There are schema objects for each primitive and compound type, and they all share a common base class called Schema.
Here's an example, of creating a schema for an array of records of complex data types:
void createMySchema() { // First construct our complex data type: avro::RecordSchema myRecord(???complex???); // Now populate my record with fields (each field is another schema): myRecord.addField(???real???, avro::DoubleSchema()); myRecord.addField(???imaginary???, avro::DoubleSchema()); // The complex record is the same as used above, let's make a schema // for an array of these record avro::ArraySchema complexArray(myRecord);
The above code created our schema, but at this point it is possible that a schema is not valid (a record may not have any fields, or some field names may not be unique, etc.) In order to use the schema, you need to convert it to the ValidSchema object:
// this will throw if the schema is invalid! avro::ValidSchema validComplexArray(complexArray); // now that I have my schema, what does it look like in JSON? // print it to the screen validComplexArray.toJson(std::cout); }
When the above code executes, it prints:
{ "type": "array", "items": { "type": "record", "name": "complex", "fields": [ { "name": "real", "type": "double" }, { "name": "imaginary", "type": "double" } ] } }
The Avro spec provides rules for dealing with schemas that are not exactly the same (for example, the schema may evolve over time, and the data my program now expects may differ than the data stored previously with the older version).
The code generation tool may help again in this case. For each structure it generates, it creates a special indexing structure that may be used to read the data, even if the data was written with a different schema.
In example.hh, this indexing structure looks like:
class complex_Layout : public avro::CompoundOffset { public: complex_Layout(size_t offset) : CompoundOffset(offset) { add(new avro::Offset(offset + offsetof(complex, real))); add(new avro::Offset(offset + offsetof(complex, imaginary))); } };
Let's say my data was previously written with floats instead of doubles. According the schema resolution rules, the schemas are compatible, because floats are promotable to doubles. As long as both the old and the new schemas are available, a dynamic parser may be created that reads the data to the code generated structure.
void dynamicParse(const avro::ValidSchema &writerSchema, const avro::ValidSchema &readerSchema) { // Instantiate the Layout object Math::complex_Layout layout; // Create a schema parser that is aware of my type's layout, and both schemas avro::ResolverSchema resolverSchema(writerSchema, readerSchema, layout); // Setup the reader avro::ResolvingReader reader(resolverSchema, data); Math::complex c; // Do the parse avro::parse(reader, c); // At this point, c is populated with the deserialized data! }