Expand description
§A primer on Apache Avro
Avro is a schema based encoding system, like Protobuf. This means that if you have raw Avro data without a schema, you are unable to decode it. It also means that the format is very space efficient.
§Schemas
Schemas are defined in JSON and look like this:
{
"type": "record",
"name": "example",
"fields": [
{"name": "a", "type": "long", "default": 42},
{"name": "b", "type": "string"}
]
}For all possible types and extra attributes, see the schema section of the specification.
Schemas can depend on each other. For example, the schema defined above can be used again or a schema can include itself:
{
"type": "record",
"name": "references",
"fields": [
{"name": "a", "type": "example"},
{"name": "b", "type": "bytes"},
{"name": "recursive", "type": ["null", "references"]}
]
}Schemas are represented using the Schema type.
§Data serialization and deserialization
There are various formats to encode and decode Avro data. Most formats use the Avro binary encoding.
§Object Container File
This is the most common file format used for Avro, it uses the binary encoding. It includes the schema in the file, and can therefore be decoded by a reader who doesn’t have the schema. It includes many records in one file.
This file format can be used via the Reader and Writer types.
§Single Object Encoding
This file format also uses the binary encoding, but the schema is not included directly. It instead includes a fingerprint of the schema, which a reader can look up in a schema database or compare with the fingerprint that the reader is expecting. This file format always contains one record.
This file format can be used via the GenericSingleObjectReader,
GenericSingleObjectWriter, SpecificSingleObjectReader,
and SpecificSingleObjectWriter types.
§Avro datums
This is not really a file format, as it’s just the raw Avro binary data. It does not include a schema and can therefore not be decoded without the reader knowing exactly which schema was used to write it.
This file format can be used via the to_avro_datum, from_avro_datum,
to_avro_datum_schemata, from_avro_datum_schemata,
from_avro_datum_reader_schemata, and
write_avro_datum_ref functions.
§Avro JSON
Not be confused with the schema definition which is also in JSON. This is the Avro data encoded in JSON.
It can be used via the From<serde_json::Value> for Value and
TryFrom<Value> for serde_json::Value implementations.
§Compression
For records with low entropy it can be useful to compress the encoded data. Using the Object Container File format this is directly possible in Avro. Avro supports various compression codecs:
- deflate
- bzip2
- Snappy
- XZ
- Zstandard
All readers are required to implement the deflate codec, but most implementations implement most
codecs.