class avro.compatibility.SchemaCompatibilityType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶

class avro.compatibility.SchemaIncompatibilityType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶

class avro.compatibility.SchemaType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶

avro.compatibility.merge(this: SchemaCompatibilityResult, that: SchemaCompatibilityResult) → SchemaCompatibilityResult¶: Merges two {@code SchemaCompatibilityResult} into a new instance, combining the list of Incompatibilities and regressing to the SchemaCompatibilityType.incompatible state if any incompatibilities are encountered. :param this: SchemaCompatibilityResult :param that: SchemaCompatibilityResult :return: SchemaCompatibilityResult

Read/Write Avro File Object Containers.

https://avro.apache.org/docs/current/spec.html#Object+Container+Files

class avro.datafile.DataFileReader(reader: IO, datum_reader: DatumReader)¶

Read files written by DataFileWriter.

close() → None¶: Close this reader.

determine_file_length() → int¶: Get file length and leave file cursor where we found it.

class avro.datafile.DataFileWriter(writer: IO, datum_writer: DatumWriter, writers_schema: Schema | None = None, codec: str = 'null')¶

append(datum: object) → None¶: Append a datum to the file.

close() → None¶: Close the file.

flush() → None¶: Flush the current state of the file, including metadata.

sync() → int¶: Return the current position as a value that may be passed to DataFileReader.seek(long). Forces the end of the current block, emitting a synchronization marker.

class avro.datafile.HeaderType¶

Support for inter-process calls.

class avro.ipc.BaseRequestor(local_protocol, transceiver)¶

Base class for the client side of a protocol interaction.

read_call_response(message_name, decoder)¶

The format of a call response is:

response metadata, a map with values of type bytes
a one-byte error flag boolean, followed by either: o if the error flag is false,

the message response, serialized per the message’s response schema.

o if the error flag is true,
the error, serialized per the message’s error union schema.

request(message_name, request_datum)¶: Writes a request message and reads a response or error message.

write_call_request(message_name, request_datum, encoder)¶

The format of a call request is:

request metadata, a map with values of type bytes
the message name, an Avro string, followed by
the message parameters. Parameters are serialized according to the message’s request declaration.

class avro.ipc.FramedReader(reader)¶: Wrapper around a file-like object to read framed data.

class avro.ipc.FramedWriter(writer)¶: Wrapper around a file-like object to write framed data.

class avro.ipc.HTTPTransceiver(host, port, req_resource='/')¶: A simple HTTP-based transceiver implementation. Useful for clients but not for servers

class avro.ipc.Requestor(local_protocol, transceiver)¶

class avro.ipc.Responder(local_protocol)¶

Base class for the server side of a protocol interaction.

invoke(local_message, request)¶: Actual work done by server: cf. handler in thrift.

respond(call_request)¶: Called by a server to deserialize a request, compute and serialize a response or error. Compare to ‘handle()’ in Thrift.

Protocol implementation.

https://avro.apache.org/docs/current/spec.html#Protocol+Declaration

class avro.protocol.Message(name: str, request: Sequence[Mapping[str, object]], response: str | object, errors: Sequence[str] | None = None, names: Names | None = None, validate_names: bool = True)¶

A message has attributes:

a doc, an optional description of the message,
a request, a list of named, typed parameter schemas (this has the same form as the fields of a record declaration);
a response schema;
an optional union of declared error schemas. The effective union has “string” prepended to the declared union, to permit transmission of undeclared “system” errors. For example, if the declared error union is [“AccessError”], then the effective union is [“string”, “AccessError”]. When no errors are declared, the effective error union is [“string”]. Errors are serialized using the effective union; however, a protocol’s JSON declaration contains only the declared union.
an optional one-way boolean parameter.

A request parameter list is processed equivalently to an anonymous record. Since record field lists may vary between reader and writer, request parameters may also differ between the caller and responder, and such differences are resolved in the same manner as record field differences.

The one-way parameter may only be true when the response type is “null” and no errors are listed.

class avro.protocol.MessageObject¶

class avro.protocol.Protocol(name: str, namespace: str | None = None, types: Sequence[str] | None = None, messages: Mapping[str, MessageObject] | None = None, validate_names: bool = True)¶

Avro protocols describe RPC interfaces. Like schemas, they are defined with JSON text.

A protocol is a JSON object with the following attributes:

protocol, a string, the name of the protocol (required);
namespace, an optional string that qualifies the name;
doc, an optional string describing this protocol;
types, an optional list of definitions of named types (records, enums, fixed and errors). An error definition is just like a record definition except it uses “error” instead of “record”. Note that forward references to named types are not permitted.
messages, an optional JSON object whose keys are message names and whose values are objects whose attributes are described below. No two messages may have the same name.

The name and namespace qualification rules defined for schema objects apply to protocols as well.

class avro.protocol.ProtocolObject¶

avro.protocol.make_avpr_object(json_data: ProtocolObject, validate_names: bool = True) → Protocol¶: Build Avro Protocol from data parsed out of JSON string.

avro.protocol.parse(json_string: str, validate_names: bool = True) → Protocol¶: Constructs the Protocol from the JSON text.

Contains Codecs for Python Avro.

Note that the word “codecs” means “compression/decompression algorithms” in the Avro world (https://avro.apache.org/docs/current/spec.html#Object+Container+Files), so don’t confuse it with the Python’s “codecs”, which is a package mainly for converting charsets (https://docs.python.org/3/library/codecs.html).

class avro.codecs.BZip2Codec¶

static compress(data: bytes) → Tuple[bytes, int]¶

Compress the passed data.

Parameters:: data (str) – a byte string to be compressed
Return type:: tuple
Returns:: compressed data and its length

static decompress(readers_decoder: BinaryDecoder) → BinaryDecoder¶

Read compressed data via the passed BinaryDecoder and decompress it.

Parameters:: readers_decoder (avro.io.BinaryDecoder) – a BinaryDecoder object currently being used for reading an object container file
Return type:: avro.io.BinaryDecoder
Returns:: a newly instantiated BinaryDecoder object that contains the decompressed data which is wrapped by a StringIO

class avro.codecs.Codec¶

Abstract base class for all Avro codec classes.

abstract static compress(data: bytes) → Tuple[bytes, int]¶

Compress the passed data.

Parameters:: data (str) – a byte string to be compressed
Return type:: tuple
Returns:: compressed data and its length

abstract static decompress(readers_decoder: BinaryDecoder) → BinaryDecoder¶

Read compressed data via the passed BinaryDecoder and decompress it.

Parameters:: readers_decoder (avro.io.BinaryDecoder) – a BinaryDecoder object currently being used for reading an object container file
Return type:: avro.io.BinaryDecoder
Returns:: a newly instantiated BinaryDecoder object that contains the decompressed data which is wrapped by a StringIO

class avro.codecs.DeflateCodec¶

static compress(data: bytes) → Tuple[bytes, int]¶

Compress the passed data.

Parameters:: data (str) – a byte string to be compressed
Return type:: tuple
Returns:: compressed data and its length

static decompress(readers_decoder: BinaryDecoder) → BinaryDecoder¶

Read compressed data via the passed BinaryDecoder and decompress it.

Parameters:: readers_decoder (avro.io.BinaryDecoder) – a BinaryDecoder object currently being used for reading an object container file
Return type:: avro.io.BinaryDecoder
Returns:: a newly instantiated BinaryDecoder object that contains the decompressed data which is wrapped by a StringIO

class avro.codecs.NullCodec¶

static compress(data: bytes) → Tuple[bytes, int]¶

Compress the passed data.

Parameters:: data (str) – a byte string to be compressed
Return type:: tuple
Returns:: compressed data and its length

static decompress(readers_decoder: BinaryDecoder) → BinaryDecoder¶

Read compressed data via the passed BinaryDecoder and decompress it.

Parameters:: readers_decoder (avro.io.BinaryDecoder) – a BinaryDecoder object currently being used for reading an object container file
Return type:: avro.io.BinaryDecoder
Returns:: a newly instantiated BinaryDecoder object that contains the decompressed data which is wrapped by a StringIO

Contains Constants for Python Avro

Input/Output utilities, including:

i/o-specific constants

i/o-specific exceptions

schema validation

leaf value encoding and decoding

datum reader/writer stuff (?)

Also includes a generic representation for data, which uses the following mapping:

Schema records are implemented as dict.

Schema arrays are implemented as list.

Schema maps are implemented as dict.

Schema strings are implemented as str.

Schema bytes are implemented as bytes.

Schema ints are implemented as int.

Schema longs are implemented as int.

Schema floats are implemented as float.

Schema doubles are implemented as float.

Schema booleans are implemented as bool.

Validation:

The validation of schema is performed using breadth-first graph traversal. This allows validation exceptions to pinpoint the exact node within a complex schema that is problematic, simplifying debugging considerably. Because it is a traversal, it will also be less resource-intensive, particularly when validating schema with deep structures.

Components¶

Nodes¶

Avro schemas contain many different schema types. Data about the schema types is used to validate the data in the corresponding part of a Python body (the object to be serialized). A node combines a given schema type with the corresponding Python data, as well as an optional “name” to identify the specific node. Names are generally the name of a schema (for named schema) or the name of a field (for child nodes of schema with named children like maps and records), or None, for schema who’s children are not named (like Arrays).

Iterators¶

Iterators are generator functions that take a node and return a generator which will yield a node for each child datum in the data for the current node. If a node is of a type which has no children, then the default iterator will immediately exit.

Validators¶

Validators are used to determine if the datum for a given node is valid according to the given schema type. Validator functions take a node as an argument and return a node if the node datum passes validation. If it does not, the validator must return None.

In most cases, the node returned is identical to the node provided (is in fact the same object). However, in the case of Union schema, the returned “valid” node will hold the schema that is represented by the datum contained. This allows iteration over the child nodes in that datum, if there are any.

class avro.io.BinaryDecoder(reader: IO[bytes])¶

Read leaf values.

read(n: int) → bytes¶: Read n bytes.

read_boolean() → bool¶: a boolean is written as a single byte whose value is either 0 (false) or 1 (true).

read_bytes() → bytes¶: Bytes are encoded as a long followed by that many bytes of data.

read_date_from_int() → date¶: int is decoded as python date object. int stores the number of days from the unix epoch, 1 January 1970 (ISO calendar).

read_decimal_from_bytes(precision: int, scale: int) → Decimal¶: Decimal bytes are decoded as signed short, int or long depending on the size of bytes.

read_decimal_from_fixed(precision: int, scale: int, size: int) → Decimal¶: Decimal is encoded as fixed. Fixed instances are encoded using the number of bytes declared in the schema.

read_double() → float¶: A double is written as 8 bytes. The double is converted into a 64-bit integer using a method equivalent to Java’s doubleToRawLongBits and then encoded in little-endian format.

read_float() → float¶: A float is written as 4 bytes. The float is converted into a 32-bit integer using a method equivalent to Java’s floatToRawIntBits and then encoded in little-endian format.

read_int() → int¶: int and long values are written using variable-length, zig-zag coding.

read_long() → int¶: int and long values are written using variable-length, zig-zag coding.

read_null() → None¶: null is written as zero bytes

read_time_micros_from_long() → time¶: long is decoded as python time object which represents the number of microseconds after midnight, 00:00:00.000000.

read_time_millis_from_int() → time¶: int is decoded as python time object which represents the number of milliseconds after midnight, 00:00:00.000.

read_timestamp_micros_from_long() → datetime¶: long is decoded as python datetime object which represents the number of microseconds from the unix epoch, 1 January 1970.

read_timestamp_millis_from_long() → datetime¶: long is decoded as python datetime object which represents the number of milliseconds from the unix epoch, 1 January 1970.

read_utf8() → str¶: A string is encoded as a long followed by that many bytes of UTF-8 encoded character data.

class avro.io.BinaryEncoder(writer: IO[bytes])¶

Write leaf values.

write(datum: bytes) → None¶: Write an arbitrary datum.

write_boolean(datum: bool) → None¶: a boolean is written as a single byte whose value is either 0 (false) or 1 (true).

write_bytes(datum: bytes) → None¶: Bytes are encoded as a long followed by that many bytes of data.

write_date_int(datum: date) → None¶: Encode python date object as int. It stores the number of days from the unix epoch, 1 January 1970 (ISO calendar).

write_decimal_bytes(datum: Decimal, scale: int) → None¶: Decimal in bytes are encoded as long. Since size of packed value in bytes for signed long is 8, 8 bytes are written.

write_decimal_fixed(datum: Decimal, scale: int, size: int) → None¶: Decimal in fixed are encoded as size of fixed bytes.

write_double(datum: float) → None¶: A double is written as 8 bytes. The double is converted into a 64-bit integer using a method equivalent to Java’s doubleToRawLongBits and then encoded in little-endian format.

write_float(datum: float) → None¶: A float is written as 4 bytes. The float is converted into a 32-bit integer using a method equivalent to Java’s floatToRawIntBits and then encoded in little-endian format.

write_int(datum: int) → None¶: int and long values are written using variable-length, zig-zag coding.

write_long(datum: int) → None¶: int and long values are written using variable-length, zig-zag coding.

write_null(datum: None) → None¶: null is written as zero bytes

write_time_micros_long(datum: time) → None¶: Encode python time object as long. It stores the number of microseconds from midnight, 00:00:00.000000

write_time_millis_int(datum: time) → None¶: Encode python time object as int. It stores the number of milliseconds from midnight, 00:00:00.000

write_timestamp_micros_long(datum: datetime) → None¶: Encode python datetime object as long. It stores the number of microseconds from midnight of unix epoch, 1 January 1970.

write_timestamp_millis_long(datum: datetime) → None¶: Encode python datetime object as long. It stores the number of milliseconds from midnight of unix epoch, 1 January 1970.

write_utf8(datum: str) → None¶: A string is encoded as a long followed by that many bytes of UTF-8 encoded character data.

class avro.io.DatumReader(writers_schema: Schema | None = None, readers_schema: Schema | None = None)¶

Deserialize Avro-encoded data into a Python data structure.

read_array(writers_schema: ArraySchema, readers_schema: ArraySchema, decoder: BinaryDecoder) → List[object]¶

Arrays are encoded as a series of blocks.

Each block consists of a long count value, followed by that many array items. A block with count zero indicates the end of the array. Each item is encoded per the array’s item schema.

If a block’s count is negative, then the count is followed immediately by a long block size, indicating the number of bytes in the block. The actual count in this case is the absolute value of the count written.

read_enum(writers_schema: EnumSchema, readers_schema: EnumSchema, decoder: BinaryDecoder) → str¶: An enum is encoded by a int, representing the zero-based position of the symbol in the schema.

read_fixed(writers_schema: FixedSchema, readers_schema: Schema, decoder: BinaryDecoder) → bytes¶: Fixed instances are encoded using the number of bytes declared in the schema.

read_map(writers_schema: MapSchema, readers_schema: MapSchema, decoder: BinaryDecoder) → Mapping[str, object]¶

Maps are encoded as a series of blocks.

Each block consists of a long count value, followed by that many key/value pairs. A block with count zero indicates the end of the map. Each item is encoded per the map’s value schema.

If a block’s count is negative, then the count is followed immediately by a long block size, indicating the number of bytes in the block. The actual count in this case is the absolute value of the count written.

read_record(writers_schema: RecordSchema, readers_schema: RecordSchema, decoder: BinaryDecoder) → Mapping[str, object]¶

A record is encoded by encoding the values of its fields in the order that they are declared. In other words, a record is encoded as just the concatenation of the encodings of its fields. Field values are encoded per their schema.

Schema Resolution:

the ordering of fields may be different: fields are matched by name.
schemas for fields with the same name in both records are resolved recursively.
if the writer’s record contains a field with a name not present in the reader’s record, the writer’s value for that field is ignored.
if the reader’s record schema has a field that contains a default value, and writer’s schema does not have a field with the same name, then the reader should use the default value from its field.
if the reader’s record schema has a field with no default value, and writer’s schema does not have a field with the same name, then the field’s value is unset.

read_union(writers_schema: UnionSchema, readers_schema: UnionSchema, decoder: BinaryDecoder) → object¶: A union is encoded by first writing an int value indicating the zero-based position within the union of the schema of its value. The value is then encoded per the indicated schema within the union.

class avro.io.DatumWriter(writers_schema: Schema | None = None)¶

DatumWriter for generic python objects.

write_array(writers_schema: ArraySchema, datum: Sequence[object], encoder: BinaryEncoder) → None¶

Arrays are encoded as a series of blocks.

Each block consists of a long count value, followed by that many array items. A block with count zero indicates the end of the array. Each item is encoded per the array’s item schema.

If a block’s count is negative, then the count is followed immediately by a long block size, indicating the number of bytes in the block. The actual count in this case is the absolute value of the count written.

write_enum(writers_schema: EnumSchema, datum: str, encoder: BinaryEncoder) → None¶: An enum is encoded by a int, representing the zero-based position of the symbol in the schema.

write_fixed(writers_schema: FixedSchema, datum: bytes, encoder: BinaryEncoder) → None¶: Fixed instances are encoded using the number of bytes declared in the schema.

write_map(writers_schema: MapSchema, datum: Mapping[str, object], encoder: BinaryEncoder) → None¶

Maps are encoded as a series of blocks.

Each block consists of a long count value, followed by that many key/value pairs. A block with count zero indicates the end of the map. Each item is encoded per the map’s value schema.

If a block’s count is negative, then the count is followed immediately by a long block size, indicating the number of bytes in the block. The actual count in this case is the absolute value of the count written.

write_record(writers_schema: RecordSchema, datum: Mapping[str, object], encoder: BinaryEncoder) → None¶: A record is encoded by encoding the values of its fields in the order that they are declared. In other words, a record is encoded as just the concatenation of the encodings of its fields. Field values are encoded per their schema.

write_union(writers_schema: UnionSchema, datum: object, encoder: BinaryEncoder) → None¶: A union is encoded by first writing an int value indicating the zero-based position within the union of the schema of its value. The value is then encoded per the indicated schema within the union.

class avro.io.ValidationNode(schema, datum, name)¶

datum¶: Alias for field number 1

name¶: Alias for field number 2

schema¶: Alias for field number 0

avro.io.validate(expected_schema: Schema, datum: object, raise_on_error: bool = False) → bool¶

Return True if the provided datum is valid for the expected schema

If raise_on_error is passed and True, then raise a validation error with specific information about the error encountered in validation.

Parameters:

expected_schema – An avro schema type object representing the schema against which the datum will be validated.
datum – The datum to be validated, A python dictionary or some supported type
raise_on_error – True if a AvroTypeException should be raised immediately when a validation problem is encountered.

Raises:

AvroTypeException if datum is invalid and raise_on_error is True

Returns:

True if datum is valid for expected_schema, False if not.

class avro.tether.HTTPRequestor(server, port, protocol)¶: This is a small requestor subclass I created for the HTTP protocol. Since the HTTP protocol isn’t persistent, we need to instantiate a new transciever and new requestor for each request. But I wanted to use of the requestor to be identical to that for SocketTransciever so that we can seamlessly switch between the two.

class avro.tether.TaskRunner(task)¶

This class ties together the server handling the requests from the parent process and the instance of TetherTask which actually implements the logic for the mapper and reducer phases

close()¶: Handler for the close message

start(outputport=None, join=True)¶

Start the server

Parameters¶

outputport - (optional) The port on which the parent process is listening

for requests from the task.

This will typically be supplied by an environment variable we allow it to be supplied as an argument mainly for debugging

join - (optional) If set to fault then we don’t issue a join to block

until the thread excecuting the server terminates.

This is mainly for debugging. By setting it to false, we can resume execution in this thread so that we can do additional testing

class avro.tether.TetherTask(inschema, midschema, outschema)¶

Base class for python tether mapreduce programs.

ToDo: Currently the subclass has to implement both reduce and reduceFlush. This is not very pythonic. A pythonic way to implement the reducer would be to pass the reducer a generator (as dumbo does) so that the user could iterate over the records for the given key. How would we do this. I think we would need to have two threads, one thread would run the user’s reduce function. This loop would be suspended when no reducer records were available. The other thread would read in the records for the reducer. This thread should only buffer so many records at a time (i.e if the buffer is full, self.input shouldn’t return right away but wait for space to free up)

complete()¶: Process the complete request

configure(taskType, inSchemaText, outSchemaText)¶

Parameters¶

taskType - What type of task (e.g map, reduce)

This is an enumeration which is specified in the input protocol

inSchemaText - string containing the input schema

This is the actual schema with which the data was encoded i.e it is the writer_schema (see https://avro.apache.org/docs/current/spec.html#Schema+Resolution) This is the schema the parent process is using which might be different from the one provided by the subclass of tether_task

outSchemaText - string containing the output scheme

This is the schema expected by the parent process for the output

count(group, name, amount)¶: Called to increment a counter

fail(message)¶: Call to fail the task.

input(data, count)¶

Recieve input from the server

Parameters¶

data - Should contain the bytes encoding the serialized data

I think this gets represented as a string

count - how many input records are provided in the binary stream

abstract map(record, collector)¶: Called with input values to generate intermediat values (i.e mapper output).

Parameters¶

record - The input record collector - The collector to collect the output

This is an abstract function which should be overloaded by the application specific subclass.

open(inputport, clientPort=None)¶

Open the output client - i.e the connection to the parent process

Parameters¶

inputport - This is the port that the subprocess is listening on. i.e the

subprocess starts a server listening on this port to accept requests from the parent process

clientPort - The port on which the server in the parent process is listening

If this is None we look for the environment variable AVRO_TETHER_OUTPUT_PORT
This is mainly provided for debugging purposes. In practice

we want to use the environment variable

property partitions¶: Return the number of map output partitions of this job.

abstract reduce(record, collector)¶

Called with input values to generate reducer output. Inputs are sorted by the mapper key.

The reduce function is invoked once for each value belonging to a given key outputted by the mapper.

Parameters¶

record - The mapper output collector - The collector to collect the output

This is an abstract function which should be overloaded by the application specific subclass.

abstract reduceFlush(record, collector)¶: Called with the last intermediate value in each equivalence run. In other words, reduceFlush is invoked once for each key produced in the reduce phase. It is called after reduce has been invoked on each value for the given key.

Parameters¶

record - the last record on which reduce was invoked.

status(message)¶: Called to update task status

avro.tether.find_port() → int¶: Return an unbound port

class avro.tether.tether_task_runner.TaskRunner(task)¶

This class ties together the server handling the requests from the parent process and the instance of TetherTask which actually implements the logic for the mapper and reducer phases

close()¶: Handler for the close message

start(outputport=None, join=True)¶

Start the server

Parameters¶

outputport - (optional) The port on which the parent process is listening

for requests from the task.

This will typically be supplied by an environment variable we allow it to be supplied as an argument mainly for debugging

join - (optional) If set to fault then we don’t issue a join to block

until the thread excecuting the server terminates.

This is mainly for debugging. By setting it to false, we can resume execution in this thread so that we can do additional testing

avro.tether.util.find_port() → int¶: Return an unbound port

class avro.tether.tether_task.HTTPRequestor(server, port, protocol)¶: This is a small requestor subclass I created for the HTTP protocol. Since the HTTP protocol isn’t persistent, we need to instantiate a new transciever and new requestor for each request. But I wanted to use of the requestor to be identical to that for SocketTransciever so that we can seamlessly switch between the two.

class avro.tether.tether_task.TetherTask(inschema, midschema, outschema)¶

Base class for python tether mapreduce programs.

ToDo: Currently the subclass has to implement both reduce and reduceFlush. This is not very pythonic. A pythonic way to implement the reducer would be to pass the reducer a generator (as dumbo does) so that the user could iterate over the records for the given key. How would we do this. I think we would need to have two threads, one thread would run the user’s reduce function. This loop would be suspended when no reducer records were available. The other thread would read in the records for the reducer. This thread should only buffer so many records at a time (i.e if the buffer is full, self.input shouldn’t return right away but wait for space to free up)

complete()¶: Process the complete request

configure(taskType, inSchemaText, outSchemaText)¶

Parameters¶

taskType - What type of task (e.g map, reduce)

This is an enumeration which is specified in the input protocol

inSchemaText - string containing the input schema

This is the actual schema with which the data was encoded i.e it is the writer_schema (see https://avro.apache.org/docs/current/spec.html#Schema+Resolution) This is the schema the parent process is using which might be different from the one provided by the subclass of tether_task

outSchemaText - string containing the output scheme

This is the schema expected by the parent process for the output

count(group, name, amount)¶: Called to increment a counter

fail(message)¶: Call to fail the task.

input(data, count)¶

Recieve input from the server

Parameters¶

data - Should contain the bytes encoding the serialized data

I think this gets represented as a string

count - how many input records are provided in the binary stream

abstract map(record, collector)¶: Called with input values to generate intermediat values (i.e mapper output).

Parameters¶

record - The input record collector - The collector to collect the output

This is an abstract function which should be overloaded by the application specific subclass.

open(inputport, clientPort=None)¶

Open the output client - i.e the connection to the parent process

Parameters¶

inputport - This is the port that the subprocess is listening on. i.e the

subprocess starts a server listening on this port to accept requests from the parent process

clientPort - The port on which the server in the parent process is listening

If this is None we look for the environment variable AVRO_TETHER_OUTPUT_PORT
This is mainly provided for debugging purposes. In practice

we want to use the environment variable

property partitions¶: Return the number of map output partitions of this job.

abstract reduce(record, collector)¶

Called with input values to generate reducer output. Inputs are sorted by the mapper key.

The reduce function is invoked once for each value belonging to a given key outputted by the mapper.

Parameters¶

record - The mapper output collector - The collector to collect the output

This is an abstract function which should be overloaded by the application specific subclass.

abstract reduceFlush(record, collector)¶: Called with the last intermediate value in each equivalence run. In other words, reduceFlush is invoked once for each key produced in the reduce phase. It is called after reduce has been invoked on each value for the given key.

Parameters¶

record - the last record on which reduce was invoked.

status(message)¶: Called to update task status

Arbitrary utilities and polyfills.

avro.utils.TypedDict(typename, fields=None, /, *, total=True, **kwargs)¶

A simple typed namespace. At runtime it is equivalent to a plain dict.

TypedDict creates a dictionary type such that a type checker will expect all instances to have a certain set of keys, where each key is associated with a value of a consistent type. This expectation is not checked at runtime.

Usage:

>>> class Point2D(TypedDict):
...     x: int
...     y: int
...     label: str
...
>>> a: Point2D = {'x': 1, 'y': 2, 'label': 'good'}  # OK
>>> b: Point2D = {'z': 3, 'label': 'bad'}           # Fails type check
>>> Point2D(x=1, y=2, label='first') == dict(x=1, y=2, label='first')
True

The type info can be accessed via the Point2D.__annotations__ dict, and the Point2D.__required_keys__ and Point2D.__optional_keys__ frozensets. TypedDict supports an additional equivalent form:

Point2D = TypedDict('Point2D', {'x': int, 'y': int, 'label': str})

By default, all keys must be present in a TypedDict. It is possible to override this by specifying totality:

class Point2D(TypedDict, total=False):
    x: int
    y: int

This means that a Point2D TypedDict can have any of the keys omitted. A type checker is only expected to support a literal False or True as the value of the total argument. True is the default, and makes all items defined in the class body be required.

The Required and NotRequired special forms can also be used to mark individual keys as being required or not required:

class Point2D(TypedDict):
    x: int               # the "x" key must always be present (Required is the default)
    y: NotRequired[int]  # the "y" key can be omitted

See PEP 655 for more details on Required and NotRequired.

avro.utils.randbytes(n)¶: Generate n random bytes.

exception avro.errors.AvroException¶: The base class for exceptions in avro.

exception avro.errors.AvroOutOfScaleException(*args)¶: Raised when attempting to write a decimal datum with an exponent too large for the decimal schema.

exception avro.errors.AvroRemoteException¶: Raised when an error message is sent by an Avro requestor or responder.

exception avro.errors.AvroRuntimeException¶: Raised when compatibility parsing encounters an unknown type

exception avro.errors.AvroTypeException(*args)¶: Raised when datum is not an example of schema.

exception avro.errors.AvroWarning¶: Base class for warnings.

exception avro.errors.ConnectionClosedException¶: Raised when attempting IPC on a closed connection.

exception avro.errors.DataFileException¶: Raised when there’s a problem reading or writing file object containers.

exception avro.errors.IONotReadyException¶: Raised when attempting an avro operation on an io object that isn’t fully initialized.

exception avro.errors.IgnoredLogicalType¶: Warnings for unknown or invalid logical types.

exception avro.errors.InvalidAvroBinaryEncoding¶: For invalid numbers of bytes read.

exception avro.errors.InvalidDefault¶: User attempted to parse a schema with an invalid default.

exception avro.errors.InvalidDefaultException(*args)¶: Raised when a default value isn’t a suitable type for the schema.

exception avro.errors.InvalidName¶: User attempted to parse a schema with an invalid name.

exception avro.errors.ProtocolParseException¶: Raised when a protocol failed to parse.

exception avro.errors.SchemaParseException¶: Raised when a schema failed to parse.

exception avro.errors.SchemaResolutionException(fail_msg, writers_schema=None, readers_schema=None, *args)¶

exception avro.errors.UnknownFingerprintAlgorithmException¶: Raised when attempting to generate a fingerprint with an unknown algorithm

exception avro.errors.UnsupportedCodec¶: Raised when the compression named cannot be used.

exception avro.errors.UsageError¶: An exception raised when incorrect arguments were passed.

Contains the Name classes.

class avro.name.Name(name_attr: str | None = None, space_attr: str | None = None, default_space: str | None = None, validate_name: bool = True)¶

Class to describe Avro name.

property space: str | None¶: Back out a namespace from full name.

class avro.name.Names(default_namespace: str | None = None, validate_names: bool = True)¶

Track name set and default namespace during parsing.

add_name(name_attr: str, space_attr: str | None, new_schema: NamedSchema) → Name¶

Add a new schema object to the name set.

@arg name_attr: name value read in schema @arg space_attr: namespace value read in schema.

@return: the Name that was just added.

prune_namespace(properties: Dict[str, object]) → Dict[str, object]¶: given a properties, return properties with namespace removed if it matches the own default namespace

avro.name.validate_basename(basename: str) → None¶: Raise InvalidName if the given basename is not a valid name.

Command-line tool

NOTE: The API for the command-line tool is experimental.

class avro.tool.GenericHandler(request, client_address, server)¶

class avro.tool.GenericResponder(proto, msg, datum)¶

invoke(message, request) → object¶: Actual work done by server: cf. handler in thrift.

class avro.timezones.TSTTzinfo¶

dst(dt: datetime | None = None) → timedelta¶: datetime -> DST offset as timedelta positive east of UTC.

tzname(dt: datetime | None = None) → str¶: datetime -> string name of time zone.

utcoffset(dt: datetime | None = None) → timedelta¶: datetime -> timedelta showing offset from UTC, negative values indicating West of UTC

class avro.timezones.UTCTzinfo¶

dst(dt: datetime | None = None) → timedelta¶: datetime -> DST offset as timedelta positive east of UTC.

tzname(dt: datetime | None = None) → str¶: datetime -> string name of time zone.

utcoffset(dt: datetime | None = None) → timedelta¶: datetime -> timedelta showing offset from UTC, negative values indicating West of UTC

Command line utility for reading and writing Avro files.

avro.__main__.print_json_pretty(row: object) → None¶: Pretty print JSON

Contains the Schema classes.

A schema may be one of:: A record, mapping field names to field value data; An error, equivalent to a record; An enum, containing one of a small set of symbols; An array of values, all of the same schema; A map containing string/value pairs, each of a declared schema; A union of other schemas; A fixed sized binary object; A unicode string; A sequence of bytes; A 32-bit signed int; A 64-bit signed long; A 32-bit floating-point float; A 64-bit floating-point double; A boolean; or Null.

class avro.schema.ArraySchema(items, names=None, other_props=None, validate_names: bool = True)¶

match(writer)¶

Return True if the current schema (as reader) matches the writer schema.

@arg writer: the schema to match against @return bool

to_canonical_json(names=None)¶

Converts the schema object into its Canonical Form http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

To be implemented in subclasses.

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶: Return self if datum is a valid representation of this schema, else None.

class avro.schema.BytesDecimalSchema(precision, scale=0, other_props=None)¶

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶: Return self if datum is a Decimal object, else None.

class avro.schema.CanonicalPropertiesMixin¶: A Mixin that provides canonical properties to Schema and Field types.

class avro.schema.DateSchema(other_props=None)¶

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶: Return self if datum is a valid date object, else None.

class avro.schema.EnumSchema(name: str, namespace: str, symbols: Sequence[str], names: Names | None = None, doc: str | None = None, other_props: Mapping[str, object] | None = None, validate_enum_symbols: bool = True, validate_names: bool = True)¶

match(writer)¶

Return True if the current schema (as reader) matches the writer schema.

@arg writer: the schema to match against @return bool

to_canonical_json(names=None)¶

Converts the schema object into its Canonical Form http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

To be implemented in subclasses.

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶: Return self if datum is a valid member of this Enum, else None.

class avro.schema.EqualByJsonMixin¶: A mixin that defines equality as equal if the json deserializations are equal.

class avro.schema.EqualByPropsMixin¶: A mixin that defines equality as equal if the props are equal.

class avro.schema.ErrorUnionSchema(schemas, names=None, validate_names: bool = True)¶

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

class avro.schema.Field(type_, name, has_default, default=None, order=None, names=None, doc=None, other_props=None, validate_names: bool = True)¶

class avro.schema.FixedDecimalSchema(size, name, precision, scale=0, namespace=None, names=None, other_props=None, validate_names: bool = True)¶

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶: Return self if datum is a Decimal object, else None.

class avro.schema.FixedSchema(name, namespace, size, names=None, other_props=None, validate_names: bool = True)¶

match(writer)¶

Return True if the current schema (as reader) matches the writer schema.

@arg writer: the schema to match against @return bool

to_canonical_json(names=None)¶

Converts the schema object into its Canonical Form http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

To be implemented in subclasses.

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶: Return self if datum is a valid representation of this schema, else None.

class avro.schema.MapSchema(values, names=None, other_props=None, validate_names: bool = True)¶

match(writer)¶

Return True if the current schema (as reader) matches the writer schema.

@arg writer: the schema to match against @return bool

to_canonical_json(names=None)¶

Converts the schema object into its Canonical Form http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

To be implemented in subclasses.

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶: Return self if datum is a valid representation of this schema, else None.

class avro.schema.NamedSchema(type_: str, name: str, namespace: str | None = None, names: Names | None = None, other_props: Mapping[str, object] | None = None, validate_names: bool = True)¶: Named Schemas specified in NAMED_TYPES.

class avro.schema.PrimitiveSchema(type, other_props=None)¶

Valid primitive types are in PRIMITIVE_TYPES.

match(writer)¶

Return True if the current schema (as reader) matches the writer schema.

@arg writer: the schema to match against @return bool

to_canonical_json(names=None)¶

Converts the schema object into its Canonical Form http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

To be implemented in subclasses.

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶

Return self if datum is a valid representation of this type of primitive schema, else None

@arg datum: The data to be checked for validity according to this schema @return Schema object or None

class avro.schema.PropertiesMixin¶

A mixin that provides basic properties.

check_props(other: PropertiesMixin, props: Sequence[str]) → bool¶

Check that the given props are identical in two schemas.

@arg other: The other schema to check @arg props: An iterable of properties to check @return bool: True if all the properties match

property other_props: Mapping[str, object]¶: Dictionary of non-reserved properties

class avro.schema.RecordSchema(name, namespace, fields, names=None, schema_type='record', doc=None, other_props=None, validate_names: bool = True)¶

static make_field_objects(field_data: Sequence[Mapping[str, object]], names: Names, validate_names: bool = True) → Sequence[Field]¶: We’re going to need to make message parameters too.

match(writer)¶

Return True if the current schema (as reader) matches the other schema.

@arg writer: the schema to match against @return bool

to_canonical_json(names=None)¶

Converts the schema object into its Canonical Form http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

To be implemented in subclasses.

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶: Return self if datum is a valid representation of this schema, else None

class avro.schema.Schema(type_: str, other_props: Mapping[str, object] | None = None, validate_names: bool = True)¶

Base class for all Schema classes.

fingerprint(algorithm='CRC-64-AVRO') → bytes¶

Generate fingerprint for supplied algorithm.

‘CRC-64-AVRO’ will be used as the algorithm by default, but any algorithm supported by hashlib (as can be referenced with hashlib.algorithms_guaranteed) can be specified.

algorithm param is used as an algorithm name, and NoSuchAlgorithmException will be thrown if the algorithm is not among supported.

abstract match(writer: Schema) → bool¶

Return True if the current schema (as reader) matches the writer schema.

@arg writer: the writer schema to match against. @return bool

abstract to_canonical_json(names: Names | None = None) → object¶

Converts the schema object into its Canonical Form http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

To be implemented in subclasses.

abstract to_json(names: Names | None = None) → object¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

abstract validate(datum: object) → Schema | None¶

Returns the appropriate schema object if datum is valid for that schema, else None.

To be implemented in subclasses.

Validation concerns only shape and type of data in the top level of the current schema. In most cases, the returned schema object will be self. However, for UnionSchema objects, the returned Schema will be the first branch schema for which validation passes.

@arg datum: The data to be checked for validity according to this schema @return Optional[Schema]

class avro.schema.TimeMicrosSchema(other_props=None)¶

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶: Return self if datum is a valid representation of this schema, else None.

class avro.schema.TimeMillisSchema(other_props=None)¶

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶: Return self if datum is a valid representation of this schema, else None.

class avro.schema.TimestampMicrosSchema(other_props=None)¶

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶

Return self if datum is a valid representation of this type of primitive schema, else None

@arg datum: The data to be checked for validity according to this schema @return Schema object or None

class avro.schema.TimestampMillisSchema(other_props=None)¶

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶

Return self if datum is a valid representation of this type of primitive schema, else None

@arg datum: The data to be checked for validity according to this schema @return Schema object or None

class avro.schema.UUIDSchema(other_props=None)¶

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶

Return self if datum is a valid representation of this type of primitive schema, else None

@arg datum: The data to be checked for validity according to this schema @return Schema object or None

class avro.schema.UnionSchema(schemas, names=None, validate_names: bool = True)¶

names is a dictionary of schema objects

match(writer)¶

Return True if the current schema (as reader) matches the writer schema.

@arg writer: the schema to match against @return bool

to_canonical_json(names=None)¶

Converts the schema object into its Canonical Form http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

To be implemented in subclasses.

to_json(names=None)¶

Converts the schema object into its AVRO specification representation.

Schema types that have names (records, enums, and fixed) must be aware of not re-defining schemas that are already listed in the parameter names.

validate(datum)¶: Return the first branch schema of which datum is a valid example, else None.

avro.schema.from_path(path: Path | str, validate_enum_symbols: bool = True, validate_names: bool = True) → Schema¶: Constructs the Schema from a path to an avsc (json) file.

avro.schema.get_other_props(all_props: Mapping[str, object], reserved_props: Sequence[str]) → Mapping[str, object]¶: Retrieve the non-reserved properties from a dictionary of properties @args reserved_props: The set of reserved properties to exclude

avro.schema.make_avsc_object(json_data: object, names: Names | None = None, validate_enum_symbols: bool = True, validate_names: bool = True) → Schema¶

Build Avro Schema from data parsed out of JSON string.

@arg names: A Names object (tracks seen names and default space) @arg validate_enum_symbols: If False, will allow enum symbols that are not valid Avro names.

avro.schema.make_bytes_decimal_schema(other_props)¶: Make a BytesDecimalSchema from just other_props.

avro.schema.make_logical_schema(logical_type, type_, other_props)¶: Map the logical types to the appropriate literal type and schema class.

avro.schema.parse(json_string: str, validate_enum_symbols: bool = True, validate_names: bool = True) → Schema¶

Constructs the Schema from the JSON text.

@arg json_string: The json string of the schema to parse @arg validate_enum_symbols: If False, will allow enum symbols that are not valid Avro names. @arg validate_names: If False, will allow names that are not valid Avro names. When disabling the validation

test the non-compliant names for non-compliant behavior, also in interoperability cases.

@return Schema

Components¶

Nodes¶

Iterators¶

Validators¶

Parameters¶

Parameters¶

Parameters¶

Parameters¶

Parameters¶

Parameters¶

Parameters¶

Parameters¶

Parameters¶

Parameters¶

Parameters¶

Parameters¶

Parameters¶

Parameters¶

Apache Avro

Navigation

Related Topics