Encoder.writeXyz
methods can be validated against a schema, and similarly sequences of Decoder.readXyz
methods can be validated against a schema.
The second usecase is using grammars to perform schema resolution. For this usecase, we've developed a subclass of Decoder
which takes two Avro schemas as input  a reader and a writer schema. This subclass accepts an input stream written according to the writer schema, and presents it to a client expecting the reader schema. If the writer writes a long, for example, where the reader expects a double, then the Decoder.readDouble
method will convert the writer's long into a double.
This document looks at grammars in the context of these two usecases. We first look at the singleschema case, then the doubleschema case. In the future, we believe the interpretation of Avro schemas as CFGs will find other uses (for example, to determine whether or not a schema admits finitesized values).
We parse a schema into a set of JSON objects. For each record, map, array, union schema inside this set, this parse is going to generate a unique identifier "n_{i}" (the "pointer" to the schema). By convention, n_{0} is the identifier for the "toplevel" schema (i.e., the schema we want to read or write). In addition, where n_{i} is a union, the parse will generate a unique identifier "b_{ij}" for each branch of the union.
A contextfree grammar (CFG) consists of a set of terminal symbols, a set of nonterminal symbols, a set of productions, and a start symbol. Here's how we interpret an Avro schema as a CFG:
Terminal symbols: The terminal symbols of the CFG consist of null
, bool
, int
, long
, float
, double
, string
, bytes
, enum
, fixed
, arraystart
, arrayend
, mapstart
, mapend
, and union
. In addition, we define the special terminals "1"
, "2"
, "3"
, ...
which designate the "tag" of a union (i.e., which branch of the union is actually being written or was found in the data).
Below, we use the variable P to represent any one of null
, bool
, int
, long
, double
, string
, bytes
(i.e., the "primitives").
Nonterminal symbols: The nonterminal symbols of the CFG consist of the identifiers n_{i}, u_{i}, r_{i}, e_{i}, f_{i} and r_{p} (there is a nonterminal r_{p} for each symbol in P).
Productions: The productions of the CFG are as follows:
Records: If n_{i} is a recordschema, then it defines the following production:
n_{i} ::= sym(f_{i1}) sym(f_{i2}) .. sym(f_{im})
where f_{ij} is field "j" of record n_{i}, and sym(f_{ij}) is the appropriate member of P if f_{ij} is a primitive type, or the appropriate n_{k} for some k if f_{ij} is a map, array, union, or record schema.
Arrays: If n_{i} is an array schema, then it defines the following productions:
n_{i} ::= arraystart
r_{i} arrayend
r_{i} ::= sym(n_{i}) r_{i}  ε
where "sym(n_{i})" is either some P, if this is an array of primitives, or the nonterminal associated with the schema of the elementtype of n_{k}.
Maps: If n_{i} is a map schema of element type P, then it defines the following production:
n_{i} ::= mapstart
r_{i} mapend
r_{i} ::= string
sym(n_{i}) r_{i}  ε
where "sym(n_{i})" is either some P, if the valuetype is a primitive, or the nonterminal associated with the schema of the valuetype of n_{k}.
Unions: If n_{i} is a union schema, then it defines the following productions:
n_{i} ::= union
u_{i}
u_{i} ::= 1 sym(b_{i1})  2 sym(b_{i2})  ...  j sym(b_{ij})
where the "1", "2", "3" are the tags for the union, and the b_{ij} is branch "j" of union "n_{i}", and sym(b_{ij}) is the appropriate member of P if b_{ij} is a primitive type, or the appropriate n_{k} if b_{ij} is a map, array, union, or record schema. (The introduction of the terminal symbol "UNION" plus the introduction of the additional nonterminal "u_{i}" is a convenience to our parsing implementation.)
Enum If n_{i} is an enum schema, then it defines the following production:
n_{i} ::= enum
e_{i}
e_{i} ::= ε
Here there is no real production for e_{i}. The symbol is used to associate some meta information such as the number of values in the enumeration.
Fixed If n_{i} is an fixed binary schema, then it defines the following production:
n_{i} ::= enum
f_{i}
f_{i} ::= ε
Here there is no real production for f_{i}. The symbol is used to associate some meta information such as the size of the fixed binary.
Start symbol: the starting symbol of the grammar is n_{0}.
This grammar defined by the above transformation is LL(1). (Proof: The only alternatives in these grammars are for the u_{i} ("union") symbols and the r_{i} ("repeating") symbols. For "union" the alternative productions correspond to each one of the branches of the union. Each alternative production for a union starts of a unique tagterminal, so by looking at the very first terminal one can decide which of the productions to select. In the case of the r_{k}, there are two alternative production, the second of which is ε. Since these only appear inside array
 or mapstart
/end
pairs, the arrayend
or mapend
symbol serves to predict that the ε should be selected and any other terminal symbol predicts that the other production should be selected.)
Here's an example. Consider the schema:
{ "type":"record", "name":"foo", "fields":[ {"name":"bar","type":"double"}, {"name":"baz","type":{"type":"array", "items":"string"}}, {"name":"zip", "type":{"type":"map", "values":["null",{"type":"array", "items":"bytes"},"foo"]}}, ] }This schema generates the following grammar:
n0 ::= double n1 n2 r1 ::= string r1  ε n1 ::= arraystart r1 arrayend r2 ::= string n3 r2  ε n2 ::= mapstart r2 mapend u3 ::= 1 null  2 n4  3 n0 n3 ::= union u3 r4 ::= bytes r4  ε n4 ::= arraystart r4 arrayendThe symbol "n0" is the startsymbol for this grammar.
So the idea is that you put your start symbol on the stack to initialize things. You pop that symbol off the stack, ask which production for S could've produced the input you're looking at, push that production back on the stack, and repeat. Let's fill in the details.
The parsing table for this parsing procedure is a function of two inputs and one output:
T: Nonterminal x Terminal > ProductionRemember, a "production" is a sequence of symbols  a mix of terminals and nonterminals  that derive a nonterminal in the grammar.
This function T
takes a a nonterminal, a terminal, and returns a production for the nonterminal. The nonterminal is the symbol you're trying to derive (from the top of the parsing stack); the terminal is the current symbol in your input stream (the lookahead symbol). If X
is the first input and a
, then the output is the unique production for X
that can produce the input symbol a
. (This function can also return the special result "Error" to indicate there is no such production, i.e., we can't parse the input.)
If you have such a table, then your parsing code looks like this:
parse(Table T, TokenStream in): Stack stack = new Stack(Table.startSymbol); for (Token t = in.next(); t != EOF; t = in.next()) advance(stack, T, t); advance(Stack stack, Table T, Token t): X = stack.pop(); while (! isTerminal(X)): if T(X,t) yields production Y_{1} Y_{2} ... Y_{n}): // push production in reverse order, so we leave looking for // the first symbol of the production stack.push(Y_{n}); ...; stack.push(Y_{2}); stack.push(Y_{1}); else, T(X,t) is undefined, so throw an error; X = stack.pop(); // Repeat until we find a terminal if X == t then return else throw an error;
T(A,y) = A ::= X_{1} ... X_{n}  if y is in Predict(A ::= X_{1} ... X_{n}) T(A,y) = Error  otherwisewhere
Predict(A ::= X_{1} ... X_{n})
returns the unique first symbol that predicts this particular production for A
.
But in our case, almost all productions have a single alternative. If a nonterminal symbol A
is on the top of the stack, then we don't even have to look at the input to figure out which production could derive A
because there's only one such production! Thus, we can define a special parsing table for Avroinduced grammars as follows:
T(A,y) = A ::= sym(f_{i1}) sym(f_{i2}) .. sym(f_{im})  if A is a record schema T(A,y) = A ::=Note that only the last three rules forarraystart
r_{i}arrayend
 if A is the nonterminal for an array schema T(A,y) = A ::=mapstart
r_{i}mapend
 if A is the nonterminal for an map schema T(A,y) = A ::=union
u_{i}  if A is a union schema T(A,y) = A ::= y sym(b_{ij})  if A is a u_{i} schema (note the "y" inside this production) T(A,y) = Error  if A is "A ::= k sym(b_{ij})" and "y" isn't  in any of the branches of the corresponding union T(A,y) = A ::= n_{i} r_{i}  if A is r_{i} and y is neitherarrayend
normapend
T(A,y) = A ::= ε  if A is r_{i} and y is eitherarrayend
ormapend
T(A,y)
consider the lookahead symbol (i.e., only the last three rules actually look at the value of y
). These are the rules for dealing with productions that have alternatives, i.e., the rules for unions (where there is an alternative for each branch) and the rules for repeaters (where there is one alternative for the "repeat" case and another alternative for the "end" case).
The nice thing about this alternative formulation of the parsing table is that we don't actually have to compute the predict set, which is not super complicated, but would be a pile of code to test and maintain.
It should be noted that the resulting parsing table catches errors in different states than the traditional LL(1) parsing table. For example, let's say our Schema is simply an array of ints, which induces the following grammar:
n_{0} ::=The traditional LL(1) table would be:arraystart
r_{int}arrayend
r_{int} ::= int r_{int}  ε
T(n_{0},while our parser table would be:arraystart
) = n_{0} ::=arraystart
r_{int}arrayend
T(r_{int},int) = r_{int} ::= int r_{int} T(r_{int},arrayend
) = ε T(A,y) = Error  if (A,y) is none of the above
T'(n_{0},y) = n_{0} ::=Note thatarraystart
r_{int}arrayend
 for all y T'(r_{int},y) = r_{int} ::= int r_{int}  for all y other thanarrayend
T'(r_{int},arrayend
) = ε
T
is defined as Error
for a lot of (A,y)
pairs, but T'
is defined as Error
for none of them. How can this be?
The difference is that T
catches many errors when terminals fail to appear in Predict sets, while T'
catches the errors when terminals fail to match corresponding terminals on the parser stack. For example, let's say r_{int}
is on the top of the parser stack, and the symbol double
is arrives (which means, in practice, that a writeDouble
call is encountered). In this case, a parser with the standard table will catch the error right away, because double
is not in the predictset for r_{int}
. A parser with our alternative table will first replace the r_{int}
on the stack with the sequence int r_{int}
(with int
on the top of the stack). It will then throw an error, because the input symbol double
does not match the nonterminal int
that's now on the top of the stack.
However, we believe that our modified parser will except exactly the same set of strings as the standard parser.
The first section ("The interpretation") informally describes the grammer generated by an Avro schema. This section provides a more formal description using a set of induction rules. The earlier description in section one is fine for describing how a single Avro schema generates a grammar. But soon we're going to describe how two schemas together define a "resolving" grammar, and for that description we'll need the more formal mechanism described here.
The terminal and nonterminal symbols in our grammar are as described in the first section. Our induction rules will define a function "C(S)=<G,a>", which takes an Avro schema "S" and returns a pair consisting of a set of productions "G" and a symbol "a". This symbol "a"  which is either a terminal, or a nonterminal defined by G  generates the values described by schema S.
The first rule applies to all Avro primitive types:
p in {null , boolean , int , long , double , string , bytes } 
C(p)=<{}, p> 
This first rule does not generate any productions, and simply returns the terminal symbol corresponding to the primitive types of the schema.
The next rule applies to record schemas:
 
C(S_{j})=<G_{j}, f_{j}>  
C(S)=<G_{1} ∪ ... ∪ G_{n} ∪ {a::=f_{1} f_{2} ... f_{n}}, a> 
In this case, the set of outputproductions consists of all the productions generated by the elementtypes of the record, plus a production that defines the nonterminal "a" to be the sequence of fieldtypes. We return "a" as the grammar symbol representing this recordschema.
Next, we define the rule for arrays:
S={"type":"array", "items":S_{e}} 
C(S_{e})=<G_{e},e> 
C(S)=<G_{e} ∪ {r ::= e r, r ::= ε, a ::= arraystart r arrayend }, a> 
For arrays, the set of output productions again contains all productions generated by the elementtype. In addition, we define two productions for "r", which represents the repetition of this element type. The first production is the recursive case, which consists of the elementtype followed by "r" all over again. The next case is the base case, which is the empty production. Having defined this repetition, we can then define "a" as this repetition bracketed by the terminal symbols arraystart
and arrayend
.
The rule for maps is almost identical to that for arrays:
S={"type":"map", "values":S_{e}} 
C(S_{e})=<G_{e},e> 
C(S)=<G_{e} ∪ {r ::= string e r, r ::= ε, a ::= mapstart r mapend }, a> 
The only difference from arrays is that mapelements consists of a string
together with an elementtype (vs. just an element type).
The rule for unions:
S=[S_{1}, S_{2}, ..., S_{n}] 
C(S_{j})=<G_{j}, b_{j}> 
C(S)=<G_{1} ∪ ... ∪ G_{n} ∪ {u::=1 b_{1}, u::=2 b_{2}, ..., u::=n b_{n}, a::=union u}, a> 
In this rule, we again accumulate productions (G_{j}) generated by each of the subschemas for each branch of the union. If there are "k" branches, we define "k" different productions for the nonterminal symbol "u", one for each branch in the union. These perbranch productions consist of the index of the branch (1 for the first branch, 2 for the second, and so forth), followed by the symbol representing the schema of that branch. With these productions for "u" defined, we can define "a" as simply the terminal symbol union
followed by this nonterminal "u".
The rule for fixed size binaries:
S={"type":"fixed", "name":a, "size":s} 
C(S)=<{a::=fixed f, f::=ε}, a> 
In this rule, we define a new nonterminal f which has associated size of the fixedbinary.
The rule for enums:
S={"type":"enum", "name":a, "symbols":["s1", "s2", "s3", ...]} 
C(S)=<{a::=enum e, e::=ε}, a> 
In this rule, we define a new nonterminal e which has associated range of values.
[long, string]
union, and the reader is expecting just a long
, an error is only reported when the writer sends a string rather than a long. Further, the Avro spec recommends that all errors be detected at readingtime, even if they could be detected earlier. Error actions support the deferral of errors.
These actions will become "action symbols" in our grammar. Action symbols are symbols that cause our parser to perform special activities when they appear on the top of the parsing stack. For example, when the skipaction makes it to the top of the stack, the parser will automatically skip the next value in the input stream. (Again, Fischer and LeBlanc has a nice description of action symbols.)
We're going to use induction rules to define a grammar. This time, our induction rules will define a twoargument function "C(W,R)=<G,a>", which takes two schema, the writer's and reader's schemas respectively. The results of this function are the same as they were for the singleschema case.
The first rule applies to all Avro primitive types:
p in {null , boolean , int , long , double , string , bytes } 
C(p,p)=<{}, p> 
In this case, the writer and reader schemas agree, so the resulting grammar should just expect the agreedupon primitive type.
The next rule deals with resolution of primitive types:
w in {int , long , double } 
r in {long , double } 
w != r 
C(w,r)=<{}, ResolverAction(w,r)> 
When this parameterized action is encountered, the parser will resolve the writer's value into the reader's expectedtype for that value. In the parsing loop, when we encounter this symbol, we use the "r" parameter of this symbol to check that the reader is asking for the right type of value, and we use the "w" parameter to figure out how to parse the data in the input stream.
One final possibility for primitive types is that they are incompatible types:
The w,r pair does not fit the previous two rules, AND neither 
of the pair is a union, AND the pair aren't both compounds 
of the same type (i.e., two arrays, two records, or two maps) 
C(w,r)=<{}, ErrorAction> 
When this parameterized action is encountered, the parser will throw an error. Keep in mind that this symbol might be generated in the middle of a recursive call to "G." For example, if the reader's schema is long, and the writer's is [long, string], we'll generate an error symbol for the stringbranch of the union; if this branch is occurred in actual input, an error will then be generated.
The next rule deals with resolution of fixed size binaries:
w = {"type":"fixed", "name":"n1", "size":s1} 
r = {"type":"fixed", "name":"n2", "size":s2} 
n1 != n2 or s1 != s2 
C(w,r)=<{}, ErrorAction> 
w = {"type":"fixed", "name":"n1", "size":s1} 
r = {"type":"fixed", "name":"n2", "size":s2} 
n1 == n2 and s1 == s2 
C(w,r)=<{ a::=fixed f, f::=ε}, a> 
The next rule deals with resolution of enums:
w = {"type":"enum", "symbols":[sw_{1}, sw_{2}, ..., sw_{m}] } 
r = {"type":"enum", "symbols":[sr_{1}, sr_{2}, ..., sr_{n}] } 
f_{i} = EnumAction(i, j) if sw_{i} == sr_{j} 
f_{i} = ErrorAction if sw_{i} does not match any sr_{j} 
C(w,r)=<{ a::=enum e, e::=ε}, a> 
Now that we have rules for primitive types, we can define rules for compound types. First, let's look at records:




{F_{1}, ..., F_{m}} is a subset of {E_{1}, ..., E_{n}}  
C(S_{j}, T_{i}) = <G_{j}, f_{j}>  for all E_{j}=F_{i}  
C(S_{j}) = <G_{j}, f_{j}>  for all E_{j} not in {F_{1}, ..., F_{m}}  


C(W,R)=<G_{1} ∪ G_{2} ∪ ... ∪ G_{n} ∪ { w::=f'_{1} f'_{2} ... f'_{n} }, w> 
The substance of this rule lies in the definion of the "f'_{j}". If the writer's field F_{j} is not a member of the reader's schema, then a skipaction is generated, which will cause the parser to automatically skip over the field without the reader knowing. (In this case, note that we use the singleargument version of "C", i.e., the version defined in the previous section!) If the writer's field F_{j} is a member f the reader's schema, then "f'_{j}" is a twosymbol sequence: the first symbol is a (parameterized) fieldaction which is used to tell the reader which of its own fields is coming next, followed by the symbol for parsing the value written by the writer.
The above rule for records works only when the reader and writer have the same name, and the reader's fields are subset of the writer's. In other cases, an error is producted.
The rule for arrays is straightforward:
W={"type":"array", "items":S_{w}} 
R={"type":"array", "items":S_{r}} 
C(S_{w}, S_{r})=<G_{e},e> 
C(W,R)=<G_{e} ∪ {r ::= e r, r ::= ε, a ::= arraystart r arrayend}, a> 
Here the rule is largely the same as for the singleschema case, although the recursive use of G may result in productions that are very different. The rule for maps changes in a similarlysmall way, so we don't bother to detail that case in this document.
The final rules are for unions. Let's first look at the case where the writer is a union but the reader is not:
W=[S_{1}, ..., S_{n}] 
R is not a union schema 
C(S_{j},R)=<G_{j}, b_{j}> 
C(R,W)=<G_{1} ∪ G_{2} ∪ ... ∪ G_{n} ∪ {a::=WriterUnionAction(b_{1}, b_{2}, ..., b_{n})}, a> 
Here, a writerunion action is generated that looks much like a union did in the singleschema case. However, unlike in that case, the writerunion action will cause the parser to automatically interpret the writer's union value.
Now let's look when the reader expects a union. The first of these cases is an error case:
W is not a union schema 
R=[R_{1}, ..., R_{n}] 
W does not resolve to any of the branches of R 
C(W,R)=<{}, ErrorAction> 
In this case, there's no way to resolve the two schemas, so we generate an error action to remind us of this fact at runtime. (Again, this error action might be under a branch of a containing union, and thus might never be triggered at runtime, so it wouldn't be correct to signal an error at "compile" time.)
Here's the nonerror case:
W is not a union schema 
R=[R_{1}, ..., R_{n}] 
Branch "j" of R is the best match for W 
C(W,R_{j})=<G,w> 
C(W,R)=<G, ReaderUnionAction(j,w)> 
In this case, we can decide at "compile time" which of the branches of the reader will be the best match for the value that's going to be written by the writer. We then generate a reader union action, which tells the parser first, which branchnumber of the reader's we should report to the schema, and then second which symbol to use to parse the writer's actual value.
The interesting case is when the writer's and reader's schemas are both unions:
W=[W_{1}, ..., W_{n}] 
R=[R_{1}, ..., R_{m}] 
C(W_{j}, R)=<G_{j}, b_{j}> 
C(W,R)=<G_{1} ∪ ... ∪ G_{n} ∪ {u::=1 b_{1}, u::=2 b_{2}, ..., u::=n b_{n}, a::=union u}, a> 
Note that in the inductive case ("C(W_{j}, R)"), each branch of the writer ("W_{j}") is compared to the entire union of the reader ("R"). Thus, one of the two previous cases (the error case or the readerunion case) gets generated for each branch of the writer's union.
Symbol advance(Stack stack, Table T, Symbol t, TokenStream in): Symbol X = stack.pop(); while (! isTerminal(X)): case X: FieldAction: // In this case, the main parsing loop can "ask" for the // field information by passing a FieldAction symbol as // "t". If it does, it'll get the (parameterized) symbol // from the parsing table. If it doesn't ask for this // information, then the information will be ignored. if (isFieldAction(t)) return X; SkipAction(productionToSkip): // In this case we automatically skip the production we've // been asked to skip in.skip(productionToSkip); WriterUnionAction(b_1, b_2, ..., b_n): // In this case, we read from the token inputstream to // determine the actual branch witten by the writer. // We then push this branch on the parsing stack, to tell // the parser what type of value to look for int i = in.readIndex(); stack.push(b_i); NonTerminal: if T(X,t) yields production Y_{1} Y_{2} ... Y_{n}): // push production in reverse order, so we leave looking for // the first symbol of the production stack.push(Y_{n}); ...; stack.push(Y_{2}); stack.push(Y_{1}); else, T(X,t) is undefined, so throw an error; X = stack.pop(); // We've left the loop, so X is a terminal symbol: case X: ResolvingTable(w,r): // If reader is looking for an "r", then the reader's // looking for the right thing according to the reader's // schema, but return the type actually written so the // proper conversion can happen. if (r == t) return w; ReaderUnionAction(index,writerSym): // Readerunion actions are allowed where the reader // is expecting a union. In this case, we return the // (parameterized!) readerunionaction symbol and // the code above figures out what to do if (t == union) return X; ErrorAction: throw the deferred error; // Fallthrough case: if (X == t) then return X else throw an error