Avro schemas as LL(1) CFG definitions

This document shows how an Avro schema can be interpreted as the definition of a context-free grammar in LL(1). We use such an interpretation for two use-cases. In one use-case, we use them to validate readers and writers of data against a single Avro schema. Specifically, sequences of Encoder.writeXyz methods can be validated against a schema, and similarly sequences of Decoder.readXyz methods can be validated against a schema. The second use-case is using grammars to perform schema resolution. For this use-case, we've developed a subclass of Decoder which takes two Avro schemas as input -- a reader and a writer schema. This subclass accepts an input stream written according to the writer schema, and presents it to a client expecting the reader schema. If the writer writes a long, for example, where the reader expects a double, then the Decoder.readDouble method will convert the writer's long into a double. This document looks at grammars in the context of these two use-cases. We first look at the single-schema case, then the double-schema case. In the future, we believe the interpretation of Avro schemas as CFGs will find other uses (for example, to determine whether or not a schema admits finite-sized values).

The interpretation

We parse a schema into a set of JSON objects. For each record, map, array, union schema inside this set, this parse is going to generate a unique identifier "n_i" (the "pointer" to the schema). By convention, n₀ is the identifier for the "top-level" schema (i.e., the schema we want to read or write). In addition, where n_i is a union, the parse will generate a unique identifier "b_ij" for each branch of the union.

A context-free grammar (CFG) consists of a set of terminal symbols, a set of non-terminal symbols, a set of productions, and a start symbol. Here's how we interpret an Avro schema as a CFG:

Terminal symbols: The terminal symbols of the CFG consist of null, bool, int, long, float, double, string, bytes, enum, fixed, arraystart, arrayend, mapstart, mapend, and union. In addition, we define the special terminals "1", "2", "3", ... which designate the "tag" of a union (i.e., which branch of the union is actually being written or was found in the data).

Below, we use the variable P to represent any one of null, bool, int, long, double, string, bytes (i.e., the "primitives").

Non-terminal symbols: The non-terminal symbols of the CFG consist of the identifiers n_i, u_i, r_i, e_i, f_i and r_p (there is a non-terminal r_p for each symbol in P).

Productions: The productions of the CFG are as follows:

Records: If n_i is a record-schema, then it defines the following production:
n_i ::= sym(f_i1) sym(f_i2) .. sym(f_im)
where f_ij is field "j" of record n_i, and sym(f_ij) is the appropriate member of P if f_ij is a primitive type, or the appropriate n_k for some k if f_ij is a map, array, union, or record schema.

Arrays: If n_i is an array schema, then it defines the following productions:
n_i ::= arraystart r_i arrayend
r_i ::= sym(n_i) r_i | ε
where "sym(n_i)" is either some P, if this is an array of primitives, or the non-terminal associated with the schema of the element-type of n_k.

Maps: If n_i is a map schema of element type P, then it defines the following production:
n_i ::= mapstart r_i mapend
r_i ::= string sym(n_i) r_i | ε
where "sym(n_i)" is either some P, if the value-type is a primitive, or the non-terminal associated with the schema of the value-type of n_k.

Unions: If n_i is a union schema, then it defines the following productions:
n_i ::= union u_i
u_i ::= 1 sym(b_i1) | 2 sym(b_i2) | ... | j sym(b_ij)
where the "1", "2", "3" are the tags for the union, and the b_ij is branch "j" of union "n_i", and sym(b_ij) is the appropriate member of P if b_ij is a primitive type, or the appropriate n_k if b_ij is a map, array, union, or record schema. (The introduction of the terminal symbol "UNION" plus the introduction of the additional non-terminal "u_i" is a convenience to our parsing implementation.)

Enum If n_i is an enum schema, then it defines the following production:
n_i ::= enum e_i
e_i ::= ε
Here there is no real production for e_i. The symbol is used to associate some meta information such as the number of values in the enumeration.

Fixed If n_i is an fixed binary schema, then it defines the following production:
n_i ::= enum f_i
f_i ::= ε
Here there is no real production for f_i. The symbol is used to associate some meta information such as the size of the fixed binary.

Start symbol: the starting symbol of the grammar is n₀.

This grammar defined by the above transformation is LL(1). (Proof: The only alternatives in these grammars are for the u_i ("union") symbols and the r_i ("repeating") symbols. For "union" the alternative productions correspond to each one of the branches of the union. Each alternative production for a union starts of a unique tag-terminal, so by looking at the very first terminal one can decide which of the productions to select. In the case of the r_k, there are two alternative production, the second of which is ε. Since these only appear inside array- or mapstart/end pairs, the arrayend or mapend symbol serves to predict that the ε should be selected and any other terminal symbol predicts that the other production should be selected.) Here's an example. Consider the schema:

{
  "type":"record", "name":"foo",
  "fields":[
    {"name":"bar","type":"double"},
    {"name":"baz","type":{"type":"array", "items":"string"}},
    {"name":"zip",
     "type":{"type":"map",
     "values":["null",{"type":"array", "items":"bytes"},"foo"]}},
  ]
}

This schema generates the following grammar:

  n0 ::= double n1 n2
  r1 ::= string r1 | ε
  n1 ::= arraystart r1 arrayend
  r2 ::= string n3 r2 | ε
  n2 ::= mapstart r2 mapend
  u3 ::= 1 null | 2 n4 | 3 n0
  n3 ::= union u3
  r4 ::= bytes r4 | ε
  n4 ::= arraystart r4 arrayend

The symbol "n0" is the start-symbol for this grammar.

Reminder on LL(1) parsing

While there's lots of material on the Web on table-driven LL(1) parsing, it all tends to over complicate things. The best discussion I've found is in Crafting a compiler, by Fischer and LeBlanc (my copy is from 1988 -- I hope they quality hasn't slid since then). Here's a quick summary. Parsing is the process of attempting to prove that a string can be derived from a grammar. Top-down parsing attempts this proof in a top-down manner. You start with the start symbol of the grammar and you ask yourself "Hey, given the input string, how can I derive this start symbol?" Now, in general, the start-symbol can be derived from one of a finite number of alternative productions:
S ::= A₁₁ A₁₂ .. A_1n₁ | A₂₁ .. A_2n₂ | ... | A_m1 .. A_{mn_m}
So the question of deriving the symbol "S" comes down to asking "Hey, given the input I'm looking at, which of these productions for S could have produced that input?" The key property of LL(1) grammars is that this question is easy to answer. All you need to do is look at the first token in your input, and that token tells you which of these alternatives could've produced that input. (This token is sometimes called the "lookahead symbol.")

So the idea is that you put your start symbol on the stack to initialize things. You pop that symbol off the stack, ask which production for S could've produced the input you're looking at, push that production back on the stack, and repeat. Let's fill in the details.

The parsing table for this parsing procedure is a function of two inputs and one output:

   T: Non-terminal x Terminal --> Production

Remember, a "production" is a sequence of symbols -- a mix of terminals and non-terminals -- that derive a non-terminal in the grammar.

This function T takes a a non-terminal, a terminal, and returns a production for the non-terminal. The non-terminal is the symbol you're trying to derive (from the top of the parsing stack); the terminal is the current symbol in your input stream (the lookahead symbol). If X is the first input and a, then the output is the unique production for X that can produce the input symbol a. (This function can also return the special result "Error" to indicate there is no such production, i.e., we can't parse the input.)

If you have such a table, then your parsing code looks like this:

parse(Table T, TokenStream in):
  Stack stack = new Stack(Table.startSymbol);
  for (Token t = in.next(); t != EOF; t = in.next())
    advance(stack, T, t);

advance(Stack stack, Table T, Token t):
  X = stack.pop();
  while (! isTerminal(X)):
    if T(X,t) yields production Y₁ Y₂ ... Y_n):
      // push production in reverse order, so we leave looking for
      // the first symbol of the production
      stack.push(Y_n);
      ...;
      stack.push(Y₂);
      stack.push(Y₁);
    else, T(X,t) is undefined, so throw an error;
    X = stack.pop(); // Repeat until we find a terminal

  if X == t then return
  else throw an error;

Parsing tables for Avro

Traditionally, the parsing table for an LL(1) grammar defined as follows:

  T(A,y) = A ::= X₁ ... X_n  -- if y is in Predict(A ::= X₁ ... X_n)
  T(A,y) = Error              -- otherwise

where Predict(A ::= X₁ ... X_n) returns the unique first symbol that predicts this particular production for A.

But in our case, almost all productions have a single alternative. If a non-terminal symbol A is on the top of the stack, then we don't even have to look at the input to figure out which production could derive A because there's only one such production! Thus, we can define a special parsing table for Avro-induced grammars as follows:

  T(A,y) = A ::= sym(f_i1) sym(f_i2) .. sym(f_im) -- if A is a record schema
  T(A,y) = A ::= arraystart r_i arrayend       -- if A is the non-terminal for an array schema
  T(A,y) = A ::= mapstart r_i mapend           -- if A is the non-terminal for an map schema
  T(A,y) = A ::= union u_i                     -- if A is a union schema
  T(A,y) = A ::= y sym(b_ij)                   -- if A is a u_i schema (note the "y" inside this production)
  T(A,y) = Error                              -- if A is "A ::= k sym(b_ij)" and "y" isn't
                                              --   in any of the branches of the corresponding union
  T(A,y) = A ::= n_i r_i    -- if A is r_i and y is neither arrayend nor mapend
  T(A,y) = A ::= ε  -- if A is r_i and y is either arrayend or mapend

Note that only the last three rules for T(A,y) consider the lookahead symbol (i.e., only the last three rules actually look at the value of y). These are the rules for dealing with productions that have alternatives, i.e., the rules for unions (where there is an alternative for each branch) and the rules for repeaters (where there is one alternative for the "repeat" case and another alternative for the "end" case).

The nice thing about this alternative formulation of the parsing table is that we don't actually have to compute the predict set, which is not super complicated, but would be a pile of code to test and maintain.

It should be noted that the resulting parsing table catches errors in different states than the traditional LL(1) parsing table. For example, let's say our Schema is simply an array of ints, which induces the following grammar:

  n₀ ::= arraystart r_int arrayend
  r_int ::= int r_int | ε

The traditional LL(1) table would be:

  T(n₀,arraystart) = n₀ ::= arraystart r_int arrayend
  T(r_int,int) = r_int ::= int r_int
  T(r_int,arrayend) = ε
  T(A,y) = Error -- if (A,y) is none of the above

while our parser table would be:

  T'(n₀,y) = n₀ ::= arraystart r_int arrayend -- for all y
  T'(r_int,y) = r_int ::= int r_int             -- for all y other than arrayend
  T'(r_int,arrayend) = ε

Note that T is defined as Error for a lot of (A,y) pairs, but T' is defined as Error for none of them. How can this be?

The difference is that T catches many errors when terminals fail to appear in Predict sets, while T' catches the errors when terminals fail to match corresponding terminals on the parser stack. For example, let's say r_int is on the top of the parser stack, and the symbol double is arrives (which means, in practice, that a writeDouble call is encountered). In this case, a parser with the standard table will catch the error right away, because double is not in the predict-set for r_int. A parser with our alternative table will first replace the r_int on the stack with the sequence int r_int (with int on the top of the stack). It will then throw an error, because the input symbol double does not match the non-terminal int that's now on the top of the stack.

However, we believe that our modified parser will except exactly the same set of strings as the standard parser.

Induction rules

The first section ("The interpretation") informally describes the grammer generated by an Avro schema. This section provides a more formal description using a set of induction rules. The earlier description in section one is fine for describing how a single Avro schema generates a grammar. But soon we're going to describe how two schemas together define a "resolving" grammar, and for that description we'll need the more formal mechanism described here.

The terminal and non-terminal symbols in our grammar are as described in the first section. Our induction rules will define a function "C(S)=<G,a>", which takes an Avro schema "S" and returns a pair consisting of a set of productions "G" and a symbol "a". This symbol "a" -- which is either a terminal, or a non-terminal defined by G -- generates the values described by schema S.

The first rule applies to all Avro primitive types:

p in {null, boolean, int, long, double, string, bytes}

C(p)=<{}, p>

This first rule does not generate any productions, and simply returns the terminal symbol corresponding to the primitive types of the schema.

The next rule applies to record schemas:

S= {"type":"record", "name":a,

"fields":[{"name":F₁, "type":S₁}, ..., {"name":F_n, "type":S_n}]}

C(S_j)=<G_j, f_j>

C(S)=<G₁ ∪ ... ∪ G_n ∪ {a::=f₁ f₂ ... f_n}, a>

In this case, the set of output-productions consists of all the productions generated by the element-types of the record, plus a production that defines the non-terminal "a" to be the sequence of field-types. We return "a" as the grammar symbol representing this record-schema.

Next, we define the rule for arrays:

S={"type":"array", "items":S_e}

C(S_e)=<G_e,e>

C(S)=<G_e ∪ {r ::= e r, r ::= ε, a ::= arraystart r arrayend}, a>

For arrays, the set of output productions again contains all productions generated by the element-type. In addition, we define two productions for "r", which represents the repetition of this element type. The first production is the recursive case, which consists of the element-type followed by "r" all over again. The next case is the base case, which is the empty production. Having defined this repetition, we can then define "a" as this repetition bracketed by the terminal symbols arraystart and arrayend.

The rule for maps is almost identical to that for arrays:

S={"type":"map", "values":S_e}

C(S_e)=<G_e,e>

C(S)=<G_e ∪ {r ::= string e r, r ::= ε, a ::= mapstart r mapend}, a>

The only difference from arrays is that map-elements consists of a string together with an element-type (vs. just an element type).

The rule for unions:

S=[S₁, S₂, ..., S_n]

C(S_j)=<G_j, b_j>

C(S)=<G₁ ∪ ... ∪ G_n ∪ {u::=1 b₁, u::=2 b₂, ..., u::=n b_n, a::=union u}, a>

In this rule, we again accumulate productions (G_j) generated by each of the sub-schemas for each branch of the union. If there are "k" branches, we define "k" different productions for the non-terminal symbol "u", one for each branch in the union. These per-branch productions consist of the index of the branch (1 for the first branch, 2 for the second, and so forth), followed by the symbol representing the schema of that branch. With these productions for "u" defined, we can define "a" as simply the terminal symbol union followed by this non-terminal "u".

The rule for fixed size binaries:

S={"type":"fixed", "name":a, "size":s}

C(S)=<{a::=fixed f, f::=ε}, a>

In this rule, we define a new non-terminal f which has associated size of the fixed-binary.

The rule for enums:

S={"type":"enum", "name":a, "symbols":["s1", "s2", "s3", ...]}

C(S)=<{a::=enum e, e::=ε}, a>

In this rule, we define a new non-terminal e which has associated range of values.

Resolution using action symbols

We want to use grammars to represent Avro's rules for schema resolution. To do this, we need a way to encode certain actions that the parser should perform as part of the resolution. In particular:

Resolver action: when the writer writes a primitive type that can be promoted into the reader's type, we use a "resolver action" to aid in this process. This is used for only a limited number of cases: int->long, int->double, long->double, and double->long.
Skip action: when writer's schema for a record contains fields that are not in the reader's schema, we to skip them. "Skip actions" are used for this purpose.
Field action: the fields of a record can appear in different orders in the reader's and writer's schemas. In the API we're designing, to support streaming, fields will be returned to the reader in the order generated by the writer; we need to help the reader map this back to its own field-order. Field actions support this requirement.
Reader union actions: the reader's schema can have a union where the writer's schema did not. For example, the writer's schema might call for simply a long, while the reader's schema calls for a union that contains a long among other things. The reader should experience the writer's long values as if they came from a union. Reader union actions support this requirement.
Writer union actions are the dual of the previous case: the writer may write a union where the reader expects just one branch of the union. Writer union actions help bridge such situations.
Enum actions: when we have reader- and writer-schema has enumerations, enum actions are used to map the writer's numerical value to the reader's numeric value.
Error actions: in general, errors in schema-resolution can only be detected when data is being read. For example, if the writer writes a [long, string] union, and the reader is expecting just a long, an error is only reported when the writer sends a string rather than a long. Further, the Avro spec recommends that all errors be detected at reading-time, even if they could be detected earlier. Error actions support the deferral of errors.

These actions will become "action symbols" in our grammar. Action symbols are symbols that cause our parser to perform special activities when they appear on the top of the parsing stack. For example, when the skip-action makes it to the top of the stack, the parser will automatically skip the next value in the input stream. (Again, Fischer and LeBlanc has a nice description of action symbols.)

We're going to use induction rules to define a grammar. This time, our induction rules will define a two-argument function "C(W,R)=<G,a>", which takes two schema, the writer's and reader's schemas respectively. The results of this function are the same as they were for the single-schema case.

The first rule applies to all Avro primitive types:

p in {null, boolean, int, long, double, string, bytes}

C(p,p)=<{}, p>

In this case, the writer and reader schemas agree, so the resulting grammar should just expect the agreed-upon primitive type.

The next rule deals with resolution of primitive types:

w in {int, long, double}

r in {long, double}

w != r

C(w,r)=<{}, ResolverAction(w,r)>

When this parameterized action is encountered, the parser will resolve the writer's value into the reader's expected-type for that value. In the parsing loop, when we encounter this symbol, we use the "r" parameter of this symbol to check that the reader is asking for the right type of value, and we use the "w" parameter to figure out how to parse the data in the input stream.

One final possibility for primitive types is that they are incompatible types:

The w,r pair does not fit the previous two rules, AND neither

of the pair is a union, AND the pair aren't both compounds

of the same type (i.e., two arrays, two records, or two maps)

C(w,r)=<{}, ErrorAction>

When this parameterized action is encountered, the parser will throw an error. Keep in mind that this symbol might be generated in the middle of a recursive call to "G." For example, if the reader's schema is long, and the writer's is [long, string], we'll generate an error symbol for the string-branch of the union; if this branch is occurred in actual input, an error will then be generated.

The next rule deals with resolution of fixed size binaries:

w = {"type":"fixed", "name":"n1", "size":s1}

r = {"type":"fixed", "name":"n2", "size":s2}

n1 != n2 or s1 != s2

C(w,r)=<{}, ErrorAction>

w = {"type":"fixed", "name":"n1", "size":s1}

r = {"type":"fixed", "name":"n2", "size":s2}

n1 == n2 and s1 == s2

C(w,r)=<{ a::=fixed f, f::=ε}, a>

If the names are identical and sizes are identical, then we match otherwise an error is generated.

The next rule deals with resolution of enums:

w = {"type":"enum", "symbols":[sw₁, sw₂, ..., sw_m] }

r = {"type":"enum", "symbols":[sr₁, sr₂, ..., sr_n] }

f_i = EnumAction(i, j) if sw_i == sr_j

f_i = ErrorAction if sw_i does not match any sr_j

C(w,r)=<{ a::=enum e, e::=ε}, a>

The symbol e has the set of actions f_i associated with it. It chooses the right action based on the runtime data.

Now that we have rules for primitive types, we can define rules for compound types. First, let's look at records:

W= {"type":"record","name":w,

"fields":[{"name":E₁, "type":S₁},..., {"name":E_n, "type":S_n}]}

R= {"type":"record", "name":r,

"fields":[{"name":F₁, "type":T₁},..., {"name":F_m, "type":T_m}]}

{F₁, ..., F_m} is a subset of {E₁, ..., E_n}

C(S_j, T_i) = <G_j, f_j> -- for all E_j=F_i

C(S_j) = <G_j, f_j> -- for all E_j not in {F₁, ..., F_m}

f'_j= _/FieldAction(i, E_i) f_j -- if E_j=F_i

^\SkipAction(f_j) -- if E_j not in {F₁, ..., F_m}

C(W,R)=<G₁ ∪ G₂ ∪ ... ∪ G_n ∪ { w::=f'₁ f'₂ ... f'_n }, w>

The substance of this rule lies in the definion of the "f'_j". If the writer's field F_j is not a member of the reader's schema, then a skip-action is generated, which will cause the parser to automatically skip over the field without the reader knowing. (In this case, note that we use the single-argument version of "C", i.e., the version defined in the previous section!) If the writer's field F_j is a member f the reader's schema, then "f'_j" is a two-symbol sequence: the first symbol is a (parameterized) field-action which is used to tell the reader which of its own fields is coming next, followed by the symbol for parsing the value written by the writer.

The above rule for records works only when the reader and writer have the same name, and the reader's fields are subset of the writer's. In other cases, an error is producted.

The rule for arrays is straightforward:

W={"type":"array", "items":S_w}

R={"type":"array", "items":S_r}

C(S_w, S_r)=<G_e,e>

C(W,R)=<G_e ∪ {r ::= e r, r ::= ε, a ::= arraystart r arrayend}, a>

Here the rule is largely the same as for the single-schema case, although the recursive use of G may result in productions that are very different. The rule for maps changes in a similarly-small way, so we don't bother to detail that case in this document.

The final rules are for unions. Let's first look at the case where the writer is a union but the reader is not:

W=[S₁, ..., S_n]

R is not a union schema

C(S_j,R)=<G_j, b_j>

C(R,W)=<G₁ ∪ G₂ ∪ ... ∪ G_n ∪ {a::=WriterUnionAction(b₁, b₂, ..., b_n)}, a>

Here, a writer-union action is generated that looks much like a union did in the single-schema case. However, unlike in that case, the writer-union action will cause the parser to automatically interpret the writer's union value.

Now let's look when the reader expects a union. The first of these cases is an error case:

W is not a union schema

R=[R₁, ..., R_n]

W does not resolve to any of the branches of R

C(W,R)=<{}, ErrorAction>

In this case, there's no way to resolve the two schemas, so we generate an error action to remind us of this fact at run-time. (Again, this error action might be under a branch of a containing union, and thus might never be triggered at run-time, so it wouldn't be correct to signal an error at "compile" time.)

Here's the non-error case:

W is not a union schema

R=[R₁, ..., R_n]

Branch "j" of R is the best match for W

C(W,R_j)=<G,w>

C(W,R)=<G, ReaderUnionAction(j,w)>

In this case, we can decide at "compile time" which of the branches of the reader will be the best match for the value that's going to be written by the writer. We then generate a reader union action, which tells the parser first, which branch-number of the reader's we should report to the schema, and then second which symbol to use to parse the writer's actual value.

The interesting case is when the writer's and reader's schemas are both unions:

W=[W₁, ..., W_n]

R=[R₁, ..., R_m]

C(W_j, R)=<G_j, b_j>

C(W,R)=<G₁ ∪ ... ∪ G_n ∪ {u::=1 b₁, u::=2 b₂, ..., u::=n b_n, a::=union u}, a>

Note that in the inductive case ("C(W_j, R)"), each branch of the writer ("W_j") is compared to the entire union of the reader ("R"). Thus, one of the two previous cases (the error case or the reader-union case) gets generated for each branch of the writer's union.

Resolving parser

Here's a stylized version of the actual parsing code, with comments, to illustrate how a resolving-grammar is actually used. To better understand this code, compare it to the simple code for "advance" given earlier in this document.

  Symbol advance(Stack stack, Table T, Symbol t, TokenStream in):
    Symbol X = stack.pop();
    while (! isTerminal(X)):
      case X:
        FieldAction:
          // In this case, the main parsing loop can "ask" for the
          // field information by passing a FieldAction symbol as
          // "t".  If it does, it'll get the (parameterized) symbol
          // from the parsing table.  If it doesn't ask for this
          // information, then the information will be ignored.
          if (isFieldAction(t)) return X;

        SkipAction(productionToSkip):
          // In this case we automatically skip the production we've
          // been asked to skip
          in.skip(productionToSkip);

        WriterUnionAction(b_1, b_2, ..., b_n):
          // In this case, we read from the token input-stream to
          // determine the actual branch witten by the writer.
          // We then push this branch on the parsing stack, to tell
          // the parser what type of value to look for
          int i = in.readIndex();
          stack.push(b_i);

        NonTerminal:
          if T(X,t) yields production Y₁ Y₂ ... Y_n):
            // push production in reverse order, so we leave looking for
            // the first symbol of the production
            stack.push(Y_n);
            ...;
            stack.push(Y₂);
            stack.push(Y₁);
          else, T(X,t) is undefined, so throw an error;

      X = stack.pop();

    // We've left the loop, so X is a terminal symbol:
    case X:
      ResolvingTable(w,r):
        // If reader is looking for an "r", then the reader's
        // looking for the right thing according to the reader's
        // schema, but return the type actually written so the
        // proper conversion can happen.
        if (r == t) return w;

      ReaderUnionAction(index,writerSym):
        // Reader-union actions are allowed where the reader
        // is expecting a union.  In this case, we return the
        // (parameterized!) reader-union-action symbol and 
        // the code above figures out what to do
        if (t == union) return X;

      ErrorAction:
        throw the deferred error;
      
    // Fall-through case:
    if (X == t) then return X
    else throw an error