Crutcher Dunnavant
University of Alabama
19 Nov 2004
crutcher (at) gmail.com
Original document available at: http://press.samedi-studios.com/drafts/dunnavant2004list_resolution/dunnavant2004list_resolution.pdf
Most popular parser generation tools work in terms of grammar rules which are not directly capable of providing list semantics. While the basic techniques for realizing list semantics are frequently re-invented, the literature has suffered from a lack of a solid collection of these techniques. This paper presents a collection of techniques for realizing various forms of lists as grammatically re-written forms using AST re-writes compatible with most parser generation environments. Also provided is a set of additional EBNF extension operators for specifying AST re-write actions.
Using most parser generation tools, matching a list of elements requires
the construction of a recursive non-terminal (left-recursive for LALR(1)
grammars) which matches the various productions of the list, and has an
epsilon production. Thus, in order to describe a list containing some
arbitrary number of
s and
s, representable in ISO
EBNF[10] as:
most parser generation tools would require the re-structured rule:
Unfortunately, given a symbol stream A B A A, this rule would yield the containment (L (L (L (L (L) A) B) A) A), rather than the more useful (L A B A A). It would likely be necessary to restructure this tree before applying semantic actions; and since matching lists of elements is virtually guaranteed to come up while writing a language evaluator, there is some real value in having a higher level model for dealing with this.
This paper defines transformations upon grammatical lists of various kinds which permit their realization in most parser generation environments. These transformations are themselves defined in terms of 4 simple parse-time AST re-write actions, which should be simple to implement in any parsing environment.
The yacc[3] family of parser generators won the parser generation space. This family is what is taught in compiler classes, and what books are written about. Many projects exist to re-implement the yacc generator in various environments. These include bison[5](C), Yacc++[9](C++), CUP (or java_cup)[6](Java), and JavaCC[8](Java). These tools work on input files structured to associate, and partially re-write, action code which is attached to the various rules of the grammar. For example, the rule:
would require a description for most of the yacc-alikes akin to:
A B C {
ASTNode Foo = new ASTNode('Foo');
Foo.addChild($1); /* For A */
Foo.addChild($2); /* For B */
Foo.addChild($3); /* For C */
$$ = Foo;
return Foo_symbol;
}
which would be re-written by the parser generator to define
,
,
, and
. Working within these constraints, it is
often difficult to provide the list semantics desired for a given grammar's
AST.
Some of the more advanced parser-generation environments support lists directly. The meta-DSL tool smgn[4] flattens recursive lists, but does not handle delimited lists, and provides no means of controlling the nature of the collapse. SDF[1], which provides parser generation for full context-free languages, supports list constructs; as does GDK[2], which exists for debugging and/or reverse engineering grammars and provides a rich framework of grammar re-write algorithms.
However, most computer programming and configuration languages are not developed in robust grammar transformation environments, but are instead developed with vanilla yacc-alike parser generation tools. In such environments, it is necessary to perform custom grammar tweaks to achieve the list semantics desired.
Provided with a sufficiently abstract AST implementation, there are many useful tree re-writes which can be performed at production time. The list resolution transforms will be expressed in terms of four such actions. These actions are here defined as extensions to EBNF, in order to permit an abstract description of the list transforms. Figure 1 provides a graphical account of the functioning of the preserve, token, and content actions.
The default action taken for a symbol in a production is preserve. When a symbol is preserved in a production rule, the AST sub-tree rooted at that symbol is made a child of the sub-tree produced for the rule's non-term. For example, the rule:
would, upon matching
, produce a
node which
contained the trees for
and
as it's children.
The drop action, represented by
, is very
simple. When a symbol is dropped in a production rule, the AST sub-tree
rooted at that symbol is omitted from the sub-tree produced for the rule's
non-term. For example, the rule:
would, upon matching
, produce a
node which
contained only the tree for
as a child.
The token action, represented by
(as
though
is to be cut below), is more complex. When a symbol is
tokenized in a production rule, only the root of the AST sub-tree rooted at
that symbol is included in the sub-tree produced for the rule's non-term,
while the symbol's sub-tree's root's children are omitted. For example, the
rule:
would, upon matching
, produce a
node which
contained as children the root of the tree for
, and the complete tree
for
. Alternatively, if this rule matched the subtrees (Bar a
b c) and (Baz e f g), it would produce
(Foo Bar (Baz e f g)).
The content action, represented by
(as
though
is to be cut above), is the complement of
token. When the content action is performed on a symbol in a
production rule, the children of the symbol's sub-tree's root are included
as children of the sub-tree produced for the rule's non-term, while the
symbol's sub-tree root itself is omitted. For example, the rule:
would, upon matching
, produce a
node which
contained as children the root of the tree for
, and the complete tree
for
. Alternatively, matching (Bar a b c) (Baz e f g)
would produce (Foo a b c (Baz e f g)).
Note:
is defined in EBNF[10] as one or
more occurrences of
(an arbitrary number of
except the empty set).
This is confusing, and the equivalent form
will be used here
instead.
The simplest form of lists are repetitions of the same symbol over and over
again without delimiters. There are two variations on this form, lists which
may be empty, and lists which may not be. Given
, a single grammar
symbol, these forms may be handled quite adequately with the following
transformations (though for empty lists,
cannot itself have an
production without producing a shift-reduce conflict).

Upon matching I I I, both of these rules would produce the sub-tree (L I I I).
However, if it is necessary for the list to match multiple symbols, or collections of symbols, there are several distinct cases to consider.
Often there are several possible matches for a list item, as in:
and we wish to collect each item up, so that each group which
makes up the list is collected together. This is a simple matter of
re-writing the list to create an item non-terminal
which produces the
various options. Letting
be a series of production alternatives
lacking
:

The non-empty variant of this transform is straightforward:

If
were
,
upon matching (I Expression) (I Expression) (I Keyword equals
Expression), these rules would produce the sub-tree (L (I
Expression) (I Expression) (I Keyword equals Expression)).
If there are several possible matches for a list item, as in:
but the items are NOT to be grouped, then there are two transformations available, both which produce the same result. Either we re-write the tree at production time, using this transform:

or, letting
be the various individual
productions of
, we re-write the grammar, using this transform:

These transforms are roughly equivalent; but the non-empty variant of the first transform:

is so much more efficient than that of the second:

that it is usually wisest to use the first form of both the empty
and non-empty list transforms, unless a pressing reason exists not too.
If
were
, upon matching A B C, all
of these rules would produce the sub-tree (L A B C).
A more complex form of lists are delimited lists. Delimited lists
have explicit separators between items. Beginning again with simple cases,
let us take
and
, singular grammar symbols, and describe a
transformation which is sufficient to handle a list of
delimited by
:
Note that a shift-reduce conflict will exist if
and
both contain
productions. Upon matching I D I D
I, this rule would produce the sub-tree (L I D I D
I).
There are several distinct variants which we may consider for delimited lists.
Suppose that we only wanted to keep the
symbols? In this
case, we would describe the initial list, and the transform, as:
Which, upon matching I D I D I, would produce
the sub-tree (L I I I). Using
as a list item
non-terminal with dropped delimiters is the preferred way to parse most
lists in most environments, though it is necessary to examine the empty
version of this transform if empty lists are needed.
Suppose that we wished to keep a marker that a delimiter had been matched, but not it's content. In this case, we would describe the initial list, and the transform, as:
Which, upon matching I (D a) I (D b) I, would produce the sub-tree (L I D I D I).
When working with ungrouped lists, it is sometimes useful to maintain the delimiters in order to separate the groups at a later time. The following transform provides this:

If
were
, upon matching A B D C,
these rules would produce the sub-tree (L A B D C).
The same collection technique used for grouping list items should be used
in the case that multiple delimiters are permitted for a given list.
The various delimiter options should be extracted into their own
non-terminal, and appropriate re-write action should be taken on the
resultant non-terminal if necessary. Letting
be a series of
production alternatives for the delimiter lacking
:

Delimited lists which can be empty require the most complex list
transformations to realize. For starters, neither their items nor their
delimiters can yield
productions without conflicting with the
productions of the list itself. Also, if the list is given an
production and a recursive production containing the delimiter
, then it would be possible to match D I, which is
not what is intended. As a result, we will introduce an additional
non-terminal,
, which will be processed with the content action
wherever it occurs. The following transformation provides for empty
delimited lists matching against the item symbol
and the delimiter
symbol
.

More complex item and delimiter groups can be realized using techniques from 4.1.1. Grouped Lists and 4.2.4. Variable Delimiters. All of the usual AST re-writes remain applicable with this transformation situation, and a complete enumeration of the various possibilities is not needed here.
There is a general target form which, with the application of different AST
re-write actions on
and
, is capable of realizing most list
semantics. If
and
be a series of production alternatives
lacking
:

Because of its flexibility, this form is a good target form for automated tools which provide list semantics.
http://homepages.cwi.nl/~jvisser/papers/sdfintro.pdf
www.gnu.org/software/bison/
www.cs.princeton.edu/~appel/modern/java/CUP/
javacc.dev.java.net
world.std.com/~compres/
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split=0 -tmp=/Users/crutcher/tmp/tmp/ dunnavant2004list_resolution.tex
The translation was initiated by Crutcher Dunnavant on 2006-03-08