List Resolution Techniques for Grammar Development

Crutcher Dunnavant

University of Alabama

19 Nov 2004

crutcher (at) gmail.com

Original document available at: http://press.samedi-studios.com/drafts/dunnavant2004list_resolution/dunnavant2004list_resolution.pdf

Abstract:

Most popular parser generation tools work in terms of grammar rules which are not directly capable of providing list semantics. While the basic techniques for realizing list semantics are frequently re-invented, the literature has suffered from a lack of a solid collection of these techniques. This paper presents a collection of techniques for realizing various forms of lists as grammatically re-written forms using AST re-writes compatible with most parser generation environments. Also provided is a set of additional EBNF extension operators for specifying AST re-write actions.

Introduction

Using most parser generation tools, matching a list of elements requires the construction of a recursive non-terminal (left-recursive for LALR(1) grammars) which matches the various productions of the list, and has an epsilon production. Thus, in order to describe a list containing some arbitrary number of $A$s and $B$s, representable in ISO EBNF[10] as:

$L = \{ A \ \vert \ B \}$

most parser generation tools would require the re-structured rule:

$L = L \ A \ \vert \ L \ B \ \vert \ \epsilon$

Unfortunately, given a symbol stream A B A A, this rule would yield the containment (L (L (L (L (L) A) B) A) A), rather than the more useful (L A B A A). It would likely be necessary to restructure this tree before applying semantic actions; and since matching lists of elements is virtually guaranteed to come up while writing a language evaluator, there is some real value in having a higher level model for dealing with this.

This paper defines transformations upon grammatical lists of various kinds which permit their realization in most parser generation environments. These transformations are themselves defined in terms of 4 simple parse-time AST re-write actions, which should be simple to implement in any parsing environment.

Background

The yacc[3] family of parser generators won the parser generation space. This family is what is taught in compiler classes, and what books are written about. Many projects exist to re-implement the yacc generator in various environments. These include bison[5](C), Yacc++[9](C++), CUP (or java_cup)[6](Java), and JavaCC[8](Java). These tools work on input files structured to associate, and partially re-write, action code which is attached to the various rules of the grammar. For example, the rule:

$Foo = A \ B \ C$

would require a description for most of the yacc-alikes akin to:

A B C {
    ASTNode Foo = new ASTNode('Foo');
    Foo.addChild($1); /* For A */ 
    Foo.addChild($2); /* For B */ 
    Foo.addChild($3); /* For C */ 
    $$ = Foo;
    return Foo_symbol;
}

which would be re-written by the parser generator to define $\$1$, $\$2$, $\$3$, and $\$\$$. Working within these constraints, it is often difficult to provide the list semantics desired for a given grammar's AST.

Some of the more advanced parser-generation environments support lists directly. The meta-DSL tool smgn[4] flattens recursive lists, but does not handle delimited lists, and provides no means of controlling the nature of the collapse. SDF[1], which provides parser generation for full context-free languages, supports list constructs; as does GDK[2], which exists for debugging and/or reverse engineering grammars and provides a rich framework of grammar re-write algorithms.

However, most computer programming and configuration languages are not developed in robust grammar transformation environments, but are instead developed with vanilla yacc-alike parser generation tools. In such environments, it is necessary to perform custom grammar tweaks to achieve the list semantics desired.

AST Re-Write Actions

Provided with a sufficiently abstract AST implementation, there are many useful tree re-writes which can be performed at production time. The list resolution transforms will be expressed in terms of four such actions. These actions are here defined as extensions to EBNF, in order to permit an abstract description of the list transforms. Figure 1 provides a graphical account of the functioning of the preserve, token, and content actions.

Figure 1: AST Re-Write Actions

AST Action: Preserve

The default action taken for a symbol in a production is preserve. When a symbol is preserved in a production rule, the AST sub-tree rooted at that symbol is made a child of the sub-tree produced for the rule's non-term. For example, the rule:

$Foo = Bar \ Baz$

would, upon matching $ Bar \ Baz $, produce a $Foo$ node which contained the trees for $Bar$ and $Baz$ as it's children.

AST Action: Drop

The drop action, represented by $[\![ Symbol ]\!]$, is very simple. When a symbol is dropped in a production rule, the AST sub-tree rooted at that symbol is omitted from the sub-tree produced for the rule's non-term. For example, the rule:

$Foo = [\![Bar]\!] Baz$

would, upon matching $ Bar \ Baz $, produce a $Foo$ node which contained only the tree for $Baz$ as a child.

AST Action: Token

The token action, represented by $\lfloor Symbol \rfloor$ (as though $Symbol$ is to be cut below), is more complex. When a symbol is tokenized in a production rule, only the root of the AST sub-tree rooted at that symbol is included in the sub-tree produced for the rule's non-term, while the symbol's sub-tree's root's children are omitted. For example, the rule:

$Foo = \lfloor Bar \rfloor Baz$

would, upon matching $ Bar \ Baz $, produce a $Foo$ node which contained as children the root of the tree for $Bar$, and the complete tree for $Baz$. Alternatively, if this rule matched the subtrees (Bar a b c) and (Baz e f g), it would produce (Foo Bar (Baz e f g)).

AST Action: Content

The content action, represented by $\lceil Symbol \rceil$ (as though $Symbol$ is to be cut above), is the complement of token. When the content action is performed on a symbol in a production rule, the children of the symbol's sub-tree's root are included as children of the sub-tree produced for the rule's non-term, while the symbol's sub-tree root itself is omitted. For example, the rule:

$Foo = \lceil Bar \rceil Baz$

would, upon matching $ Bar \ Baz $, produce a $Foo$ node which contained as children the root of the tree for $Bar$, and the complete tree for $Baz$. Alternatively, matching (Bar a b c) (Baz e f g) would produce (Foo a b c (Baz e f g)).

List Transformations

Note: $\{ P \}-$ is defined in EBNF[10] as one or more occurrences of $P$ (an arbitrary number of $P$ except the empty set). This is confusing, and the equivalent form $P, \{ P \}$ will be used here instead.

Non-Delimited Lists

The simplest form of lists are repetitions of the same symbol over and over again without delimiters. There are two variations on this form, lists which may be empty, and lists which may not be. Given $I$, a single grammar symbol, these forms may be handled quite adequately with the following transformations (though for empty lists, $I$ cannot itself have an $\epsilon$ production without producing a shift-reduce conflict).

$\begin{array}{rcl}
L = \{ I \} & \Rightarrow &
L = \epsilon \ \vert \ \lceil L...
... = I, \{ I \} & \Rightarrow &
L = I \ \vert \ \lceil L \rceil \ I
\end{array}$

Upon matching I I I, both of these rules would produce the sub-tree (L I I I).

However, if it is necessary for the list to match multiple symbols, or collections of symbols, there are several distinct cases to consider.

Grouped Lists

Often there are several possible matches for a list item, as in:

$L = \{ \ Keyword \ equals \ Expression \ \vert \ Expression \ \}$

and we wish to collect each item up, so that each group which makes up the list is collected together. This is a simple matter of re-writing the list to create an item non-terminal $I$ which produces the various options. Letting $\Phi$ be a series of production alternatives lacking $\epsilon$:

$\begin{array}{rcl}
L & = & \{ \Phi \} \\
& \Downarrow & \\
L & = & \{ I \} \...
...\\
L & = & \epsilon \ \vert \ \lceil L \rceil \ I \\
I & = & \Phi
\end{array}$

The non-empty variant of this transform is straightforward:

$\begin{array}{rcl}
L & = & \Phi, \{ \Phi \} \\
& \Downarrow & \\
L & = & I, ...
...rrow & \\
L & = & I \ \vert \ \lceil L \rceil \ I \\
I & = & \Phi
\end{array}$

If $\Phi$ were $Keyword \ equals \ Expression \ \vert \ Expression$, upon matching (I Expression) (I Expression) (I Keyword equals Expression), these rules would produce the sub-tree (L (I Expression) (I Expression) (I Keyword equals Expression)).

Ungrouped Lists

If there are several possible matches for a list item, as in:

$L = \{ \ A \ B \ \vert \ C \ \}$

but the items are NOT to be grouped, then there are two transformations available, both which produce the same result. Either we re-write the tree at production time, using this transform:

$\begin{array}{rcl}
L & = & \{ \Phi \} \\
& \Downarrow & \\
L & = & \{ \lceil...
...psilon \ \vert \ \lceil L \rceil \ \lceil I \rceil \\
I & = & \Phi
\end{array}$

or, letting $\phi_1 ... \phi_n$ be the various individual productions of $\Phi$, we re-write the grammar, using this transform:

$\begin{array}{rcl}
L & = & \{ \Phi \} \\
& \Downarrow & \\
L & = & \epsilon ...
... L \rceil \ \phi_1 \ \vert \ ... \ \vert \ \lceil L \rceil \ \phi_n
\end{array}$

These transforms are roughly equivalent; but the non-empty variant of the first transform:

$\begin{array}{rcl}
L & = & \Phi, \{ \Phi \} \\
& \Downarrow & \\
L & = & \lc...
...ceil \ \vert \ \lceil L \rceil \lceil I \rceil \\
I & = & \Phi \\
\end{array}$

is so much more efficient than that of the second:

$\begin{array}{rcl}
L & = & \Phi, \{ \Phi \} \\
& \Downarrow & \\
L & = & \ep...
... L \rceil \ \phi_1 \ \vert \ ... \ \vert \ \lceil L \rceil \ \phi_n
\end{array}$

that it is usually wisest to use the first form of both the empty and non-empty list transforms, unless a pressing reason exists not too. If $\Phi$ were $A \ B \ \vert \ C$, upon matching A B C, all of these rules would produce the sub-tree (L A B C).

Non-Empty Delimited Lists

A more complex form of lists are delimited lists. Delimited lists have explicit separators between items. Beginning again with simple cases, let us take $I$ and $D$, singular grammar symbols, and describe a transformation which is sufficient to handle a list of $I$ delimited by $D$:

$\begin{array}{rcl}
L = I, \{ D I \} & \Rightarrow &
L = I \ \vert \ \lceil L \rceil \ D \ I
\end{array}$

Note that a shift-reduce conflict will exist if $I$ and $D$ both contain $\epsilon$ productions. Upon matching I D I D I, this rule would produce the sub-tree (L I D I D I).

There are several distinct variants which we may consider for delimited lists.

Dropped Delimiters

Suppose that we only wanted to keep the $I$ symbols? In this case, we would describe the initial list, and the transform, as:

$\begin{array}{rcl}
L = I, \{ [\![ D ]\!] I \} & \Rightarrow &
L = I \ \vert \ \lceil L \rceil \ [\![ D ]\!] \ I
\end{array}$

Which, upon matching I D I D I, would produce the sub-tree (L I I I). Using $I$ as a list item non-terminal with dropped delimiters is the preferred way to parse most lists in most environments, though it is necessary to examine the empty version of this transform if empty lists are needed.

Token Delimiters

Suppose that we wished to keep a marker that a delimiter had been matched, but not it's content. In this case, we would describe the initial list, and the transform, as:

$\begin{array}{rcl}
L = I, \{ \lfloor D \rfloor I \} & \Rightarrow &
L = I \ \vert \ \lceil L \rceil \ \lfloor D \rfloor \ I
\end{array}$

Which, upon matching I (D a) I (D b) I, would produce the sub-tree (L I D I D I).

Delimiters on Ungrouped Non-Empty Lists

When working with ungrouped lists, it is sometimes useful to maintain the delimiters in order to separate the groups at a later time. The following transform provides this:

$\begin{array}{rcl}
L & = & \Phi, \{ D , \Phi \} \\
& \Downarrow & \\
L & = &...
... \vert \ \lceil L \rceil \ D \ \lceil I \rceil \\
I & = & \Phi \\
\end{array}$

If $\Phi$ were $A \ B \ \vert \ C$, upon matching A B D C, these rules would produce the sub-tree (L A B D C).

Variable Delimiters

The same collection technique used for grouping list items should be used in the case that multiple delimiters are permitted for a given list. The various delimiter options should be extracted into their own non-terminal, and appropriate re-write action should be taken on the resultant non-terminal if necessary. Letting $\Phi$ be a series of production alternatives for the delimiter lacking $\epsilon$:

$\begin{array}{rcl}
L & = & I, \{ \Delta, I \} \\
& \Downarrow & \\
L & = & I...
... & \\
L & = & I \ \vert \ \lceil L \rceil D \ I \\
D & = & \Delta
\end{array}$

Empty Delimited Lists

Delimited lists which can be empty require the most complex list transformations to realize. For starters, neither their items nor their delimiters can yield $\epsilon$ productions without conflicting with the $\epsilon$ productions of the list itself. Also, if the list is given an $\epsilon$ production and a recursive production containing the delimiter $D$, then it would be possible to match D I, which is not what is intended. As a result, we will introduce an additional non-terminal, $\hat{L}$, which will be processed with the content action wherever it occurs. The following transformation provides for empty delimited lists matching against the item symbol $I$ and the delimiter symbol $D$.

$\begin{array}{rcl}
L & = & ( I, \{ D \ I \} ) \ \vert \ \epsilon \\
& \Downarr...
...L} \rceil \\
\hat{L} & = & I \ \vert \ \lceil \hat{L} \rceil D \ I
\end{array}$

More complex item and delimiter groups can be realized using techniques from 4.1.1. Grouped Lists and 4.2.4. Variable Delimiters. All of the usual AST re-writes remain applicable with this transformation situation, and a complete enumeration of the various possibilities is not needed here.

A General Form

There is a general target form which, with the application of different AST re-write actions on $I$ and $D$, is capable of realizing most list semantics. If $\Phi$ and $\Delta$ be a series of production alternatives lacking $\epsilon$:

$\begin{array}{rcl}
L & = & \epsilon \ \vert \ \lceil \hat{L} \rceil \\
\hat{L}...
...\lceil \hat{L} \rceil D \ I \\
I & = & \Phi \\
D & = & \Delta \\
\end{array}$

Because of its flexibility, this form is a good target form for automated tools which provide list semantics.

Bibliography

1
Joost Visser, and Jeroen Scheerder
A Quick Introduction to SDF.
2000.

http://homepages.cwi.nl/~jvisser/papers/sdfintro.pdf



2
Jan Kort, Ralf Lämmel, and Chris Verhoef
The Grammar Deployment Kit.
In Electronic Notes in Theoretical Computer Science 65 No. 3, 2002.



3
John Levine, Tony Mason, and Doug Brown
lex & yacc.
O'Reilly, 2nd Edition, 1992.
ISBN: 1-56592-000-7



4
Holger M. Kienle
Using smgn for Rapid Prototyping of Small Domain-Specific Languages.
In ACM SIGPLAN Notices V.36(9), September 2001.



5
The GNU Project
Bison.

www.gnu.org/software/bison/



6
Scott E. Hudson
CUP Parser Generator for Java.

www.cs.princeton.edu/~appel/modern/java/CUP/



7
Don Yessick, and Joel Jones
Reinventing the Wheel Or Not Yet Another Compiler Compiler Compiler.
In Southeast ACM Conference, 2002.



8

Java Compiler Compiler [tm] (JavaCC [tm]) - The Java Parser Generator.

javacc.dev.java.net



9

Yacc++ and the Language Objects Library.

world.std.com/~compres/



10
International Organization for Standards
ISO/IEC 14977 - Extended BNF.
1996.



About this document ...

List Resolution Techniques for Grammar Development

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split=0 -tmp=/Users/crutcher/tmp/tmp/ dunnavant2004list_resolution.tex

The translation was initiated by Crutcher Dunnavant on 2006-03-08


Crutcher Dunnavant 2006-03-08