Lake symbols for island parsing

Katsumi Okuda1 and Shigeru Chiba2

The Art, Science, and Engineering of Programming, 2021, Vol. 5, Issue 2, Article 11

Submission date: 2020-02-06
Publication date: 2020-11-01
Full text: PDF


Context: An island parser reads an input text and builds the parse (or

abstract syntax) tree of only the programming constructs of interest in

the text. These constructs are called islands and the rest of the text

is called water, which the parser ignores and skips over. Since an

island parser does not have to parse all the details of the input, it is

often easy to develop but still useful enough for a number of software

engineering tools. When a parser generator is used, the developer can

implement an island parser by just describing a small number of grammar

rules, for example, in Parsing Expression Grammar (PEG).

Inquiry: In practice, however, the grammar rules are often complicated

since the developer must define the water inside the island; otherwise,

the island parsing will not reduce the total number of grammar

rules. When describing the grammar rules for such water, the developer

must consider other rules and enumerate a set of symbols, which we call

alternative symbols. Due to this difficulty, island parsing seems to be

not widely used today despite its usefulness in many applications.

Approach: This paper proposes the lake symbols for addressing this

difficulty in developing an island parser. It also presents an extension

to PEG for supporting the lake symbols. The lake symbols automate the

enumeration of the alternative symbols for the water inside an

island. The paper proposes an algorithm for translating the extended PEG

to the normal PEG, which can be given to an existing parser generator

based on PEG.

Knowledge: The user can use lake symbols to define water without

specifying each alternative symbol. Our algorithms can calculate all

alternative symbols for a lake symbol, based on where the lake symbol is

used in the grammar.

Grounding: We implemented a parser generator accepting our extended PEG

and implemented 36 island parsers for Java and 20 island parsers for

Python. Our experiments show that the lake symbols reduce 42 % of

grammar rules for Java and 89 % of rules for Python on average,

excluding the case where islands are expressions.

Importance: This work eases the use of island parsing. Lake symbols

enable the user to define the water inside the island simpler than

before. Defining water inside the island is essential to apply island

parsing for practical programming languages.