Parser generator - Architecture

From Laja
Jump to: navigation, search

Parser generation

The following diagram tries to explain the steps performed when generating a paser in Laja.

  1. The user can be the developer that wants to parse some text format.
  2. Two text files is created, a Laja template file that specifies code generation details (like output directory, package etc) and a grammar file that defines the grammar rules and how the parsed data is organized into objects and messages/events.
  3. The Laja code generator now generates a parser based on the grammar and settings.
  4. The parser consists of the parser code and interfaces that corresponds to the objects defined in the grammar file (2).
  5. This is the program that uses the parser. It needs to create classes that implements the interfaces in the parser (the factory and objects). This results in a separation of the generated parser and the code that uses the parser. These classes can be auto generated by Laja, but should after that be manually maintained.


Laja-parser-generator.png

An example

Code is best explained by examples. Let's look at example2 in directory LAJA_HOME/example/java/example2.

In this example we want to parse the text file example2.txt that looks like this:

age: 18
name: Roger
country: Sweden!

To be able to parse this format we need to define the parsing rules (grammar). It would also be nice if we could have the output from the parser packaged into an object structure. In Laja we can define how the grammar should be structured into objects and messages/events.


In the grammar file, you define both grammar rules and how they relate to objects. Take a look at example2.grammar (row numbers are not included in the original file):


 1. grammar example2 {
 2.    s = [" "]+;
 3.    newline = "\r\n" | "\n";
 4.
 5.    letter = "a".."z";
 6.    digit = "0".."9";
 7.    label = letter [digit|letter]+;
 8.    row = label ":" s [!(newline|END)+]:value [newline];
 9.    example2 = row+;
10.
11.    Row row.setLabel(String label);
12.    row.setValue(String value);
13.
14.    Example2 example2.addRow(Row row);
15. }

Row 2 to 9 contains the grammar definition, row 11-14 contains the mapping information which is needed if we want to receive any data from the parser.

Row Description
1 the first thing in the grammar file is the word grammar followed by the name of the grammar, example2 in this example. The main node example2 is defined in row 9, and is mandatory.
2 [x]+ means zero or more times of x. Here we allow any number of spaces to occure. We need to take care of white spaces in the grammar.
3 the character | is the OR operator. We accept both Windows style \r\n and Unix style \n end of lines.
5 any letters between "a" and "z" (a and z included). Case sensitive is default.
6 any number between 0 and 9.
7 a letter otionally followed by one or more digits or letters, e.g: x, x2, ab93c.
8 the statement
[!(newline|END)+]:value
means: take everything until we have reached the end of the line or the end of the source and let it have the name value.
9 example2 is the main definition of the whole grammar and says that we expect to find one or more rows.
11 declares row to be handled by the object Row. When a matching label is found, the statement row.setLabel(String label) is executed on the current instance of Row.
12 when a matching value is found, the statement row.setValue(String value) is executed on the current instance of Row.
14 declares the top node element2 to be handled by the object Example2. When a matching row is found, the statment example2.addRow(Row row) is executed on the instance of Element2.

Parsing Algorithm

The following diagram shows how the generated parser works. The parser Example2Parser.java is used as an example. Laja perform parsing in two steps. In phase one the text is parsed and matched against the grammar and all decisions along the way are saved in a list. In phase two, all desicions are read from that list, target objects (instances of Example2 and Row in this example) are created by the factory, and messages/events ar sent (executing set and add methods) to those instances.

The object hierarchy is defined together with the grammar in the file example2.grammar. We are in most of the cases interested to package the data we parse in some kind of data structure. In Laja these objects are represented by plain old Java objects (POJO's) and is created by the factory by calling the methods createExample2() and createRow() in this example.

In phase two Laja begins with putting an instance of Example2 (lets call that instance example2) on top of the Example2 stack. This is now the base and top instance of our object hierarchy and will remain on the stack during the whole parsing.

When a matching row is found in phase two, an instance of Row (lets call it row) is put on top of row's stack. When the whole definition of row has been parsed and the statements row.setLabel(String label) and row.setValue(String value) has been called, then the method example2.addRow(Row row) is called with row as argument. The top node exmample2 is the result from the parsing.

Laja-parser-generator-algorithm.png


Lets having a look at a snippet from the generated parser Example2Parser.java (to generate the parser, go to directory LAJA_HOME/example/java/parsergenerator/example2 and type laja example2):

 1.    private net.sf.laja.parser.engine1.element.Element getGrammar2() {
 2.        // *** Output classes ***
 3.        Data.RowRowLabel rowRowLabel = data2.new RowRowLabel("rowRowLabel");
 4.        Data.RowRowValue rowRowValue = data2.new RowRowValue("rowRowValue");
 5.        Data.Example2Example2Row example2Example2Row = data2.new Example2Example2Row("example2Example2Row");
 6.
 7.        // *** Declarations and Statements ***
 8.        Optional s = new Optional(1, "s");
 9.        OrList newline = new OrList(2, "newline");
10.        Range letter = new Range(3, "letter", "a", "z");
11.        Range digit = new Range(4, "digit", "0", "9");
12.        ElementList label = new ElementList(5, "label");
13.        ElementList row = new ElementList(6, "row");
14.        Repeat example2 = new Repeat(7, "example2");
15.
16.        // s = [" "]+
17.        Repeat s_1 = new Repeat(8, "s_1");
18.        s_1.add(10, new Str(9, " "));
19.        s.add(11, s_1);
20.
21.        // newline = "\r\n" | "\n"
22.        newline.add(13, new Str(12, "\r\n"));
23.        newline.add(15, new Str(14, "\n"));
24.
25.        // letter = "a".."z"
26.
27.        // digit = "0".."9"
28.
29.        // label = letter [digit|letter]+
30.        label.add(16, letter);
31.        Optional label_1 = new Optional(17, "label_1");
32.        Repeat label_1_1 = new Repeat(18, "label_1_1");
33.        OrList label_1_1_1 = new OrList(19, "label_1_1_1");
34.        label_1_1_1.add(20, digit);
35.        label_1_1_1.add(21, letter);
36.        label_1_1.add(22, label_1_1_1);
37.        label_1.add(23, label_1_1);
38.        label.add(24, label_1);
39.
40.        // row = label ":" s [!(newline|END)+]:value [newline]
41.        row.add(25, label, rowRowLabel);
42.        row.add(27, new Str(26, ":"));
43.        row.add(28, s);
44.        Optional row_1 = new Optional(29, "row_1", rowRowValue);
45.        Repeat row_1_1 = new Repeat(30, "row_1_1");
46.        OrList row_1_1_1 = new OrList(31, "row_1_1_1");
47.        row_1_1_1.add(32, newline);
48.        row_1_1_1.add(34, new End(33, "row_1_1_1"));
49.        row_1_1.add(35, row_1_1_1, NOT);
50.        row_1.add(36, row_1_1);
51.        row.add(37, row_1);
52.        Optional row_2 = new Optional(38, "row_2");
53.        row_2.add(39, newline);
54.        row.add(40, row_2);
55.
56.        // example2 = row+
57.        example2.add(41, row, example2Example2Row);
58.
59.        return new TopElement(data2, example2);
60.    }

As you can see, the generated parser code is quite readable. It is generated from the files example2.laja and example2.grammar. The target classes (receiver of events) are declared first and corresponds to the mapping declarations in example2.grammar' (line 2 to 9). The section declarations and statements contains the code that is a result of the grammar definition in example2.grammar (line 11 to 14).

Row Description
1 The 2 in the method name getGrammar2 indicates that this method is called in phase 2.
2-5 These objects handles the events (often set or add methods) to the current object of that type which is the top elements of the stacks. These objects does not exist in method getGrammar1() because no events are sent in phase 1.
8-14 Declares the basic elements. All these wrapper classes Optional, OrList, Range, ElementList and Repeat implements different behaviour and is part of the parser engine.
16,21,25,27,29,40,56 Corresponds to row 3 to 9 in the grammar file (example2.grammar').
17-18 Laja translates [" "]+ to [" "+]. The outer [] is declared on line 8 (Optional) and " "+ on line 17-18.
19 Puts " "+ inside the Optional object.
22-23 Defines the new line character(s) for Unix and Windows.
30-38 Defines label.
41-54 Defines row.
44 The instance rowRowValue takes care of the event handling. The construct [x]:name will execute row.setValue(String value) even if no matching x was found. Change to [x:name] if the setter should be executed only if a matching x was found.
57 Defines example2. The variable example2Example2Row takes care of the event handling example2.addRow(Row row) for every matching row.
59 Needed for the parser to handle events correctly on example2.


The object structure looks like this:

800px