Part 10 of 10: Summary and Advanced Debugging

Let's summarize what we've learned as well as look at more in-depth debugging techniques to help you write more complex grammars.

Summary of Treetop Features

Comments start with "#", just like Ruby. They capture everything up to the end of the line, just like Ruby.

Grammars are identified by the keyword grammer, has a CamelCase name, and ends with the word end, like Ruby classes and modules.

Note: you can also have modules in Treetop grammars, just like Ruby. Modules can be used to encapsulate complicated subgrammars. You can use the include keyword to include other grammars.

Grammars consists of Rules. Each rule starts with the keyword rule, has a lower-case name, and ends with the word end, like Ruby methods.

Each rule generates an instance of the class SyntaxNode, extended with the methods you defined between the curly brackets. Methods to access the named elements (child nodes) are also included.

Some useful methods available to a syntax node:

terminal? True if the node doesn't have children.
non_terminal? True if the node does have children.
text_value The text the node has matched.
elements An array of all the children.
inspect Pretty-print the contents of the node and (recursively) of its children.

A simple program to load a Treetop grammar grammar_name and to parse something with it is:

#!/usr/bin/env ruby
require 'rubygems'
require 'treetop'
Treetop.load grammar_name
# or include grammar_name if Polyglot gem installed
parser=GrammarNameParser.new
result=parser.parse(input)

The parse method returns nil if the input cannot be parsed by the grammar. In that case parser.failure_reason can be used to determine where and for what reason the parser failed. (Unless the reason is because the parser could not consume all the input. In that case, you can use parser.index to determine how far it got.)

The parse method calls the first rule defined in the grammar. You can override this by setting parser.root.

A rule consists of:

  • Terminals (text in quotes)
  • Non-Terminals (names of other rules)
  • Regular Expressions (usually inside square brackets)
  • Iterations (* matches 0 or more; + matches 1 or more, ? matches 0 or 1)
  • Choices (separated by /)
  • Lookaheads (identified by ! or &)
  • Method Definitions (enclosed in curly brackets)
  • and, of course brackets to override the standard order.

Here are some sample rules to illustrate the types of components:

grammar SampleGrammar
  rule root_rule
    'terminal' non_terminal [regular expression]
  end
  rule sample_iterations
    optional_non_terminal?
    'terminal repeated 0 or more times'*
    [expression repeated one or more times]+
  end
  rule choice_sample
    a / b tried_if_cant_match_a_and_does_match_b
  end
  rule bracket_override
    (a / b) tried if matches_a_or_matches_b
  end
  rule lookaheads
    matches_me !as_long_as_not_followed_by_me /
    matches_me &only_if_followed_by_me
  end
  rule with_methods
    non_terminal {
      def sample_method
        can_refer_to_by_name(non_terminal)
      end
    }
  end

Advanced Debugging Capabilities

The tt command provided with Treetop allows you to compile your Treetop grammar into Ruby code. The compiler takes your file_name.treetop file and creates a file_name.rb file. If you have problems loading your Treetop grammar, use the command-line and tt to get more details on why your grammar does not compile.

If your grammar does not parse what you think it should parse, you can use the generated Ruby file to figure out what is going on.

A sample rule will generate code like this (at least in Treetop 1.4.4 — this aspect of Treetop is the one most likely to change):

def _nt_double_quote
  # (some code about cached info)

  if has_terminal?('"', false, index)
    r0 = instantiate_node(SyntaxNode,input, index...(index + 1))
    @index += 1
  else
    terminal_parse_failure('"')
    r0 = nil
  end

  node_cache[:double_quote][start_index] = r0

  r0
end

You can insert debug statements into this code and use require in place of Treetop.load to help you understand how the compiled parser is parsing your text.

Finally, you can start up the interactive interpreter irb and execute your test code from there. In irb, you can call Treetop.load; parser = GrammarNameParser.new to reload your grammar each time you change it. In irb, you can test out the value of the various syntax nodes and of the parser to determine which grammar will work for you.

If your grammar fails to parse, you can still get a sense of how far it got by examining parser.instance_variable_get :@node_cache. But by now you're getting into pretty heavy debugging...

So that's it! You should have enough knowledge by now to set you on your way. Happy parsing!

Previous

Part 9 of 10: Looking ahead

There is one way that our Treetop grammar is different than the harried UN translator. Treetop grammars have a lookahead function, which allows them to peer a short way into the future, rather like Nicholas Cage in Next. We need that feature in order to handle the last case of our email list...

Part 8 of 10: Chasing our tail

Our goal has always been to understand a list of email addresses. We would like to apply the full_email_address rule over and over until our whole list is consumed. Can we? Yes, we can!...

Part 7 of 10: Has/has not

Let's look at the next contact in our email list. This one just has an email address. So the email name portion is optional. We can represent this in Treetop as follows...

Part 6 of 10: Testing Your Understanding

Yes! Our program understands! But how can we be sure? Can a Treetop parser transform what it claims to have understood?...