Part 8 of 10: Chasing our tail

Our goal has always been to understand a list of email addresses. We would like to apply the full_email_address rule over and over until our whole list is consumed. Can we? Yes, we can!

Remember, our parser is like the poor UN translator — it can only take things in the order it receives them. So how about if we think of our email list using the following rule:

A full email address

followed by (maybe) some spaces

followed by either a comma or a semicolon

followed by (maybe) some more spaces

followed by another email list.

Basically, an email list is an email item followed by an email list (which is an email item followed by an email list, which is...). That's the principle of Tail Recursion. Tail Recursion is an important part of grammars as it allows us to process a list when we don't know at the start how many items there will be in it.

We can represent this in Treetop as follows:

grammar EmailList
  rule email_list
    full_email_address [ ]* (','/';') [ ]* email_list
  rule full_email_address
    optional_email_name [ ]* '<' email_address '>'
  rule optional_email_name
    '"' email_name '"' / ''
  rule email_name
  rule email_address

Notice that we put our new rule at the head of the list. The first rule of our grammar is called the Root. It represents what, all together, we are looking for. Since we are looking for an email list, it's appropriate that our root rule is email_list.

You may be asking why we used (','/';') instead of the simpler [,;]. The reasons are rather difficult to explain, having to do with backtracking capability (the ability to pretend we never translated something). From your point of view, items you put in Regular Expressions will not generate error messages when they're missing, whereas items in quotes will.

This is all very well and fine if we are processing an infinite email list. But email lists, like all good things, have to end some time. So we need to build in an way to end our email list. We already know how to use options, because we did that in the previous tutorial.

Modify your grammar as follows to make the remainder of the list optional:

  rule email_list
    full_email_address [ ]* optional_email_list
  rule optional_email_list
    (','/';') [ ]* email_list / ''

Now we need a way to access the list of email names. What do you think is the easiest way? Of course, you're right! We add a Ruby routine to extract the email name and address, and then we append the return value from our recursive call to that list. Your code will look something like this:

  rule email_list
    full_email_address [ ]* optional_email_list {
      # return an array of email pairs,
      # each pair consisting of an (optionally empty) email name
      # followed by an email address.
      def list_of_email_pairs
        # start out with the email name and address for the first email
        email_list = [ [full_email_address.optional_email_name.email_name_text_value, full_email_address.email_address.text_value] ]
        # unless we are the last pair,
        unless optional_email_list.terminal?
          # add in the pairs from the rest of the list
          email_list += optional_email_list.email_list.list_of_email_pairs

The modifications to the email list program should be easy. Just use the each method to display the pairs according to our original intent:

  puts "You said the following email addresses:"
  result.list_of_email_pairs.each do |email_pair|
    print "  #{email_pair[1]}"
    print ", #{email_pair[0]}" unless email_pair[0].empty?
    print "\n"

Give it a spin with our first two full addresses together:

"Jena L. Dovie" <>, <>

Works? Good! Okay, at this point, you have learned quite a bit about Treetop, and you could go on to write more of your own grammars with a little help from the (sparse) documentation.

However, there's one email address we haven't figured out how to parse: Charmain Lashunda  <> To parse that address we have to look at some of the advanced features of Treetop grammars. Read on, if you dare!