In my research, I try, to whatever extent possible, to allow myself to be guided by the ideals of beauty and elegance. It’s not always easy to achieve these ideals, and sometimes it’s hard to justify prioritizing them over more mundane and practical concerns. But I believe there is a deep connection between truth and beauty, as John Keats pointed out long ago, and so I try to keep beauty in the front of my mind whenever possible.
(In my last post, I made fun of the PCFG formalism, which I think is both ugly and confusing ).
The most aesthetically important component of the system is the tool that produces a visual description of a parse tree. I spend a lot of time looking at these images, so if they don’t look nice, I get a headache. And more importantly, if the image is too cluttered or confusing, it impedes my ability to understand what’s going on in the grammar engine.
So I am very excited about one of the features in the most recent development cycle, which is a new, slimmed-down look for the parse trees. To motivate this, observe the following image, which was built using the old system’s parse viewer (if it is too small, right click and select “view in new tab”):
There are a couple of annoying things going on in this image. First of all, notice how in the final prepositional phrase (“on gun ownership”), there is one main link labelled with on, and then another subsidiary link labelled on_xp. The main link is meaningful, because it expresses the logical connection between the words “restrictions” and “ownership”. On the other hand, the subsidiary link really doesn’t tell you very much. It’s required by the inner workings of the parsing system, but it doesn’t help you understand the structure of the sentence.
Another example of this redundancy is the pos_zp role that links the name “Bloomberg” to the apostrophe “‘s”. Again, the system requires this role before it will allow the owner role, but the latter role is the one that is actually informative.
The new visualization system removes the “preparatory” links from the parse image. This removes the redundant clutter, and it also has a nice additional benefit of reducing the height of the image.
Another difference, which should be pretty obvious, is the change in the part-of-speech annotations that are listed underneath each word. In the old system, I was using a POS system that was basically descended from old PCFG concepts. So the word “restrictions” was labeled with NNS, while “joined” was labeled with VBA . Now instead of those somewhat cryptic symbols, we just have N[-s], which indicates a plural noun, for “restrictions”, and V[-ed], which indicates a past-tense verb, for “joined”.
In other cases, I’ve left out the POS annotations entirely, because they’re obvious from the role label. For example, here’s the old and new output on an simple sentence:
As you can see in the old version, the word “has” is marked with a special POS tag HAVE. Now, it is an important technical point that the form “has/had/have” is its own grammatical category, and therefore in principle it should have its own symbol. However, the viewer of the parse doesn’t need to be reminded of this, since the word is marked with the special aux_have link, which cannot connect to any other category.
Question for readers: which visualization format do you prefer? Is the new version easier to understand or not?
A closing thought on the power of writing up ideas for presentation: in the course of writing this post, I noticed a subtle bug in the way the new version of the system is handling punctuation. Do you see the boxes around the periods at the end of the sentences? The color indicates the amount of codelength required to send the sentence-ending punctuation. The green boxes indicate that the (old) system is using a very short code, which is feasible for encoding periods, because almost all normal declarative sentences end with periods. The yellow boxes indicate an increased codelength cost. Now in the images from the old version, the boxes are green, but in the new version, the boxes are yellow. This indicates that the new version has a bug in the component responsible for predicting the sentence-final punctuation.
Thanks for reading! Please feel free to comment on this page directly, or reach out to me through the “contact” link, if you are interested in talking about NLP, sentence parsing, grammar, and related issues.
 – Several years ago, when I was working on the early versions of the Ozora parser, I actually used a grammar formalism based on the PCFG. There were a number of technical problems with this formalism, but these technical problems probably would not have proved compelling on their own. In addition to the technical issues, though, there was also a huge aesthetic issue: the resulting parse trees didn’t look good, and the problem got worse and worst as the sentence got bigger. Because of the way English sentences branch, PCFG visualizations tend to appear very tall, and angled down and to the right. Consider the following sentence:
The debate over the referendum was rekindled in Israel after reports that Naftali Bennett , a minister whose Jewish Home Party opposes the establishment of a Palestinian state , was soliciting the support of Yair Lapid , the finance minister and leader of the centrist Yesh Atid Party , for new legislation .
This is a long sentence, and any visualization tool is going to struggle with the width. But when faced with a long sentence, PCFG tree images also have a huge problem with height. I parsed this sentence using Microsoft’s Linguistic Analysis API, and here’s a slice of what came out:
You can see the huge gap between the POS annotations at the top of the image, and the words at the bottom. Almost half the screen space is entirely wasted. In contrast, here’s the Ozora visualizer’s output for the left half of the sentence:
As you can see, there is not nearly as much wasted space in the visualization, and it is much easier to understand the logical relationship between the various words.
 This concept of VBA was something I was quite proud of when I came up with it, though I am no longer using it. The idea behind the VBA category relates to the fact that, while most English verbs have only four distinct conjugations, some verbs have a fifth conjugation, called the past participle, which normally has an -en suffix. Examples are “broken”, “spoken”, “eaten”. This tense is usually denoted as VBN, while the regular past was denoted VBD. Now if the verb has no VBN, then you are allowed to use the VBD instead. But if it does have a VBN, you must use it or the sentence will be ungrammatical. Consider:
I have visited London three times in the past year.
I have spoke to the president about the issue. ***
Mike wanted to buy a used car.
John was able to fix the broke radiator. ***
In these sentences, the first example of the pair is grammatical, because the verbs “visit” and “use” have no VBN form. The second examples are ungrammatical, because the verbs “speak” and “break” have distinct VBN forms, which must be used in the given context.
It’s actually a bit tricky to express this rule succinctly in the parsing engine. You have to say something like “allow VBD, but only if the verb does not have VBN“. In other words the parsing system has to query back into the lexicon to determine if an expression is grammatical.
To avoid this, my idea was to package together words with ambiguous VBD/VBN conjugations like “visited” and “joined” together in the VBA symbol. Verbs that had separate fifth conjugations would produce VBD or VBN as appropriate, but other four-conjugation words would produce only VBA. Then you could express the grammar rule in a succinct way: accept either VBA or VBN but not VBD.