Against Phrasal Taxonomy Grammar

This post is about the problems with approaches to grammar theory that depend on the creation of a phrasal taxonomy. A phrasal taxonomy is a system for categorizing phrases into discrete categories like “Noun Phrase” (NP) or “Verb Phrase” (VP). At first glance, the idea seems quite plausible. Most people have a strong intuitive feeling that phrases like “John”, “the man on the train”, or “a red brick house” are all members of the NP category, while phrases like “kicked a ball”, “flew”, or “raised a family” are all VPs.

In addition to our intuition, the phrasal taxonomy concept is also supported by an important empirical phenomenon. Linguistic terminology is terrible, so in this post I’ll just call the phenomenon pluggability. To illustrate the idea of pluggability, we write down a couple of simple sentences, and then delete a couple of words to leave a blank:

  1. John ____ to the store.
  2. Sally ____ with her mother.
  3. Mike and Bob ____ .
  4. My friend ____ .

Then we brainstorm a bit and try to come up with some completion phrases that could fill the blank spot. In this case, it’s easy to come up with some good ones:

  1. rode a bike
  2. spoke quickly
  3. danced a polka
  4. threw a party

Here’s the amazing thing: all of the phrase candidates can be “plugged into” all of the original sentences. And we could easily write down many more completion phrases and blanked sentences that would work just as well. This is the core insight of pluggability. Pluggability suggests that if we can document and characterize all the sets of phrases that can be plugged into various types of blanks, we’ll have made a major step towards understanding grammar.

At this point it’s important to point out that, as grammarians, we’re primarily interested in grammatical acceptability instead of semantic plausibility. If the grammatical interpretation of a sentence is clear, it is no problem to use if the sentence expresses a bizarre, whimsical, or fantastical event. For example, if we insert phrase 2 into sentence 1 above, the resulting sentence is certainly grammatical, but sounds very strange in terms of meaning (perhaps the store is haunted).

Based on the observations of the above blanked sentences and completion phrases, let’s declare victory and claim that we’ve identified our first phrase category. We’ll call this phrase category VP. Let’s keep working and look at another type of phrase. Consider the sentences:

  1. The woman was arrested by ____ .
  2. ____ started a technology company.
  3. Ryan attended a meeting with ____ .
  4. Mark and ____ travelled to Russia.

Submit the following phrases as valid completions for the above sentences:

  1. Jim
  2. the man in black
  3. a tree
  4. four Japanese government officials

Again we see that despite the wide range of diversity of the completion phrases, they still can each complete the sentences, with varying degrees of semantic plausibility. So we’ll call the phrases in the above list “Noun Phrases” or NPs for short.

It looks like we have made a good start: we’ve got a good handle on the two most imporant types of phrases in English. Of course we are going to have to do a lot of work to refine our taxonomy to cover the full range of English grammar, but probably (it seems at the outset) there will not be too many phrase categories – perhaps a couple of hundred? It doesn’t seem like it will be unreasonably difficult to document all of these categories. After all, the language must be easy enough that normal human children can learn it reliably.

What do we do with the taxonomy once we’ve completed it?

There are a variety of paths one might consider as ways to reach the ultimate destination, which is a complete theory of grammar. The most obvious path is to use the taxonomy to describe the ways in which phrase structures can be combined together.

For example, given the basic structures VP and NP as mentioned above, plus a structure S for sentence, a symbol PP for prepositional phrase, and Prep for an individual preposition word like “to”, “from”, “on”, etc. We could write down the following rules:

  1. S :: NP + VP
  2. VP :: VP + NP
  3. VP :: VP + PP
  4. PP :: Prep + NP

Rule 1 says that a sentence is an NP subject plus a VP. Rule 2 says a VP can “split off” an NP representing a direct object. Rule 3, similarly, says that a VP can split off a prepositional phrase. And rule 4 shows that a PP is just an NP with a single preposition word in front of it. Again, this set of rules is hardly complete, but it seems plausible that it is a good start, and that with a serious effort we could identify the full set of rules for English.

This approach is the basis for one of the most common types of grammar theory, called the Probabilistic Context Free Grammar (PCFG). Chomsky famously used this approach, along with another concept called “syntactic movement”, in his early theories of grammar. Because this type of formalism is widely used in mainstream linguistics, it was also adopted by many researchers in machine parsing. The first and most important parse tree annotation effort, called the Penn Treebank, used a PCFG formalism to define the annotations. Because of the influence of the Penn Treebank, a lot of early work in machine parsing also used the PCFG. More recently, though many research groups have moved on to a grammar formalism based on typed dependencies, Microsoft notably still offers a PCFG based parser through their Linguistic Analysis API. (In a previous blog post, I poked fun at Microsoft’s parser and the PCFG formalism).

In my view, the PCFG and other approaches to grammar that are based on phrasal taxonomies, suffer from a number of serious conceptual flaws. I want to try to illustrate these flaws by describing how one might describe some more advanced gramatical phenomena using a taxonomy.


An example of a PCFG-based parse tree, from English Stack Exchange

Verb Tense

Let’s look at what phrasal taxonomy needs to do to handle verb tenses. English has a relatively simplistic verb conjugation system, but the rules it does have are strongly enforced – no native English speaker would say “I have broke the radio”, except, perhaps, in an extremely informal setting. Have a look at the following sentences, which are minor variations on the previous set:

  1. John is ___ to the store.
  2. Mike and Bob are ___ .
  3. Sally has ___ with her mother.
  4. My friends have ___ .

And consider the results of trying to complete the sentences with the following phrases:

  1. riding a bike
  2. throwing a party
  3. spoken quickly
  4. danced a polka

You can see the issue easily. To complete sentences 1+2, we need to use a phrase that is in the gerund (-ing) tense. But for sentences 3+4, we need a phrase in the past participle tense (which in the case of “dance” is just “danced” because it does not have a distinct past participle form). So the pluggability phenomenon breaks down here, even though all of the completion phrases are seemingly VPs.

Hmmm. That’s unfortunate. It looks like we are going to need to introduce some additional complexity to deal with tense. Well, maybe it’s not such a big deal. We just need to split the original VP category into 5 subcategories. We can call them VBP for present, VBD for past, VBZ for third person singular, VBG for gerundial, and VBN for the past participle form. It’s going to be a bit tricky to write the phrase combination rules in terms of these new symbols, but probably not impossible. There aren’t that many more symbols to deal with.

Quantity Modifiers

Let’s look at another set of examples, this time relating to quantity. Consider the following sentences:

  1. ___ talked to the mayor about the crime problem.
  2. The woman gave ___ a plate of cupcakes.
  3. On the way home I met some of ___ .
  4. All of ____ went to the show.

At a first glance, it looks like basically these are going to be NPs. So let’s try some NP-like completions:

  1. the audience
  2. my friends
  3. most of the people
  4. one of the journalists

As you can probably see, the issue here relates to the quantity modifiers all/most/one/etc. The simple phrases 1+2 can be inserted into any of the sentences. Similarly, any of the completion phrases can be plugged into the simple sentences 1+2. So we see again the pluggability phenomenon that motivated this project at the outset.

The problem comes when we try to insert phrases 3+4 into sentences 3+4. This combination doesn’t work because the both the phrases and the sentences have quantity modifiers, and using two quantity modifiers in the same phrase is ungrammatical. You cannot say “On the way home I met some of one of the journalists”.

This is another setback for the phrasal taxonomy project. Most of the completion phrases above look like NPs, so we would like to categorize them as such. But the analysis shows that we are going to need to make a distinction between different sets of NPs, depending on whether they contain a quantity modifier. Well, okay, let’s introduce new categories NP[+Q] and NP[-Q], to capture this distinction. Again, it might not seem like an insurmountable barrier; at this stage it is just one additional symbol.

Adjective Ordering Rules

When he was a child, the great fantasy author JRR Tolkien told a story to his mother about a “green great dragon”. His mother told him that this was the wrong way to say it, and he should say “great green dragon” instead. Tolkien asked why, but his mother couldn’t explain it, and the episode was an early influence leading to his lifelong interest in language and fantasy.

Tolkien’s error was to violate an adjective ordering rule. English has a lot of complex rules about how adjectives must be ordered, one of which is that adjectives expressing magnitude, like “great” or “large”, must come before ones expressing color or quality, like “green” or “wooden”. Let’s see how this impacts the phrasal taxonomy project. Here are some sentences:

  1. A ___ rolled down the street.
  2. I wanted to give my sister a ___ .
  3. My parents bought a wooden ___ .
  4. The black ___ exploded with a flash of fire.

And compare the completion results with the following phrases:

  1. car
  2. necklace
  3. large house
  4. tiny basket

We see exactly the same problem pattern that we saw for the quantity modifiers. The one-word phrases (“car” and “necklace”) can be plugged into any of the sentence slots. And all of the phrases can be plugged into sentences 1+2. But the two-word phrases 3+4 cannot be plugged into the sentences that already contain an adjective, because of adjective ordering rules. The phrase “wooden large house” is ungrammatical; we have to say “large wooden house” instead. (If someone said “wooden large house” in real life, a listener might interpret it as a compound word “largehouse” that has some specific technical meaning, like “longboard” has a special meaning in the context of surfing).

If you are a non-native speaker of English, you might be curious or skeptical about how strongly these ordering rules are enforced. The Google NGram Viewer gives us a nice tool to study this phenomenon. This search indicates that the phrase “large wooden house” occurs in thousands of books, while the frequency of the phrase “wooden large house” is so small as to be statistically indistinguishable from zero.

Anyway, this is another problem for the phrasal taxonomy project. In this case it’s not even clear how many new symbols we need to introduce to solve the problem. Do we need a new symbol for every category of adjective that could be implicated in an ordering rule?

Are We Done Yet?


In each of the above sections, we saw that in order to describe a certain grammatical phenomenon, we were going to have to introduce another distinction into our taxonomy. Instead of a simple VP category, we’re going to need a couple of subcategories that express information about verb tense. Instead of a simple NP category, we’re going to need a bunch of subcategories that include information about plurality, adjective ordering and the presence of quantity modifiers.

You can imagine a hapless grammarian sitting at a desk, assiduously writing down a long list of phrasal categories that capture all the relevant grammatical issues in English. Will this poor soul ever be finished with his Sisyphean task? Probably not in this lifetime.

NOR En lærd i sitt studerkammer, ENG A Scholar in his Study

The problem is that the number of categories is exponential in the number of distinctions that need to be expressed. Consider noun phrases. As we saw, we will need to create different categories that represent whether the noun has a quantity modifier, and what type of adjective it contains. Suppose we decide that 3 levels of distinct adjectives are necessary. Furthemore, it is obvious that we will need to express the distinction between plural and singular nouns. We are already at 2*2*3=12 different categories of noun phrases, and we’ve only just started. What about determiners? What about preposition attachments: is “the mayor of Baltimore” in the same category as just “the mayor”? For every new type of distinction we need to make, we’re going to have to double (at least) the number of symbols we need to maintain.

The situation for Verb Phrases is even worse. We saw above how we will need to create different categories for each verb tense. But in fact we will need to go further than that. We’ll need to create new category splits for each of the following distinctions:

  1. Does the phrase have a direct object?
  2. Does the phrase have an indirect object?
  3. Does the phrase have an infinitive (“to”-) complement?
  4. Does the phrase have a sentential (“that”-) complement?

Each of these distinctions requires us to double the number of categories we use. As we saw above, there are five verb tenses, so now we’re at 5*2*2*2*2=80 categories just for verb phrases.

Why does this Matter?

In the early days of Natural Language Processing, the plan was to combine the domain expertise of linguists with the algorithmic expertise of computer scientists, in order to create programs that could understand human language. This plan was pursued vigorously by several research groups. After some time an awkward pattern emerged: it turned out that the linguists couldn’t actually contribute very much to the project. Early researchers found that smart algorithms and machine learning techniques were more useful for building NLP systems than linguistic knowledge. This pattern was expressed succinctly by Fred Jelinek, one of the pioneers of the field, who said “every time I fire a linguist, the performance of the system goes up”. From these early observations about the efficacy of different approaches, almost everyone in the field drew the following conclusion:

  • Linguistic knowledge and theory is not useful for NLP work. Instead, people should rely entirely on Machine Learning techniques.

As a result of this collective decision, modern NLP research contains only a very superficial discussion of actual linguistic phenomena. Most people in the NLP world have little interest in analyzing, for example, the differences or similarities between various types of relative clauses. They do not have much interest in studying the empirical adequacy of different theories of grammar (e.g. X-Bar theory). Instead, the focus is overwhelmingly on computer science topics such as data structures, algorithms, learning methods, and neural network architectures.

My research is predicated upon an alternate explanation of the early history of the field:

  • Linguistic knowledge is potentially very useful for NLP. However, the field of linguistics has not yet obtained a sufficiently accurate theory, and it is better to rely on ML than on low-quality theory.

In other words, if it can be discovered, a highly accurate linguistic theory of grammar will prove decisively significant for the development of NLP systems. Unfortunately, we cannot look to mainstream linguistics for this theory. That field has been dominated for decades by the personality of Noam Chomsky. His pronouncements are given far too much weight by his followers, in particular his bizarre and almost anti-intellectual position that “probabilistic models give no insight into the basic problems of syntactic structure”. In fact, probabilistic modeling is the only tool that can bring a scientific debate about the structure of language to a decisive conclusion.

I encountered the general issues with phrasal taxonomy that I mentioned above in my early work on large scale text compression. My first combined parser/compressor systems were based on a PCFG formalism. The initial versions of the system, which used only simple grammar rules, worked acceptably well. But as I attempted to make increasingly sophisticated grammatical refinements, I needed to scale up the taxonomy, and I found that this was extremely difficult to do, largely because of the kinds of issues I mentioned above. That eventually led me to discard the PCFG formalism in favor of a system that doesn’t require a strict taxonomy.





Ozora Research: One Page Summary

At Ozora Research, our goal is to build NLP systems that meet and exceed the state of the art, by using a radically new research methodology. The methodology is extremely general, which means our work is high-risk, high-payoff: if the NLP research is successful, it will affect not just NLP, but many adjacent fields like computer vision and bioinformatics.

The methodology works as follows. We have a lossless compression program for English text. The input to the compressor is a special sentence description that is based on a parse tree. We have a sentence parser, which analyzes a natural language sentence to find the parse tree that produces the shortest possible encoded length for the sentence. With these tools in place, we can now rigorously and systematically evaluate the parser (and other related NLP tools) by looking at the codelength the combined system achieves on a raw, unlabelled text corpus.

Compare this methodology to the situation in mainstream NLP research. In sentence parsing, almost all work depends entirely on the existence of human-annotated “gold standard” parse data, such as the Penn Treebank (PTB). This dependence puts severe limitations on the field. One issue is that any conceptual error or inconsistency in the PTB annotation process gets “baked in” to the resulting parsers. Another issue is the small size of the corpus, which is on the order of 40,000 sentences: there are many important but infrequent linguistic phenomena that simply will not appear in such a small sample.

Our research also engages new, interdisciplinary expertise by emphasizing the role of empirical science, as opposed to algorithmic science which is the centerpiece of modern NLP work. For example, our system incorporates knowledge about verb argument structure: certain verbs such as “argue”, “know”, or “claim” can take sentential (that-) complements, while most verbs cannot. Similarly, our system knows about the special grammar of emotion adjectives like “happy” or “proud”, which can be connected to complements that explain the cause of the emotion (“My father was happy that the Cubs won the World Series”). From this viewpoint, the challenge is to develop a computational framework within which the relevant grammatical knowledge can be expressed simply and cleanly. These issues are largely ignored in mainstream NLP work.

Our work is in the early stages. The basic components of the system are in place, but it has not yet achieved a high level of performance. Funding from the NSF will enable us to scale up the system to determine if the approach is truly viable. Specifically, we will scale up the grammar system to include many infrequent but important phenomena, and also upgrade the statistical model that backs the compressor, by using more advanced machine learning techniques. Funding will also enable us to package the results in a publishable form for the benefit of the broader research community.




New Parse Visualization Format

In my research, I try, to whatever extent possible, to allow myself to be guided by the ideals of beauty and elegance. It’s not always easy to achieve these ideals, and sometimes it’s hard to justify prioritizing them over more mundane and practical concerns. But I believe there is a deep connection between truth and beauty, as John Keats pointed out long ago, and so I try to keep beauty in the front of my mind whenever possible.

(In my last post, I made fun of the PCFG formalism, which I think is both ugly and confusing [1]).

The most aesthetically important component of the system is the tool that produces a visual description of a parse tree. I spend a lot of time looking at these images, so if they don’t look nice, I get a headache. And more importantly, if the image is too cluttered or confusing, it impedes my ability to understand what’s going on in the grammar engine.

So I am very excited about one of the features in the most recent development cycle, which is a new, slimmed-down look for the parse trees. To motivate this, observe the following image, which was built using the old system’s parse viewer (if it is too small, right click and select “view in new tab”):


There are a couple of annoying things going on in this image. First of all, notice how in the final prepositional phrase (“on gun ownership”), there is one main link labelled with on, and then another subsidiary link labelled on_xp. The main link is meaningful, because it expresses the logical connection between the words “restrictions” and “ownership”. On the other hand, the subsidiary link really doesn’t tell you very much. It’s required by the inner workings of the parsing system, but it doesn’t help you understand the structure of the sentence.

Another example of this redundancy is the pos_zp role that links the name “Bloomberg” to the apostrophe “‘s”. Again, the system requires this role before it will allow the owner role, but the latter role is the one that is actually informative.

The new visualization system removes the “preparatory” links from the parse image. This removes the redundant clutter, and it also has a nice additional benefit of reducing the height of the image.


Another difference, which should be pretty obvious, is the change in the part-of-speech annotations that are listed underneath each word. In the old system, I was using a POS system that was basically descended from old PCFG concepts. So the word “restrictions” was labeled with NNS, while “joined” was labeled with VBA [2]. Now instead of those somewhat cryptic symbols, we just have N[-s], which indicates a plural noun, for “restrictions”, and V[-ed], which indicates a past-tense verb, for “joined”.

In other cases, I’ve left out the POS annotations entirely, because they’re obvious from the role label. For example, here’s the old and new output on an simple sentence:



As you can see in the old version, the word “has” is marked with a special POS tag HAVE. Now, it is an important technical point that the form “has/had/have” is its own grammatical category, and therefore in principle it should have its own symbol. However, the viewer of the parse doesn’t need to be reminded of this, since the word is marked with the special aux_have link, which cannot connect to any other category.

Question for readers: which visualization format do you prefer? Is the new version easier to understand or not?

A closing thought on the power of writing up ideas for presentation: in the course of writing this post, I noticed a subtle bug in the way the new version of the system is handling punctuation. Do you see the boxes around the periods at the end of the sentences? The color indicates the amount of codelength required to send the sentence-ending punctuation. The green boxes indicate that the (old) system is using a very short code, which is feasible for encoding periods, because almost all normal declarative sentences end with periods. The yellow boxes indicate an increased codelength cost. Now in the images from the old version, the boxes are green, but in the new version, the boxes are yellow. This indicates that the new version has a bug in the component responsible for predicting the sentence-final punctuation.

Thanks for reading!  Please feel free to comment on this page directly, or reach out to me through the “contact” link, if you are interested in talking about NLP, sentence parsing, grammar, and related issues.



[1] – Several years ago, when I was working on the early versions of the Ozora parser, I actually used a grammar formalism based on the PCFG. There were a number of technical problems with this formalism, but these technical problems probably would not have proved compelling on their own. In addition to the technical issues, though, there was also a huge aesthetic issue: the resulting parse trees didn’t look good, and the problem got worse and worst as the sentence got bigger. Because of the way English sentences branch, PCFG visualizations tend to appear very tall, and angled down and to the right. Consider the following sentence:

The debate over the referendum was rekindled in Israel after reports that Naftali Bennett , a minister whose Jewish Home Party opposes the establishment of a Palestinian state , was soliciting the support of Yair Lapid , the finance minister and leader of the centrist Yesh Atid Party , for new legislation .

This is a long sentence, and any visualization tool is going to struggle with the width. But when faced with a long sentence, PCFG tree images also have a huge problem with height. I parsed this sentence using Microsoft’s Linguistic Analysis API, and here’s a slice of what came out:


You can see the huge gap between the POS annotations at the top of the image, and the words at the bottom. Almost half the screen space is entirely wasted. In contrast, here’s the Ozora visualizer’s output for the left half of the sentence:


As you can see, there is not nearly as much wasted space in the visualization, and it is much easier to understand the logical relationship between the various words.

[2] This concept of VBA was something I was quite proud of when I came up with it, though I am no longer using it. The idea behind the VBA category relates to the fact that, while most English verbs have only four distinct conjugations, some verbs have a fifth conjugation, called the past participle, which normally has an -en suffix. Examples are “broken”, “spoken”, “eaten”. This tense is usually denoted as VBN, while the regular past was denoted VBD. Now if the verb has no VBN, then you are allowed to use the VBD instead. But if it does have a VBN, you must use it or the sentence will be ungrammatical. Consider:

I have visited London three times in the past year.

I have spoke to the president about the issue. ***

Mike wanted to buy a used car.

John was able to fix the broke radiator. ***


In these sentences, the first example of the pair is grammatical, because the verbs “visit” and “use” have no VBN form. The second examples are ungrammatical, because the verbs “speak” and “break” have distinct VBN forms, which must be used in the given context.

It’s actually a bit tricky to express this rule succinctly in the parsing engine. You have to say something like “allow VBD, but only if the verb does not have VBN“. In other words the parsing system has to query back into the lexicon to determine if an expression is grammatical.


To avoid this, my idea was to package together words with ambiguous VBD/VBN conjugations like “visited” and “joined” together in the VBA symbol. Verbs that had separate fifth conjugations would produce VBD or VBN as appropriate, but other four-conjugation words would produce only VBA. Then you could express the grammar rule in a succinct way: accept either VBA or VBN but not VBD.



Chuckling a Bit at Microsoft and the PCFG Formalism

(Note: I actually think Microsoft is a pretty great company, with a ton of smart people)

When you’re doing independent research, it’s really important to celebrate the small victories. Being independent means you don’t have a lot of money, you don’t have access to the best data sets, and you can’t run your code on the fastest computers. You have to do all the nitty-gritty software engineering and DevOps work yourself, while researchers at big tech companies can outsource that stuff to someone else, so they can focus on high-level research issues. Even worse, you have to deal with the possibility that your Big Idea is actually wrong. So it’s important psychologically to savor the rare moments when you succeed, where a BigCo fails.

For me, one of those moments came a couple of days ago when I was surfing the Microsoft Cognitive Services page. This service is a part of Microsoft’s big, recent commitment to research in artificial intelligence (AI). The basic idea is to make a suite of state-of-the-art algorithms available on the cloud through an web API, so that small companies or researchers who want to use such algorithms can do so without having to do lots of yak-shaving. One of the services is called the Linguistic Analysis API, which is basically a natural language sentence parser. You can try out the parser on the page, and the demonstration sentence they preload into the page is the following:

The Linguistic Analysis API simplifies complex languages to help you easily parse text

Which results in the following parse tree [2]:



Now, take a moment to study this parse tree. Does it make sense to you? Is it visually appealing? In my (very biased) opinion, the answer to both questions is no. This kind of parse tree totally obscures the actual grammatical relationships between the words. What, for example, is the relationship between the verb “simplifies” and the noun “languages”? Obviously the correct answer is object, but to determine this from looking at the tree, you need to jump up from the word itself, to the VBZ symbol, then connect to the NP subtree through the VP parent, and then drill down to the NNS. And you have to just know that this particular configuration of symbols implies a direct object relationship.

Not only is it confusing, this kind of parse tree visualization looks really ugly. Notice how much space is being wasted on the left side, where there is a huge gap between the symbol layer on the top, and the words on the bottom.  On the right side, the words and symbols are packed together tightly, because there are so many symbol expansions – the word “parse” is nine levels below the starting TOP symbol. Reading the sentence and introspecting, do you really feel that it requires that much complexity to describe?

But, okay, maybe these are just aesthetic complaints that don’t have a place in proper scientific deliberation. Here’s a real complaint: the parse is wrong. If you’re so inclined, take a minute to study the tree and try to figure out the issue.






Ironically, the mistake relates to the word “parse” itself. The parser thinks that “parse” is an adjective (JJ), contained within an adjective phrase (ADJP). That’s clearly wrong: “parse” can be a noun or a verb, but not an adjective. The symbol above it (“easily parse”) should be a VP, not an ADJP. And since the word “text” is the direct object of “parse”, the subtree for “text” should be below the subtree for “parse”, not above it.

Huh. Okay, well, the Microsoft parser got this one wrong. So what? Natural language parsing is hard, very hard; I’ll be the first to admit it. The fact that their parser makes a few mistakes on one sentence isn’t such a huge failure – it got most of the sentence right, and probably it works fine on other sentences.

But wait a minute. Their parser failed on the demo sentence they chose to put on the splash page. They could have used any sentence they wanted (“The Linguistic Analysis API makes natural language a piece of cake!”, “Use the Linguistic Analysis API to simplify your text analysis workflow!”, “Can eagles that fly swim?” etc). Or, if for some reason they are really attached to that particular sentence, they could have hacked the parser somehow to require that it produces the correct parse for it [1]. The parse result for the demo sentence is literally the first thing a potential customer would see when trying out the service.

So what really happened? My guess is one of two things: 1) They didn’t actually notice that the parse is wrong, or 2) they think potential customers won’t notice that the parse is wrong. In both scenarios, the ultimate cause is the same: the parse tree notation system they’re using is incomprehensible gibberish. Either the bad notation obscured the problem from the Microsoft developers themselves, or they figured the bad notation would obscure the problem from potential customers.

This is the point where we stop chuckling at Microsoft, and start chuckling at the formalism itself. The formalism is called a Probabilistic Context Free Grammar (PCFG), and it wasn’t invented by Microsoft. It’s been in use by both the mainstream linguistics and the mainstream NLP communities for decades.

In fact, until a couple of years ago, if you wanted your parser to be taken seriously by the NLP community, it had to use the PCFG formalism as its output. This is because the most prominent evaluation database, the Penn Treebank, was annotated (by humans) in a PCFG format. To evaluate your parser, you invoked it on the test sentences in the PTB, and then compared the output of your system to the annotation information in the database. If your system didn’t produce a PCFG output, it could not be evaluated. Furthermore, you’re required not just to use the PCFG formalism, but the specific instantiation of the formalism. For example, you’re not at liberty to add a new symbol OBSP to represent gerundial verb phrases that act as the target of an observation verb (“I heard him playing the piano in the other room”).

So, again up until recently, if there were any errors or problems, general or specific, with the PCFG formalism as used in the treebank – if in fact this parse tree notation style is incomprehensible gibberish – then these problems were baked into all the parsers that were evaluated in this manner. You could never claim that your parser is better than the Penn Treebank; the quality of a parsing system is defined by the extent to which it agrees with the benchmark. If the researchers who developed the PTB made a mistake when choosing their part-of-speech tagset, this error would propagate into all the parsers developed by the community.

Actually, the field depends on the Penn Treebank for more than just evaluation. Almost without exception, modern NLP systems are developed using machine learning (ML) techniques. In order for ML to work, the algorithms must have a good supply of “training” data: input samples that are labeled with the correct output response. The ML algorithm then learns from these examples to reproduce the desired input/output function. So, if you’re doing parsing research, where do you get the labeled training data? You guessed it: the Penn Treebank.

More recently, many researchers in the field have switched to a new parse formalism called dependency grammar, also known as link grammar. These formalisms describe sentences in terms of links pointing from head words to tail words; the links usually also have a label denoting the type of the relationship (subject, object, etc). To evaluate a parser that produces dependency grammar output, researchers take the PCFG annotation data and programmatically convert it to the new format. In this way, researchers are able to break away from the overarching PCFG formalism.

But they are still constrained by the inherent limitations of the approach to evaluation. Any error or lacuna in the underlying human annotation will still cause an error in the converted version. If the underlying data set fails to make an important distinction (such as that related to observational verbs mentioned above), the converted version will fail to make the distinction also.

Furthermore, any evaluation procedure based on a limited size corpus will have serious difficulty with judging a system’s performance on rare grammatical constructs. For example, the word “help” has an eccentric characteristic, which is that it can connect to bare infinitive complements:

I would like to help you (to) build your company.

Most other verbs that connect to infinitive complements require the particle “to”, but for “help”, it is optional or even discouraged. So if the treebank dataset doesn’t have enough sentences with the specific word “help”, parsers won’t be able to learn how to handle this construct correctly.

Let’s see how some other mainstream parsing systems handle Microsoft’s teaser sentence. Here’s the result from the Stanford parser:


Here’s the result from spaCy:


Well, these examples are at least better visually [3]. One thing that is immediately obvious is that they are flatter. But it is still quite difficult to understand what is going on. In particular, what are  xcomp, ccomp, and advcl? If you look up these tags on the Universal Dependencies page, you will find that xcomp is an “open clausal complement”, ccomp is a “clausal complement”, and advcl is an “adverbial clause modifier”. Does that clarify things for you?

Let’s look at what the Ozora parser does with this sentence [4]:



Now you can see that the connection between “simplifies” and “help” is a purpose link. That seems a lot more informative than xcomp or advcl. Also, “help” links to “parse” with an inf_arg link, which is more precise than ccomp, because the latter contains that-complements as well as infinitive complements.

Of course, I’ve got a long way to go: the Ozora parser is far from perfect. But, unlike almost every other parser, Ozora’s was developed without a treebank database. So it is immune to the inherent limitations of that evaluation paradigm. If you’re interested in how this works, check out my book.




[1] Depending on the methodology they used to build the parser, this may not actually be that easy. A big drawback of many modern ML approaches is that they are not interpretable and therefore not easily debuggable. You can’t dig into your 10-layer CNN and surgically modify some weights to ensure it outputs the right response for a given input. See also the fiasco related to the Google Image tagging.

[2] These screenshots were taken on March 14, 2017. Here is a full screenshot showing the MSFT URL in the upper left:


[3] spaCy’s parse actually contains an error: “help” is connected to “text”, and “parse” is linked to it using an amod link.

[4] Full disclosure: I had to do a bit of hacking to get it to work correctly on this example. The reason relates to the acronym “API”. In the demo sentence, “API” acts as a common noun; but in the previous configuration, the parser treated all acronyms as proper nouns. And proper nouns basically cannot take modifiers, so “Linguistic Analysis API” was ungrammatical. To fix this, I changed the system to allow it to interpret acronyms as both common and proper nouns and then to pick the best interpretation.




Argument Structure of Observational Verbs and the Eccentric Verb “Help”

Argument structure is a way of categorizing words, especially verbs, in terms of what types of grammatical structures they can connect to. These connections help to complete the meaning of the phrase in various ways. Since different verbs are used to express different types of logical relationships, it’s not surprising that they have different rules about what kind of grammatical connections they can make.

For example, words like “argue”, “know”, “understand”, and “promise” can connect to that-complements. A that-complement is a standalone sentence with the word “that” in front of it.

I know that the treasure is buried on the other side of the island.

The lawyer argued that there was a logical inconsistency in the wording of the law.

Even very young children understand that people like to be treated fairly.

My brother promised that he would return from the war as soon as possible.

As you can see, in every case mentioned above the subclause that occurs after “that” is a complete grammatical sentence that could stand on its own. Here is the sentence parse generated by the Ozora parser:


One important fact about that-complements is that the particle “that” can be ellided. The above sentences can all be rewritten without “that”, and will still be grammatical and also have the same meaning.


(As an aside, it is awkward to write about the word “that” because it is used in so many different ways in English: to introduce that-complements, to link nouns to relative clauses, as a demonstrative pronoun (“that newspaper”), and more).

Another common type of argument structure relates to infinitive complements. These are verb phrases, with no subject, in the standard present tense (dictionary form), that start with the particle “to”. Verbs that express ideas about actions, such as “ask”, “request”, “allow”, and “want”, often take infinitive complements. But if you try to connect a verb that does not accept an infinitive, you get something that is either ungrammatical or strange-sounding:

You really need to take a shower.

The king (asked, ruled**) the knight to rescue the princess.

The new technology will allow us to transmit much more data per second.

The governor (wants, thinks**) to ban smoking in public schools.

Some verbs can take both a direct object and an infinitive complement. These constructions typically express an idea about a person taking an action. In these instances, the object must follow the normal rule about case in English: if you use a pronoun, it must be in the accusative case (him/her/me as opposed to he/she/I). This is pretty natural and doesn’t lead to confusion, because the presence of the particle to clearly indicates that the clause is an infinitive form.

Here are two parse descriptions, illustrating that the Ozora parser “understands” this rule. The red link in the second one is a parse failure indicator, showing that the system doesn’t accept the sentence as grammatical:



Getting to the point of this post, there is a strange category of verbs, which all seem to relate to observation, that allow an odd type of argument structure:

I saw (him/Yo-Yo Ma/he**) play the cello in New York City.

We heard a rock shatter the window.

The paparazzi photographed (her/the actress/she**) swimming naked in a lake.

We lay in the field and watched clouds (meander/meandering/meanders**) across the sky.

A few things to notice here:

  • The particle to does not and in fact cannot appear (try inserting it into the above sentences).
  • There are two possibilities for the tense of the verb: it can be in the standard present (VBP, dictionary form) or the gerund form (-ing suffix, VBG, progressive aspect).
  • There is a noun which is simultaneously the subject of the subclause and the object of the parent clause. But the noun must be conjugated in accusative case.
  • The subclauses are not grammatical on their own, both because the noun is required to be in accusative case, and because the tense of the subclause verb must be VBP or VBG, regardless of the subject.


Him play the cello. **

He played the cello.

I saw him play the cello.

I saw him played the cello**.

Eccentric Verb “Help”

The observational verbs are kind of a strange category, but at least they are a category. It appears that the verb help has a unique argument structure, illustrated as follows:

I helped him (build/building**) his company.

The rain helped wash away the stains on the driveway.

The office staff helped (him/the janitor/he**) clean up the mess.


Actually, there is an easy way to characterize this verb: it takes bare infinitive  argument. A bare infinitive is just an infinitive with the particle to removed. With this qualification, we can analyze help with the same rules we used for the standard infinitive.

Can you think of any other verbs that take bare infinitive arguments?

Writing computer software to analyze English can be frustrating, because it involves a lot of code to handle finicky edge cases. The word “help” is one example: I apparently need to build an entirely new set of grammar rules to accommodate a single word. Another example is the allomorphic word pair “a/an”, which has a special agreement rule that refers to the phonological expression of the word immediately after it (not the head of the noun phrase: “an oak tree” vs “a large oak tree”).





Response to Review of “Notes” by Peter McCluskey


An acquaintance of mine, Peter McCluskey, was nice enough to read my book Notes on a New Philosophy of Empirical Science and write a review of it on his blog Bayesian Investor. The review is basically positive. Several of the critical comments in the review indicate real shortcomings of the book, which I hope to correct in the final version.

McCluskey understands the big-picture concepts presented by the book. This is a good start, because it is often hard for me to convey these concepts to people who are not well versed in information theory and statistics, even though the book attempts to address a general audience. For example, he seems to understand without much effort that there is a near equivalence between the step “make a concrete prediction” in the traditional scientific method, and the step “build a data compressor” in the CRM. That is a big leap for most people.

I do want to clarify some points about the goal of the book. McCluskey summarizes:

Machine Learning (ML) is potentially science, and this book focuses on how ML will be improved by viewing its problems through the lens of CRM. Burfoot complains about the toolkit mentality of traditional ML research, arguing that the CRM approach will turn ML into an empirical science.

It’s important to emphasize that I don’t want to “fix” Machine Learning. Instead, I want to create a field that is adjacent to ML, and stands in the same relation to ML as mathematics stands to physics. To emphasize this, I’ll write it in analogical form:

mathematics:physics   ::   machine learning:comperical science

At this point I will mention that in 2011, when I was writing the book, all of these speculations  were purely theoretical. At that time a critic could have justifiably attacked me for doing armchair philosophy and making presumptuous claims without strong evidence. But now, in 2017, comperical science is no longer a merely theoretical construct: I have been applying the philosophy to the field of NLP and have made substantial progress.

McCluskey’s first critical point is that, though protection against fraud and manual overfitting are benefits of the CRM approach, I have “exaggerated” them. This is probably a reasonable criticism; I have a bad tendency to employ absolutist vocabulary (eg “invincibly rigorous”) when describing the CRM philosophy.

However, I stand by the claim that the CRM provides very strong protection against the most common types of both honest and dishonest scientific mistakes, especially those related to statistical errors. Furthermore, this protection is urgently needed by several fields of science, such as nutrition, medicine, psychology, and economics. Consider the following commentary by leading experts in a variety of fields:

Andrew Gelman:

[M]any published papers are clearly in error, which can often be seen just by internal examination of the claims and which becomes even clearer following unsuccessful replication…

When seemingly solid findings in social psychology turn out not to replicate, we’re no longer surprised….

Paul Romer:

For more than three decades, macroeconomics has gone backwards. The
treatment of identification now is no more credible than in the early 1970s…

A parallel with string theory from physics hints at a general failure mode of science that is triggered when respect for highly regarded leaders evolves into a deference to authority that displaces objective fact from its position as the ultimate determinant of scientific truth.

John Ioannidis:

There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field….

Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias.

I invite the reader to compare the degree of objectivity and rigor that must be present in comperical science to the issues and problems indicated by the above comments. In most cases, any kind of bug or miscalculation in a data compressor will be instantly revealed when the decompressed file fails to match the original. Furthermore, exact input/output matching is not enough: in order for a result to be valuable, the encoded file must also be small in comparison to previous results. File size is easily, directly and unambiguously verifiable by basic computer commands.

McCluskey complains that the book does not discuss the distinction between lossless and lossy compression. This is a fair complaint, and it is one that I have heard from other people. In my mind the rationale is clear, but the book should include more direct statements about why the CRM requires lossless compression.

The rationale is that lossless compression permits strong, objective comparisons between competing candidate theories of a phenomenon. This, in turn, permits researchers to conduct a systematic search through the space of theories, eventually homing in on a high-quality theory. Lossy compression is an interesting problem, but it does not allow strong comparisons between theories, because different theories might choose to discard different details of the original data. Without the ability to make strong comparisons, the theory-search gets stuck in the conceptual mud.

McCluskey is an investor, and so he naturally wondered about applying the CRM idea to stock market data. He concluded that the CRM was not really applicable this kind of data set. I agree, and I should have emphasized this more in the book. In general, it is very important to exercise a degree of taste and judgment when selecting the type of data set to be targeted for CRM research. If you pick a bad data set, the resulting research will fail to produce any interesting results. A good data set for CRM work should have a couple of properties:

  1. The distribution that produces the data should be essentially stationary. Otherwise, practitioners must take care to emphasize that boundary of applicability of the conclusions of the research. For example, if a CRM dataset is produced by taking images of cancerous growths in men, then the resulting knowledge should not be used to diagnose cancerous growths in women.
  2. It should be related to a phenomenon of intrinsic interest. One can imagine a line of CRM research that attempts to compress large database of cat pictures. Such research would produce a highly detailed computational description of the visual appearance of felines – an achievement of somewhat dubious intellectual value.
  3. It should have rich, organic structure. A database of random noise is a poor choice, because noise cannot be compressed. A database of numbers produced by a computer’s pseudo-random number generator is also a poor choice; it can be very highly compressed, but only by reverse-engineering the PRNG.

Stock market data runs into problems with properties #1 and #3 above. First, it is fundamentally time-dependent and not stationary. As McCluskey notes, a database of stock market data from the years 1995-2015 will be strongly influenced by the bubble. In many ways, the bubble was a unique economic event, and so knowledge of the conditions that it produced will not be helpful in predicting future trends.

Secondly, changes in stock prices are intrinsically hard to predict, because the present price should reflect almost all the available information the market has to evaluate the stock. This is called the Efficient Market Hypothesis. Actually, one interesting reformulation of the EMH is that the stream of stock price changes is random and thus incompressible.

One final point McCluskey raises is about the question “what is intelligence?” and how comperical science relates to natural and artificial intelligence. The book only hints at the answer. In the final thought experiment, the fictional protagonist compiles an enormously large and varied database of images, text, audio, and scientific measurements. Then he begins a CRM inquiry targeting this database, using a suite of extremely abstract, general, and meta-theoretical techniques. I believe a suite of techniques such as this, if sufficiently powerful, would be equivalent to intelligence. However, I do not believe this to be within the reach of modern AI research.

One of the key problems with developing general, abstract, and meta-theoretical techniques is that they are hard to evaluate. It is hard enough to formulate technical problems well, and formulating technical meta-problems is much harder. A slight glitch in the problem formulation can make the challenge impossible on one side, or trivial on the other. One of the goals of comperical philosophy is to provide a framework within which researchers can scale up the power of their abstractions and general-purpose techniques, while always staying on firm methodological ground. Consider the following “road map” of development in comperical linguistics:

  • First, develop a good specific theory of English text
  • Next develop a good specific theory of Chinese text. Then Russian text. Then Hindi. Then French. Presumably, each step will be easier than the last, as researchers learn new tricks, techniques, and concepts.
  • Finally, develop a unified theory of natural language text: a general purpose algorithm that, when invoked on a large corpus of text in a given language, will automatically produce a good specific theory of the language.

You can imagine an analogous process for image understanding: first develop a good specific theory of cars, then for buildings, then for plants, and so on. Then develop a unified theory and learning algorithm that automatically builds specific theories, given a good data set.

The final leap is to unify the general theories for the various domains. Instead of one general algorithm for language learning, another for image learning, and a third for motor skills learning, construct a truly general purpose algorithm that can be applied to any type of data set.

Thanks again to Peter McCluskey for his review, and to you, the reader, for taking the time to read my response.

Place-Activity Nouns

Most of the time, an English noun phrase that is headed by a common singular noun requires a determiner. All of the following sentences are ungrammatical as written, but can be fixed by adding an appropriate determiner:

I want to eat apple. **

Plane crashed in the river. **

My uncle bought new car. **

There are a lot of exceptions to the determiner rule. Today I want to discuss just one exception, which involves a special set of nouns like “work”, “church”, “school”, or “prison”. These nouns nominally denote places, but also implicitly indicate an activity that occurs at the places. Consider these examples:

I have to go to work early tomorrow morning.

My cousin was released from prison last year.

The minister’s daughter made an embarrassing scene at church on Sunday.

The above sentences are grammatical, even though the noun phrases “to work”, “from prison”, and “at church” are missing determiners. To see the contrast, consider the following ungrammatical sentences:

He got really drunk at bar last night. **

My sister spends every morning in cafe reading philosophy. **

I went to store to buy an umbrella after I lost my previous one. **

There is no obvious reason why “cafe” or “bar” couldn’t have the same special status as “church”, since they are both particular locations where fairly specific types of activities happen.

Possibly the special status of the place-activity nouns (“work”, “church”, etc) reflects the logical fact that a person typically goes to only one of those places. Most people work in a specific and attend school at a specific institution. In logical terms, there is a many-to-one relationship between people and workplaces or people and churches. So place-activity noun phrases are implicitly determined:

I have to get up early tomorrow to go to (my) church.

My sister hates going to (her) school.

In contrast, people often visit several different cafes, bars, restaurants or stores.  With these words, the many-to-one logical relationship becomes a many-to-many relationship, and therefore those nouns cannot be implicitly determined.

This example illustrates the fact that in order for a computer system to make strong grammar judgments, it must be equipped with an advanced lexical database (ie dictionary). “Cafe” and “school” have exactly the same part of speech, but there are sentences which become ungrammatical if you replace “school” with “cafe” (the converse does not appear to be true).

As a subject for another post, observe that the word “home” has an even more special status: “I want to go home” is grammatical without either a determiner or a preposition.

Can you think of other place-activity nouns in addition to the ones I mentioned above?