This post is about the problems with approaches to grammar theory that depend on the creation of a phrasal taxonomy. A phrasal taxonomy is a system for categorizing phrases into discrete categories like “Noun Phrase” (NP) or “Verb Phrase” (VP). At first glance, the idea seems quite plausible. Most people have a strong intuitive feeling that phrases like “John”, “the man on the train”, or “a red brick house” are all members of the NP category, while phrases like “kicked a ball”, “flew”, or “raised a family” are all VPs.
In addition to our intuition, the phrasal taxonomy concept is also supported by an important empirical phenomenon. Linguistic terminology is terrible, so in this post I’ll just call the phenomenon pluggability. To illustrate the idea of pluggability, we write down a couple of simple sentences, and then delete a couple of words to leave a blank:
- John ____ to the store.
- Sally ____ with her mother.
- Mike and Bob ____ .
- My friend ____ .
Then we brainstorm a bit and try to come up with some completion phrases that could fill the blank spot. In this case, it’s easy to come up with some good ones:
- rode a bike
- spoke quickly
- danced a polka
- threw a party
Here’s the amazing thing: all of the phrase candidates can be “plugged into” all of the original sentences. And we could easily write down many more completion phrases and blanked sentences that would work just as well. This is the core insight of pluggability. Pluggability suggests that if we can document and characterize all the sets of phrases that can be plugged into various types of blanks, we’ll have made a major step towards understanding grammar.
At this point it’s important to point out that, as grammarians, we’re primarily interested in grammatical acceptability instead of semantic plausibility. If the grammatical interpretation of a sentence is clear, it is no problem to use if the sentence expresses a bizarre, whimsical, or fantastical event. For example, if we insert phrase 2 into sentence 1 above, the resulting sentence is certainly grammatical, but sounds very strange in terms of meaning (perhaps the store is haunted).
Based on the observations of the above blanked sentences and completion phrases, let’s declare victory and claim that we’ve identified our first phrase category. We’ll call this phrase category VP. Let’s keep working and look at another type of phrase. Consider the sentences:
- The woman was arrested by ____ .
- ____ started a technology company.
- Ryan attended a meeting with ____ .
- Mark and ____ travelled to Russia.
Submit the following phrases as valid completions for the above sentences:
- the man in black
- a tree
- four Japanese government officials
Again we see that despite the wide range of diversity of the completion phrases, they still can each complete the sentences, with varying degrees of semantic plausibility. So we’ll call the phrases in the above list “Noun Phrases” or NPs for short.
It looks like we have made a good start: we’ve got a good handle on the two most imporant types of phrases in English. Of course we are going to have to do a lot of work to refine our taxonomy to cover the full range of English grammar, but probably (it seems at the outset) there will not be too many phrase categories – perhaps a couple of hundred? It doesn’t seem like it will be unreasonably difficult to document all of these categories. After all, the language must be easy enough that normal human children can learn it reliably.
What do we do with the taxonomy once we’ve completed it?
There are a variety of paths one might consider as ways to reach the ultimate destination, which is a complete theory of grammar. The most obvious path is to use the taxonomy to describe the ways in which phrase structures can be combined together.
For example, given the basic structures VP and NP as mentioned above, plus a structure S for sentence, a symbol PP for prepositional phrase, and Prep for an individual preposition word like “to”, “from”, “on”, etc. We could write down the following rules:
- S :: NP + VP
- VP :: VP + NP
- VP :: VP + PP
- PP :: Prep + NP
Rule 1 says that a sentence is an NP subject plus a VP. Rule 2 says a VP can “split off” an NP representing a direct object. Rule 3, similarly, says that a VP can split off a prepositional phrase. And rule 4 shows that a PP is just an NP with a single preposition word in front of it. Again, this set of rules is hardly complete, but it seems plausible that it is a good start, and that with a serious effort we could identify the full set of rules for English.
This approach is the basis for one of the most common types of grammar theory, called the Probabilistic Context Free Grammar (PCFG). Chomsky famously used this approach, along with another concept called “syntactic movement”, in his early theories of grammar. Because this type of formalism is widely used in mainstream linguistics, it was also adopted by many researchers in machine parsing. The first and most important parse tree annotation effort, called the Penn Treebank, used a PCFG formalism to define the annotations. Because of the influence of the Penn Treebank, a lot of early work in machine parsing also used the PCFG. More recently, though many research groups have moved on to a grammar formalism based on typed dependencies, Microsoft notably still offers a PCFG based parser through their Linguistic Analysis API. (In a previous blog post, I poked fun at Microsoft’s parser and the PCFG formalism).
In my view, the PCFG and other approaches to grammar that are based on phrasal taxonomies, suffer from a number of serious conceptual flaws. I want to try to illustrate these flaws by describing how one might describe some more advanced gramatical phenomena using a taxonomy.
Let’s look at what phrasal taxonomy needs to do to handle verb tenses. English has a relatively simplistic verb conjugation system, but the rules it does have are strongly enforced – no native English speaker would say “I have broke the radio”, except, perhaps, in an extremely informal setting. Have a look at the following sentences, which are minor variations on the previous set:
- John is ___ to the store.
- Mike and Bob are ___ .
- Sally has ___ with her mother.
- My friends have ___ .
And consider the results of trying to complete the sentences with the following phrases:
- riding a bike
- throwing a party
- spoken quickly
- danced a polka
You can see the issue easily. To complete sentences 1+2, we need to use a phrase that is in the gerund (-ing) tense. But for sentences 3+4, we need a phrase in the past participle tense (which in the case of “dance” is just “danced” because it does not have a distinct past participle form). So the pluggability phenomenon breaks down here, even though all of the completion phrases are seemingly VPs.
Hmmm. That’s unfortunate. It looks like we are going to need to introduce some additional complexity to deal with tense. Well, maybe it’s not such a big deal. We just need to split the original VP category into 5 subcategories. We can call them VBP for present, VBD for past, VBZ for third person singular, VBG for gerundial, and VBN for the past participle form. It’s going to be a bit tricky to write the phrase combination rules in terms of these new symbols, but probably not impossible. There aren’t that many more symbols to deal with.
Let’s look at another set of examples, this time relating to quantity. Consider the following sentences:
- ___ talked to the mayor about the crime problem.
- The woman gave ___ a plate of cupcakes.
- On the way home I met some of ___ .
- All of ____ went to the show.
At a first glance, it looks like basically these are going to be NPs. So let’s try some NP-like completions:
- the audience
- my friends
- most of the people
- one of the journalists
As you can probably see, the issue here relates to the quantity modifiers all/most/one/etc. The simple phrases 1+2 can be inserted into any of the sentences. Similarly, any of the completion phrases can be plugged into the simple sentences 1+2. So we see again the pluggability phenomenon that motivated this project at the outset.
The problem comes when we try to insert phrases 3+4 into sentences 3+4. This combination doesn’t work because the both the phrases and the sentences have quantity modifiers, and using two quantity modifiers in the same phrase is ungrammatical. You cannot say “On the way home I met some of one of the journalists”.
This is another setback for the phrasal taxonomy project. Most of the completion phrases above look like NPs, so we would like to categorize them as such. But the analysis shows that we are going to need to make a distinction between different sets of NPs, depending on whether they contain a quantity modifier. Well, okay, let’s introduce new categories NP[+Q] and NP[-Q], to capture this distinction. Again, it might not seem like an insurmountable barrier; at this stage it is just one additional symbol.
Adjective Ordering Rules
When he was a child, the great fantasy author JRR Tolkien told a story to his mother about a “green great dragon”. His mother told him that this was the wrong way to say it, and he should say “great green dragon” instead. Tolkien asked why, but his mother couldn’t explain it, and the episode was an early influence leading to his lifelong interest in language and fantasy.
Tolkien’s error was to violate an adjective ordering rule. English has a lot of complex rules about how adjectives must be ordered, one of which is that adjectives expressing magnitude, like “great” or “large”, must come before ones expressing color or quality, like “green” or “wooden”. Let’s see how this impacts the phrasal taxonomy project. Here are some sentences:
- A ___ rolled down the street.
- I wanted to give my sister a ___ .
- My parents bought a wooden ___ .
- The black ___ exploded with a flash of fire.
And compare the completion results with the following phrases:
- large house
- tiny basket
We see exactly the same problem pattern that we saw for the quantity modifiers. The one-word phrases (“car” and “necklace”) can be plugged into any of the sentence slots. And all of the phrases can be plugged into sentences 1+2. But the two-word phrases 3+4 cannot be plugged into the sentences that already contain an adjective, because of adjective ordering rules. The phrase “wooden large house” is ungrammatical; we have to say “large wooden house” instead. (If someone said “wooden large house” in real life, a listener might interpret it as a compound word “largehouse” that has some specific technical meaning, like “longboard” has a special meaning in the context of surfing).
If you are a non-native speaker of English, you might be curious or skeptical about how strongly these ordering rules are enforced. The Google NGram Viewer gives us a nice tool to study this phenomenon. This search indicates that the phrase “large wooden house” occurs in thousands of books, while the frequency of the phrase “wooden large house” is so small as to be statistically indistinguishable from zero.
Anyway, this is another problem for the phrasal taxonomy project. In this case it’s not even clear how many new symbols we need to introduce to solve the problem. Do we need a new symbol for every category of adjective that could be implicated in an ordering rule?
Are We Done Yet?
In each of the above sections, we saw that in order to describe a certain grammatical phenomenon, we were going to have to introduce another distinction into our taxonomy. Instead of a simple VP category, we’re going to need a couple of subcategories that express information about verb tense. Instead of a simple NP category, we’re going to need a bunch of subcategories that include information about plurality, adjective ordering and the presence of quantity modifiers.
You can imagine a hapless grammarian sitting at a desk, assiduously writing down a long list of phrasal categories that capture all the relevant grammatical issues in English. Will this poor soul ever be finished with his Sisyphean task? Probably not in this lifetime.
The problem is that the number of categories is exponential in the number of distinctions that need to be expressed. Consider noun phrases. As we saw, we will need to create different categories that represent whether the noun has a quantity modifier, and what type of adjective it contains. Suppose we decide that 3 levels of distinct adjectives are necessary. Furthemore, it is obvious that we will need to express the distinction between plural and singular nouns. We are already at 2*2*3=12 different categories of noun phrases, and we’ve only just started. What about determiners? What about preposition attachments: is “the mayor of Baltimore” in the same category as just “the mayor”? For every new type of distinction we need to make, we’re going to have to double (at least) the number of symbols we need to maintain.
The situation for Verb Phrases is even worse. We saw above how we will need to create different categories for each verb tense. But in fact we will need to go further than that. We’ll need to create new category splits for each of the following distinctions:
- Does the phrase have a direct object?
- Does the phrase have an indirect object?
- Does the phrase have an infinitive (“to”-) complement?
- Does the phrase have a sentential (“that”-) complement?
Each of these distinctions requires us to double the number of categories we use. As we saw above, there are five verb tenses, so now we’re at 5*2*2*2*2=80 categories just for verb phrases.
Why does this Matter?
In the early days of Natural Language Processing, the plan was to combine the domain expertise of linguists with the algorithmic expertise of computer scientists, in order to create programs that could understand human language. This plan was pursued vigorously by several research groups. After some time an awkward pattern emerged: it turned out that the linguists couldn’t actually contribute very much to the project. Early researchers found that smart algorithms and machine learning techniques were more useful for building NLP systems than linguistic knowledge. This pattern was expressed succinctly by Fred Jelinek, one of the pioneers of the field, who said “every time I fire a linguist, the performance of the system goes up”. From these early observations about the efficacy of different approaches, almost everyone in the field drew the following conclusion:
- Linguistic knowledge and theory is not useful for NLP work. Instead, people should rely entirely on Machine Learning techniques.
As a result of this collective decision, modern NLP research contains only a very superficial discussion of actual linguistic phenomena. Most people in the NLP world have little interest in analyzing, for example, the differences or similarities between various types of relative clauses. They do not have much interest in studying the empirical adequacy of different theories of grammar (e.g. X-Bar theory). Instead, the focus is overwhelmingly on computer science topics such as data structures, algorithms, learning methods, and neural network architectures.
My research is predicated upon an alternate explanation of the early history of the field:
- Linguistic knowledge is potentially very useful for NLP. However, the field of linguistics has not yet obtained a sufficiently accurate theory, and it is better to rely on ML than on low-quality theory.
In other words, if it can be discovered, a highly accurate linguistic theory of grammar will prove decisively significant for the development of NLP systems. Unfortunately, we cannot look to mainstream linguistics for this theory. That field has been dominated for decades by the personality of Noam Chomsky. His pronouncements are given far too much weight by his followers, in particular his bizarre and almost anti-intellectual position that “probabilistic models give no insight into the basic problems of syntactic structure” (see here and also here). In fact, probabilistic modeling is the only tool that can bring a scientific debate about the structure of language to a decisive conclusion.
I encountered the general issues with phrasal taxonomy that I mentioned above in my early work on large scale text compression. My first combined parser/compressor systems were based on a PCFG formalism. The initial versions of the system, which used only simple grammar rules, worked acceptably well. But as I attempted to make increasingly sophisticated grammatical refinements, I needed to scale up the taxonomy, and I found that this was extremely difficult to do, largely because of the kinds of issues I mentioned above. That eventually led me to discard the PCFG formalism in favor of a system that doesn’t require a strict taxonomy.