P2FE : Case Rules for Pronouns in Partial Determiner Phrases

In my work, I get deeply involved in both computer languages (like Java) and natural languages (like English). Programmers spend a lot of time thinking about and debating the relative merits and shortcomings of different computer languages. Often these debates take on the aspect of a religious conflict, but they are not entirely worthless, because programming languages can change over time, as language designers add new features and remove inconsistencies. The work of deciding on and incorporating the improvements is a difficult process, not just because of the technical difficulty, but because the changes require buy-in and approval from the language community. To propose a new feature, a language designer writes a detailed document describing a problem and proposing a solution. These documents have names like “Java Specification Request” (JSR) and “Python Enhancement Proposal” (PEP). After the proposal has been subjected to sufficient study and scrutiny, a standards organization takes a vote to decide whether or not to adopt it.

In my research I often come up against some particularly annoying feature of English, and think to myself, “hmmm, the language would really be so much better if we fixed this inconsistency”. Unfortunately, there is no mechanism for proposing and adopting improvements to English that is analogous to the JSR or PEP. While English does change, this change happens in a haphazard and unplanned way. The gradual drift in the way the language is used means that, over time, it becomes more and more cluttered and inconsistent. This disorganization causes the language to become harder to learn, harder to understand, and less useful for the purposes of communication.

Gandhi advises us to “be the change you want to see in the world” [1]. In that spirit, I submit this Plan to Fix English (P2FE) for consideration and study by English speakers everywhere. I believe this proposal, if widely adopted, will make the language more consistent and precise. Even if you do not accept this particular improvement, I hope you endorse the general principle that we can and should work together to make English better. And if you don’t care about fixing English, you might at least enjoy the discussion of grammar theory, which begins with a description of the case phenomenon in relation to pronouns.

Pronouns and the Case Phenomenon

Many languages exhibit the linguistic pattern known as case. Case is a grammatical phenomenon where a noun must be conjugated in a certain way depending on its relationship to a verb. Grammarians commonly distinguish between the nominative case, which is used for the subject of the verb, the accusative case, used for the object, and the reflexive case, used when the same entity is both the subject and object.

Some languages exhibit case phenomena quite strongly, requiring every noun to be conjugated as appropriate. English exhibits only weak, vestigial case rules, which refer only to pronouns. The rules are illustrated by the following sentences:

  1. He bought a new car.
  2. Him bought a new car (***).
  3. She told him about the murder.
  4. She told he about the murder (***).
  5. The lawyer talked to us for three hours.
  6. The lawyer talked to we for three hours (***).
  7. John talked himself (John) into buying a new car.
  8. John talked him (John) into buying a new car. (***)
  9. John talked him (Mike) into buying a new car.
  10. John talked himself (Mike) into buying a new car. (***)

As you can see, even though the pronouns ‘he’, ‘him’ and ‘himself’ are semantically identical, we need to conjugate them properly in order to satisfy grammatical constraints.

The grammatical rule of case can be expressed in terms of the role that the pronoun is fulfilling in the sentence. For the time being, let’s consider a simplified set of three roles: subject, object, and prepositional complement. Then the rules are as follows:

  1. If the pronoun fulfills the subject role, it must be in the nominative case.
  2. If the pronoun acts as the object or prepositional complement, it must be in the accusative case.
  3. Exception: if the pronoun in case 2 refers to the subject of the sentence, it must be in the reflexive case.

This set of rules explains the pattern of grammaticality shown in the above sentences. Sentence #2 is ungrammatical because it uses an accusative case pronoun in the subject role, and sentence #4 is ungrammatical because it uses a nominative case pronoun in the object role. Sentence #6 is ungrammatical because it uses a nominative case pronoun as a preposition complement.

(I won’t discuss reflexive pronouns in this post, but I’ve included some sentences that illustrate their usage pattern. Using a reflexive pronoun instead of an accusative one never makes a sentence ungrammatical. Instead, it precludes certain interpretations about which noun the pronoun refers to. For example, sentence #8 is marked incorrect because the pronoun ‘him’ cannot refer to ‘John’; it must refer to some other person.)

The Ozora parser can handle these simple sentences quite easily, because the rules described above are easy to encode in computational terms. As you can see below, it parses the grammatical sentences well, but refuses to accept the ungrammatical sentences, producing instead a red gen_srole link that indicates that it found no acceptable parse.


The fact that this knowledge is built into the system helps with parsing more complex sentences, by allowing the system to prune away bad candidate parses early. This refusal to parse ungrammatical sentences is a distinctive feature of the Ozora parser. Other parsing systems will happily produce output for the bad sentences; here’s the output from the Microsoft Linguistic Services API:


The Head/Tail Question

Many grammar theories, including the one used in the Ozora parser, are described in terms of merge operations. A merge operation takes a head phrase and a tail phrase and joins them together. The resulting merged structure inherits the properties of the head phrase:

  1. The headword of the merged structure is the headword of the head phrase
  2. The syntactic type (NP/VP/etc) of the merged structure is the syntactic type of the head phrase

Because of this asymmetry in the inheritance rules between the head phrase and the tail phrase, it is very important, when writing down a grammar rule, to know which substructure is the head. Fortunately, often it is quite obvious which is which, as in the following examples:

  1. unusually large car
  2. swimming in the river
  3. My mother’s necklace

In the first example, “unusually large” is the tail phrase, while ‘car’ is the head phrase, so the result is an NP with the headword ‘car’. Next, the head phrase ‘swimming’ joins with the tail phrase “in the river”, and the result is a VP, with the headword ‘swimming’. Finally, the head phrase ‘necklace’ joins with the tail phrase “my mother’s”, so that the resulting phrase is a NP with the headword ‘necklace’.

The basic phrases listed above are easy, but when one is dealing with more complex grammatical patterns, but the question becomes more subtle. Over time, I’ve built up a fair bit of experience and intuition related to this problem. One important rule of thumb is that the headword and the syntactic type should act as a linguistic summary of the entire phrase, and this summary should provide most of the information necessary to determine whether the phrase can fit into other places. To understand this idea, consider the following sentences:

  1. John bought a car.
  2. John bought the shiny new red car that had been parked in the dealer’s lot for five weeks.

In both sentences, the object of the word ‘bought’ is a noun phrase headed by the word ‘car’. That’s mostly all you need to know to determine that the phrase can fit into the object role. Even though the object phrase in the second sentence is quite long, the whole thing can be summarized as an NP headed with ‘car’. With just this summary, we can correctly deduce that the long phrase is a (semantically and syntactically) valid object of the verb ‘bought’. In contrast, if our headword-identification logic was broken so that we guessed that the headword was the verb ‘parked’, then we would incorrectly conclude that the phrase couldn’t fit into the slot it actually occupies. Similarly, if the headword was ‘dealer’, the phrase would be syntactically acceptable, but very odd semantically, because ‘dealer’ is an unlikely choice for the object of ‘bought’.

In addition to the logical arguments in favor of a grammar formalism based on the head/tail distinction, there is also a strong statistical rationale. Consider the following sentence:

  1. The boys will eat pizza and hamburgers.

Here’s the parse that the Ozora parser builds from the sentence:


The Ozora parser works by finding a parse description that maximizes the probability of a sentence. The language model is built from a large number of modular submodels. The most important submodel is the one used to assign probabilities to particular tailwords, given the headword and the semantic role. This type of model is both statistically powerful and intuitively natural. Given the headword and role as context information, humans can make strong intuitive judgments about what tailwords are plausible. In the example sentence, we see that word “boys” appears as the tailword of “eat” with the subject role. From a semantic point of view, that is perfectly reasonable. Similarly ‘pizza’ is a highly plausible tailword for “eat” in the object role. On the other hand, unless the text in question is some kind of bizarre fast food horror novel, “pizza” is a wildly implausible as the tailword for “eat” in the subject role.

These strong intuitive judgments generally agree with the actual statistics of the text data: you will probably need to search through millions of books or newspaper articles before finding a sentence where “pizza” is the subject of “eat”, but it will be easy to find one where it is the object. Because of its statistical power, this strategy of language modeling helps us to improve both the overall performance of the model and also the accuracy of the parser.

Partial Determiners

The third topic of this post relates to partial determiners, which are words such as allmostseveral, and none. They are a common and important feature of English, allowing speakers to refer to groups with a greater or lesser degree of participation. The typical pattern is to complete the partially determined phrase by putting connecting the determiner to a plural noun or uncountable noun with the word of:

  1. All of the pizza
  2. Most of the boys
  3. Some of water
  4. None of his friends
  5. A few of the Italians

This pattern seems intuitively easy to understand, so we might expect that it will be easy to describe using a grammar formalism. But when you actually try to do this, a thorny question arises: which phrase is the head and which is the tail? For the phrase “all the pizza”, is the headword “pizza”, or is it “all”? To answer this, let’s look at a few sentences that use partial determiners:

  1. The boys ate all the pizza.
  2. The mayor spoke to some of the reporters.
  3. None of the criminals escaped from jail.

The arguments about semantic and statistical plausibility given above suggest that in these constructions the main noun should be the headword. ‘Pizza’ is a highly plausible target for the object of ‘eat’; ‘reporters’ is a great target for the to-complement of ‘speak’, and so on. In contrast, the words ‘all’, ‘some’, and ‘none’ have a vanilla character; they’re not especally implausible in those contexts, but they’re not especially strong choices either.

In addition to this statistical argument, there’s another reason to choose the core noun as the headword instead of the partial determiner. In the small community of grammar formalism developers, there is a rule of thumb that says “Function words should not be headwords”. This rule is based on practicality concerns. Imagine you are a journalist with data science training, and you want to find articles where a particular individual (say, John Kerry) talked to reporters [2]. This can be expressed as a relatively simple query: find phrases where the main verb is ‘talk’, the subject is ‘John Kerry’, and the verb connects to ‘reporter’ with a to role. Then consider this sentence:

  1. Secretary of State John Kerry talked to some of the reporters about the Middle East crisis.

If the core noun acts as the headword, then our simple query will return the above sentence. But if the partial determiner is the headword, we’ll need to make the query significantly more complex to retrieve the sentence.

The Plan to Fix English

To understand my proposal, spend a moment studying the following sentences:

  1. None of us want to go to school tomorrow.
  2. None of we want to go to school tomorrow. (***)
  3. Some of them thought that the president should resign.
  4. Some of they thought that the president should resign. (***)

Let’s see how the Ozora parser responds to the first two sentences:


As you can see, the parser gets it exactly reversed. It happily generates a parse for the ungrammatical sentence, while refusing to accept the grammatical sentence (the gen_srole indicates a parse failure).

Why does it fail on such a simple pair of sentences? The reason relates to the issues I mentioned above. The Ozora grammar system handles partial determiner phrases by treating the main noun (‘us’ or ‘we’) as the headword, and the quantifier (‘none’) as the tailword. That means the system considers the headword of the phrase “none of us” to be ‘us’, not ‘none’. Furthermore, the system knows about the rules of case, so it refuses to allow ‘us’ to fulfill the subject role.

This brings us to the substance of this Plan to Fix English. The proposal is simply to reverse our intuitive judgments and accept the parser’s response to the above sentences as grammatically correct.

This may seem like a ridiculous argument. I’m effectively saying, “you all should change the way you speak to make my life easier.” In fact, that’s exactly what I’m saying, but it’s not as unreasonable as it sounds at first. The Ozora parser is built on a set of logical rules about grammar. In a large majority of cases, the parser agrees with our intuitive grammatical judgments. In many other cases, the parser gets the wrong answer, because of an implementation bug or a legitimate shortcoming of its grammar system. In those instances, I’ll frankly agree that the responsibility to fix the problem is entirely mine. But in this particular case, the parser is faithfully following the logical rules that I described above. In other words, it disagrees with our intuition because our intuition is logically inconsistent.

I could, of course, write some additional code to implement a special case exception that would allow the parser to agree with our intuition. But I don’t want to do that, because the extra code would be unavoidably ugly, since the it is forced to reflect ugliness in the underlying material. That brings me back to the appeal I made at the beginning of the post. English is a priceless shared resource which we bequeath to our children and which upholds our civilization and culture. But as the language drifts, it is gradually becoming more and more disorganized. The situation doesn’t seem terrible today, but in time the language may become so muddy and illogical that we lose the ability to discuss sophisticated concepts or to compose beautiful works of literature. We can’t stop linguistic drift, but we might at least be able to guide the movement in the direction of greater elegance and precision, instead of towards confusion and sloppiness.

I hope you found this P2FE to be somewhat thought-provoking, even if you don’t agree with the specific conclusions. Stay tuned for the next installment!

[1] – This may be a false attribution, but it sounds like something Gandhi might say, and anyway it is good advice.

[2] – You can actually try this query on Ozora’s online demo system. As of October 2017, the top result is the following sentence from this article:

Kerry was talking to reporters after meetings in Beijing with top Chinese officials , including President Xi Jinping.



Against Phrasal Taxonomy Grammar

This post is about the problems with approaches to grammar theory that depend on the creation of a phrasal taxonomy. A phrasal taxonomy is a system for categorizing phrases into discrete categories like “Noun Phrase” (NP) or “Verb Phrase” (VP). At first glance, the idea seems quite plausible. Most people have a strong intuitive feeling that phrases like “John”, “the man on the train”, or “a red brick house” are all members of the NP category, while phrases like “kicked a ball”, “flew”, or “raised a family” are all VPs.

In addition to our intuition, the phrasal taxonomy concept is also supported by an important empirical phenomenon. Linguistic terminology is terrible, so in this post I’ll just call the phenomenon pluggability. To illustrate the idea of pluggability, we write down a couple of simple sentences, and then delete a couple of words to leave a blank:

  1. John ____ to the store.
  2. Sally ____ with her mother.
  3. Mike and Bob ____ .
  4. My friend ____ .

Then we brainstorm a bit and try to come up with some completion phrases that could fill the blank spot. In this case, it’s easy to come up with some good ones:

  1. rode a bike
  2. spoke quickly
  3. danced a polka
  4. threw a party

Here’s the amazing thing: all of the phrase candidates can be “plugged into” all of the original sentences. And we could easily write down many more completion phrases and blanked sentences that would work just as well. This is the core insight of pluggability. Pluggability suggests that if we can document and characterize all the sets of phrases that can be plugged into various types of blanks, we’ll have made a major step towards understanding grammar.

At this point it’s important to point out that, as grammarians, we’re primarily interested in grammatical acceptability instead of semantic plausibility. If the grammatical interpretation of a sentence is clear, it is no problem to use if the sentence expresses a bizarre, whimsical, or fantastical event. For example, if we insert phrase 2 into sentence 1 above, the resulting sentence is certainly grammatical, but sounds very strange in terms of meaning (perhaps the store is haunted).

Based on the observations of the above blanked sentences and completion phrases, let’s declare victory and claim that we’ve identified our first phrase category. We’ll call this phrase category VP. Let’s keep working and look at another type of phrase. Consider the sentences:

  1. The woman was arrested by ____ .
  2. ____ started a technology company.
  3. Ryan attended a meeting with ____ .
  4. Mark and ____ travelled to Russia.

Submit the following phrases as valid completions for the above sentences:

  1. Jim
  2. the man in black
  3. a tree
  4. four Japanese government officials

Again we see that despite the wide range of diversity of the completion phrases, they still can each complete the sentences, with varying degrees of semantic plausibility. So we’ll call the phrases in the above list “Noun Phrases” or NPs for short.

It looks like we have made a good start: we’ve got a good handle on the two most imporant types of phrases in English. Of course we are going to have to do a lot of work to refine our taxonomy to cover the full range of English grammar, but probably (it seems at the outset) there will not be too many phrase categories – perhaps a couple of hundred? It doesn’t seem like it will be unreasonably difficult to document all of these categories. After all, the language must be easy enough that normal human children can learn it reliably.

What do we do with the taxonomy once we’ve completed it?

There are a variety of paths one might consider as ways to reach the ultimate destination, which is a complete theory of grammar. The most obvious path is to use the taxonomy to describe the ways in which phrase structures can be combined together.

For example, given the basic structures VP and NP as mentioned above, plus a structure S for sentence, a symbol PP for prepositional phrase, and Prep for an individual preposition word like “to”, “from”, “on”, etc. We could write down the following rules:

  1. S :: NP + VP
  2. VP :: VP + NP
  3. VP :: VP + PP
  4. PP :: Prep + NP

Rule 1 says that a sentence is an NP subject plus a VP. Rule 2 says a VP can “split off” an NP representing a direct object. Rule 3, similarly, says that a VP can split off a prepositional phrase. And rule 4 shows that a PP is just an NP with a single preposition word in front of it. Again, this set of rules is hardly complete, but it seems plausible that it is a good start, and that with a serious effort we could identify the full set of rules for English.

This approach is the basis for one of the most common types of grammar theory, called the Probabilistic Context Free Grammar (PCFG). Chomsky famously used this approach, along with another concept called “syntactic movement”, in his early theories of grammar. Because this type of formalism is widely used in mainstream linguistics, it was also adopted by many researchers in machine parsing. The first and most important parse tree annotation effort, called the Penn Treebank, used a PCFG formalism to define the annotations. Because of the influence of the Penn Treebank, a lot of early work in machine parsing also used the PCFG. More recently, though many research groups have moved on to a grammar formalism based on typed dependencies, Microsoft notably still offers a PCFG based parser through their Linguistic Analysis API. (In a previous blog post, I poked fun at Microsoft’s parser and the PCFG formalism).

In my view, the PCFG and other approaches to grammar that are based on phrasal taxonomies, suffer from a number of serious conceptual flaws. I want to try to illustrate these flaws by describing how one might describe some more advanced gramatical phenomena using a taxonomy.


An example of a PCFG-based parse tree, from English Stack Exchange

Verb Tense

Let’s look at what phrasal taxonomy needs to do to handle verb tenses. English has a relatively simplistic verb conjugation system, but the rules it does have are strongly enforced – no native English speaker would say “I have broke the radio”, except, perhaps, in an extremely informal setting. Have a look at the following sentences, which are minor variations on the previous set:

  1. John is ___ to the store.
  2. Mike and Bob are ___ .
  3. Sally has ___ with her mother.
  4. My friends have ___ .

And consider the results of trying to complete the sentences with the following phrases:

  1. riding a bike
  2. throwing a party
  3. spoken quickly
  4. danced a polka

You can see the issue easily. To complete sentences 1+2, we need to use a phrase that is in the gerund (-ing) tense. But for sentences 3+4, we need a phrase in the past participle tense (which in the case of “dance” is just “danced” because it does not have a distinct past participle form). So the pluggability phenomenon breaks down here, even though all of the completion phrases are seemingly VPs.

Hmmm. That’s unfortunate. It looks like we are going to need to introduce some additional complexity to deal with tense. Well, maybe it’s not such a big deal. We just need to split the original VP category into 5 subcategories. We can call them VBP for present, VBD for past, VBZ for third person singular, VBG for gerundial, and VBN for the past participle form. It’s going to be a bit tricky to write the phrase combination rules in terms of these new symbols, but probably not impossible. There aren’t that many more symbols to deal with.

Quantity Modifiers

Let’s look at another set of examples, this time relating to quantity. Consider the following sentences:

  1. ___ talked to the mayor about the crime problem.
  2. The woman gave ___ a plate of cupcakes.
  3. On the way home I met some of ___ .
  4. All of ____ went to the show.

At a first glance, it looks like basically these are going to be NPs. So let’s try some NP-like completions:

  1. the audience
  2. my friends
  3. most of the people
  4. one of the journalists

As you can probably see, the issue here relates to the quantity modifiers all/most/one/etc. The simple phrases 1+2 can be inserted into any of the sentences. Similarly, any of the completion phrases can be plugged into the simple sentences 1+2. So we see again the pluggability phenomenon that motivated this project at the outset.

The problem comes when we try to insert phrases 3+4 into sentences 3+4. This combination doesn’t work because the both the phrases and the sentences have quantity modifiers, and using two quantity modifiers in the same phrase is ungrammatical. You cannot say “On the way home I met some of one of the journalists”.

This is another setback for the phrasal taxonomy project. Most of the completion phrases above look like NPs, so we would like to categorize them as such. But the analysis shows that we are going to need to make a distinction between different sets of NPs, depending on whether they contain a quantity modifier. Well, okay, let’s introduce new categories NP[+Q] and NP[-Q], to capture this distinction. Again, it might not seem like an insurmountable barrier; at this stage it is just one additional symbol.

Adjective Ordering Rules

When he was a child, the great fantasy author JRR Tolkien told a story to his mother about a “green great dragon”. His mother told him that this was the wrong way to say it, and he should say “great green dragon” instead. Tolkien asked why, but his mother couldn’t explain it, and the episode was an early influence leading to his lifelong interest in language and fantasy.

Tolkien’s error was to violate an adjective ordering rule. English has a lot of complex rules about how adjectives must be ordered, one of which is that adjectives expressing magnitude, like “great” or “large”, must come before ones expressing color or quality, like “green” or “wooden”. Let’s see how this impacts the phrasal taxonomy project. Here are some sentences:

  1. A ___ rolled down the street.
  2. I wanted to give my sister a ___ .
  3. My parents bought a wooden ___ .
  4. The black ___ exploded with a flash of fire.

And compare the completion results with the following phrases:

  1. car
  2. necklace
  3. large house
  4. tiny basket

We see exactly the same problem pattern that we saw for the quantity modifiers. The one-word phrases (“car” and “necklace”) can be plugged into any of the sentence slots. And all of the phrases can be plugged into sentences 1+2. But the two-word phrases 3+4 cannot be plugged into the sentences that already contain an adjective, because of adjective ordering rules. The phrase “wooden large house” is ungrammatical; we have to say “large wooden house” instead. (If someone said “wooden large house” in real life, a listener might interpret it as a compound word “largehouse” that has some specific technical meaning, like “longboard” has a special meaning in the context of surfing).

If you are a non-native speaker of English, you might be curious or skeptical about how strongly these ordering rules are enforced. The Google NGram Viewer gives us a nice tool to study this phenomenon. This search indicates that the phrase “large wooden house” occurs in thousands of books, while the frequency of the phrase “wooden large house” is so small as to be statistically indistinguishable from zero.

Anyway, this is another problem for the phrasal taxonomy project. In this case it’s not even clear how many new symbols we need to introduce to solve the problem. Do we need a new symbol for every category of adjective that could be implicated in an ordering rule?

Are We Done Yet?


In each of the above sections, we saw that in order to describe a certain grammatical phenomenon, we were going to have to introduce another distinction into our taxonomy. Instead of a simple VP category, we’re going to need a couple of subcategories that express information about verb tense. Instead of a simple NP category, we’re going to need a bunch of subcategories that include information about plurality, adjective ordering and the presence of quantity modifiers.

You can imagine a hapless grammarian sitting at a desk, assiduously writing down a long list of phrasal categories that capture all the relevant grammatical issues in English. Will this poor soul ever be finished with his Sisyphean task? Probably not in this lifetime.

NOR En lærd i sitt studerkammer, ENG A Scholar in his Study

The problem is that the number of categories is exponential in the number of distinctions that need to be expressed. Consider noun phrases. As we saw, we will need to create different categories that represent whether the noun has a quantity modifier, and what type of adjective it contains. Suppose we decide that 3 levels of distinct adjectives are necessary. Furthemore, it is obvious that we will need to express the distinction between plural and singular nouns. We are already at 2*2*3=12 different categories of noun phrases, and we’ve only just started. What about determiners? What about preposition attachments: is “the mayor of Baltimore” in the same category as just “the mayor”? For every new type of distinction we need to make, we’re going to have to double (at least) the number of symbols we need to maintain.

The situation for Verb Phrases is even worse. We saw above how we will need to create different categories for each verb tense. But in fact we will need to go further than that. We’ll need to create new category splits for each of the following distinctions:

  1. Does the phrase have a direct object?
  2. Does the phrase have an indirect object?
  3. Does the phrase have an infinitive (“to”-) complement?
  4. Does the phrase have a sentential (“that”-) complement?

Each of these distinctions requires us to double the number of categories we use. As we saw above, there are five verb tenses, so now we’re at 5*2*2*2*2=80 categories just for verb phrases.

Why does this Matter?

In the early days of Natural Language Processing, the plan was to combine the domain expertise of linguists with the algorithmic expertise of computer scientists, in order to create programs that could understand human language. This plan was pursued vigorously by several research groups. After some time an awkward pattern emerged: it turned out that the linguists couldn’t actually contribute very much to the project. Early researchers found that smart algorithms and machine learning techniques were more useful for building NLP systems than linguistic knowledge. This pattern was expressed succinctly by Fred Jelinek, one of the pioneers of the field, who said “every time I fire a linguist, the performance of the system goes up”. From these early observations about the efficacy of different approaches, almost everyone in the field drew the following conclusion:

  • Linguistic knowledge and theory is not useful for NLP work. Instead, people should rely entirely on Machine Learning techniques.

As a result of this collective decision, modern NLP research contains only a very superficial discussion of actual linguistic phenomena. Most people in the NLP world have little interest in analyzing, for example, the differences or similarities between various types of relative clauses. They do not have much interest in studying the empirical adequacy of different theories of grammar (e.g. X-Bar theory). Instead, the focus is overwhelmingly on computer science topics such as data structures, algorithms, learning methods, and neural network architectures.

My research is predicated upon an alternate explanation of the early history of the field:

  • Linguistic knowledge is potentially very useful for NLP. However, the field of linguistics has not yet obtained a sufficiently accurate theory, and it is better to rely on ML than on low-quality theory.

In other words, if it can be discovered, a highly accurate linguistic theory of grammar will prove decisively significant for the development of NLP systems. Unfortunately, we cannot look to mainstream linguistics for this theory. That field has been dominated for decades by the personality of Noam Chomsky. His pronouncements are given far too much weight by his followers, in particular his bizarre and almost anti-intellectual position that “probabilistic models give no insight into the basic problems of syntactic structure” (see here and also here). In fact, probabilistic modeling is the only tool that can bring a scientific debate about the structure of language to a decisive conclusion.

I encountered the general issues with phrasal taxonomy that I mentioned above in my early work on large scale text compression. My first combined parser/compressor systems were based on a PCFG formalism. The initial versions of the system, which used only simple grammar rules, worked acceptably well. But as I attempted to make increasingly sophisticated grammatical refinements, I needed to scale up the taxonomy, and I found that this was extremely difficult to do, largely because of the kinds of issues I mentioned above. That eventually led me to discard the PCFG formalism in favor of a system that doesn’t require a strict taxonomy.




Ozora Research: One Page Summary

At Ozora Research, our goal is to build NLP systems that meet and exceed the state of the art, by using a radically new research methodology. The methodology is extremely general, which means our work is high-risk, high-payoff: if the NLP research is successful, it will affect not just NLP, but many adjacent fields like computer vision and bioinformatics.

The methodology works as follows. We have a lossless compression program for English text. The input to the compressor is a special sentence description that is based on a parse tree. We have a sentence parser, which analyzes a natural language sentence to find the parse tree that produces the shortest possible encoded length for the sentence. With these tools in place, we can now rigorously and systematically evaluate the parser (and other related NLP tools) by looking at the codelength the combined system achieves on a raw, unlabelled text corpus.

Compare this methodology to the situation in mainstream NLP research. In sentence parsing, almost all work depends entirely on the existence of human-annotated “gold standard” parse data, such as the Penn Treebank (PTB). This dependence puts severe limitations on the field. One issue is that any conceptual error or inconsistency in the PTB annotation process gets “baked in” to the resulting parsers. Another issue is the small size of the corpus, which is on the order of 40,000 sentences: there are many important but infrequent linguistic phenomena that simply will not appear in such a small sample.

Our research also engages new, interdisciplinary expertise by emphasizing the role of empirical science, as opposed to algorithmic science which is the centerpiece of modern NLP work. For example, our system incorporates knowledge about verb argument structure: certain verbs such as “argue”, “know”, or “claim” can take sentential (that-) complements, while most verbs cannot. Similarly, our system knows about the special grammar of emotion adjectives like “happy” or “proud”, which can be connected to complements that explain the cause of the emotion (“My father was happy that the Cubs won the World Series”). From this viewpoint, the challenge is to develop a computational framework within which the relevant grammatical knowledge can be expressed simply and cleanly. These issues are largely ignored in mainstream NLP work.

Our work is in the early stages. The basic components of the system are in place, but it has not yet achieved a high level of performance. Funding from the NSF will enable us to scale up the system to determine if the approach is truly viable. Specifically, we will scale up the grammar system to include many infrequent but important phenomena, and also upgrade the statistical model that backs the compressor, by using more advanced machine learning techniques. Funding will also enable us to package the results in a publishable form for the benefit of the broader research community.




New Parse Visualization Format

In my research, I try, to whatever extent possible, to allow myself to be guided by the ideals of beauty and elegance. It’s not always easy to achieve these ideals, and sometimes it’s hard to justify prioritizing them over more mundane and practical concerns. But I believe there is a deep connection between truth and beauty, as John Keats pointed out long ago, and so I try to keep beauty in the front of my mind whenever possible.

(In my last post, I made fun of the PCFG formalism, which I think is both ugly and confusing [1]).

The most aesthetically important component of the system is the tool that produces a visual description of a parse tree. I spend a lot of time looking at these images, so if they don’t look nice, I get a headache. And more importantly, if the image is too cluttered or confusing, it impedes my ability to understand what’s going on in the grammar engine.

So I am very excited about one of the features in the most recent development cycle, which is a new, slimmed-down look for the parse trees. To motivate this, observe the following image, which was built using the old system’s parse viewer (if it is too small, right click and select “view in new tab”):


There are a couple of annoying things going on in this image. First of all, notice how in the final prepositional phrase (“on gun ownership”), there is one main link labelled with on, and then another subsidiary link labelled on_xp. The main link is meaningful, because it expresses the logical connection between the words “restrictions” and “ownership”. On the other hand, the subsidiary link really doesn’t tell you very much. It’s required by the inner workings of the parsing system, but it doesn’t help you understand the structure of the sentence.

Another example of this redundancy is the pos_zp role that links the name “Bloomberg” to the apostrophe “‘s”. Again, the system requires this role before it will allow the owner role, but the latter role is the one that is actually informative.

The new visualization system removes the “preparatory” links from the parse image. This removes the redundant clutter, and it also has a nice additional benefit of reducing the height of the image.


Another difference, which should be pretty obvious, is the change in the part-of-speech annotations that are listed underneath each word. In the old system, I was using a POS system that was basically descended from old PCFG concepts. So the word “restrictions” was labeled with NNS, while “joined” was labeled with VBA [2]. Now instead of those somewhat cryptic symbols, we just have N[-s], which indicates a plural noun, for “restrictions”, and V[-ed], which indicates a past-tense verb, for “joined”.

In other cases, I’ve left out the POS annotations entirely, because they’re obvious from the role label. For example, here’s the old and new output on an simple sentence:



As you can see in the old version, the word “has” is marked with a special POS tag HAVE. Now, it is an important technical point that the form “has/had/have” is its own grammatical category, and therefore in principle it should have its own symbol. However, the viewer of the parse doesn’t need to be reminded of this, since the word is marked with the special aux_have link, which cannot connect to any other category.

Question for readers: which visualization format do you prefer? Is the new version easier to understand or not?

A closing thought on the power of writing up ideas for presentation: in the course of writing this post, I noticed a subtle bug in the way the new version of the system is handling punctuation. Do you see the boxes around the periods at the end of the sentences? The color indicates the amount of codelength required to send the sentence-ending punctuation. The green boxes indicate that the (old) system is using a very short code, which is feasible for encoding periods, because almost all normal declarative sentences end with periods. The yellow boxes indicate an increased codelength cost. Now in the images from the old version, the boxes are green, but in the new version, the boxes are yellow. This indicates that the new version has a bug in the component responsible for predicting the sentence-final punctuation.

Thanks for reading!  Please feel free to comment on this page directly, or reach out to me through the “contact” link, if you are interested in talking about NLP, sentence parsing, grammar, and related issues.



[1] – Several years ago, when I was working on the early versions of the Ozora parser, I actually used a grammar formalism based on the PCFG. There were a number of technical problems with this formalism, but these technical problems probably would not have proved compelling on their own. In addition to the technical issues, though, there was also a huge aesthetic issue: the resulting parse trees didn’t look good, and the problem got worse and worst as the sentence got bigger. Because of the way English sentences branch, PCFG visualizations tend to appear very tall, and angled down and to the right. Consider the following sentence:

The debate over the referendum was rekindled in Israel after reports that Naftali Bennett , a minister whose Jewish Home Party opposes the establishment of a Palestinian state , was soliciting the support of Yair Lapid , the finance minister and leader of the centrist Yesh Atid Party , for new legislation .

This is a long sentence, and any visualization tool is going to struggle with the width. But when faced with a long sentence, PCFG tree images also have a huge problem with height. I parsed this sentence using Microsoft’s Linguistic Analysis API, and here’s a slice of what came out:


You can see the huge gap between the POS annotations at the top of the image, and the words at the bottom. Almost half the screen space is entirely wasted. In contrast, here’s the Ozora visualizer’s output for the left half of the sentence:


As you can see, there is not nearly as much wasted space in the visualization, and it is much easier to understand the logical relationship between the various words.

[2] This concept of VBA was something I was quite proud of when I came up with it, though I am no longer using it. The idea behind the VBA category relates to the fact that, while most English verbs have only four distinct conjugations, some verbs have a fifth conjugation, called the past participle, which normally has an -en suffix. Examples are “broken”, “spoken”, “eaten”. This tense is usually denoted as VBN, while the regular past was denoted VBD. Now if the verb has no VBN, then you are allowed to use the VBD instead. But if it does have a VBN, you must use it or the sentence will be ungrammatical. Consider:

I have visited London three times in the past year.

I have spoke to the president about the issue. ***

Mike wanted to buy a used car.

John was able to fix the broke radiator. ***


In these sentences, the first example of the pair is grammatical, because the verbs “visit” and “use” have no VBN form. The second examples are ungrammatical, because the verbs “speak” and “break” have distinct VBN forms, which must be used in the given context.

It’s actually a bit tricky to express this rule succinctly in the parsing engine. You have to say something like “allow VBD, but only if the verb does not have VBN“. In other words the parsing system has to query back into the lexicon to determine if an expression is grammatical.


To avoid this, my idea was to package together words with ambiguous VBD/VBN conjugations like “visited” and “joined” together in the VBA symbol. Verbs that had separate fifth conjugations would produce VBD or VBN as appropriate, but other four-conjugation words would produce only VBA. Then you could express the grammar rule in a succinct way: accept either VBA or VBN but not VBD.