(Note: I actually think Microsoft is a pretty great company, with a ton of smart people)
When you’re doing independent research, it’s really important to celebrate the small victories. Being independent means you don’t have a lot of money, you don’t have access to the best data sets, and you can’t run your code on the fastest computers. You have to do all the nitty-gritty software engineering and DevOps work yourself, while researchers at big tech companies can outsource that stuff to someone else, so they can focus on high-level research issues. Even worse, you have to deal with the possibility that your Big Idea is actually wrong. So it’s important psychologically to savor the rare moments when you succeed, where a BigCo fails.
For me, one of those moments came a couple of days ago when I was surfing the Microsoft Cognitive Services page. This service is a part of Microsoft’s big, recent commitment to research in artificial intelligence (AI). The basic idea is to make a suite of state-of-the-art algorithms available on the cloud through an web API, so that small companies or researchers who want to use such algorithms can do so without having to do lots of yak-shaving. One of the services is called the Linguistic Analysis API, which is basically a natural language sentence parser. You can try out the parser on the page, and the demonstration sentence they preload into the page is the following:
The Linguistic Analysis API simplifies complex languages to help you easily parse text
Which results in the following parse tree :
Now, take a moment to study this parse tree. Does it make sense to you? Is it visually appealing? In my (very biased) opinion, the answer to both questions is no. This kind of parse tree totally obscures the actual grammatical relationships between the words. What, for example, is the relationship between the verb “simplifies” and the noun “languages”? Obviously the correct answer is object, but to determine this from looking at the tree, you need to jump up from the word itself, to the VBZ symbol, then connect to the NP subtree through the VP parent, and then drill down to the NNS. And you have to just know that this particular configuration of symbols implies a direct object relationship.
Not only is it confusing, this kind of parse tree visualization looks really ugly. Notice how much space is being wasted on the left side, where there is a huge gap between the symbol layer on the top, and the words on the bottom. On the right side, the words and symbols are packed together tightly, because there are so many symbol expansions – the word “parse” is nine levels below the starting TOP symbol. Reading the sentence and introspecting, do you really feel that it requires that much complexity to describe?
But, okay, maybe these are just aesthetic complaints that don’t have a place in proper scientific deliberation. Here’s a real complaint: the parse is wrong. If you’re so inclined, take a minute to study the tree and try to figure out the issue.
Ironically, the mistake relates to the word “parse” itself. The parser thinks that “parse” is an adjective (JJ), contained within an adjective phrase (ADJP). That’s clearly wrong: “parse” can be a noun or a verb, but not an adjective. The symbol above it (“easily parse”) should be a VP, not an ADJP. And since the word “text” is the direct object of “parse”, the subtree for “text” should be below the subtree for “parse”, not above it.
Huh. Okay, well, the Microsoft parser got this one wrong. So what? Natural language parsing is hard, very hard; I’ll be the first to admit it. The fact that their parser makes a few mistakes on one sentence isn’t such a huge failure – it got most of the sentence right, and probably it works fine on other sentences.
But wait a minute. Their parser failed on the demo sentence they chose to put on the splash page. They could have used any sentence they wanted (“The Linguistic Analysis API makes natural language a piece of cake!”, “Use the Linguistic Analysis API to simplify your text analysis workflow!”, “Can eagles that fly swim?” etc). Or, if for some reason they are really attached to that particular sentence, they could have hacked the parser somehow to require that it produces the correct parse for it . The parse result for the demo sentence is literally the first thing a potential customer would see when trying out the service.
So what really happened? My guess is one of two things: 1) They didn’t actually notice that the parse is wrong, or 2) they think potential customers won’t notice that the parse is wrong. In both scenarios, the ultimate cause is the same: the parse tree notation system they’re using is incomprehensible gibberish. Either the bad notation obscured the problem from the Microsoft developers themselves, or they figured the bad notation would obscure the problem from potential customers.
This is the point where we stop chuckling at Microsoft, and start chuckling at the formalism itself. The formalism is called a Probabilistic Context Free Grammar (PCFG), and it wasn’t invented by Microsoft. It’s been in use by both the mainstream linguistics and the mainstream NLP communities for decades.
In fact, until a couple of years ago, if you wanted your parser to be taken seriously by the NLP community, it had to use the PCFG formalism as its output. This is because the most prominent evaluation database, the Penn Treebank, was annotated (by humans) in a PCFG format. To evaluate your parser, you invoked it on the test sentences in the PTB, and then compared the output of your system to the annotation information in the database. If your system didn’t produce a PCFG output, it could not be evaluated. Furthermore, you’re required not just to use the PCFG formalism, but the specific instantiation of the formalism. For example, you’re not at liberty to add a new symbol OBSP to represent gerundial verb phrases that act as the target of an observation verb (“I heard him playing the piano in the other room”).
So, again up until recently, if there were any errors or problems, general or specific, with the PCFG formalism as used in the treebank – if in fact this parse tree notation style is incomprehensible gibberish – then these problems were baked into all the parsers that were evaluated in this manner. You could never claim that your parser is better than the Penn Treebank; the quality of a parsing system is defined by the extent to which it agrees with the benchmark. If the researchers who developed the PTB made a mistake when choosing their part-of-speech tagset, this error would propagate into all the parsers developed by the community.
Actually, the field depends on the Penn Treebank for more than just evaluation. Almost without exception, modern NLP systems are developed using machine learning (ML) techniques. In order for ML to work, the algorithms must have a good supply of “training” data: input samples that are labeled with the correct output response. The ML algorithm then learns from these examples to reproduce the desired input/output function. So, if you’re doing parsing research, where do you get the labeled training data? You guessed it: the Penn Treebank.
More recently, many researchers in the field have switched to a new parse formalism called dependency grammar, also known as link grammar. These formalisms describe sentences in terms of links pointing from head words to tail words; the links usually also have a label denoting the type of the relationship (subject, object, etc). To evaluate a parser that produces dependency grammar output, researchers take the PCFG annotation data and programmatically convert it to the new format. In this way, researchers are able to break away from the overarching PCFG formalism.
But they are still constrained by the inherent limitations of the approach to evaluation. Any error or lacuna in the underlying human annotation will still cause an error in the converted version. If the underlying data set fails to make an important distinction (such as that related to observational verbs mentioned above), the converted version will fail to make the distinction also.
Furthermore, any evaluation procedure based on a limited size corpus will have serious difficulty with judging a system’s performance on rare grammatical constructs. For example, the word “help” has an eccentric characteristic, which is that it can connect to bare infinitive complements:
I would like to help you (to) build your company.
Most other verbs that connect to infinitive complements require the particle “to”, but for “help”, it is optional or even discouraged. So if the treebank dataset doesn’t have enough sentences with the specific word “help”, parsers won’t be able to learn how to handle this construct correctly.
Let’s see how some other mainstream parsing systems handle Microsoft’s teaser sentence. Here’s the result from the Stanford parser:
Here’s the result from spaCy:
Well, these examples are at least better visually . One thing that is immediately obvious is that they are flatter. But it is still quite difficult to understand what is going on. In particular, what are xcomp, ccomp, and advcl? If you look up these tags on the Universal Dependencies page, you will find that xcomp is an “open clausal complement”, ccomp is a “clausal complement”, and advcl is an “adverbial clause modifier”. Does that clarify things for you?
Let’s look at what the Ozora parser does with this sentence :
Now you can see that the connection between “simplifies” and “help” is a purpose link. That seems a lot more informative than xcomp or advcl. Also, “help” links to “parse” with an inf_arg link, which is more precise than ccomp, because the latter contains that-complements as well as infinitive complements.
Of course, I’ve got a long way to go: the Ozora parser is far from perfect. But, unlike almost every other parser, Ozora’s was developed without a treebank database. So it is immune to the inherent limitations of that evaluation paradigm. If you’re interested in how this works, check out my book.
 Depending on the methodology they used to build the parser, this may not actually be that easy. A big drawback of many modern ML approaches is that they are not interpretable and therefore not easily debuggable. You can’t dig into your 10-layer CNN and surgically modify some weights to ensure it outputs the right response for a given input. See also the fiasco related to the Google Image tagging.
 These screenshots were taken on March 14, 2017. Here is a full screenshot showing the MSFT URL in the upper left:
 spaCy’s parse actually contains an error: “help” is connected to “text”, and “parse” is linked to it using an amod link.
 Full disclosure: I had to do a bit of hacking to get it to work correctly on this example. The reason relates to the acronym “API”. In the demo sentence, “API” acts as a common noun; but in the previous configuration, the parser treated all acronyms as proper nouns. And proper nouns basically cannot take modifiers, so “Linguistic Analysis API” was ungrammatical. To fix this, I changed the system to allow it to interpret acronyms as both common and proper nouns and then to pick the best interpretation.