At Ozora Research, our goal is to build NLP systems that meet and exceed the state of the art, by using a radically new research methodology. The methodology is extremely general, which means our work is high-risk, high-payoff: if the NLP research is successful, it will affect not just NLP, but many adjacent fields like computer vision and bioinformatics.
The methodology works as follows. We have a lossless compression program for English text. The input to the compressor is a special sentence description that is based on a parse tree. We have a sentence parser, which analyzes a natural language sentence to find the parse tree that produces the shortest possible encoded length for the sentence. With these tools in place, we can now rigorously and systematically evaluate the parser (and other related NLP tools) by looking at the codelength the combined system achieves on a raw, unlabelled text corpus.
Compare this methodology to the situation in mainstream NLP research. In sentence parsing, almost all work depends entirely on the existence of human-annotated “gold standard” parse data, such as the Penn Treebank (PTB). This dependence puts severe limitations on the field. One issue is that any conceptual error or inconsistency in the PTB annotation process gets “baked in” to the resulting parsers. Another issue is the small size of the corpus, which is on the order of 40,000 sentences: there are many important but infrequent linguistic phenomena that simply will not appear in such a small sample.
Our research also engages new, interdisciplinary expertise by emphasizing the role of empirical science, as opposed to algorithmic science which is the centerpiece of modern NLP work. For example, our system incorporates knowledge about verb argument structure: certain verbs such as “argue”, “know”, or “claim” can take sentential (that-) complements, while most verbs cannot. Similarly, our system knows about the special grammar of emotion adjectives like “happy” or “proud”, which can be connected to complements that explain the cause of the emotion (“My father was happy that the Cubs won the World Series”). From this viewpoint, the challenge is to develop a computational framework within which the relevant grammatical knowledge can be expressed simply and cleanly. These issues are largely ignored in mainstream NLP work.
Our work is in the early stages. The basic components of the system are in place, but it has not yet achieved a high level of performance. Funding from the NSF will enable us to scale up the system to determine if the approach is truly viable. Specifically, we will scale up the grammar system to include many infrequent but important phenomena, and also upgrade the statistical model that backs the compressor, by using more advanced machine learning techniques. Funding will also enable us to package the results in a publishable form for the benefit of the broader research community.