The Ozora Research Blog is where we discuss our work in “comperical” linguistics.

Comperical science is defined by the following process:

  1. Obtain a large database of observations relevant to some phenomenon of interest.
  2. Develop a theory of the phenomenon, or modify an existing theory.
  3. Instantiate the theory as a lossless data compressor.
  4. Invoke the compressor on the database; the theory’s “score” is the sum of the resulting encoded file size plus the bit length of the compressor program itself.
  5. Keep the new theory if its score is lower than the previous best result, otherwise discard it.
  6. Return to step #2

The nature of the research involved in this process depends on the type of data used in Step #1. Comperical linguistics is the research field that results from using a text database.

This work involves two very different types of intellectual activity. On the one hand, we do a lot of linguistic analysis. A lot of concepts that are often discussed in mainstream linguistics also show up in our work – things like parse trees, verb argument structure, morphology, the theta criterion, subject/auxiliary swapping, and so on.

On the other hand, we also do a lot of work that is more closely related to the fields of computer science, machine learning, and natural language processing (NLP). The central requirement of text data compression is to build a good statistical model of the probability of a sentence. There are simple ways to do that (N-grams), but our goal is to go beyond simple techniques, and build real linguistic concepts into our system.