Supervised Learning Model for lemmatization and thoughts about compression.

The lemmatizer is dependent on training data quality. It can be generally used for any natural language except the Porter Stemmer part (current naive implementation supports Russian language only).

Code and functionality description:

The model is represented by LemmatizerModel java class.
Entry point and consuming code stored in Main.java
TrainDataParser ‘s purpose is to open and parse training data.

MODEL DESIGN

In order to make it fast to search through the trained data and to store the model in a packed data structure, the model is represented as a prefix tree, where word forms are represented by paths from the root, each node contains the value (Character) and two kinds of children nodes - the next characters and one special which contains a lemmas list for current tree layer.

In this picture you can see an example of the prefix tree with lemmas stored

Compression is achieved by sharing word prefixes and usage the string pool for storing resulting lemmas. In Java string pool can be used automatically, in C++ we must implement it by providing additional data structures (container similar to Java’s string pool) for lemmas it can be a compressed stream in a memory with special indexed elements access.

High speed for lemma - retrieval operation achieved by using fast searching property of prefix tree data structure. We have to perform no more than n decisions to get to the lemma where n - is the word form length.

Now this tree is represented by “node is dictionary” approach with vitalized all nodes which are not initialized. It is a significant optimization for memory usage.

Possible packing approach without lemmatization quality loss.

The algorithm of lemma retrieval is very quick and in case of low memory environment this tree can be saved in special binary format or even stored to a hard drive. Here, we should notice that even very slow hard drive access won’t affect lemma retrieval ‘cause of very low access quantity required to get the lemma.

Model decreasing approach with quality loss

First idea is to truncate a length of paths in the words tree. In model API there is setCompressionAspect(aspect) where “aspect” has a double type and must be in [0,3:1.0] interval. Its value sets the ratio between word form length and path in the prefix tree. The resulting compressed tree will be looking like:

This truncation works, but we’ll get lower lemmatization quality in the result. In fact we’ll get redundant lemmas on request to model. And this is problem. Possible solution to make this quality decreasing less significant is to apply filtering rules. And this is the reason why I want to try heuristic based on stemming. Stemmer class for Russian language is available in the code (it is naive but basically works). This approach supports flexible size decreasing aspect.

Second idea is to store only segment hash (or something similar to Huffman coding) if the path parts is clear from lemma nodes and branches. This can be really good optimization, but won’t allow to create “flexible compressing ratio” for this model. Just post processing model packing.

Rectangles are candidates to be hashed, all hashes can be be stored is special pool (similar to string pool) or if the size of hashed representation is small they can be inlined.

And third idea is based on stream packing. We can come up with binary representation of the words prefix tree which supports random access and traversing. And then store this data structure in a custom stream that supports zipping. In this case we will get possibility to select any compression aspect ratio (supported by zipping algorithm) and we’ll get a trade-off between memory usage and processor loading. We must count CPU load as quality decrease. But good thing is that in this case accuracy of final result won’t be affected.

All these approaches require additional research and experimenting.

Further reading: See repository https://github.com/korobool/nlp

Oleksandr Korobov - just notes

вторник, 8 января 2013 г.