Posterous theme by Cory Watilo

Tuning your Maxixe segmenter with the new optimizer

Remember when we talked about Maxixe? We talked about how to use it to train a very simple segmenter and split a sentence without spaces into single words. But now you will probably want to try it yourself, for real, in your big enterprisey next big thing! But how do you actually do that?

Ingredients

Maxixe needs three things to build a segmenter:

  1. A large corpus of text data. This should be as close to your production data as possible.
  2. A set of n-gram counts from this corpus. For example, all 2-grams, 3-grams and 4-grams from the corpus are counted and then saved as the index.
  3. A threshold that controls how lhigh the score for a possible segmentation has to be for it to be cut. Simply put, the lower this is, the more segments you will get.

The corpus

So how do you get hundres of megabytes of text in your target language? This can actually be pretty easy! For example, Wikipedia has dumps of their database for all languages, and some other gals and guys already figured out how to get a plain text corpus from them. This can be a bit much and you will probably want to try only a subset of Wikipedia. Also remember: After using Wikipedia as a corpus, your segmenter will be trained to split encyclopedia articles, but it will not necessarily be well equipped to split blog comments or 4chan memes. 

Another source can be Project Gutenberg, which has a lot of free books in plain text format, ripe for the taking. Similar pages are available for more specialized areas. For japanese, there is Aozora Bunko, for classical japanese you can take a look at the Japanese Text Intiative. Just type "$favorite_language text" in Google or DDG and use whatever comes up. Okay, if you need Chinese, I searched for that, too.

Anyway, you now have your corpus. AWSM! What to do next?

Building your index

When you use the segmenter, you can actually change the threshold value anytime you like (you should not, though). But you will have to decide in advance which n-gram count indexes you want to have, as these have to be pre-computed. So which n-grams should you index?

I know that every time I write "n-gram", someone stops reading. What is this thing? An n-gram is just a string of length n. So let's say we have the string "nyan" and want all 2-grams that are in it. Easy! Just start at the beginning and take two characters each time. So for "nyan" we get ["ny", "ya", "an"]. The 3-grams would be ["nya", "yan"]. 

Now, remember that the Tango algorithm that Maxixe uses depends on the fact that n-grams that are part of words or words themselves are more likely to occur in a corpus than n-grams that are not. A simple example: If you have a large corpus, your 3-gram count for "cat" and "dog" will probably be rather high, but the count for "qxc" will be rather low, as a word would need to end with "qx", begin with "xc" or even contain the string "qxc" completely if this 3-gram could be found in the corpus.

So, which values for n should you choose? This is not an easy question. Generally speaking, small n values will lead to small segments and vice versa. But sometimes you may, for example, need n-grams for n = 2, 4 to get the best recognition performance. You will just have to test it.

Over the threshold

There is this last parameter you need, the threshold t. This is the cutoff value. If any position between two characters in the string has a calculated probability value that is larger than the threshold, it will be split at this point. Why do you need this? Usually, positions are split if the calculated value for the current position is a local maximum, which means that it is more likely that a word boundary lies on the current position then on the one directly before or after it. But sometimes, this just isn't enough. A simple example is the word "a", as in "not a simple example". To split this string correctly, the positions right before and right after "a" have to be split. But they are next to each other, so they can not both be local maxima, as this would mean they are each larger than the other. If you are a mathematician, you can now construct an algebra were this is possible, but we normal people will just have to set this to some value. As values can only vary between 0 and 1, 0.5 seem like a good natural compromise. But if you want to get the best value, you will have to test it, too.

Optimize

Guess what? I wrote a method to help you test which values for n and t are the best! You should probably do a "gem install maxixe" as soon as you can reach your terminal to get the new version.

I stole this idea from the paper, too. The idea is to split a small number of sentences (5 to 50) by hand and use these to figure out the parameters that work best for your use case. Now, the authors of the paper are proper scientists, so they used precision and recall to measure recognition performance. But I wrote this in 2 hours after work, so I will measure recognition performance by edit distance: diff the pre-segmented strings with the ones split by the algorithm, add up the edits needed to make them the same and use this as a score. A score of 0 means that there was no difference between your pre-segmented sentences and the ones split by the segmenter, so this is a perfect score. The higher the scores, the worse the word recognition.

Here is the example from maxixes readme:

As you can see, you will have to build an index first. This should contain any values for n you could possible want. The optimizer will try any combination of values for n and t and calculate the score. Then, it will give you the smallest result for score, n and t, in that order. The above example has the best recognition performance if you have n-grams with n = 2,4 and a threshold of 0.5. If we used a real corpus, we would now have a segmenter with nicely tuned parameters, ready to segment a lot of sentences.

BTW, there are more combinations of parameters that will give you a score of 0 for this example. If you want to see these, take a look at Maxixe::Trainer.check_recognition. 

What's next?

By now, you should have a working segmenter, finely tuned for your exact use case. So, you made a segmenter for biblical Greek? Or for Cicero's work? Why not make a gem out of it, so others can use it? I will do the same, so stay tuned for more...