Posterous theme by Cory Watilo

Simple testing for your algorithmic Javascript

Here's the backstory.

I noticed that Codebrawl does not yet support Github Flavored Markdown in Gist comments. Gist comments get loaded client-side via the Gist API and are then rendered with showdown.js. Showdown is a great project, but it does not support the extensions Github uses. This really sucks for gist comments which contain code, as they are rendered as regular Markdown-style code spans, which ignores newlines and indentation and makes them completely unreadable.

So I forked the project, added the necessary 30 lines of code and made a pull request. I "tested" my code the stupid way: Look at the output in the browser. This made me feel somewhat uneasy. I want this code to be useful, but how should anyone know that I did not break anything? I decided to add some tests. I have done testing in other languages, but never in Javascript. How do you do this?

Testing your Javascript

How can you test Javascript? JS usually runs in the browser, so you could just make an html page that runs your test code. The downside is that you need a browser and you need to refresh it. I'd rather work only on the command line for this project, as my JS does not really do anything that needs a DOM. It just translates one string to another. So what about testing it in node.js or something?

There is a really nice looking BDD testing framework for Javascript that also supports node.js: Jasmine by Pivotal Labs. It looks pretty much like what I want. It has an RSpec-like syntax and supports all kinds of run options. Having said that, I could not get it to run at all, at least not in under an hour. This was probably my own fault. Still, it drove me to think about what I actually needed and if I could just write it myself.

I really just want two things for this: RSpec-like syntax and RSpec-like output. I like the descriptive "it should do somethign", but I don't need all this beautiful "things.should be_wonderful" stuff, a simple assert_equal would be enough. 

Here is what I came up with.

This allows you to write tests like this:

Looks kinda okay! Now to run the tests, just run the specs through node.js. Here is a screenshot.

Jstestsuccess

Nice! How does it look if it fails?

Jstestfail

Also nice! I think I'll use this for now.

This is of course not a full featured solution. Right now, it only has an assert_equal method to test things - you could easily add more methods, though. I found that this mini-framework works quite well for my DOM-less algorithmic javascript, without introducing any external dependencies. If you have a similar problem, you might want to check out my showdown fork at https://github.com/rogerbraun/showdown and see how I use this.

Have fun!

Selective color effect with ChunkyPNG, or: How I won Codebrawl #10

The selective color effect, implemented with ChunkyPNG

So, I won Codebrawl #10. I participated the previous time that ChunkyPNG was featured and I really enjoyed using it. ChunkyPNG is a pure Ruby library for handling PNG files. One of the nice features it has is that it gives you direct read-write access to the pixels, so you can just load an image, do a map or fold over the array of pixels and then save it right back to a new PNG. ChunkyPNG is well documented and gives you several easy methods to work with pixels and colors, so be sure to take a look

When I started to work on this Codebrawl, I thought about two ways to do this: Try to detect each crayon as an object and keep one of them in color, or take a reference color and just keep the colors that are similar to it. Object detection would be great, of course. Just changing colors based on reference does not work in cases were you want to, for example, keep a flag in color. Okay, it does work when you use it on a Japanese flag... Just using a color reference and a distance does work great for the example image that the Codebrawl uses, though, and implementing object detection is a much harder problem than mapping over colors.

I think I know how to iterate over an array, so all I was missing was a distance function for colors. But how do you actually compare colors?

How to measure the distance between colors

There are several ways to get a value that tells you how much alike two colors are. If you have an RGB pixel, you could just take the red, green and blue values as coordinates in a three dimensional space and calculate the distance there (and I guess you know how to do that).

I did not actually implement this, as it does not represent similar colors in the way we would see them. Think about it, the color (255, 0, 0) and (127, 0, 0) are both just shades of red, but they would have the same distance as (255, 0, 0) and (255, 128, 0), which introduces green to the mix and looks pretty different overall.

This also tells us something about what kind of representation we want: One where only the color counts, but not the brightness (or luminance). Skipping through Wikipedia, I first noticed rg chromacity space.

rg chromacity

rg chromacity is a simple way to remove intensity information from your colors and only keep the proportions of red, green and blue. You normalize the values to be between 0 and 1, with r + g + b always adding up to one. It is called rg chromacity because only the red and green components are needed to describe a color, as the blue component is always b = 1 - (r + g). For example, rgb(255,0,0) is rg(1,0), as is rgb(127,0,0).

You now have a 2d space, so distances can be easily calculated. Using this to decide which colors to keep gives somewhat satisfactory results, at least for some of the crayons. But in the end, it just was not as good as I had hoped. Part of this surely is the way this color space actually looks. It is somewhat uneven, as the distance between pure red, rg(1,0), and pure blue, rg(0,0), is one. But between red and green, rg(0,1), it is the square root of two!

By just measuring the distance, I am essentially cutting a circle of colors I want to keep out of this triangle, but colors in this space are not uniformly distributed. I had a lot of ideas how to counter these problems: Make colors spread from a reference color if the neighbors are similar, make negative cuts in this space by specifying colors that should always be made gray, etc... What I really needed was a better distance function for my colors. So let's use a grown-ups color system, HSV.

HSV

If you go looking for color spaces, you will encounter HSL and HSV pretty soon. They use hue, saturation and lightness (or value, respectively) to define a color. If you take a look at some pictures of this color space, you will see that the hue component looks pretty much like what we need.

Now, calculating the hue is more complicated than calculating rg chromacity colors. If I understood it correctly, you make a hexagon of colors, put red at 0°, green at 120° and blue at 240° and then calculate where your color lies. You can see the formula I used in the code.

Calculating the distance is just subtracting one hue from the other. We have to do it two times, though, as the hue is circular and has red on both ends, so the shorter distance counts. This gives good results for most colors. I could not get the yellow and red crayon to seperate perfectly... This may be because I did not find a good reference color, or just that the dark yellow in the tip of the crayon and the light red are actually too similar and this approach won't work at all. You should take a look at the other entries and see how well they did.

Better color distance

I tackled the problem of color similarity from a rather primitive point of view: Numerical values of single pixels. As it turns out, this is not enough to mirror human color perception. Take a look at this: http://gizmodo.com/5839481/the-most-wicked-optical-illusion-ive-seen-so-far. Both spirals have the same color, but you would never think that if you didn't know or check. As far as I know, no algorithm that works for cases like this exists, yet.

There is a standard for measuring color difference by the International Commission on Illumination (sounds good, right?). It takes into account how humans perceive color, so it should give you the best results (if you are human, that is). I did not try it, though, as it seems somewhat complicated and I wanted to keep this entry short and to the point. Peplin did it like this and his code is much more complicated, but still quite readable. One problem with this approach is that you have a lot of magic numbers that you get from the standard definition. These are (at least for me) hard to visualize and I find it hard to grasp their meaning.

Anyway, here are my pictures and my Ruby code.

(download)

BONUS: Coffeescript version

I also wrote a Coffescript version of the same algorithm which you can use to quickly check the effects of changing the reference color or the color distance. Porting it over from Ruby was pretty straight-forward, except for the strange interface for pixels that the HTML5 canvas element uses and the fact that modulus is broken in Javascript (WTF?!). I tried to take a functional approach, but I am not a Coffee/Javascript programmer, so this code is rather bad. If you are interested anyway, you can see the code at GitHub.

You can find this version ready to use at http://severe-autumn-9391.heroku.com/index.html. Use the slider to set the color distance that will still be colored, and just click on any point in the image on the right to set it as the reference color. Have fun!

Tuning your Maxixe segmenter with the new optimizer

Remember when we talked about Maxixe? We talked about how to use it to train a very simple segmenter and split a sentence without spaces into single words. But now you will probably want to try it yourself, for real, in your big enterprisey next big thing! But how do you actually do that?

Ingredients

Maxixe needs three things to build a segmenter:

  1. A large corpus of text data. This should be as close to your production data as possible.
  2. A set of n-gram counts from this corpus. For example, all 2-grams, 3-grams and 4-grams from the corpus are counted and then saved as the index.
  3. A threshold that controls how lhigh the score for a possible segmentation has to be for it to be cut. Simply put, the lower this is, the more segments you will get.

The corpus

So how do you get hundres of megabytes of text in your target language? This can actually be pretty easy! For example, Wikipedia has dumps of their database for all languages, and some other gals and guys already figured out how to get a plain text corpus from them. This can be a bit much and you will probably want to try only a subset of Wikipedia. Also remember: After using Wikipedia as a corpus, your segmenter will be trained to split encyclopedia articles, but it will not necessarily be well equipped to split blog comments or 4chan memes. 

Another source can be Project Gutenberg, which has a lot of free books in plain text format, ripe for the taking. Similar pages are available for more specialized areas. For japanese, there is Aozora Bunko, for classical japanese you can take a look at the Japanese Text Intiative. Just type "$favorite_language text" in Google or DDG and use whatever comes up. Okay, if you need Chinese, I searched for that, too.

Anyway, you now have your corpus. AWSM! What to do next?

Building your index

When you use the segmenter, you can actually change the threshold value anytime you like (you should not, though). But you will have to decide in advance which n-gram count indexes you want to have, as these have to be pre-computed. So which n-grams should you index?

I know that every time I write "n-gram", someone stops reading. What is this thing? An n-gram is just a string of length n. So let's say we have the string "nyan" and want all 2-grams that are in it. Easy! Just start at the beginning and take two characters each time. So for "nyan" we get ["ny", "ya", "an"]. The 3-grams would be ["nya", "yan"]. 

Now, remember that the Tango algorithm that Maxixe uses depends on the fact that n-grams that are part of words or words themselves are more likely to occur in a corpus than n-grams that are not. A simple example: If you have a large corpus, your 3-gram count for "cat" and "dog" will probably be rather high, but the count for "qxc" will be rather low, as a word would need to end with "qx", begin with "xc" or even contain the string "qxc" completely if this 3-gram could be found in the corpus.

So, which values for n should you choose? This is not an easy question. Generally speaking, small n values will lead to small segments and vice versa. But sometimes you may, for example, need n-grams for n = 2, 4 to get the best recognition performance. You will just have to test it.

Over the threshold

There is this last parameter you need, the threshold t. This is the cutoff value. If any position between two characters in the string has a calculated probability value that is larger than the threshold, it will be split at this point. Why do you need this? Usually, positions are split if the calculated value for the current position is a local maximum, which means that it is more likely that a word boundary lies on the current position then on the one directly before or after it. But sometimes, this just isn't enough. A simple example is the word "a", as in "not a simple example". To split this string correctly, the positions right before and right after "a" have to be split. But they are next to each other, so they can not both be local maxima, as this would mean they are each larger than the other. If you are a mathematician, you can now construct an algebra were this is possible, but we normal people will just have to set this to some value. As values can only vary between 0 and 1, 0.5 seem like a good natural compromise. But if you want to get the best value, you will have to test it, too.

Optimize

Guess what? I wrote a method to help you test which values for n and t are the best! You should probably do a "gem install maxixe" as soon as you can reach your terminal to get the new version.

I stole this idea from the paper, too. The idea is to split a small number of sentences (5 to 50) by hand and use these to figure out the parameters that work best for your use case. Now, the authors of the paper are proper scientists, so they used precision and recall to measure recognition performance. But I wrote this in 2 hours after work, so I will measure recognition performance by edit distance: diff the pre-segmented strings with the ones split by the algorithm, add up the edits needed to make them the same and use this as a score. A score of 0 means that there was no difference between your pre-segmented sentences and the ones split by the segmenter, so this is a perfect score. The higher the scores, the worse the word recognition.

Here is the example from maxixes readme:

As you can see, you will have to build an index first. This should contain any values for n you could possible want. The optimizer will try any combination of values for n and t and calculate the score. Then, it will give you the smallest result for score, n and t, in that order. The above example has the best recognition performance if you have n-grams with n = 2,4 and a threshold of 0.5. If we used a real corpus, we would now have a segmenter with nicely tuned parameters, ready to segment a lot of sentences.

BTW, there are more combinations of parameters that will give you a score of 0 for this example. If you want to see these, take a look at Maxixe::Trainer.check_recognition. 

What's next?

By now, you should have a working segmenter, finely tuned for your exact use case. So, you made a segmenter for biblical Greek? Or for Cicero's work? Why not make a gem out of it, so others can use it? I will do the same, so stay tuned for more...

Using Picky for search and profit

As you might remember from my post about my statistical segmenter, Maxixe, I have used the nearly pure Ruby search engine Picky in one of my current projects, wadoku.eu.

I had tried to use Solr before, but I just could not get it to work in any useful way. I made my own database-based search, but it seems I am an idiot. I just could not get the search times to be reasonably small, so I toyed with the idea of using the fulltext search included with SQLite. But then it would have been hard to switch DBs later, so I started looking for other solutions. Picky turned out to be exactly what I needed: Offline indexing, fast performance, pure Ruby (so I could actually take a look under the hood - more on that later) and easy to setup. If you want to get a quick taste of why I prefer Picky over Solr, just take a look at their respective homepages. Which one looks more accessible?

Picky and WaDokuJT

Wadoku.eu is an online interface for Ulrich Apel's WaDokuJT dictionary. As you might imagine, dictonary data is highly structured. WaDokuJT has about 200.000 entries, each a japanese word or phrase with several german translation equivalents, arranged in meaning groups. There is even more: Etymological information, notes on word usage, etc. WaDokuJT uses a simple markup language to structure these entries. I wrote parser for it in Citrus, but please don't look at it.

Anyway, there is a lot of structure and semantics in these entries, and just using a full text search would lose a lot of this. Picky brands itself as a "semantic text search engine" and is perfect for the job. It works like this:

  1. Give Picky some of your data to index (everything with an #each is indexable), define indexed categories and searches
  2. Run the Picky server
  3. That's it!

Picky then gives you a very simple JSON interface for your searches. You should try it, it's really easy and really fast. The beauty lies in how you define what you want to index and how it is searched. Take a look at the Picky Wiki for a taste.

I had a tab seperated file containing all entries, so I wrote a class that would respond to #each with every entry in the file, disguised as an object. Your object just has to respond to #id and whatever you name your categories. This made it easy to handle a problem I had: I wanted to make everything searchable not only by kanji and kana, but also by their transcription into latin characters (or, as we smart guys say, romaji). This is really easy. I have the reading in kana already in the file, so let's say my entry objects responds to #kana. I can then just write this:

I can then add the category "romaji" to the index and it will work just as if it were part of the file. As you can see, transforming your data for use in indexing is very easy with Picky. Just write your Ruby and be happy.

I won't go into more detail here, but I want to talk about another thing that makes Picky great: It's author.

Fixing Picky

The data I want to search through contains a lot of what is (too often) considered to be special characters: umlautskanji, stuff like that. When I made Picky index my data, I noticed two things: First, that the partial index looked strange. Second, that the Picky server would not start at all. Bummer. I wrote a small example and filed a bug. Soon, Florian Hanke (Picky's father) answered with a detailed explanation of what he thought went wrong and explained how he debugged it. This was wonderful, as it enabled me to quickly find the source of the two bugs (one in Picky, one inyajl-ruby) and make some suggestions on fixing them. As it turns out, Picky had seemingly never been used for characters outside of the ASCII set before, as it usually does substitutions to get rid of diacritics before putting the tokens in the index. The indexing treated the symbols as ASCII, so it would cut off only one byte at a time when generating partials. This won't work for japanese characters, as they are 3 byte each. Also, yajl-ruby would not symbolize non-ASCII characters correctly, as it also treated everything as ASCII internally. There were some other UTF-8 related problems on the way (read the issue if you are interested), but everything was working about three days after I looked at Picky for the first time. That is less time then I spent on getting Sphinx or Solr to work.

Florian helped a lot with getting it to work, even though japanese was not his use case. I really got the feeling like this was the ideal way an open source project should work: Put your stuff out there, help other people fix bugs and contribute, make the software better for everyone. Florian clearly is passionate about Picky - and it's contagious!

Just use it, already!

So why should you use Picky?

  • It's fast to search and easy to setup
  • It's just Ruby, everywhere
  • If you have problems, you will get help
  • It has an octopus as mascot

Maxixe - A simple statistical segmenter for any language

[UPDATE: I wrote a follow up article on how to train your segmenter and tune the parameters!]

Recently, I wanted to use Picky for a dictionary app (and did, see WadokuJT). This worked great after fixiing some small problems (which I will talk about in a later post). 

Now, to use Picky, you simply have to tell it which data to index, and then it will perform some operations on it to make it searchable. One of these operations is the tokenization: The data is split in tokens which are then indexed seperately. This is rather straight-forward for western languages: Just split the string data on spaces and punctuation. So a string "I love my cat" would give you these tokens: ["I", "love", "my", "cat"].

Now, not all languages work this way. Many asian and other languages have scriptio continua, so they just write word after word after work, with no spaces (and sometimes, no punctuation). You can even get this with western languages, for example latin. So, if you want to tokenize text in these languages, you will have to find out where the word boundaries are.

In my case, Japanese, there are already several software packages that can split text into words, for example morphological analyzers like MeCab, ChaSen and Juman. These do more than we need, as they also analyze sentence structure and morphology of the words. Also, they are all rather large pieces of software that contain hand-written dictionaries and grammar rules to parse natural-language sentences. Further, they are often somewhat hard to install and run, the documentation often being available only in Japanese. And finally, if your language is not Japanese, but say, Chinese, or Linear B, you will have to take a completely different approach unless you want to rewrite the dictionary and grammar. I wanted something universal, written in Ruby, that I could just require from Picky.

So i went to Google Scholar and looked at the results for "japanese segmentation". I found a paper called "Mostly-unsupervised statistical segmentation of Japanese kanji sequences" by Ando and Lee. As it turns out, the algorithm described in this paper (called TANGO) is not really specific to japanese, and you can use it to segment pretty much any language.

So i wrote a gem, Maxixe, that implements this algorithm. Here is an example of what you can do with it.

Say you have a training file containing these unsegmented sentences: 

ILIKEMYDOG
THISHOUSEISMYHOUSE
MYDOGISSONICE
INMYHOUSETHEREAREFOURDOGS
IWANTAHOUSEFORMYDOG

You can use these to train the segmenter and split a sentence. It will detect word boundaries by itself.

hash = Maxixe::Trainer.generate_training_data([3,4], "dogs")
s = Maxixe::Segmenter.new(hash)
s.segment "MYDOGISINTHEHOUSE"
 => "MY DOG IS IN THE HOUSE"

Nice, isn't it? So how does this work? 

The algorithm is based on the idea that if you count any n-grams (substrings of length n) in a text, n-grams that are words or parts of word will appear more often than those which are just random strings created by two adjacent words. So what you do is use your training data to build a hash which contains the count of all n-grams in this corpus. Then, when you try to segment a string, you calculate if it is more probable that you are between two words or on a word, by comparing the count of n-grams to the left and right of the current position and the count of the n-grams that overlap with the current position. Then, after getting a numerical value for all positions, you split on all local maxima and if the value is above a certain threshold. I suggest that you read the paper yourself if you are interested. Once you grasp the underlying idea, it is quite easy to understand.

There are three things here that affect the segmentation performance: The corpus you use, the kinds of n-grams you use and your threshold. The corpus part is straightforward: Use a large amount of text that is similar to the one you want to split. The n-grams and threshold are more complicated. The above example works if you use a threshold of 0.5 and generate n-grams for n = 3,4, but you can easily find other values which will give the wrong results. Sadly, you will have to figure out which parameters are the best by segmenting a small number of sentences by hand and checking which parameters give the results most similar to your test data. I plan to add support for automating this to Maxixe, but right now, you will have to do it yourself.

And that's it! So now, if you ever need to tokenize some text that does not have spaces, give Maxixe a try. The code is small and pure Ruby, so be sure to take a look under the hood.

(download)

Hausaufgabe für praktische Lexikographie: Einfache Kanji-Statistik mit Ruby

Jetzt zu der eigentlichen Aufgabe: Ladet euch folgende zwei Dateien herunter und benennt "kanji_statistics.rb.txt" in "kanji_statistics.rb" um (entfernt also das .txt). Lest die Kommentare in kanji_statistics.rb und versucht die Datei so zu ergänzen daß ein Aufruf von "ruby kanji_statistics.rb inugami.txt" euch die Häufigkeiten der Kanji zurück gibt. Aufgabe: Welche Kanji sind die drei häufigsten? Wie häufig kommen sie vor? BONUS: Sortiere die Kanji nach Häufigkeit! Schlage dazu die Methode "sort" nach (http://ruby-doc.org/core/classes/Hash.html#M002865). Wenn ihr Fragen habt kommentiert einfach oder schreibt eine Mail.

Hausaufgabe für praktische Lexikographie: Ruby-Installation und erste Schritte

Für die Lösung dieser Aufgabe ist ein Mindestmaß an Ruby-Kenntnissen erforderlich. Wer noch keine hat sollte das Tutorial auf tryruby.org durchmachen, fast alle für dieses Problem nötigen Fähigkeiten werden in diesem vermittelt. Dann installiert euch Ruby in einer Version ab 1.9. Wer MacOS oder Linux benutzt schafft es hoffentlich selber (wenn nicht: melden!), aber für die Windowsnutzer gilt folgende Anleitung: Von http://rubyinstaller.org/ die Datei rubyinstaller-1.9.2-p0.exe herunterladen und ausführen. Bei der Installation unbedingt anwählen die Dateien zur PATH-Variable hinzuzufügen wie in den Screenshots. Danach auf Start -> Ausführen und "cmd" eingeben. Ein schwarzes Fenster mit einer Eingabeaufforderung erscheint und in diesem Fenster könnt ihr jetzt eine interaktive Ruby-Umgebung (wie auf tryruby.org) benutzen wenn ihr "irb" eingebt.
Media_httprbraunnetbl_cnrsf
Media_httprbraunnetbl_ceqnb
Media_httprbraunnetbl_iqccb
In irb könnt ihr jetzt eure Rubyfähigkeiten nach belieben austesten. Testet mal einfache Arithmetik wie "2 + 4" und Stringfunktionen wie "'Hallo'.reverse". Wenn ihr soweit gekommen seid könnt ihr bei der eigentlichen Aufgabe weitermachen.

Update: Studentenausweise

So wie es aussieht liest jemand bei der Uni mein Blog (oder sie sind selbst drauf gekommen). Hier der Aufdruck auf einem ganz aktuellen Ausweis:
Media_httprbraunnetbl_drfea
Allerdings ist dies ein Ausweis der so verschickt wurde. Ob die Automaten auch diesen Aufdruck machen weiß ich leider nicht. Wenn ein Leser seinen Aufdruck gerade frisch aktualisiert hat würde ich mich über eine Rückmeldung freuen!

Die Supereliteuni Tübingen (leider keine Geographiekenntnisse)

Die Uni Tübingen hat ein neues Corporate Design! Corporate Design ist das, was man macht, wenn man so tun will als wäre man aktiv und würde alles zum besseren verändern. Zu diesem Anlass wurden an alle Mitarbeiter schöne kleine Pins verteilt. Die sollen uns hier nicht interessieren. Die Kartonkarte auf der die Pins drauf sind ist allerdings interessant. Hier das Bild:
Media_httprbraunnetbl_qdifk
Wer genau hinschaut kann ja mal gucken wo der rote Punkt, der wohl Tübingen darstellen soll, denn eigentlich wirklich drauf ist. Ich habe mal eine Deutschlandkarte drübergelegt und Tübingen schwarz markiert. Das Ergebnis:
Media_httprbraunnetbl_cfitb
Sieht also so aus als hätten wir das abgelegt Design irgendeiner Uni aus Berlin bekommen... Nunja. Dieser Fall zeigt wieder mal ganz gut wie eigener Anspruch und tatsächliche Fähigkeiten bei dieser Uni auseinander driften