Using Picky for search and profit
As you might remember from my post about my statistical segmenter, Maxixe, I have used the nearly pure Ruby search engine Picky in one of my current projects, wadoku.eu.
I had tried to use Solr before, but I just could not get it to work in any useful way. I made my own database-based search, but it seems I am an idiot. I just could not get the search times to be reasonably small, so I toyed with the idea of using the fulltext search included with SQLite. But then it would have been hard to switch DBs later, so I started looking for other solutions. Picky turned out to be exactly what I needed: Offline indexing, fast performance, pure Ruby (so I could actually take a look under the hood - more on that later) and easy to setup. If you want to get a quick taste of why I prefer Picky over Solr, just take a look at their respective homepages. Which one looks more accessible?
Picky and WaDokuJT
Wadoku.eu is an online interface for Ulrich Apel's WaDokuJT dictionary. As you might imagine, dictonary data is highly structured. WaDokuJT has about 200.000 entries, each a japanese word or phrase with several german translation equivalents, arranged in meaning groups. There is even more: Etymological information, notes on word usage, etc. WaDokuJT uses a simple markup language to structure these entries. I wrote parser for it in Citrus, but please don't look at it.
Anyway, there is a lot of structure and semantics in these entries, and just using a full text search would lose a lot of this. Picky brands itself as a "semantic text search engine" and is perfect for the job. It works like this:
- Give Picky some of your data to index (everything with an #each is indexable), define indexed categories and searches
- Run the Picky server
- That's it!
Picky then gives you a very simple JSON interface for your searches. You should try it, it's really easy and really fast. The beauty lies in how you define what you want to index and how it is searched. Take a look at the Picky Wiki for a taste.
I had a tab seperated file containing all entries, so I wrote a class that would respond to #each with every entry in the file, disguised as an object. Your object just has to respond to #id and whatever you name your categories. This made it easy to handle a problem I had: I wanted to make everything searchable not only by kanji and kana, but also by their transcription into latin characters (or, as we smart guys say, romaji). This is really easy. I have the reading in kana already in the file, so let's say my entry objects responds to #kana. I can then just write this:
I can then add the category "romaji" to the index and it will work just as if it were part of the file. As you can see, transforming your data for use in indexing is very easy with Picky. Just write your Ruby and be happy.
I won't go into more detail here, but I want to talk about another thing that makes Picky great: It's author.
Fixing Picky
The data I want to search through contains a lot of what is (too often) considered to be special characters: umlauts, kanji, stuff like that. When I made Picky index my data, I noticed two things: First, that the partial index looked strange. Second, that the Picky server would not start at all. Bummer. I wrote a small example and filed a bug. Soon, Florian Hanke (Picky's father) answered with a detailed explanation of what he thought went wrong and explained how he debugged it. This was wonderful, as it enabled me to quickly find the source of the two bugs (one in Picky, one inyajl-ruby) and make some suggestions on fixing them. As it turns out, Picky had seemingly never been used for characters outside of the ASCII set before, as it usually does substitutions to get rid of diacritics before putting the tokens in the index. The indexing treated the symbols as ASCII, so it would cut off only one byte at a time when generating partials. This won't work for japanese characters, as they are 3 byte each. Also, yajl-ruby would not symbolize non-ASCII characters correctly, as it also treated everything as ASCII internally. There were some other UTF-8 related problems on the way (read the issue if you are interested), but everything was working about three days after I looked at Picky for the first time. That is less time then I spent on getting Sphinx or Solr to work.
Florian helped a lot with getting it to work, even though japanese was not his use case. I really got the feeling like this was the ideal way an open source project should work: Put your stuff out there, help other people fix bugs and contribute, make the software better for everyone. Florian clearly is passionate about Picky - and it's contagious!
Just use it, already!
So why should you use Picky?
- It's fast to search and easy to setup
- It's just Ruby, everywhere
- If you have problems, you will get help
- It has an octopus as mascot