If you happened to catch my RSS feed at some terrible hour of the morning a couple nights ago, you may have read a little about Merquery.

Merquery is the reason I want to parse nice search query syntax like Google's (or more generally, Lucene's). It's a full-text indexer and search engine. Not the crawly, home-page kind like Google, but the deployable kind like Lucene.

Now I know what you're thinking. You're coming up with all these nerd-slang fake syndromes with which to diagnose me. You're wildly gesticulating while going on about how a bunch of people smarter than me have done this before and published their results.

Let me say that I think Lucene is one of the best open source offerings out there. If you're serious about deploying search locally, you should use Lucene. Heck, you can even use Python to do that.

Problem is, there are a ton of people out there who aren't that serious about search. Or more appropriately, don't need to be that serious about search. I've been investigating Lucene for the past week, and my impressions are that it's:

  • huge and complex
  • overkill for Joe Programmer's First TurboGears App
  • hard to deploy for my app in less than 5 minutes

Obviously, Lucene was never intended for use with that second point. But there are a lot of developers these days using small-but-scalable web frameworks that could use a lightweight search engine that's easy to plug into their apps. I'm talking, like, just pass it a list of SQLObject columns and it will do the rest. Giving your SQLObject tables a search method is easy, but giving them a fully-featured query syntax and ranking the results is a little more work. Sure, we can count on Google to index our pages for us, but they can't search our databases, which is also a huge need.

So my goal is for Merquery to focus on being super-easy to deploy for those web developers using things like Django, TurboGears, Pylons, etc. I already know what Python database wrappers they're likely to be using, so I can make Merquery cater directly to their needs.

With that in mind, here are some design decisions and other things I've been thinking about:


  • Full Lucene query syntax would be nice, but hasn't Google's small subset of it proven to be good enough?

  • I like being able to search for stop words. Google used to have stop words, but now you can search for anything you want.

  • Python has lots of nice ways to process large datasets without having everything in memory at once; I plan to use these methods extensively.

  • So far I think pyparsing will be the only dependency. There's a lot of cool stuff already in the standard library that's well-suited to building a search engine. And hey, Paul says search query parsing will come with the next release of pyparsing.

  • Some motivating phrases: lighweight, fast, easily deployable, easily extendable


This article (with code) about making something similar in Python is a good read. In fact I've found a few articles like that one, but they're all from at least a few years ago. I got the impression that no one has approached this problem from the same perspective of focusing on the tools today's Python web developers are using (SQLObject, TurboGears, and such). But if you know otherwise, let me know!