Merquery, Text Indexing and Search with a Focus
posted by brian at 09:06 AM
If you happened to catch my RSS feed at some terrible hour of the morning a couple nights ago, you may have read a little about Merquery.

Merquery is the reason I want to parse nice search query syntax like Google's (or more generally, Lucene's). It's a full-text indexer and search engine. Not the crawly, home-page kind like Google, but the deployable kind like Lucene.
Now I know what you're thinking. You're coming up with all these nerd-slang fake syndromes with which to diagnose me. You're wildly gesticulating while going on about how a bunch of people smarter than me have done this before and published their results.
Let me say that I think Lucene is one of the best open source offerings out there. If you're serious about deploying search locally, you should use Lucene. Heck, you can even use Python to do that.
Problem is, there are a ton of people out there who aren't that serious about search. Or more appropriately, don't need to be that serious about search. I've been investigating Lucene for the past week, and my impressions are that it's:
- huge and complex
- overkill for Joe Programmer's First TurboGears App
- hard to deploy for my app in less than 5 minutes
Obviously, Lucene was never intended for use with that second point. But there are a lot of developers these days using small-but-scalable web frameworks that could use a lightweight search engine that's easy to plug into their apps. I'm talking, like, just pass it a list of SQLObject columns and it will do the rest. Giving your SQLObject tables a search method is easy, but giving them a fully-featured query syntax and ranking the results is a little more work. Sure, we can count on Google to index our pages for us, but they can't search our databases, which is also a huge need.
So my goal is for Merquery to focus on being super-easy to deploy for those web developers using things like Django, TurboGears, Pylons, etc. I already know what Python database wrappers they're likely to be using, so I can make Merquery cater directly to their needs.
With that in mind, here are some design decisions and other things I've been thinking about:
- Full Lucene query syntax would be nice, but hasn't Google's small subset of it proven to be good enough?
- I like being able to search for stop words. Google used to have stop words, but now you can search for anything you want.
- Python has lots of nice ways to process large datasets without having everything in memory at once; I plan to use these methods extensively.
- So far I think pyparsing will be the only dependency. There's a lot of cool stuff already in the standard library that's well-suited to building a search engine. And hey, Paul says search query parsing will come with the next release of pyparsing.
- Some motivating phrases: lighweight, fast, easily deployable, easily extendable
This article (with code) about making something similar in Python is a good read. In fact I've found a few articles like that one, but they're all from at least a few years ago. I got the impression that no one has approached this problem from the same perspective of focusing on the tools today's Python web developers are using (SQLObject, TurboGears, and such). But if you know otherwise, let me know!
Comments
Hi!
I think this indexer would be a better choice
http://www.oluyede.org/blog/2005/11/16/hype-the-python-indexer/
I played with PyLucene a little and it leaked memory as hell.
Nice idea though, hope it works out :)
Hey Brian --
This sounds awesome! I've posted some thoughts about the project over here: http://www.jacobian.org/2006/mar/29/merquery/
Please drop me a line if there's anything I can do yo give you a hand on this -- It's super exciting.
Sebastjan: Looks promising so far, I'll give it a test run this evening. Thanks for the link! I wonder why it didn't come up in search results while I was researching this stuff?
Jacob: Great! I'll get you a write account on my repository assuming Hype doesn't blow away all my goals.
Either way, there's a little work to be done to make adding decent search to web apps a one-liner. I'll probably post about Hype here tonight.
Have you looked at Xapian? (I've looked, but haven't actually used it for anything real.) It isn't pure-python, but in my experience it is pretty easy to build (way easier than pylucene), and the Divmod guys seem to have made good Python bindings for it. Heck, they wrote LuPy (http://divmod.org/projects/lupy) and then did xapwrap instead (http://divmod.org/projects/xapwrap), so they went down this path already and chose Xapian.
There's also a library to embed the JVM in the Python process. I can't remember the name. Anyway, I know people have used this to embed Lucene, and it's not as complex to manage as pylucene.
Ian, I did come across Xapian (and Xapwrap) while I was shopping around for search engines. My first impression was "Lucene in C++", but I should actually do some test runs with Xapwrap before I go and put work into a library.
Here's an idea: just write some adapters to work with all of the above (Hype, Lucene, Xapian) geared towards the audience I described in my post? As I said, I think it should be a one-liner for just a basic setup without customization. Then if people don't want the above dependencies, throw in a pure-Python indexer later? I'll make a post about this once I'm a little more informed.
Something that was API-compatible with Xapwrap, but written in pure Python, would definitely be a nice way to handle the packaging issues. It would take the pressure off performance too, because you'd just throw in the "real" Xapian index when that started to matter.
Have you considered staying compatible with the Lucene on-disk format? The Zend guys have said that their Lucene implementation (in pure PHP) will use the same format, and I've run into a number of organizations that have existing Lucene indexes which would be useful to leverage.
I've also used PyLucene in two of my own Django projects and didn't have any problems, although maybe that's because I'm familiar with the Java implementation. A more Pythonic API would be nice though :)
have you seen SolR ?
http://incubator.apache.org/projects/solr
basically you treat it as a 'search server' and just talk to it via HTTP.
it is pretty easy to set up, and has all the power of lucene (it is lucene)
what about hype? http://hype.python-hosting.com/
Hi,
I'm one of the authors of the search query parser for pyparsing. I've actually implemented this into Django, including an indexer. I think the parser works really well, my indexer is still a bit messy because it was the first thing I wrote with Python+Django. If you need any help with the parser and implementing it into Django, I'd be more than happy to help, since I'd like to have an improved search engine soon.
Rudolph Froger
How you tried Xapian ?? it is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.
I use if myself and love it , the
latest version is 1.0.0 on 05/18/07
Mike