More on Merquery
posted by brian at 10:34 PM
A bunch of traffic has been directed to this blog due to the post about Merquery. Seems like there has been discussion going on in that posts's comments and also over at Jacob Kaplan-Moss' post.
As predicted, a lot of people are only seeing the "reinvention" aspect of Merquery. And admittedly, there is a lot that would need to be reinvented based on the goals I wrote down.
The real novel part about Merquery is that it's easy to drop into a Python web application. Imagine if you're going through a TurboGears tutorial and all you have to do is add one line to add full-text indexing and search to your database tables. Cool!
So here's the reformulated plan. Write adapters for the nice-looking Python indexing engines mentioned so far, such as PyLucene, Hype, and Xapwrap. Make using any of them look the same (so they're easy to swap in and out), and make it a one-liner for the most basic indexing setup desirable. Then, add a pure-Python indexer to the package as a side project, for those people who don't want dependencies. (All three of those existing libraries mentioned above still require the library they wrap to be installed.)
Unlike the current interfaces for those indexing libraries, these adapters don't have to be completely general (yet). If they only provide adapters for SQLObject classes and the Django database API, that's already a great accomplishment, even though these adapters are less flexible than the generic interfaces already provided. This will allow Django and TurboGears developers to stick with what they know rather than worry about getting an indexer working with their underlying database. (Hey, we've got to start somewhere, might as well have mass appeal right from the get-go.)
Here's an idea of what some customization of a developer's search engine might look like:
class Person(SQLObject):
firstName = StringCol(notNone=True)
lastName = StringCol(notNone=True)
nameSearch = Merquery.LuceneIndex(first=Person.firstName,
last=Person.lastName)
(I have no idea why that space is there. Sorry for my blog being so ugly.)
In this example, the developer has customized the index by giving Person.firstName strings the field name 'first' and Person.lastName strings the field name 'last'. So to find people with 'Beck' in their name but not 'Brian Beck', this would work:
beck -first:brian
Developers could just pass query strings like the above directly from their forms to the index:
results = nameSearch.query("beck -first:brian")
Since LuceneIndex knows we passed in SQLObject columns, it will know to return results as a ranked list of SQLObject instances.
results[0].firstName, results[0].lastName
Obviously this example might not be very realistic since firstName and lastName are just strings and we could accomplish this with SQL. But the same ideas apply for fields storing big documents, etc., where things like term frequency and proximity become important.
Thoughts?
Update: I made a Merquery Google Group so discussion can now happen in a centralized place. I was also kind enough to make the first typo on there.
Comments
"Less flexible" is often a good goal when building APIs to existing tools.
The ideal API, imho, is something that is absolutely trivial to use in the common case and then provides graceful upward steps as your needs increase.
I think that what you're proposing makes good sense, because it will handle common cases trivially. As long as it can be shown how to go from your example to something more complicated (generating indexes from data in multiple tables?), then it's golden. (Why would you want to generate indexes from multiple tables? I had experience with Lucene ~2002, and we pulled certain bits of data from our RDBMS and stuck them in separate fields in Lucene and could run superfast, complicated queries on that data. Blew the pants off of a normalized relational database, and was faster than a denormalized database for the stuff we were doing...)
The solution to scaling upward may be as simple as diving into the original search API, possibly with some helper functions...
I would suggest just picking *a* search engine to start with and not even worry about abstracting the interfaces. IIRC, Xapian is GPL, which is always unpleasant for businessfolk to worry about. PyLucene undoubtedly has greater overhead because it needs to drag along libgcj. Which makes Hype sound good. I'm looking forward to giving Hype a try myself.
BTW, TurboGears 0.9a2 also supports SQLAlchemy. I expect we'll be reading more about SQLAlchemy as time goes on.
Hello,
This sound great since in the project I am starting I will need such thing the pb is that I do not have a clue how to do it. So I will be happy to follow your progress and may be give you the end if I can on some topic.
Regarding the Api I would dream about something as simple as a decorator. On the model of @login_required in django.
@content_indexed
class Announce(meta.Model):
titre = meta.CharField(maxlength=30)
description = meta.TextField(maxlength=300)
is_approved= meta.BooleanField()
That was my 2 cents
This is the most important development in the Python web world. I'd be glad to help. Contact me if you're interested.
Peter Hunt
I started something quite similar to this a while back, and have written down some of the ideas I had at the time.
I thought you might find them useful.
I'd be very interested in a full text indexer abstraction layer. I started writing one myself a while back and have written some of my thoughts down here. Perhaps they will be useful to you.
Any updates on Merquery? Has anyone started on developing it yet?
Jason,
I'm developing it for Django as part of Google's Summer of Code program. Check out Alec's library above for something similar, but not necessarily focused on web framework integration.
Will it provide search capabilities through multiple 'tables' (SQLObjects)?
if there's
class Person(SQLObject):
#as you defined it
class Bio(SQLObject):
Person = ForeignKey(...)
long_historic_text = StringCol(notNone=True)
and i search for a name that's not in the Person databse but as part of someone's history, will that be found?
If not you should consider adding it.. need any help?