Entries for March 2006

More on Merquery

A bunch of traffic has been directed to this blog due to the post about Merquery. Seems like there has been discussion going on in that posts's comments and also over at Jacob Kaplan-Moss' post.

As predicted, a lot of people are only seeing the "reinvention" aspect of Merquery. And admittedly, there is a lot that would need to be reinvented based on the goals I wrote down.

The real novel part about Merquery is that it's easy to drop into a Python web application. Imagine if you're going through a TurboGears tutorial and all you have to do is add one line to add full-text indexing and search to your database tables. Cool!

So here's the reformulated plan. Write adapters for the nice-looking Python indexing engines mentioned so far, such as PyLucene, Hype, and Xapwrap. Make using any of them look the same (so they're easy to swap in and out), and make it a one-liner for the most basic indexing setup desirable. Then, add a pure-Python indexer to the package as a side project, for those people who don't want dependencies. (All three of those existing libraries mentioned above still require the library they wrap to be installed.)

Unlike the current interfaces for those indexing libraries, these adapters don't have to be completely general (yet). If they only provide adapters for SQLObject classes and the Django database API, that's already a great accomplishment, even though these adapters are less flexible than the generic interfaces already provided. This will allow Django and TurboGears developers to stick with what they know rather than worry about getting an indexer working with their underlying database. (Hey, we've got to start somewhere, might as well have mass appeal right from the get-go.)

Here's an idea of what some customization of a developer's search engine might look like:


class Person(SQLObject):
    firstName = StringCol(notNone=True)
    lastName = StringCol(notNone=True)
nameSearch = Merquery.LuceneIndex(first=Person.firstName,
                                  last=Person.lastName)

(I have no idea why that space is there. Sorry for my blog being so ugly.)

In this example, the developer has customized the index by giving Person.firstName strings the field name 'first' and Person.lastName strings the field name 'last'. So to find people with 'Beck' in their name but not 'Brian Beck', this would work:

beck -first:brian

Developers could just pass query strings like the above directly from their forms to the index:

results = nameSearch.query("beck -first:brian")

Since LuceneIndex knows we passed in SQLObject columns, it will know to return results as a ranked list of SQLObject instances.

results[0].firstName, results[0].lastName

Obviously this example might not be very realistic since firstName and lastName are just strings and we could accomplish this with SQL. But the same ideas apply for fields storing big documents, etc., where things like term frequency and proximity become important.

Thoughts?

Update: I made a Merquery Google Group so discussion can now happen in a centralized place. I was also kind enough to make the first typo on there.

Python Web Programming Talk Post-Mortem

As mentioned earlier today, I gave a Python Web Programming talk this evening for the Case community. A slightly larger crowd showed up compared to the last talk—around a couple dozen people.

Like last time, I could have been more organized. I had a sufficiently complex example lined up for demonstration, but unfortunately it involved way more JavaScript than Python, so it turned out not to be such a great example. So I just came up with a new example on the spot—the classic to-do list example. It fit my requirements of having more than one SQLObject table (so I could show how to relate them), having more than one page (so I could show how controllers work), and not being a wiki.

Also like last time, there were a few things I neglected to mention that probably would have put many people's minds at ease while they were trying to keep up with what the heck I was coding.

For instance, I never mentioned that templating with Kid guarantees well-formed XML input and output, and in practice leads to much cleaner templates (based on my own experiences with Cheetah). So people who had never seen Kid or ZPT before were probably thinking "what the heck is this all about?"

Deficiencies aside, I was told that it was an entertaining talk and I didn't make too much of a fool of myself. But most importantly, I hope I made a good impression of Python.

Web Programming with Python Talk

At 5:30 this evening (Thursday, March 30th) I'll be giving a talk about Web Programming with Python. The talk is in Glennan 421, same place as the previous Python talk.

I'll give a primer on the major web application frameworks in Python, then show you how to make a web application with TurboGears, one of the frameworks. Even if you don't know Python, it is easy to pick up. If you're coming from a PHP or Ruby background, I can even answer any questions about how things are different in Python. By the end of the talk we should have a live, working web application.*

At 7:00 PM there is a Google talk with pizza and such. I'll finish with plenty of time to spare so we can all go over to the Google talk afterwards!

* If there are any web application ideas you'd like to see (that are possible in less than an hour), suggest them here. I haven't planned this app at all yet.

<capsule> eggs again sent me like 80 e-mails about some snake

Merquery, Text Indexing and Search with a Focus

If you happened to catch my RSS feed at some terrible hour of the morning a couple nights ago, you may have read a little about Merquery.

Merquery is the reason I want to parse nice search query syntax like Google's (or more generally, Lucene's). It's a full-text indexer and search engine. Not the crawly, home-page kind like Google, but the deployable kind like Lucene.

Now I know what you're thinking. You're coming up with all these nerd-slang fake syndromes with which to diagnose me. You're wildly gesticulating while going on about how a bunch of people smarter than me have done this before and published their results.

Let me say that I think Lucene is one of the best open source offerings out there. If you're serious about deploying search locally, you should use Lucene. Heck, you can even use Python to do that.

Problem is, there are a ton of people out there who aren't that serious about search. Or more appropriately, don't need to be that serious about search. I've been investigating Lucene for the past week, and my impressions are that it's:

  • huge and complex
  • overkill for Joe Programmer's First TurboGears App
  • hard to deploy for my app in less than 5 minutes

Obviously, Lucene was never intended for use with that second point. But there are a lot of developers these days using small-but-scalable web frameworks that could use a lightweight search engine that's easy to plug into their apps. I'm talking, like, just pass it a list of SQLObject columns and it will do the rest. Giving your SQLObject tables a search method is easy, but giving them a fully-featured query syntax and ranking the results is a little more work. Sure, we can count on Google to index our pages for us, but they can't search our databases, which is also a huge need.

So my goal is for Merquery to focus on being super-easy to deploy for those web developers using things like Django, TurboGears, Pylons, etc. I already know what Python database wrappers they're likely to be using, so I can make Merquery cater directly to their needs.

With that in mind, here are some design decisions and other things I've been thinking about:


  • Full Lucene query syntax would be nice, but hasn't Google's small subset of it proven to be good enough?

  • I like being able to search for stop words. Google used to have stop words, but now you can search for anything you want.

  • Python has lots of nice ways to process large datasets without having everything in memory at once; I plan to use these methods extensively.

  • So far I think pyparsing will be the only dependency. There's a lot of cool stuff already in the standard library that's well-suited to building a search engine. And hey, Paul says search query parsing will come with the next release of pyparsing.

  • Some motivating phrases: lighweight, fast, easily deployable, easily extendable


This article (with code) about making something similar in Python is a good read. In fact I've found a few articles like that one, but they're all from at least a few years ago. I got the impression that no one has approached this problem from the same perspective of focusing on the tools today's Python web developers are using (SQLObject, TurboGears, and such). But if you know otherwise, let me know!

Google Search Syntax BNF

A while ago I was making a specialized search engine of sorts and I wanted to use Google's nice query syntax with my own custom modifiers instead of things like inurl:, intitle:, site:, etc. Besides these powerful modifiers, Google's search syntax is nice because no query is invalid. Yahoo, MSN, and Amazon also at least support more than just basic search expressions.

Sometimes I wish other sites (like reddit) would implement this syntax for their search queries. So tomorrow I'll release a little Python module that parses this query syntax and makes the query easy to read and process. I wrote code that did this successfully a year or two ago, but I'd like to rewrite it now with pyparsing or something. Now there will be no excuse to offer only lame search queries.

To get the ball rolling, here's the BNF for the syntax I will implement. I have no idea if this is even a proper way to express BNF, but I used the Python grammar as a reference (because if it's good enough for Guido, it's good enough for me).

L ::= expr | expr L
expr ::= term | binary_expr
binary_expr ::= term " " binary_op " " term
binary_op ::= "*" | "OR" | "AND"
include_bool ::= "+" | "-"
term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~" literal)
modifier ::= (letter | "_")+
literal ::= word | quoted_words
quoted_words ::= '"' word (" " word)* '"'
word ::= (letter | digit | "_")+
number ::= digit+
range ::= number (".." | "...") number
letter ::= "A"..."Z" | "a"..."z"
digit ::= "0"..."9"

Contrary to what many people believe, you can NOT use parentheses for grouping or precedence in Google search queries. Every punctuation character except for [+-_".] is converted to a space and becomes meaningless.

Intro to Python Talk, Post-Mortem

It only just occurred to me that I should have blogged about this event before it happened to advertise... oh well.

So tonight Chris (with a little help from me) gave an introductory Python talk. It went pretty well and around 15 people showed up, including some surprise Python experts!

I wish I had helped Chris plan it a little better, because afterwards we remembered a couple things we should have mentioned. Some of these include more standard library modules that people might be interested in, like the threading modules, a few more of the common built-in functions (like range, enumerate, sorted), and the Cleveland Python Interest Group.

Other than that, it was a pretty good introduction to Python, from the basics like syntax to the high-level stuff like closures and magic methods. (Thanks to Gary, who saved the day by showing how easy it is to do closures in Python while I was rambling on about them like a moron.)

Next week I'll be giving a talk on Web Programming in Python. Confused by all the Python web programming frameworks? I'll help you pick a good one and show you how to make an application with it. Do you know PHP or Ruby on Rails? Then I can probably even answer your questions about exactly how it differs from Python web development. Time and location is TBD, I'll update this post when things are finalized.

Sprinting messes with your mind

What Mike and I accomplished during Spring break can only be described as a sprint. We hammered out an initial release of a decent contribution to the Python community in two days. In fact, we started preparing for the next release by rewriting everything from scratch to more easily support new collaborative filtering models and make our current models more accurate. There's another project we started that I'll put online in a week or so.

Our long and focused coding sessions resulted in what I have only ever experienced after playing a single video game for prolonged hours. I fall asleep thinking about code and math; sleep is interrupted by thoughts about code and math; the day begins (often in the late afternoon) with thoughts about code and math. It takes a while to wear off.

More specifically, here are some thoughts that keep popping up since we first released consensus:

  • Why don't any of the models presented in research papers work as presented? We've had to modify almost all of them.
  • Why is our refactored code an order of magnitude slower than our initial naive code?
  • Why did using numarray for churning numbers on our dataset make things slower, not faster?
  • If AudioScrobbler's dataset is so large, why are their all-time top recommendations for me so unstable?

Anyway, I hope that posting these thoughts will get them off my mind, because I'm ready to think about other things for a while.

consensus: Collaborative filtering in Python

Mike is visiting for Spring break, so naturally we've spent the whole time coding. We've started a few projects, and we've just made the first of them available.

consensus is a collaborative filtering library for Python. It implements three different filtering models, and makes it easy to implement more. We're no experts on the subject, but the results seem pretty good.

As an example, we wrote a little script to harvest data from thousands of AudioScrobbler users, starting with myself. All three models yielded similar results with lots of overlap. Based on my AudioScrobbler profile (which hasn't been updated in at least a year), our models suggested these additional artists:

  • Sufjan Stevens
  • Elliott Smith
  • Broken Social Scene
  • Belle and Sebastian
  • The Beatles
  • Interpol
  • Wilco
  • The Decemberists
  • Bright Eyes
  • Beck
  • Death Cab for Cutie

Not bad! There is some overlap with AudioScrobbler's suggestions, so we must be doing something right. We're not sure what fancy algorithms they're using, but based on which suggestions we determined I would actually like, our models had a higher success rate. But who knows, if we had a larger data set, maybe they'd be the same...

Anyway, if you'd like to install the latest release of consensus, just use setuptools:

sudo easy_install consensus

Update: Both consensus and easyBay are now on the Cheese Shop.

i need a linkblog

...for posts like this one, undeserving of a name.

Hey web guys, check out this nice presentation/screencast (warning, 379M video) from Sean Kelly at NASA comparing some web development frameworks. I haven't even gotten to the Python parts yet and I'm excited. So there's something for you to watch while you're at work. Unlike many screencasts, this one is fast-paced and actually kept my attention.

easyBay Unleashed at Clepy

I gave my presentation about talking to eBay with Python last night at Clepy and I think it went pretty well, with positive feedback. You can view the slides from the presentation online. Clink clink:

You can now check out my Python eBay library from my Subversion repository:

svn co svn://exogen.case.edu/easyBay

Some documentation can be found at exogen.case.edu/easyBay. No point release or Python egg yet, everything is still in the trunk. You can sign up for (free) Developer Keys at developer.ebay.com in order to use the library.

1NCREA5E UR C0NF1D3NCE

Everyone knows that when communicating with the opposite sex, what's important is confidence. But did you know that some Case people are trying to vote away this reasonable convention? Hey, this isn't Survivor, guys!

All I have to say is that my saving throw fails against Hundert's charisma.

I press Save.

Friday Fun Facts, Thursday Edition

Did you know there are dozens of little services hiding on campus just waiting to make your life easier? Now you can get a handle on them with start.case.edu. Disclaimer: it's a beta.

I've watched this amazing video of the game Spore three times and I'm still not tired of it. I'd love to play this on the Virtual Worlds Lab computers...

Last Tuesday Project Club gave an electromagnetics demonstration to attract interest to the upcoming electromagnetics competition. We had a coil gun, a disc launcher, jacob's ladder, and a friggin' tesla coil. Oh yeah, and a couple games of Shock Tanks. Pictures of me and Jon Ward drawing arcs from the coil are sure to crop up soon.

Patty is back from Amsterdam, so I've had to give up spending my days doing manly things, such as harnessing the power of tesla coils.

Talk about manly... today Steve and I went on a Haircut & Ice Cream man-date, our second ever. And I'm sure the future holds many more.

Yesterday's Achewood strip amazingly embodies one aspect of what makes the comic so good... attention to detail, developed history, and subtle humor. My favorite in a while.

I have like a dozen experimental web services gathering dust in various states of disrepair. I need to release early and often to stop this from happening in the future.

Only four days until the March meeting of the Cleveland Python Interest Group. Don't forget, I'll be giving a talk about how to talk to eBay using Python — great for economics research projects or just for fun.