Entries in "Projects"
September Projects
I'm in the situation Ian Bicking was in not long ago—I'm really tired of this blog design and software and it's making me not want to post any of the entries I have pending. This blog will soon redirect to something better.
Pagoda should have a Developer Preview in October. Check out my presentation from the September meeting of the Cleveland Python interest group.
Remember how Ian and I spent months thinking up hundreds of names for our company? We are now incorporated as Unstoppable Rocket—one of the first names that was suggested.
If you've been following the geopy list, you've heard about the new release coming out. It should make things much more flexible and extendable, and fix all the issues from the past year or so. geopy 0.99 will be out this week.
The geopy update is also getting me back into the campus crime map and my Case geocoder service, which is going to be really smart. Updates there soon.
I started a new project called Revisionist, which is like Pagoda's revision model except generalized and using SQLAlchemy 0.4. I'm hoping other people will be interested in using and improving such a project. With the right helpers it should make revisioning complex models really easy.
If anyone has any neat suggestions for what Gary or I should talk about at the October Clepy meeting, let me know.
Project and apartment updates
Some interesting bits from the past few weeks...
Next Tuesday I'll be having lunch with Mike Cermak, webmaster for the Greater Cleveland Regional Transit Authority. In my previous entry I mentioned my RTA Schedule project which has been gaining popularity. There were only a few routes listed on there when I posted it, and the list has been growing as people have been using the route adder. Mike wants to work together to come up with ideas and improvements that will encourage projects like mine—a very cool response, and beneficial to RTA users as well. I'm looking forward to it!
Remember those wacky import tricks I posted about to get multiple database engines working nicely in Pagoda? After coming up with that, Ian dug around to figure out what changes would be necessary to not have to do that. He narrowed it down to one single line of code in TurboGears! In turbogears.database:
def create_session():
"Creates a session with the appropriate engine"
return sqlalchemy.create_session(bind_to=get_engine())
That bind_to argument is totally unnecessary when using DynamicMetaData! Changing that to just use SQLAlchemy's create_session without arguments makes multiple database engines possible without any black magic. Unfortunately, we didn't notice TurboGears 1.0.2 about to be released and didn't start any discussion about changing this in time. For now we use this little monkeypatch:
session_context = turbogears.database.session.context
session_context.registry.createfunc = sqlalchemy.create_session
So I think it works more like Alchemyware now, except we don't have to write models any differently and the engines are cached. The metadata is simply pointed to the appropriate engine in each thread.
Speaking of Pagoda, we're still at least a couple weeks away from a beta release. We're currently writing glue for all the little bits and pieces we've created over the past couple months. We've satisfied many of our original goals and learned more about (and sometimes changed) others. I'll share more about these satisfied and modified goals later.
Pagoda's third contributor, Chris, moved back home to start hunting for jobs in the California area. Good luck, Chris! Chris is a fine electrical engineer and programmer and you should hire him. This was his plan since starting to help with Pagoda, so it doesn't really affect our development schedule.
After receiving practically no feedback from the release of dmath, there has been a small surge of interest recently, with a couple contributions, so there will likely be a new release. I put up a new egg of the old version on the Cheese Shop after learning that the Python 2.5 version was busted.
geopy continues to receive patches; recently the most-requested improvement was contributed by Amos Latteier and that is the removal of print chatter in favor of logging. I'll get 0.94 out this weekend with that and other improvements.
Since Chris moved out, our friend Greg moved in with me and Sara. Greg went to school for art and likes to paint and draw, and might even prove his cooking talents at culinary school next semester. I'll be helping him make a website for his comics, which are very funny, but I can't decide if it's because I know Greg and imagine him coming up with them, which itself makes me laugh. You'll be the judge soon enough...
There are two more new, smaller residents of our apartment as well... one's a 14-inch Oscar cichlid and the other's a 15-inch Plecostomus. They're friendly and big! Now I have fantasies about getting them a bigger aquarium with all manner of luxuries. I picked them up from someone who's graduating and they came with their 45-gallon home and necessities for free! I'll post some pictures of these guys soon.
Multiple sites, one Python: Pagoda import tricks
One of our early goals when designing Pagoda was to allow a single Pagoda instance to support multiple sites. This was due to the way memory works for web servers running on Python and TurboGears. How exactly this adds up depends on your threading and web server configuration (mod_python), but traditionally hosting multiple sites means running at least one Python instance per site, each costing 10-20 MB. The more modules each instance loads, the higher the memory usage, and since Pagoda sites will likely use a bunch of modules, that adds up. The most limiting factor in many hosting services is the amount of memory your account is allowed to consume.
Obviously if each Pagoda site is large and running custom code, it might be a good idea to run each in its own Python instance, so one site can't bring down all the others. But the common case, we think, is a bunch of moderately sized sites using just the built-in page management tools. So we devised some ways to allow multiple sites to run from one TurboGears project...
The first and simplest plan involved a database model, where pages and other table rows point to whichever site they belong to. You probably already know why this is a bad idea. First of all, every single table in the database needed to have a site_id column, since nothing would be shared between sites. Unique things like usernames would need their constraints modified to only be unique per-site. That got old pretty fast. Secondly was security. How could we ensure that every piece of code touching the database, even the eventual third-party plugins, would use the correct site in their queries so as not to mess with the others? And finally, having each site's contents in one massive database would not be very convenient if the site owners wanted backups of their portion of the database.
So we started looking at multi-database solutions, and quickly realized we were pretty much on our own for what we wanted to do. We don't just want some models in one database, and other models in a difference database; we want the same models in every database. Every site needs a pages table, for example. Since we're mapping tables with SQLAlchemy, and each mapper is bound to metadata, an engine, and a session, it seems that we'd need to run the table and mapper definitions once per site; each time, the engine would point to the appropriate site's. And now the big trick: how do we do this without modifying any model code, so that plugin writers don't have to learn any silly new details, and without doing a bunch of extra work every time a controller needs to use a model? If our controllers import pagoda.models.pages, how will it know to get the Page class bound to the current site's engine, and not another site's?
We looked to CherryPy for inspiration. In a TurboGears controller, importing cherrypy.request and cherrypy.response will make the current thread's request and response objects available. How do these objects magically belong to the appropriate thread? They simply use a class called ThreadLocalProxy. As the name suggests, cherrypy.request and cherrypy.response are proxy objects that determine the current thread and point object access to the correct request and response instances. Similarly, we want something like SiteLocalProxy, which will make model classes available that are magically bound to the correct site's engine.
Using ThreadLocalProxy as inspiration, we made a clever little object called site. When anything is imported from pagoda.site, it will rebind turbogears.database.metadata and turbogears.database.session after updating sqlalchemy.dburi in the config to point to the current site's. Then the requested module is imported and cached for next time (so the models aren't reinitialized every time). No model code was changed at all! The only necessary modification was importing from pagoda.site.models instead of pagoda.models in our controllers.
Our first implementation looked very much like ThreadLocalProxy, but it made our import statements look funny since site wasn't a real module. So we started investigating the imp, ihooks, and imputils modules, eventually leading us to PEP 302. With help from Importing (to reduce the amount of code necessary), we now have a special pseudo-module called site, and Pagoda modules imported from that will take the current request's site into account instead of just being imported once for the entire process.
Before writing up this entry, I came across Alchemyware. At first it looked promising for what we want to do, but as far as I can tell it requires modifying the way you write models and reinstantiating them on every request. Also, I don't understand how the mapped class can be "shared by everyone" if it's being mapped to multiple databases.
Anyway, after cleaning up our proof-of-concept I'll share the code behind our import trickery in case anyone is trying to do something similar, but mostly just because such tricks are interesting.
In case you forgot, we missed the end-of-March deadline we set for our demo, due in part to being burned out after PyCon. We're shooting for the end of April now.
Pagoda CMS Notes

If you were at PyCon 2007 or read Gary's blog or read the TurboGears mailing list, you may have seen Pagoda CMS mentioned. Pagoda is an open source content management system I've been working on with Chris and Ian. It's built on TurboGears and is focused on being simple yet extensible. We put up an introductory screencast at pagodacms.org that we hurriedly made the night before PyCon.
We've tried a bunch of content management systems, both open source and commercial, and developed for small shops, big corporations, government organizations, and of course Case itself. There are features that are consistently implemented poorly, hard to understand, or simply missing. Pagoda is a result of the observations we've made of how content management systems are really used in a production setting.
These are just a few of the notes and design goals we've been using along the way.
Don't overengineer it
Somewhere along the line someone decided that if you're going to make a content management system, you have to build everything on top of a dozen layers of abstraction. Some pretend that there's no difference between static page content (like a blog entry) and dynamically generated content (like a news feed). Some pretend that building complex workflows that are exactly suited to the way your organization is structured is a common need (we've found that people already have real-life workflows and rarely do they want this duplicated in a CMS).
Experience has shown us that such complexity is rarely needed. We don't try to fit every feature into a "plugin" structure or an "actions" framework. We've streamlined the features based on our experience, and hopefully kept it fun to hack on (when you do need something extra) by avoiding meaningless abstractions.
Do one thing really, really well
A lot of content management systems try to do everything involved in running a web site. Database management, email management, form design, you name it. We don't want a content management system that takes over every aspect of making a web site. We've made conscious decisions to leave a lot of features out. In addition to the above (which can all be found in Zope + Plone, for example), we've spent a lot of time deciding how far certain features should reach and what should be left up to the webmaster.
One example is theme switching. When you're first building a web site, being able to download prepackaged themes might be nice. But for production sites, this simply does not happen. Imagine the Cleveland Museum of Natural History or a university department downloading new themes and swapping them out. Not gonna happen. Instead this is limiting, because prepackaged themes require predetermined markup. As a result, most Plone sites look the same and are structured the same way. They have the little tree on the left and those tiny tabs and a logo above that. And then you're scared to modify too much CSS because there's a bunch already dedicated to making those tabs pixel-perfect. We don't have a default theme or even default markup. Markup and design are meant for programmers and web designers, let's not pretend otherwise.
Use simple terminology
As Jeffrey Veen mentioned in Making A Better CMS, stop it with the jargon already! "Mambots", "archetypes", "portlets", and I'll admit it, I'm not even a fan of the term "widgets". We've tried to use understandable terminology throughout Pagoda and not extend failed analogies.
One example where we created a feature and spent some effort on choosing a name is Placeholders. This is a feature that we've actually needed on production sites but haven't found in other content management systems. The idea is that there is text that appears on multiple pages within the content, and it would be nice to only have to change in one place so we don't have to hunt down every page in the future. Phone numbers, store hours, admission prices, and press contact information are some examples. These aren't template variables because they have nothing to do with templates (to the user) and aren't arbitrary Python objects, and they're not code snippets because they have nothing to do with code. They're simply content placeholders. Here's the mockup we used while implementing this feature:

Borrow features that work
We've had a lot of inspiration along the way and used it to solve real problems. For example, if you need to have a downloadable file on your web site, a lot of content management systems will force you to ask "where do I put this?" and once you've decided on a place, require you to find your way there in the filesystem. We decided on pages having Attachments. Most downloads are associated with a particular page, so just upload them to that page and that will determine their location. We used 37signal's Campfire for inspiration, where people can upload files to the room they're in and they appear as attachments.
Reduce the number of clicks
We're lucky enough to have started developing after AJAX became popular. The "Web 2.0" buzzword might be annoying, but this is really something we can use to make content management quicker and easier. Navigation and messing around with page options won't require dozens of clicks and page reloads anymore. Instead of having to retrofit our software to take advantage of AJAX, we can design with it in mind.
Built a content management system, not a new framework
Similar to doing one thing really well, we're not building a web framework. That's what TurboGears is for. People can still use their existing TurboGears controllers, models, and templates. We're using SQLAlchemy for Pagoda's models and Genshi for the templates. To install Pagoda for your existing TurboGears project, you'll just have to subclass from PagodaController instead of the default RootController, so Pagoda can dispatch requests to the appropriate page.
So hopefully it sounds like an interesting project. We're still hacking on the core and hope to release a demo before the end of March, when we'll also invite people to help out and find weak spots. We have some mailing lists on Google Groups for discussion: pagoda-talk (general discussion), pagoda-coders (core development), and pagoda-announcements (for releases and other notices). For the first few releases we'll also make announcements on the TurboGears list.
Japan Loves Sucks/Rocks
Last night, Gary and I were in Jennings making some improvements to sucks/rocks. We switched to Paste's HTTP server so that queries to the search API wouldn't block the whole server (which was single-threaded). This was a big problem because after running out of both Yahoo and Google queries, we had to fall back to Microsoft Live's, which is the slowest to respond by far, and has an annoyingly small index.
While doing that and other minor changes, Gary noticed the server log start going crazy. "Holy crap! Someone's sending tons of traffic our way!" I asked him to check the referrer. "I just hope it's not — shit, it's Slashdot! Japanese Slashdot!" We checked slashdot.jp and sure enough, it was the latest story. We went into panic mode as our CPU usage on WebFaction grew to 50% as we served multiple requests per second. We were able to make a few changes and get it back down around 6%, and Paste's HTTP server kept us online the whole time! We also noticed a bunch of other Japanese sites in our logs, including headlines.yahoo.co.jp , 100shiki.com, and japan.internet.com. Japanese people love sucks/rocks! Our Google Analytics graph for this morning is below.


While Slashdot Japan was busy hammering us, we added a Latest searches box where you can watch incoming searches scroll by. We had been doing this before by watching the server log, and figured others would enjoy it too. This was flying by with all the traffic last night. (You can see below how crappy Microsoft's index is — they don't even have any results for xbox360, with or without the space!)

I'm not sure how many more features such a silly little project can handle, but Gary seems to have started working on a new (unrelated) project.
sucks/rocks: Does it suck?
A few nights ago Gary and I were in JCC trying to think of a quick project to make together. Someone suggested a web app that uses Gary's algorithm for evaluating software and companies: Google for "software/company sucks" and "software/company rocks" and compare the number of results. The result is sucks-rocks.com, which we've been refining for the past few days.
This might sound a lot like Google Fight until you realize that Google Fight's results are totally meaningless in the context of which term is better. sucks/rocks actually compares the number of pages praising something and the number of pages griping about it, which is a much better metric. Several phrases are used for both positive and negative queries.
I showed this to some people at the Django sprint at PyCon after someone brought up Cats or Dogs, which is even more interesting.
We included some examples for version control systems, operating systems, broadband providers, cell phone providers, and programming languages.
We're actually using Yahoo's Search API instead of Google's, since Google deprecated theirs and it only allows 1,000 queries per day. The results aren't much different. If you find anything interesting, post the link (using "Link to this") in a comment. (Note: There are a couple bugs that are fixed but still need to be pushed to the live site.)
CAS 1.0 Authentication for Django, Part 2
After using my Django CAS authentication module for a while, I decided to make a couple improvements.
The biggest improvement is that instead of modifying code in the CAS module itself to set your CAS address and do things like custom User field population, all this stuff can now be configured in your settings file.
Another improvement is that CAS authentication now works for the bundled admin interface. Since the administration interface does not account for an authentication backend that doesn't know the user's password, this makes the login form useless. The CAS module will now intercept requests to the administration interface and do the proper authentication routine if necessary, never showing the login form (which doesn't make sense for CAS). Intercepting requests, you ask? Yes, that means the CAS module is now middleware. Actually it's middleware, a couple views, and an authentication backend.
So here's how to use it now...
Extract it in django/contrib/. The code will be located at django/contrib/cas/. Is this a valid place to install third-party middleware? It's not really clear. Just do it anyway.
Now add it to the middleware and authentication backends in your settings. Make sure you also have the authentication middleware installed. Here's what mine looks like:
MIDDLEWARE_CLASSES = (
'django.middleware.common.CommonMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.cas.middleware.CASMiddleware',
'django.middleware.doc.XViewMiddleware',
)
AUTHENTICATION_BACKENDS = (
'django.contrib.cas.backend.CASBackend',
)
You can now configure the CAS module in the same settings file. Here are the possible options, most of which can be safely ignored:
CAS_SERVICE_URL: This is the only setting you must explicitly define. Set it to the base URL of your CAS source.CAS_POPULATE_USER: A callable or the location of a callable. When a user logs in and is missing name and email attributes in the database, this will be called with their User model instance. Default is None (do nothing).CAS_ADMIN_PREFIX: The URL prefix of the Django administration site. If undefined, the CAS middleware will just check the view being rendered to see if it lives indjango.contrib.admin.views. The method is a little evil, but it works.CAS_LOGIN_URL: The URL where you bounddjango.contrib.cas.views.login. If undefined, assume/accounts/login/.CAS_LOGOUT_URL: The URL where you bounddjango.contrib.cas.views.logout. If undefined, assume/accounts/logout/.CAS_REDIRECT_URL: Where to send a user after logging in or out if there is no referrer and nonextpage set. Default is/.CAS_REDIRECT_FIELD_NAME: The name of the GET parameter in which to store the page URL to send the user to after logging in. Default isnext.
Need an example? Here's what my CAS settings look like:
CAS_SERVICE_URL = 'https://login.case.edu/cas/'
CAS_POPULATE_USER = 'present.utils.populate_user'
And the callable that lives at present.utils.populate_user (notice this code lives in my project instead of tinkering with the CAS module) looks like this:
def populate_user(user):
try:
ldap = LDAP()
person = ldap.filter_one_by(uid=user.username)
except:
if not user.email:
user.email = "%s@case.edu" % user.username
else:
# If it succeeds, update their User entry
user.email = person.mail[0]
user.first_name = fix_case(person.givenName[0])
user.last_name = fix_case(person.sn[0])
(LDAP and fix_case also live in my utils module).
Finally, make sure your project knows how to log users in and out by adding these to your URLconf:
(r'^accounts/login/$', 'django.contrib.cas.views.login'),
(r'^accounts/logout/$', 'django.contrib.cas.views.logout'),
Users should now be able to log into your site, and staff into the administration interface, using CAS 1.0.
Talk: Geocoding at Case
Tomorrow (November 28th at 6 PM) is ACM's Nerd Cultural Dinner at Case. The dinner is located in Thwing Ballroom and you should probably reserve a ticket by e-mailing acm-officers@case.edu.
After dinner is served, there will be several presentations. The first (at 6:30 PM) is by yours truly and is about Geocoding at Case. They gave me 15 minutes to talk, and I'll be following this approximate outline:
Geocoding at Case
- What is geocoding?
- Why is it useful?
- Using the Case geocoder web service
- A simple example
- Development details
- Location finding: Case Wiki, Google Maps, ...
- Location parsing: Matching buildings, areas, addresses, ...
- How you can contribute to its development and improvement
Hope to see you there!
Simple CAS 1.0 Authentication for Django
Back when I expressed interest in making the web presentation bounty based solely on client-side code, Simon (bounty master and Filer admin) expressed his wish to keep the two services decoupled (so I shouldn't rely on Filer for slideshow storage). While I still want to have a save-to-Filer feature, I decided that I should just go ahead and get the web presentation system up and running before worrying about a client-side-only version. So I started a Django project.
Anyway, the result is that I got CAS 1.0 working alongside the Django authentication system, which means I can take advantage of built-in features like permissions and messages with CAS-authenticated users.
If anyone else is interested in using CAS authentication with Django, you can download the code I'm using. Here's a brief usage guide:
- Set
SERVICE_URLincas/__init__.pyto the location of your CAS service. For example, Case's ishttps://login.case.edu/cas/. - Set
DEFAULT_REDIRECT_URLincas/__init__.py. Normally the user will be sent back to theirHTTP_REFERER(the page that requested login) after authentication. But if the user requests/accounts/login/directly (or there is noHTTP_REFERER), they will be sent toDEFAULT_REDIRECT_URL. - Enable the
loginandlogoutviews by adding these to your URLconf (customize the URLs if you want):(r'^accounts/login/$', 'your_site.cas.views.login'), (r'^accounts/logout/$', 'your_site.cas.views.logout'), - Add the backend in
settings.py:AUTHENTICATION_BACKENDS = ( 'your_site.cas.backends.CASBackend', ) - Make sure at least the following apps are installed:
INSTALLED_APPS = ( 'django.contrib.auth', 'django.contrib.sessions', 'your_site.cas', ) - Finally, if you have a way to populate the user's name and e-mail address fields from their username, put it in
cas/backends.py(see the comments). For example, I have LDAP code there.
P.S.: This just implements the minimum required for CAS authentication. Features like gateway, renew, and proxies are not supported.
An alpha version of the presentation system should be online to play with later this week.
geopy 0.93 Released: distance, util, GeoNames
Finally released geopy 0.93, which contains the distance and util modules I previously mentioned, a GeoNames geocoder, and improvements to the Google geocoder in other formats. Updating the documentation was all that was holding it back, really.
You can now pass domain and resource arguments to the Google geocoder. To query the actual Google Maps interface (instead of their official HTTP geocoder), initialize like so:
g = geocoders.Google(resource='maps')
The JavaScript results tend to be the best for this resource, so change that as well:
g = geocoders.Google(resource='maps', output_format='js')
Finally, for geocoding addresses outside of the US, change the domain being queried:
g = geocoders.Google(domain='maps.google.co.uk', resource='maps', output_format='js')
As James Robinson brought up on the geopy mailing list, work is under way for accuracy support. This will let you determine how precise the geocoded result is for the given location. For example, is it only guaranteed to be the correct city? Street? Is it the exact address? I decided to release this version of geopy without completing this, because not much work is done so far (and we also want to normalize values across geocoders), and the distance module was a pretty big addition.
To upgrade:
sudo easy_install geopy
A Web-Based Presentation System for Case
If you saw the new EECS department bounty system, you may have noticed the bounty for a web-based presentation system. This is now a work-in-progress by yours truly.
The goal is to let Case people generate slideshows through the web with Case-themed templates, mostly for use on the plasma displays in the EECS department (well, the displays that will be in Glennan). Another use could be by faculty in place of PowerPoint (so it should have familiar features such as transitions).
You may have seen those displays in the fishbowl last year showing pictures and such. Apparently these are a pain to administer because they are connected by a wireless connection and thus are not easy to connect to via Remote Desktop. So when they are rebooted or PowerPoint closes somehow, getting the slideshow back up and running can be a real hassle. Having them in kiosk mode in a web browser will allow them to easily refresh if anything goes wrong, and will also let them grab updated content. This is how the plasma displays around campus show their (I think Flash-based) slideshows.
There are two such tools on which to base such a system: S5 and Slidy. If you've ever viewed a text-based presentation online, it was probably with S5. This is the first time I have heard of Slidy, but it actually looks a lot better to me. Every time I've encountered S5 in the wild, the JavaScript has eventually completely messed up the text sizing and positioning, making everything unreadable.
There are existing services that combine web presentation systems with editing and sharing, such as S5 Presents and SoapBX. SoapBX is not open source while S5 Presents is GPL licensed and in Ruby. I would have no problem with using S5 Presents, but apparently it has gone unmaintained for almost two years, and Simon was disappointed when he looked into configuring it to run under a real web server.
Tonight I decided to see what actually using S5 is like, so I customized the default theme to resemble the Case PowerPoint template found on the Case branding page. So here is my Case-themed S5 presentation. It's pretty incomplete and doesn't match exactly, but it's close (and anyway, the goal is to make it look good, not match exactly—their PowerPoint template could use some work).
When I first considered this bounty, I thought "hmm, web-based, which web framework will I use?" One of the major goals is to be easily maintainable once I leave, so something popular or at least very easy would be the best choice. Which best fits this description? Django? TurboGears? web.py? I've never really considered the maintainability of any of these, since usually I can choose whichever I like at the moment.
After a bit of talking with Chris, we determined that any server-side code at all isn't even really needed. The only thing it would be used for is to display view and edit pages for the presentations and to store them. The first two can be done with static HTML and JavaScript, while the third can be used by POSTing with AJAX to some location. For example, we could POST presentations to the EECS department's new DokuWiki-powered site with the appropriate permissions. Since S5 and Slidy presentations are just HTML, the wiki page itself would be the presentation outline, and the slideshow viewer could just grab the markup from there. Or we could use AJAX to upload the presentation via WebDAV to the author's Filer account.
Since there would be heavy JavaScript usage regardless of the selected framework, the author would need to be fluent in it anyway. So why not a pure JavaScript solution, given the possibilities above? This is what I'll be looking into for the next few weeks. The code will soon live at opensource.case.edu/projects/present.
If anyone has experience with S5 or Slidy and would like to share their opinions, please do (even if it's just which system you like more).
dmath: Math routines for Python's arbitrary-precision Decimal type
Yesterday Chris and I spent all day writing math functions for Python's Decimal type. The result is our new dmath library, available on Google Code and the Cheese Shop under the MIT/X11 license.
Sparked by the routine for atan in my last post, I decided it wouldn't be too hard to go ahead and do the rest of the functions already offered by math and cmath. We now have acos, asin, atan, atan2, ceil, cos, cosh, degrees, e, exp, floor, golden_ratio, hypot, log, log10, pi, pow, radians, sign, sin, sinh, sqrt, tan, and tanh.
Check it out:
>>> from dmath import *
>>> from decimal import Decimal as D, getcontext
>>> getcontext().prec = 50
>>> asin(D(1))
Decimal("1.5707963267948966192313216916397514420985846996876")
>>> golden_ratio()
Decimal("1.6180339887498948482045868343656381177203091798058")
We're calling this release 0.9 because it just needs some testing and maybe some speed improvements, otherwise it's ready to use. There is currently some work being done in Python sandbox/trunk to convert the decimal module to C, and maybe they'll include fast versions of all these routines. But hey, you can use these right now!
Arbitrary precision is one of the coolest things in programming. We spent a lot of time in Mathematica, where if you ask it to tell you the precision, it says 'Infinity'. During our testing, we actually stumbled across a bug in Mathematica's ArcTan function! This page correctly states that ArcTan[-Infinity, y] should always be Pi (with the sign of y). However, Mathematica always returns 0. I sent a message with my findings to the Mathematica mailing list and Daniel Lichtblau of Wolfram Research confirmed that it is indeed a simple bug. ArcTan users, beware!
Anyway, enjoy dmath. Contributions are welcome, especially if you have any speed tips!
Geocoding and Python's decimal module
Python has an awesome decimal module for decimal floating point arithmetic. It has configurable precision and keeps track of significant digits and does some other neat stuff.
While I was adding the geopy distance module, I began to wonder if it would be worth the effort to switch everything in geopy over to use Decimals instead of floats. After checking out the decimal module (I had never used it before), I decided that I had nothing to lose, so I went for it...
I quickly ran into some snags when I realized that I'd have to code my own trigonometric functions for use with Decimals, since those that come with Python are for complex or floating point numbers. The decimal recipes page in the documentation has functions for sin and cos, but distance uses asin, acos, atan, and atan2. Don Peterson has a nice decimalfuncs module with most of these, but it's GPL (and would be an uncommon dependency) — geopy is MIT/X11. So I went ahead and started on these...
I decided it would be easiest to define asin and acos in terms of atan, and it turns out there is a (relatively) quickly converging algorithm for that. Here's what I came up with for a Decimal-compatible atan:
def atan(x):
if x == D('-Inf'):
return pi() / -2
elif x == 0:
return D(0)
elif x == D('Inf'):
return pi() / 2
if x < -1:
c = pi() / -2
x = 1 / x
elif x > 1:
c = pi() / 2
x = 1 / x
else:
c = 0
getcontext().prec += 2
x_squared = x ** 2
y = x_squared / (1 + x_squared)
y_over_x = y / x
i, lasts, s, coeff, num = D(0), 0, y_over_x, 1, y_over_x
while s != lasts:
lasts = s
i += 2
coeff *= i / (i + 1)
num *= y
s += num * coeff
if c:
s = c - s
getcontext().prec -= 2
return +s
It depends on the pi function from the decimal recipes page, which calculates pi to the currently configured precision.
Upon finishing this, Chris came home and I told him what I was doing. Immediately, he tried to talk me out of it, asserting that floating point was good enough for geocoding. I tried to counter by explaining all the floating point calculations being performed in distance, but in the end he won. I no longer think it would be a very important change to convert everything in geopy to use the Decimal type.
What finally convinced me was this quote from the Vincenty distance page I used for reference:
Vincenty’s formula is accurate to within 0.5mm, or 0.000015″ (!), on the ellipsoid being used.
0.000015 arcseconds is about 4.16667e-9 degrees. Well, if floating point is good to about 10 decimal places, I guess Chris wins this time...
Still, if anyone wants Decimal support in the future, maybe I'll just ask Don Peterson for permission to include decimalfuncs with geopy...
Update: On second thought, maybe I will just continue implementing my own trig functions for Decimals. Chris and I just spent a while investigating the precision of my atan vs. decimalfunc's, and mine seems to be faster and more precise.
geopy gets distance and util modules
If you check out geopy trunk right now you'll notice a few changes.
I introduced two modules: util and distance.
util now contains the parse_geo and arc_angle functions, and will grow more in the future.
distance is a bigger addition and contains helpful functions for calculating geodesic distances. I planned to add this eventually, but development was sparked by a request from Chris Mulligan.
There are two distance formulas: Great-circle (aka haversine, aka spherical law of cosines) distance and Vincenty distance.
Great-circle distance uses a spherical model of the earth, using the average great-circle radius of 6372.795 kilometers (this is configurable). This results in an error of up to about 0.5%.
Vincenty distance uses a more accurate ellipsoidal model of the earth. This is the default distance formula, and is thus aliased as distance.distance — so you can easily swap out distance formulas just by changing distance.distance at the top of your code. There are multiple popular ellipsoidal models, and which one will be the most accurate depends on where your points are located on the earth. geopy includes a few good models in the distance.ELLIPSOIDS dictionary:
# model major (km) minor (km) flattening
ELLIPSOIDS = {'WGS-84': (6378.137, 6356.7523142, 1 / 298.257223563),
'GRS-80': (6378.137, 6356.7523141, 1 / 298.257222101),
'Airy (1830)': (6377.563396, 6356.256909, 1 / 299.3249646),
'Intl 1924': (6378.388, 6356.911946, 1 / 297.0),
'Clarke (1880)': (6378.249145, 6356.51486955, 1 / 293.465),
'GRS-67': (6378.1600, 6356.774719, 1 / 298.25),
}
Here's an example usage of distance.distance:
>>> from geopy import distance
>>> import Case
>>> wiki = Case.Geocode.CaseWikiGeocoder()
>>> _, a = wiki.geocode('Wade')
>>> _, b = wiki.geocode('Fribley')
>>> distance.distance(a, b).kilometers
1.342250272726943
>>> distance.distance(a, b).miles
0.83403565192666562
Using Great-circle distance:
>>> distance.distance = distance.GreatCircleDistance
>>> distance.distance(a, b).miles
0.835175984734287
You can change the ellipsoid model used by the Vincenty formula like so:
>>> distance.VincentyDistance.ELLIPSOID = 'Intl 1924'
The above model name will automatically be retrieved from the ELLIPSOIDS dictionary. Alternatively, you can specify the model values directly:
>>> distance.VincentyDistance.ELLIPSOID = (6377., 6356., 1 / 297.)
Oh yeah, you can add distances too (for paths and such). Here's the distance from Fribley to Wade to Phi Kappa Theta:
>>> _, c = wiki.geocode('Phi Kappa Theta')
>>> (distance.distance(b, a) + distance.distance(a, c)).miles
1.0596624112817861
Also included in the distance module are functions for converting between length units (kilometers, miles, feet, nautical miles), and calculating a destination given a starting point, initial bearing, and distance.
This stuff is still just in trunk, no egg or updated documentation yet...
geopy: Now on Google Code, More Geocoders
I decided to try out the Google's Project Hosting feature for geopy. You can find the hosted page at code.google.com/p/geopy. So far it seems pretty sweet and very easy to administer.
I added a geocoder for Microsoft's Windows Live Local (powered by Virtual Earth) to the geocoders module. Sadly, they don't actually have a non-JavaScript geocoding API, so I had to reverse-engineer it.
Norman Khine and I have been investigating issues geocoding UK addresses with the Google Maps API. Due to contractual reasons, they can't offer geocoded addresses with their HTTP geocoder. So instead I again had to reverse-engineer their JavaScript to get it to work. The geocoded results aren't always accurate, but this is Google's problem and not geopy's.
I also tried to add a geocoder for MapQuest's OpenAPI. It is possible to get geocoded results over HTTP (although they don't tell you how, you have to look at their JavaScript or guess), but unfortunately they require you to parse the input location first. This is totally lame. You're telling me they can't parse the address into street, city, country for me? I didn't want to have to do this, but I now plan to add address parsing methods to geopy.
geopy: A Geocoding Toolbox for Python
I just uploaded the first release of geopy to the Python Cheese Shop.
The web site (with a little documentation) is located at exogen.case.edu/projects/geopy/.
There are five geocoders: Google, Yahoo!, geocoder.us, MediaWiki, and Semantic MediaWiki. Usage examples are given on the web site.
I need to read up more on setuptools to make my package maintenance life easier. For instance, telling the Cheese Shop to install my package from Subversion would be nice right now, but I'm not sure how.
Update: You can now view the pretty syntax-highlighted source code on the web site. Thanks to the wonderful KDevelop for that ability!
Geocoding Tools for Python (and CaseClasses)
I've been working on some geocoding classes in Python. Right now I've got tools made for MediaWiki, Semantic MediaWiki, and the Google geocoder. I plan to include the Yahoo! geocoder in this toolbox soon.
I think this could be a useful package, so I plan to upload it to the Cheese Shop soon.
I added the relevant geocoder classes to the Python CaseClasses so that developers can easily geocode strings using Case's Semantic MediaWiki.
Check it out. First grab the geocoder:
>>> from Case import Geocode
>>> wiki = Geocode.CaseWikiGeocoder()
Then start geocoding:
>>> place, (lat, lng) = wiki.geocode('KSL')
Fetching http://wiki.case.edu/KSL...
>>> print "%s: %.5f, %.5f" % (place, lat, lng)
Kelvin_Smith_Library: 41.50727, -81.60950
geocode returns a tuple consisting of the location name found and the coordinates (another tuple).
Here's where the Semantic part comes in. The Project Club article isn't geocoded, but it is located in Olin, which is geocoded:
>>> place, (lat, lng) = wiki.geocode('Project Club')
Fetching http://wiki.case.edu/Project_Club...
Fetching http://wiki.case.edu/index.php/Special:ExportRDF/Project_Club?xmlmime=rdf...
Fetching http://wiki.case.edu/index.php/Olin_Building...
>>> print "%s: %.5f, %.5f" % (place, lat, lng)
Olin_Building: 41.50224, -81.60778
CaseWikiGeocoder is a subclass of SemanticMediaWikiGeocoder and is defined by only the following:
class CaseWikiGeocoder(SemanticMediaWikiGeocoder):
def __init__(self):
super(CaseWikiGeocoder, self).__init__("http://wiki.case.edu/%s",
relations=['Located in'])
This creates a SemanticMediaWikiGeocoder with a base URL of 'http://wiki.case.edu/' that follows the 'Located in' relation if a page fails to geocode. So SemanticMediaWikiGeocoder could easily be used for any Semantic MediaWiki with any set of relationships defined. This class is brand new and has only been tested on the Case Wiki, so it might be buggy.
MediaWikiGeocoder relies on BeautifulSoup since it assumes wiki pages can be malformed.
Remember, if you have easy_install, you can simply type this to install CaseClasses:
sudo easy_install http://opensource.case.edu/svn/CaseClasses/python/trunk
After this geocoding toolbox is complete, I'll see if I can make that Case geocoder web service we talked about on the forum.
Update: Per Greg's comment, support for reading coordinates from semantic attributes is now in Case.Geocode (along with some other small improvements).
CaseWikiGeocoder is now defined as:
class CaseWikiGeocoder(SemanticMediaWikiGeocoder):
def __init__(self):
base = super(CaseWikiGeocoder, self)
base.__init__("http://wiki.case.edu/%s",
attributes=['Geographical coordinate'],
relations=['Located in'], prefer_semantic=True)
JavaScript Contour Mapping: Still Seems Feasible
A few posts ago I talked about generic contour plotting for Google Maps using only JavaScript.
After some more work and research into contouring algorithms, it still seems like a reasonable idea.
Instead of gridding the irregularly spaced points (crime locations in my case) I learned that many contouring algorithms instead use Delaunay triangulation. So I implemented incremental triangulation in JavaScript and the result seems fast enough (note: still needs Firefox or Adobe SVG). On my (older) computer it gets unreasonably slow around a few hundred points, but this is more than I expect to render at a time anyway (consider that the Crimes in 2006 page only has about 60 unique locations).
The next step is to modify the algorithm to use Constrained Delaunay or Ruppert's Refined Delaunay.
After that, it's just a matter of interpolating the crime density (so that levels in between high and low densities are created if they don't exist), smoothing the polygons, and drawing & coloring them with SVG.
I checked out a nice collection of research papers from KSL called Geometric Modelling which includes some contouring discussion. I plan to make another visit to the library this week to see if I can find anything helpful about the steps above. Googling for this stuff has been extremely helpful, but no single page has very comprehensive information, I just have to piece it together.
On the Case Forum Greg mentioned his idea for a Case geocoding service that would aggregate information from geocoding services such as Google Maps and the Case Wiki so that any Case web service could easily geocode local places (Andrew mentioned geocoding photos, for example). I think the crime log location parser fits this bill pretty nicely! The version in trunk has tons of Case-specific location parsing techniques that could save other web services a lot of work. It just needs to be put on a server that will accept strings and send back geocoded results.
More crime & mapping news to come...
Merquery Summer of Code Results
This morning I commited a working version of Merquery to the Django Subversion repository.
svn co http://code.djangoproject.com/svn/django/branches/search-api/
My code in particular lives in branches/search-api/django/contrib/search.
The Lucene adapter is fully functional (but needs some more convenience functions), and the Xapian and Hyper Estraier adapters need a little more work.
Here's an example of using the Lucene adapter, using a similar model to my old Merquery post:
from django.db import models
from django.contrib.search.backends import LuceneIndexer
class Person(models.Model):
first_name = models.CharField(maxlength=30)
last_name = models.CharField(maxlength=30)
biography = models.TextField()
indexer = LuceneIndexer('/tmp/person-index', Person,
fields=['Person.biography'],
attributes={'first': 'Person.first_name',
'last': 'Person.last_name'})
As you can see, you specify an index location (database locations should be supported in the future), which model should be considered the document (hit results), and which fields to index.
It also allows shorthand for the fields:
indexer = LuceneIndexer('/tmp/person-index', Person, 'Person.biography',
first='Person.first_name', last='Person.last_name')
Okay, let's insert some people...
b = Person(first_name='Brian', last_name='Beck', biography='Python advocate')
g = Person(first_name='Guido', last_name='van Rossum', biography='Python creator')
s = Person(first_name='Spiros', last_name='Eliopoulos', biography='Loves Haskell')
And force all the Person objects to be indexed...
indexer.update()
You can also send update a list of Person objects to update (beware of Lucene's update inserting duplicates for now).
Finally, Lucene's sweet query syntax is available for your database:
>>> for hit in indexer.search('python'):
... print hit, hit.instance
<LuceneHit: merquery.person 1, Score: 0.625> <Person: Brian Beck>
<LuceneHit: merquery.person 2, Score: 0.625> <Person: Guido van Rossum>
>>> for hit in indexer.search('python creator'):
... print hit, hit.instance
<LuceneHit: merquery.person 2, Score: 1.0> <Person: Guido van Rossum>
<LuceneHit: merquery.person 1, Score: 0.168048456311> <Person: Brian Beck>
>>> for hit in indexer.search('last:Beck OR first:Spiros'):
... print hit, hit.instance
<LuceneHit: merquery.person 1, Score: 0.496906995773> <Person: Brian Beck>
<LuceneHit: merquery.person 3, Score: 0.496906995773> <Person: Spiros Eliopoulos>
There are some things that should be changed before Merquery is production-ready, and some things that would be nice in the long-term.
One is that the indexer knows how to follow ForeignKey fields (maybe 'Person.address.street_name', for example), but there needs to be a couple changes to get them to work—this is a result of a Field instance seemingly not having an attribute linking back to the Model it is bound to. Support for ManyToMany joins also needs to be thought out.
Another is the way you pass Model instances to specify as the document return type (the type returned by hit.instance). Documents should be able to aggregate many Model instances and not have to consider any particular Model the document type. So a Document class will probably be introduced that acts like the Model metaclass, letting you 'build' a Document prototype and telling the indexer how all of its attributes should be retrieved and treated.
Automatically knowing when to update the index would also be very nice.
There are a few main long-term goals...
One is to make a universal Merquery query language that will automatically translate queries into the backend's query syntax. This would especially good for Hyper Estraier, which has whacked-out attribute-search syntax:
@first STRINC Brian
Another is to make some Models to keep track of indexing status and query statistics, and offer nice admin views of these.
Finally, storing all index data in a database (especially a Model-compatible one) instead of on the filesystem would be great.
Since I have a week before classes start, I'm going to continue making the necessary changes to make Merquery production-ready.
Functional Programming in JavaScript: Nice But Slow?
While making those two JavaScript contour prototypes I posted yesterday, I had the chance to try out the nice functional programming tools offered by MochiKit. I love MochiKit—mostly because it's highly influenced by Python.
However, since the performance of the contour mapping code is extremely important to its feasibility, I did some profiling and was a bit surprised.
Consider this code to test membership in an array of points:
var containsPoint = function (container, element) {
for (var i = 0; i < container.length; i++) {
if (pointsEqual(container[i], element)) {
return true;
}
}
return false;
}
Boring, right? Now the functional equivalent:
return some(container, partial(pointsEqual, element));
some and partial come from MochiKit.Base and MochiKit.Iter, respectively.
Sadly, I profiled the functional approach at being 27 times slower than the verbose version.
Similarly, this code:
var neighbors = new Array();
for (var i = 0; i < points.length; i++) {
var p = points[i];
if (is_neighbor(scale, prev, p)) {
neighbors.push(p);
}
}
...is 3 times faster than this code:
var neighbors = filter(partial(is_neighbor, scale, prev), points);
I assume this has something to do with the functional approach being run as "pure JavaScript" while the verbose version is optimized by the implementation—like pure-Python vs. Python written to take advantage of its C implementation (the "pure" version being slower).
I think a lot of people don't profile their JavaScript code, so I'm just throwing this out there. Unless someone can point out a solution to the dramatic speed difference, I will sadly be avoiding a lot of the functional tools offered by MochiKit.
I'll be spending the next few days moving into my Village apartment.
Tomorrow: The final day of Summer of Code, I'll talk about my results.
Generic Contour Plotting for Google Maps
The other day Mike, Spiros, and I began thinking about generating SVG contour maps with JavaScript. The reason is that I want to be able to show a contour map of crime density (like this one of property crime) on the Campus Crime Mapper.
I think a way to generate generic contour maps for Google Maps in JavaScript would be extremely popular with Google Maps developers. It could be used to map temperatures, population density, precipitation, wifi coverage, and lots of other things people are using Google Maps for.
There are only a couple sites out there that have overlayed contour maps on their Google map. The best one seems to be EarthTools, which uses it to display a topographic contour map of the area. I can't tell if it uses static image tiles, SVG, or Canvas; maybe someone else can figure it out.
You're probably wondering why I want to do it all in JavaScript and SVG as opposed to rendering static tiles on the server and overlaying them. There are a few reasons:
- Using JavaScript will make it available to any Google Maps developer, regardless of their server-side tools.
- The contour shapes will only need to be calculated once for all zoom levels, since they are infinitely scalable.
- I am almost positive that the time it takes for the client to calculate and render the SVG contours will be about the same amount of time it would take to download all the custom tiles anyway (especially if they will be zooming and scrolling — hint: they will be).
- Theoretically it will allow the user to change the granularity, colorization, and other aspects of the map.
I'm not looking to generate anything insanely detailed, just something at about this detail level.
I have some proof-of-concept code online that tests finding shape contours and smoothing them with Bezier curves. You will need either Firefox 1.5 or a browser with Adove SVG enabled to view those demos (I'm considering also supporting Canvas).
Note: Those pages (and the Crime Mapper) will be down while I move into the Village.
Speaking of SVG and Canvas, I'm going to use this sweet JavaScript plotting library called plotkit to show crime statistics — check out the demo on that page!
I'm excited to see how far I can take this, but right now I've got two days left in the Summer of Code.
Saturday Mapping Thrills
Today I decided that I needed to be punished and proceeded to create and geocode 58 parking lots on the Case Wiki.
This will slightly help the campus crime mapper since some reports are given lot numbers as their location. You probably noticed that the crime mapper got a bit fancier. I made icons for a few categories of violations, and took care of multiple incidents being located in one spot. It's also running with FastCGI now so maybe it's faster?
One pattern I've already started to notice is that some places are hotspots for a certain type of violation. Check out the Crimes in 2006 page and you can tell that it's a bad idea to park your bike near Fribley, and to park your car on Carlton Road. (Also on that page, zoom way out for a surprise!)
Well, I'm all geocoded out after hunting down parking lots for the past couple hours. Suggestions for the crime mapper are welcome.
P.S. The code is now on opensource.case.edu and there's a to-do list on the wiki. Feel free to browse around or contribute!
Update: Hey all you geocoders and wiki mappers, check out this awesome resource I just found. It's a build identification glossary with location names, aliases, and addresses. This is exactly what the crime mapper needs...
Saturday Mapping Thrills
Today I decided that I needed to be punished and proceeded to create and geocode 58 parking lots on the Case Wiki.
This will slightly help the campus crime mapper since some reports are given lot numbers as their location. You probably noticed that the crime mapper got a bit fancier. I made icons for a few categories of violations, and took care of multiple incidents being located in one spot. It's also running with FastCGI now so maybe it's faster?
One pattern I've already started to notice is that some places are hotspots for a certain type of violation. Check out the Crimes in 2006 page and you can tell that it's a bad idea to park your bike near Fribley, and to park your car on Carlton Road. (Also on that page, zoom way out for a surprise!)
Well, I'm all geocoded out after hunting down parking lots for the past couple hours. Suggestions for the crime mapper are welcome.
P.S. The code is now on opensource.case.edu and there's a to-do list on the wiki. Feel free to browse around or contribute!
Update: Hey all you geocoders and wiki mappers, check out this awesome resource I just found. It's a build identification glossary with location names, aliases, and addresses. This is exactly what the crime mapper needs...
Mapping Crimes on Campus
So a couple entries ago I mentioned that someone should make something like chicagocrime.org for Case. This is possible since we have a daily crime log which lists the location of each incident, and an excellent wiki with many geocoded locations.
I was bored a few nights ago and decided to try it out. In the interest of releasing early and often, I've made my first test available online at exogen.case.edu/crime/recent.
As you can see, it's nothing too fancy yet. This is my first non-trivial Django application, and I plan on using it as a testbed for Merquery.
So here's my to-do list (when I'm not working on Merquery, of course):
- Draw new markers.
- What should happen when two markers are placed on the same location, which is likely? Show both in the info window? Change the color of the marker? Try to put them next to each other?
- Offer views by type of violation.
- Activate the better location parsing (right now only the wiki data is checked, but the Google geocoding API helps with addresses). Already written, just needs to be put into production...
- Geocode all the locations that couldn't be found.
- Offer an RSS feed of recent crimes.
- Import all incidents back to 2000 (for fun).
- Use all the imported data for crime statistics per area.
- Put code on opensource.case.edu.
Now to watch for that bicycle thief...
Update: Problem seems to be fixed, let me know if it's broken for you.
Update: Put the new location parser into production, and it now has a 66% success rate for geocoding locations. More wiki entries would help that number... anything to avoid doing them manually.
What You Need, When You Need It
So, it's midnight and I'm sitting in the basement of Olin. I'm looking up schemes to help The Segelmeister's wireless connectivity, which has been less than optimal for the past week. I settle on these plans for a coffee can wireless antenna. I spin around in my chair, ready to collect some parts. I look down at my feet. About one foot away, sitting in a box, is a completely assembled coffee cantenna, including enough spare parts for another.
Where else could this happen but in Case's very own Thunder Labs a.k.a. Project Club Lab? I've mentioned Thunder Labs and Project Club several times on this blog, but I've never really explained it to the uninitiated. The wiki page does it no justice. Project Club is where hackers, builders, and inventors can go on campus to tinker!
Located in Olin 101, we've got enough toys to keep the creative and technically minded busy during their time at Case. We have oscilloscopes, soldering irons, motors, wheels, batteries, tons of tools and materials, workstations, servers, computer parts, a projector, a laminator, secure storage lockers, engineering & programming textbooks, a silkscreener, and more. Did I mention we have pirate costumes? Top shelf, in the Dell box.
Want to run a live web cam through some awesome Linux software for your crazy art show project? (Ian?)
Want to use a terminal for the awesome Linux setup in the Jennings lab, except on a couch, with a nice stereo, and even a nice paint job?
Want to build a robot? A tesla coil? An electric bike?
We want to do all that stuff, too. So come give us a hand in Olin 101!
Now, back to Sara's wireless problem...
More on Merquery
A bunch of traffic has been directed to this blog due to the post about Merquery. Seems like there has been discussion going on in that posts's comments and also over at Jacob Kaplan-Moss' post.
As predicted, a lot of people are only seeing the "reinvention" aspect of Merquery. And admittedly, there is a lot that would need to be reinvented based on the goals I wrote down.
The real novel part about Merquery is that it's easy to drop into a Python web application. Imagine if you're going through a TurboGears tutorial and all you have to do is add one line to add full-text indexing and search to your database tables. Cool!
So here's the reformulated plan. Write adapters for the nice-looking Python indexing engines mentioned so far, such as PyLucene, Hype, and Xapwrap. Make using any of them look the same (so they're easy to swap in and out), and make it a one-liner for the most basic indexing setup desirable. Then, add a pure-Python indexer to the package as a side project, for those people who don't want dependencies. (All three of those existing libraries mentioned above still require the library they wrap to be installed.)
Unlike the current interfaces for those indexing libraries, these adapters don't have to be completely general (yet). If they only provide adapters for SQLObject classes and the Django database API, that's already a great accomplishment, even though these adapters are less flexible than the generic interfaces already provided. This will allow Django and TurboGears developers to stick with what they know rather than worry about getting an indexer working with their underlying database. (Hey, we've got to start somewhere, might as well have mass appeal right from the get-go.)
Here's an idea of what some customization of a developer's search engine might look like:
class Person(SQLObject):
firstName = StringCol(notNone=True)
lastName = StringCol(notNone=True)
nameSearch = Merquery.LuceneIndex(first=Person.firstName,
last=Person.lastName)
(I have no idea why that space is there. Sorry for my blog being so ugly.)
In this example, the developer has customized the index by giving Person.firstName strings the field name 'first' and Person.lastName strings the field name 'last'. So to find people with 'Beck' in their name but not 'Brian Beck', this would work:
beck -first:brian
Developers could just pass query strings like the above directly from their forms to the index:
results = nameSearch.query("beck -first:brian")
Since LuceneIndex knows we passed in SQLObject columns, it will know to return results as a ranked list of SQLObject instances.
results[0].firstName, results[0].lastName
Obviously this example might not be very realistic since firstName and lastName are just strings and we could accomplish this with SQL. But the same ideas apply for fields storing big documents, etc., where things like term frequency and proximity become important.
Thoughts?
Update: I made a Merquery Google Group so discussion can now happen in a centralized place. I was also kind enough to make the first typo on there.
Merquery, Text Indexing and Search with a Focus
If you happened to catch my RSS feed at some terrible hour of the morning a couple nights ago, you may have read a little about Merquery.

Merquery is the reason I want to parse nice search query syntax like Google's (or more generally, Lucene's). It's a full-text indexer and search engine. Not the crawly, home-page kind like Google, but the deployable kind like Lucene.
Now I know what you're thinking. You're coming up with all these nerd-slang fake syndromes with which to diagnose me. You're wildly gesticulating while going on about how a bunch of people smarter than me have done this before and published their results.
Let me say that I think Lucene is one of the best open source offerings out there. If you're serious about deploying search locally, you should use Lucene. Heck, you can even use Python to do that.
Problem is, there are a ton of people out there who aren't that serious about search. Or more appropriately, don't need to be that serious about search. I've been investigating Lucene for the past week, and my impressions are that it's:
- huge and complex
- overkill for Joe Programmer's First TurboGears App
- hard to deploy for my app in less than 5 minutes
Obviously, Lucene was never intended for use with that second point. But there are a lot of developers these days using small-but-scalable web frameworks that could use a lightweight search engine that's easy to plug into their apps. I'm talking, like, just pass it a list of SQLObject columns and it will do the rest. Giving your SQLObject tables a search method is easy, but giving them a fully-featured query syntax and ranking the results is a little more work. Sure, we can count on Google to index our pages for us, but they can't search our databases, which is also a huge need.
So my goal is for Merquery to focus on being super-easy to deploy for those web developers using things like Django, TurboGears, Pylons, etc. I already know what Python database wrappers they're likely to be using, so I can make Merquery cater directly to their needs.
With that in mind, here are some design decisions and other things I've been thinking about:
- Full Lucene query syntax would be nice, but hasn't Google's small subset of it proven to be good enough?
- I like being able to search for stop words. Google used to have stop words, but now you can search for anything you want.
- Python has lots of nice ways to process large datasets without having everything in memory at once; I plan to use these methods extensively.
- So far I think pyparsing will be the only dependency. There's a lot of cool stuff already in the standard library that's well-suited to building a search engine. And hey, Paul says search query parsing will come with the next release of pyparsing.
- Some motivating phrases: lighweight, fast, easily deployable, easily extendable
This article (with code) about making something similar in Python is a good read. In fact I've found a few articles like that one, but they're all from at least a few years ago. I got the impression that no one has approached this problem from the same perspective of focusing on the tools today's Python web developers are using (SQLObject, TurboGears, and such). But if you know otherwise, let me know!
Google Search Syntax BNF
A while ago I was making a specialized search engine of sorts and I wanted to use Google's nice query syntax with my own custom modifiers instead of things like inurl:, intitle:, site:, etc. Besides these powerful modifiers, Google's search syntax is nice because no query is invalid. Yahoo, MSN, and Amazon also at least support more than just basic search expressions.
Sometimes I wish other sites (like reddit) would implement this syntax for their search queries. So tomorrow I'll release a little Python module that parses this query syntax and makes the query easy to read and process. I wrote code that did this successfully a year or two ago, but I'd like to rewrite it now with pyparsing or something. Now there will be no excuse to offer only lame search queries.
To get the ball rolling, here's the BNF for the syntax I will implement. I have no idea if this is even a proper way to express BNF, but I used the Python grammar as a reference (because if it's good enough for Guido, it's good enough for me).
L ::= expr | expr L
expr ::= term | binary_expr
binary_expr ::= term " " binary_op " " term
binary_op ::= "*" | "OR" | "AND"
include_bool ::= "+" | "-"
term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~" literal)
modifier ::= (letter | "_")+
literal ::= word | quoted_words
quoted_words ::= '"' word (" " word)* '"'
word ::= (letter | digit | "_")+
number ::= digit+
range ::= number (".." | "...") number
letter ::= "A"..."Z" | "a"..."z"
digit ::= "0"..."9"
Contrary to what many people believe, you can NOT use parentheses for grouping or precedence in Google search queries. Every punctuation character except for [+-_".] is converted to a space and becomes meaningless.
Sprinting messes with your mind
What Mike and I accomplished during Spring break can only be described as a sprint. We hammered out an initial release of a decent contribution to the Python community in two days. In fact, we started preparing for the next release by rewriting everything from scratch to more easily support new collaborative filtering models and make our current models more accurate. There's another project we started that I'll put online in a week or so.
Our long and focused coding sessions resulted in what I have only ever experienced after playing a single video game for prolonged hours. I fall asleep thinking about code and math; sleep is interrupted by thoughts about code and math; the day begins (often in the late afternoon) with thoughts about code and math. It takes a while to wear off.
More specifically, here are some thoughts that keep popping up since we first released consensus:
- Why don't any of the models presented in research papers work as presented? We've had to modify almost all of them.
- Why is our refactored code an order of magnitude slower than our initial naive code?
- Why did using numarray for churning numbers on our dataset make things slower, not faster?
- If AudioScrobbler's dataset is so large, why are their all-time top recommendations for me so unstable?
Anyway, I hope that posting these thoughts will get them off my mind, because I'm ready to think about other things for a while.
consensus: Collaborative filtering in Python
Mike is visiting for Spring break, so naturally we've spent the whole time coding. We've started a few projects, and we've just made the first of them available.
consensus is a collaborative filtering library for Python. It implements three different filtering models, and makes it easy to implement more. We're no experts on the subject, but the results seem pretty good.
As an example, we wrote a little script to harvest data from thousands of AudioScrobbler users, starting with myself. All three models yielded similar results with lots of overlap. Based on my AudioScrobbler profile (which hasn't been updated in at least a year), our models suggested these additional artists:
- Sufjan Stevens
- Elliott Smith
- Broken Social Scene
- Belle and Sebastian
- The Beatles
- Interpol
- Wilco
- The Decemberists
- Bright Eyes
- Beck
- Death Cab for Cutie
Not bad! There is some overlap with AudioScrobbler's suggestions, so we must be doing something right. We're not sure what fancy algorithms they're using, but based on which suggestions we determined I would actually like, our models had a higher success rate. But who knows, if we had a larger data set, maybe they'd be the same...
Anyway, if you'd like to install the latest release of consensus, just use setuptools:
sudo easy_install consensus
Update: Both consensus and easyBay are now on the Cheese Shop.
Knoware Client Update
It's getting pretty late into the summer now, and I predict that the client portion of Knoware will be done by the end of the week. So, outlook good.
Despite my prior C++ coding projects, my course experience here at Case, and even having taught it for a class, I was dreading getting back into it after having been dazzled by Python for the past year. God damn it, I thought, I'm gonna have to find all these libraries for crap like XML parsing, gonna have to re-learn sockets for like the third time, etc. Of course, no one said I had to use C++ — Python even has bindings for both the Qt and KDE libraries. But, I thought, I owe it to the users to slave over a damn solid C++ program.
The only thing I was certain of was using Qt for the GUI, obviously, just like the rest of KDE. And that's when I started getting more surprised every day, even now. What? Qt does XML parsing? It does sockets? It does threads? It can communicate with external processes? I don't even have to use the classic C++ strings, because it does that, too, and like I'd expect? Qt has definitely made getting back into C++ a pleasure. Thanks, Trolltech.
Remember those old Knoware screenshots? I prettied up a few areas with the help of KHTML, finally bringing them to life. Take a look:
If I don't write a post here about statistical results by this time next week, I'm in trouble.
Knoware, Statistics & Orange
The lack of recent Knoware updates on this blog is a good indication of what the statistical aspects of this project mean for a programmer like myself — lots of time reading and researching up front. Aside from things like database design, I've put very little actual code into Knoware for the past week due the sheer amount of knowledge required to understand (let alone successfully apply) Random Forests and Classification & Regression Trees. This particular approach also includes learning a bit of R.
However, I just came across Orange for the second time and actually read the feature list this time. Wow! Could I have asked for a better module? Their Orange For Beginners guide includes examples of classification, bagging, regression, and other processes that make up the bulk of Random Forests. I think this could definitely speed up the process of getting a working system up before September.
Since I'm talking about the pieces I'm using to construct Knoware, I guess I'll mention that I'm working with SQLite right now for the database backend. I've used it for a few projects so far and I've become quite a fan. MySQL and friends are just too much sometimes — users, permissions, configuration, bah!
The Return of CASPER
This post is not related to Knoware, but an old half-finished project called CASPER.
CASPER is a recursive acronym standing for Casper Artist-Sorted Playlist Entropy Reducer. The idea relies on the fact that there are two extremes dominating the way people can sort a playlist: alphabetically or completely random. Since the former would be a strange way to listen to music, most people go with the shuffled approach. In my experience, the reasoning behind listening to a shuffled playlist is to promote variety. But while complete chaos might increase variety, it certainly does not maximize it. Some people notice that their media players seem to have an affinity for certain artists or songs. So I came up with a method to increase the variety of any given attribute in a playlist.
The method places songs throughout the playlist such that items with the same key (for example, songs by the same artist) occur as far apart as possible, and it does this for every song in the list. For example, if the playlist contains 3 songs by X and 3 songs by Y, one result might be X2-Y3-X1-Y2-X3-Y1, where the order of the songs by X and Y are random (so there are lots of possibilities for even such simple cases). With large playlists, this careful song placement is much less apparent, so it appears random. The total entropy of the shuffled sequence is decreased, but the variety is increased.
I've decided to release some simple, documented Python code demonstrating the algorithm used for CASPER. This should make it easy for anyone to implement this shuffling method into their media player if they so desire. If your media player can easily make use of Python, it would probably take less than a minute to do. Check out casper.py if you're curious. Running the script as-is will produce a simple example. Requires Python 2.4 or higher, but is easy to port.
The full source code of the linked file is in my extended entry if you're really lazy, but it's probably formatted poorly and won't have syntax highlighting.
Continue reading "The Return of CASPER"
Drafting the Knoware Interface
As suggested by my KDE mentor, one of my first tasks is to try designing an attractive interface for Knoware. I've been doing mostly web programming for the past year or two, so getting back to using a GUI toolkit has been fun, and working with Qt Designer is a pleasure.
As mentioned in a previous post, the Knoware client will be used to perform three main tasks:
- Detect and display the user's system configuration,
- Browse and report problems the user is experiencing, and
- Collaborate with other users on possible solutions.
Here's what I've come up with so far (click to enlarge):
An announcements panel, the first thing the user sees when opening Knoware. Much more information will be displayed than is currently shown. It will be styled with CSS, so it should look very nice (think of amaroK's sidebar).
A system configuration panel. Self-explanatory. I may try to make the visualization of the user's system a bit more interesting and easier to browse.
A panel for searching and browsing problems. The search results should be updated as the user types. Problem details, instructions to reproduce the problem, and subscribed users will all be shown here. This is also where the statistics Knoware has collected should be displayed — more on how I plan to visualize this later.
A discussion panel. I do have more things in mind for collaborative problem-solving, but this is the most important and flexible, I think. Discussion 'rooms' may be thought of as one-per-problem (like comments on a Bugzilla page). But an interesting feature is that they are both live (like a chat room) and persistent (like a forum). Right now the two forms of communication people turn to for fixing their problems are forums and IRC — this unites the strong points of both. Since live chat may generate a long discussion history, users can flag individual messages to indicate importance. The flag and recycle-bin icons next to the search field are toggle buttons which show only flagged messages and show junk messages, respectively.
Besides putting all this functionality in one place and including a few niceties, there isn't much new here. The novel part about Knoware is, after all, on the server-side: the use of statistical methods to identify patterns. The discovered patterns should be visible to users, since they are the people who have the ability to use the information and apply it. So how can I best visualize this information?
One way might be to think of any general system configuration as a tree (kinda like in the System panel above). There are always certain components in a configuration, and various areas that differ on each system. To visualize what Knoware has discovered, then, might include making such a tree in which only the likely predictors are shown. The more confident Knoware is about a predictor, the bigger or bolder it could show it, or maybe all the other components are shown but grayed out — the important thing is that the likely predictors of the problem stand out.
Another way might be to show a list much like Apple's Spotlight results. Instead of object categories and files, results would be arranged into component categories and predictors, along with some statistical information.
Some mock-ups would convey these ideas much more easily. Maybe I'll post some later.
Knoware, Part II
In my last entry, Andi Roedl helpfully pointed out LSHW, a hardware-listing tool for Linux. I gave it a try (it's even on Portage), and it works great! From what I can tell, it even retrieves information such as clock speeds, so Knoware may have the ability to identify overclocking as a problem source. I contacted Lyonel Vincent, the author of LSHW, and it looks like this is the best answer to the hardware detection part of Knoware so far.
The other piece of the puzzle is installed package detection. This could end up being even more important than the hardware portion, since issues like library version compatibility seem to cause just as much distress. I haven't searched very hard for something like this yet — any ideas? Chris pointed out that I may have to find per-distribution methods for this to work reliably, which is what I was thinking. Gentoo, for example, uses equery and qpkg.
In the statistics department, my friend Erin and her husband Ryan have been offering their educated brains for the picking. Both suggested that Bayesian networks (as I had in mind) might not provide the best results for Knoware. Instead they pointed me at Random Forests (more info here). Erin explains:
The idea is you can't make any good statistical inference from one decision tree. Random Forests works by splitting the data using majority vote decision tree processes. It does that lots of times, thousands if you want, and it gets more precise the more it does that. Then it tells you which independent variables are the most valuable predictors of your dependent variable. In Knoware's case, you don't know what the problem is or how to fix it, so the problem is the dependent variable. Also, it is a means to evaluate groups, so you can see if a certain kind of problem clumps with other factors.
The source code for Random Forests is available, but it's in Fortran and I'm unsure of licensing issues. Word on the street is there's source in another language out there somewhere. Otherwise I could just do a bit of research and try to code it from scratch. This kind of process seems like a prime candidate for rapid prototyping, so I'm planning to use Python for it. Hopefully it will provide a nice complement to the Python Bayesian Network Toolbox being developed by Elliot Cohen.
Mark Dickie recently commented on my previous entry, saying:
I believe that Dell have something like this although I think it's nothing much more than an update/security announcer tailored to your hardware and software. Might be worth a look.
This raises an interesting point that I have not yet made explicit. I always imagined that utilities like the Microsoft crash reporter (and the Dell utility Mark speaks of, which is news to me) aim to accomplish something exactly like Knoware, but behind closed doors. A big distinction here is that Knoware puts actually fixing problems into the user's hands, and more importantly into the community's hands.
But there is also a simple yet key feature that Mark brings up, and that is providing a distribution channel for patches and other announcements. Notifying users of solutions once they are found is what will eventually cause more users to participate in Knoware. Even if a user has not taken the time to subscribe to a specific problem, Knoware will be able to determine that the user probably does experience the problem based on their system configuration.
Knoware, Part I
In my last post I mentioned that I made my decision to continue with my KDE project, Knoware, for the Summer of Code. For those who missed the full description (read it if you get confused), Knoware is a program that will use Bayesian networks to find patterns between Linux system configurations and bug report subscriptions. In this entry I'll provide a little background as well as some more details, plans, and open issues. Since the project has barely started yet, I'm open to input from any direction.
First, some background. I actually envisioned Knoware as a Windows program in high school more than five years ago. At the time I didn't even know Bayes nets existed, just that computers must have the processing power to analyze the thousands of statistics and arrive at a solution. I pictured the interface much like a chat program, since collaboratively working through problem solutions will play a prominent role in the program's success. I even came up with a business plan for it: make it free for users, but provide custom statistics reports to hardware and software companies at a cost (remember, Knoware will know which products are conflicting with one another, the distribution of models and driver versions, and such). I look back on this now and still see it as plausible, but I'm very glad to be developing for a platform I use and enjoy now. More on other aspects of my original vision, like the interface, shortly.
You may have realized that the mention of Windows brings up an interesting point: conceptually, Knoware has very little to do with KDE or even the Linux platform. It could easily be made to work with Windows or Mac OS, and focus on their respective bugs. If it turns out to work as well as expected, you can be sure versions will pop up on other platforms.
The current limitation to Linux and KDE may actually be a benefit in finding accurate statistics. Sure, simultaneously opening up the program to Windows and Mac users would result in a larger database. The problem is that the bug reports and system configurations would be much less focused. The extra input does no good if it's for a completely different problem and coming from a completely different system. If all input configurations have Linux and KDE in common, this will help the users focus on more of the same problems instead of completely unrelated ones.
The issue has been raised that it may be hard to gain enough users initially, making it difficult to draw useful conclusions. I'm pretty confident when it comes to this issue. Take for example the site KDE-Forum.org. This forum has more than 8,000 registered users, and this is surely a small fraction of KDE users — the English-speaking vocal minority. If just 2.5% of that vocal minority uses Knoware to report a specific problem, that's 200 configurations to analyze. I'm no expert on Bayes nets (yet), but to me that sounds like a fairly significant data set. And in reality, of course, I picture many more users giving Knoware a try.
One aspect that will help draw in participants is the user interface. My mentor for this project told me to concentrate on this area first, since it will ultimately be the first thing to get a user's attention. In my proposal, I scheduled the user interface toward the end of the summer, so already I've learned something new (about KDE development, at least). Earlier I mentioned that in the past I pictured the user interface much like a chat program. This has changed significantly in my mind, since collaborating with other participants is something the user will only be doing some of the time. The two other main tasks will be scanning — that is, gathering information about the user's system — and browsing for relevant bugs.
Scanning the system is an automated process, but requires the user to specify the level of granularity they will provide, since there is a chance that this information will be made available to other users. It's also possible that the user will need to input some information manually, in case the scanner can't determine an important aspect of the user's system.
Browsing for bugs will hopefully be made easy through searching, live filtering, tagging, and other new trends in navigation. Since the detailed bug reports will need to be created at some point, this process is not just read-only. The interface should therefore assist the user in proper naming and categorization of the problem.
Lastly, I'll discuss possible integration with existing projects. As far as I know, there's nothing out there right now (in the public, at least) which hopes to accomplish the same thing as Knoware. However, there are a few projects which can assist in its goal. KDE has an existing bug tracking system using Bugzilla, and a graphical front-end for it called KBugBuster. While I hope to use more meta-data to make bug browsing easier, there's no sense in completely reinventing the wheel, so this is a likely project I will turn to for integration. Additionally, KDE has a crash handler that tries to make crashes a little more friendly. While crashes are only a tiny fraction of the problems I encounter, I won't rule out the possibility of submitting crash reports to Knoware through this program (much like the Windows crash handler).
Hopefully that puts to rest any questions or doubts out there. As I said, input is welcome during any phase of development.
Summer of Code: Pre-Project Developments
Since the spring semester ended and my number of commitments promptly fell close to zero, my e-mail inbox has been mostly empty. But over the past few days, it's been a challenge to keep up with the swarm of traffic going on due to Google's Summer of Code program. This is a dream come true for people like me who check their e-mail every 30 seconds.
In a previous posting I mentioned that I was leaning towards my KDE proposal as my selected Summer of Code project. At the time of writing, I had already been in contact with the KDE folks for a couple days, and hadn't received a single message from the Python guys. The selection process inevitably requires crushing some dreams, and the lack of contact with the nice folks at the PSF facilitated my early decision.
What a surprise it was, then, when I received a message from one of the KDE mentors telling me that they had already been exchanging messages with the Python guys, who really wanted to see the proposal brought to life. Google had apparently instructed the organizations to 'fight' among themselves for duplicate acceptances. After finally receiving messages from Ian Bicking and David Ascher (this is not to their discredit — the KDE guys were just fast), I wasn't so sure which project to select anymore.
On one hand, I felt a little selfish selecting my own project over the Python one simply for the reason that it excited me more, even with more risk involved. On the other hand, I think a lot of other people are qualified to work on the Python project, and I also think Knoware will directly benefit more people in the end. Since my decision was not swayed, I hope it doesn't look like I've just been stubborn all along — I looked very closely at the goals laid out in my project schedule, and I see no reason not to continue. Furthermore, I plan to use Python for a significant part of my KDE project, and it directly lines up with another proposal accepted by the PSF. So there's something in it for the Python guys as well. Who'd have thought that after all this time, I'd have to turn down the organization and community that has been the biggest part of my life as a programmer for the past year?
Patty brought up a good point last night. After about a month of being unemployed and exploring possible ways to make money, it's funny that now I'm stressing out over my options of doing something completely awesome and doing something just slightly less awesome. Funny how things change...
I'm sending my final decision out to the mentoring organizations shortly. This means that the PSF will get to choose another proposal, which must be exciting for them as well as one lucky participant.
Summer of Code Projects
As I mentioned in my last posting, I have the option of participating in the development of either of two Summer of Code proposals. Before I reveal the projects, it might help to put them into context with some information about the mentoring organizations: KDE and the Python Software Foundation. KDE is a desktop environment for Linux with a very complete software suite. I've tried out nearly two dozen window managers and desktop environments, and KDE definitely has something special going on. (Okay, for my brief GNOME-bashing on Alex's blog, I must say that I've learned to appreciate GNOME as well. I'm actually using it right now to write this up.) Python is my favorite programming language — what more can I say? I'm extremely productive with it.
On to the projects — first, the KDE proposal. On the KDE bounties page there's an entry for a "visionary application addition," awarded to innovative proposals — this was the bounty I was selected for. The idea is for a program called Knoware where users voluntarily contribute information about their system's configuration (hardware, installed packages, etc.) and then subscribe to bugs or problems they're experiencing (doesn't have to be a bug — any fixable annoyance). This information is kept on a server, which uses a system called Bayes Nets (short for Bayesian Networks — here are short and long descriptions — basically, inductive logic) to identify patterns between bug subscriptions and system configurations. If enough people participate, Knoware could quickly track down bugs that would otherwise take forever to identify. In my experience, a vast amount of problems in Linux arise from the unlimited number of possible hardware configurations people have. Often people turn to forums for tech support, where users generally don't provide enough information about their problem and others spend forever probing them about their system. There are a lot more details I could go into about this project, but I'll save it for a later posting. For example, the client isn't just a passive application that gathers system information, it also networks you with other users who are subscribed to similar problems, or have similar hardware configurations, so that you can collaboratively work through possible solutions (and log your progress for others to see).
If that description confused you, consider this example: everyone who subscribes to the 'Program Y crashes when I do this' bug has an 'X' brand graphics card. Knoware will recognize this pattern and then examine more details like the card model, specifications, drivers, etc. For instance, it could eventually determine that only 512MB 2nd generation cards with driver version 2.5 cause this crash, and this gives developers a starting point for fixing the problem.
My other submission was picked up from the Python Web Programming Ideas page. I proposed the addition of a comments & annotation system to the Python standard reference, much like what is seen in PHP's documentation. Oddly, I've had multiple conversations with people who agree that the PHP documentation is often made worse by the comments system. My proposal included the use of Ajax to provide abilities like live editing and filtering, a rating or categorization system to prevent abuse, and other additions. Okay, the comments thing is kind of boring, but the annotation part I am excited about — they are two different things. If you take a look at PHP's documentation, you see that all the user contributed notes are at the bottom of the page, like any other comments sytem. Annotation, on the other hand, is meant to sit alongside a specific block of text on the page. For example, many textbooks feature in-margin annotation to identify terms or concepts. Since it can be very easy to disrupt the flow of a page with extraneous text, especially on a web page, I'm looking forward to trying out different solutions such as CSS tooltips, in-margin annotation, and in-place footnote links. Assuming, of course, that I accept this bounty.
Right now I'm leaning heavily towards the Knoware proposal. My reasoning is that it's my own idea and I have a strong interest in seeing it done right and by my own hand. I'll go into more detail about Knoware and other Summer of Code happenings in future postings.
Server Changes
So this weekend I decided to put my money where my mouth is and try out a different HTTP server for my personal site. (Admittedly, my site doesn't come anywhere close to needing the scaling abilities of any decent HTTP server.) I'm trying out lighttpd for now, next on my list are thttpd and Boa. (That reminds me, how could I have forgotten to mention Boa and Jigsaw in those comments on Jeremy's blog?) Anyway, my dislike for Apache has grown over the years, so we'll see if all this talk about alternatives holds any merit.
Much of the content on my site was outdated, so I took this opportunity to start anew. I even made a new homepage design (experimental). The style is pretty much the opposite of what you see here, tiny-and-desaturated versus the big-and-bold design of my blog (I like both).
I've been working on a pretty complex web application for the past few weeks. Writing production-quality code takes a while. Whenever I start a new project, I can make a working prototype in a couple nights (and did so with this one as well), but a few weeks after starting from scratch I'm still not up to the feature set of the prototype. Not that it's a problem. The version I'm working on is extremely polished and flexible. I plan to release it before the end of June.
And no, it's not what you think — that goes for you, too, Ben. That's next in the queue.
