Entries in "Programming"
Multiple sites, one Python: Pagoda import tricks
One of our early goals when designing Pagoda was to allow a single Pagoda instance to support multiple sites. This was due to the way memory works for web servers running on Python and TurboGears. How exactly this adds up depends on your threading and web server configuration (mod_python), but traditionally hosting multiple sites means running at least one Python instance per site, each costing 10-20 MB. The more modules each instance loads, the higher the memory usage, and since Pagoda sites will likely use a bunch of modules, that adds up. The most limiting factor in many hosting services is the amount of memory your account is allowed to consume.
Obviously if each Pagoda site is large and running custom code, it might be a good idea to run each in its own Python instance, so one site can't bring down all the others. But the common case, we think, is a bunch of moderately sized sites using just the built-in page management tools. So we devised some ways to allow multiple sites to run from one TurboGears project...
The first and simplest plan involved a database model, where pages and other table rows point to whichever site they belong to. You probably already know why this is a bad idea. First of all, every single table in the database needed to have a site_id column, since nothing would be shared between sites. Unique things like usernames would need their constraints modified to only be unique per-site. That got old pretty fast. Secondly was security. How could we ensure that every piece of code touching the database, even the eventual third-party plugins, would use the correct site in their queries so as not to mess with the others? And finally, having each site's contents in one massive database would not be very convenient if the site owners wanted backups of their portion of the database.
So we started looking at multi-database solutions, and quickly realized we were pretty much on our own for what we wanted to do. We don't just want some models in one database, and other models in a difference database; we want the same models in every database. Every site needs a pages table, for example. Since we're mapping tables with SQLAlchemy, and each mapper is bound to metadata, an engine, and a session, it seems that we'd need to run the table and mapper definitions once per site; each time, the engine would point to the appropriate site's. And now the big trick: how do we do this without modifying any model code, so that plugin writers don't have to learn any silly new details, and without doing a bunch of extra work every time a controller needs to use a model? If our controllers import pagoda.models.pages, how will it know to get the Page class bound to the current site's engine, and not another site's?
We looked to CherryPy for inspiration. In a TurboGears controller, importing cherrypy.request and cherrypy.response will make the current thread's request and response objects available. How do these objects magically belong to the appropriate thread? They simply use a class called ThreadLocalProxy. As the name suggests, cherrypy.request and cherrypy.response are proxy objects that determine the current thread and point object access to the correct request and response instances. Similarly, we want something like SiteLocalProxy, which will make model classes available that are magically bound to the correct site's engine.
Using ThreadLocalProxy as inspiration, we made a clever little object called site. When anything is imported from pagoda.site, it will rebind turbogears.database.metadata and turbogears.database.session after updating sqlalchemy.dburi in the config to point to the current site's. Then the requested module is imported and cached for next time (so the models aren't reinitialized every time). No model code was changed at all! The only necessary modification was importing from pagoda.site.models instead of pagoda.models in our controllers.
Our first implementation looked very much like ThreadLocalProxy, but it made our import statements look funny since site wasn't a real module. So we started investigating the imp, ihooks, and imputils modules, eventually leading us to PEP 302. With help from Importing (to reduce the amount of code necessary), we now have a special pseudo-module called site, and Pagoda modules imported from that will take the current request's site into account instead of just being imported once for the entire process.
Before writing up this entry, I came across Alchemyware. At first it looked promising for what we want to do, but as far as I can tell it requires modifying the way you write models and reinstantiating them on every request. Also, I don't understand how the mapped class can be "shared by everyone" if it's being mapped to multiple databases.
Anyway, after cleaning up our proof-of-concept I'll share the code behind our import trickery in case anyone is trying to do something similar, but mostly just because such tricks are interesting.
In case you forgot, we missed the end-of-March deadline we set for our demo, due in part to being burned out after PyCon. We're shooting for the end of April now.
CAS 1.0 Authentication for Django, Part 2
After using my Django CAS authentication module for a while, I decided to make a couple improvements.
The biggest improvement is that instead of modifying code in the CAS module itself to set your CAS address and do things like custom User field population, all this stuff can now be configured in your settings file.
Another improvement is that CAS authentication now works for the bundled admin interface. Since the administration interface does not account for an authentication backend that doesn't know the user's password, this makes the login form useless. The CAS module will now intercept requests to the administration interface and do the proper authentication routine if necessary, never showing the login form (which doesn't make sense for CAS). Intercepting requests, you ask? Yes, that means the CAS module is now middleware. Actually it's middleware, a couple views, and an authentication backend.
So here's how to use it now...
Extract it in django/contrib/. The code will be located at django/contrib/cas/. Is this a valid place to install third-party middleware? It's not really clear. Just do it anyway.
Now add it to the middleware and authentication backends in your settings. Make sure you also have the authentication middleware installed. Here's what mine looks like:
MIDDLEWARE_CLASSES = (
'django.middleware.common.CommonMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.cas.middleware.CASMiddleware',
'django.middleware.doc.XViewMiddleware',
)
AUTHENTICATION_BACKENDS = (
'django.contrib.cas.backend.CASBackend',
)
You can now configure the CAS module in the same settings file. Here are the possible options, most of which can be safely ignored:
CAS_SERVICE_URL: This is the only setting you must explicitly define. Set it to the base URL of your CAS source.CAS_POPULATE_USER: A callable or the location of a callable. When a user logs in and is missing name and email attributes in the database, this will be called with their User model instance. Default is None (do nothing).CAS_ADMIN_PREFIX: The URL prefix of the Django administration site. If undefined, the CAS middleware will just check the view being rendered to see if it lives indjango.contrib.admin.views. The method is a little evil, but it works.CAS_LOGIN_URL: The URL where you bounddjango.contrib.cas.views.login. If undefined, assume/accounts/login/.CAS_LOGOUT_URL: The URL where you bounddjango.contrib.cas.views.logout. If undefined, assume/accounts/logout/.CAS_REDIRECT_URL: Where to send a user after logging in or out if there is no referrer and nonextpage set. Default is/.CAS_REDIRECT_FIELD_NAME: The name of the GET parameter in which to store the page URL to send the user to after logging in. Default isnext.
Need an example? Here's what my CAS settings look like:
CAS_SERVICE_URL = 'https://login.case.edu/cas/'
CAS_POPULATE_USER = 'present.utils.populate_user'
And the callable that lives at present.utils.populate_user (notice this code lives in my project instead of tinkering with the CAS module) looks like this:
def populate_user(user):
try:
ldap = LDAP()
person = ldap.filter_one_by(uid=user.username)
except:
if not user.email:
user.email = "%s@case.edu" % user.username
else:
# If it succeeds, update their User entry
user.email = person.mail[0]
user.first_name = fix_case(person.givenName[0])
user.last_name = fix_case(person.sn[0])
(LDAP and fix_case also live in my utils module).
Finally, make sure your project knows how to log users in and out by adding these to your URLconf:
(r'^accounts/login/$', 'django.contrib.cas.views.login'),
(r'^accounts/logout/$', 'django.contrib.cas.views.logout'),
Users should now be able to log into your site, and staff into the administration interface, using CAS 1.0.
Simple CAS 1.0 Authentication for Django
Back when I expressed interest in making the web presentation bounty based solely on client-side code, Simon (bounty master and Filer admin) expressed his wish to keep the two services decoupled (so I shouldn't rely on Filer for slideshow storage). While I still want to have a save-to-Filer feature, I decided that I should just go ahead and get the web presentation system up and running before worrying about a client-side-only version. So I started a Django project.
Anyway, the result is that I got CAS 1.0 working alongside the Django authentication system, which means I can take advantage of built-in features like permissions and messages with CAS-authenticated users.
If anyone else is interested in using CAS authentication with Django, you can download the code I'm using. Here's a brief usage guide:
- Set
SERVICE_URLincas/__init__.pyto the location of your CAS service. For example, Case's ishttps://login.case.edu/cas/. - Set
DEFAULT_REDIRECT_URLincas/__init__.py. Normally the user will be sent back to theirHTTP_REFERER(the page that requested login) after authentication. But if the user requests/accounts/login/directly (or there is noHTTP_REFERER), they will be sent toDEFAULT_REDIRECT_URL. - Enable the
loginandlogoutviews by adding these to your URLconf (customize the URLs if you want):(r'^accounts/login/$', 'your_site.cas.views.login'), (r'^accounts/logout/$', 'your_site.cas.views.logout'), - Add the backend in
settings.py:AUTHENTICATION_BACKENDS = ( 'your_site.cas.backends.CASBackend', ) - Make sure at least the following apps are installed:
INSTALLED_APPS = ( 'django.contrib.auth', 'django.contrib.sessions', 'your_site.cas', ) - Finally, if you have a way to populate the user's name and e-mail address fields from their username, put it in
cas/backends.py(see the comments). For example, I have LDAP code there.
P.S.: This just implements the minimum required for CAS authentication. Features like gateway, renew, and proxies are not supported.
An alpha version of the presentation system should be online to play with later this week.
Workshop: Making Databases Fun with Python
Reminder! This is today!
Did you ever notice how writing SQL is not very fun?
This Monday (November 20th) on behalf of Case Project Club, I will be hosting a workshop for those interested in Python and databases. The talk will be at 7:01 PM (sharp) until 8:30 PM in the Olin 303 classroom/computer lab. I'll have Python all set up for everyone to play with and follow along. Pizza and drinks will be provided!
Python is a powerful dynamic programming language suitable for many tasks, including data analysis for research, web programming, and just plain fun. Even if you don't know Python, there won't be any crazy wizardry going on during the worskhop, so you should be able to pick up the basics very quickly.
Some contents of the talk will include:
- Simple data/object persistence, for when SQL is overkill.
- The dbapi, a standardized interface for talking to databases with Python.
- An overview of object-relational mappers that will let you harness the power of relational databases without writing a single line of SQL (and easily swap out SQL backends).
- Construction of a database application during the workshop everyone can play with, made with Django's object-relational mapper (or perhaps SQLAlchemy).
Again, no prior knowledge of Python or any of the related libraries is required.
Hope to see you there!
Automating Case Wiki Tasks
A while ago Chris added a login method to the CAS module in CaseClasses. It returns a mechanize Browser object so that you can programmatically surf the web as if you had logged into CAS in a real web browser.
CaseClasses also has a Codes module that has the abbreviated codes for majors, departments, and buildings. I combined these two features to tackle the Building codes project on the Case Wiki.
P.S.: There is a MediaWiki API that would normally be used to do this kind of stuff, but according to Greg, editing is not fully functional yet.
Think you could add a lot to the wiki with some automated task? Here's how it was done.
First, you'll need mechanize and CaseClasses:
$ sudo easy_install mechanize
$ sudo easy_install http://opensource.case.edu/svn/CaseClasses/python/trunk
Now log into CAS with mechanize:
import Case
from getpass import getpass
username = 'bmb12'
password = getpass() # Enter a password without echoing
cas = Case.CAS()
browser = cas.login(username, password)
You can open any page with browser and interact with it as a logged in Case user. So let's go to the Case Wiki and log in:
browser.set_handle_robots(False)
browser.open("http://wiki.case.edu")
browser.follow_link(text_regex='Log In')
Editing can be done like so:
browser.open("http://wiki.case.edu/User:Brian.Beck")
browser.follow_link(text='Edit this page')
browser.select_form(name='editform')
browser['wpTextbox1'] += " Also, this guy sucks!"
browser.submit()
Automating the building code edits was done like so:
for code, name in Case.Codes.buildings.iteritems():
url = "http://wiki.case.edu/%s" % name.replace(' ', '_')
try:
browser.open(url)
except:
print "Didn't find %r." % name
else:
browser.follow_link(text='Edit this page')
browser.select_form(name='editform')
source = browser['wpTextbox1']
add_text = "The building code for %s is [[building code:=%s]].\r\n"
add_text %= (name, code)
if 'code:=' not in source:
insert_at = source.find('{{Building')
if insert_at != -1:
new_source = source[:insert_at] + add_text + source[insert_at:]
else:
new_source = source + add_text
browser['wpTextbox1'] = new_source
browser.submit()
print "Added building code for %r." % name
Happy automating!
Update: The same has now been done for the Street addresses project. Check out the discussion to see how.geopy 0.93 Released: distance, util, GeoNames
Finally released geopy 0.93, which contains the distance and util modules I previously mentioned, a GeoNames geocoder, and improvements to the Google geocoder in other formats. Updating the documentation was all that was holding it back, really.
You can now pass domain and resource arguments to the Google geocoder. To query the actual Google Maps interface (instead of their official HTTP geocoder), initialize like so:
g = geocoders.Google(resource='maps')
The JavaScript results tend to be the best for this resource, so change that as well:
g = geocoders.Google(resource='maps', output_format='js')
Finally, for geocoding addresses outside of the US, change the domain being queried:
g = geocoders.Google(domain='maps.google.co.uk', resource='maps', output_format='js')
As James Robinson brought up on the geopy mailing list, work is under way for accuracy support. This will let you determine how precise the geocoded result is for the given location. For example, is it only guaranteed to be the correct city? Street? Is it the exact address? I decided to release this version of geopy without completing this, because not much work is done so far (and we also want to normalize values across geocoders), and the distance module was a pretty big addition.
To upgrade:
sudo easy_install geopy
A Web-Based Presentation System for Case
If you saw the new EECS department bounty system, you may have noticed the bounty for a web-based presentation system. This is now a work-in-progress by yours truly.
The goal is to let Case people generate slideshows through the web with Case-themed templates, mostly for use on the plasma displays in the EECS department (well, the displays that will be in Glennan). Another use could be by faculty in place of PowerPoint (so it should have familiar features such as transitions).
You may have seen those displays in the fishbowl last year showing pictures and such. Apparently these are a pain to administer because they are connected by a wireless connection and thus are not easy to connect to via Remote Desktop. So when they are rebooted or PowerPoint closes somehow, getting the slideshow back up and running can be a real hassle. Having them in kiosk mode in a web browser will allow them to easily refresh if anything goes wrong, and will also let them grab updated content. This is how the plasma displays around campus show their (I think Flash-based) slideshows.
There are two such tools on which to base such a system: S5 and Slidy. If you've ever viewed a text-based presentation online, it was probably with S5. This is the first time I have heard of Slidy, but it actually looks a lot better to me. Every time I've encountered S5 in the wild, the JavaScript has eventually completely messed up the text sizing and positioning, making everything unreadable.
There are existing services that combine web presentation systems with editing and sharing, such as S5 Presents and SoapBX. SoapBX is not open source while S5 Presents is GPL licensed and in Ruby. I would have no problem with using S5 Presents, but apparently it has gone unmaintained for almost two years, and Simon was disappointed when he looked into configuring it to run under a real web server.
Tonight I decided to see what actually using S5 is like, so I customized the default theme to resemble the Case PowerPoint template found on the Case branding page. So here is my Case-themed S5 presentation. It's pretty incomplete and doesn't match exactly, but it's close (and anyway, the goal is to make it look good, not match exactly—their PowerPoint template could use some work).
When I first considered this bounty, I thought "hmm, web-based, which web framework will I use?" One of the major goals is to be easily maintainable once I leave, so something popular or at least very easy would be the best choice. Which best fits this description? Django? TurboGears? web.py? I've never really considered the maintainability of any of these, since usually I can choose whichever I like at the moment.
After a bit of talking with Chris, we determined that any server-side code at all isn't even really needed. The only thing it would be used for is to display view and edit pages for the presentations and to store them. The first two can be done with static HTML and JavaScript, while the third can be used by POSTing with AJAX to some location. For example, we could POST presentations to the EECS department's new DokuWiki-powered site with the appropriate permissions. Since S5 and Slidy presentations are just HTML, the wiki page itself would be the presentation outline, and the slideshow viewer could just grab the markup from there. Or we could use AJAX to upload the presentation via WebDAV to the author's Filer account.
Since there would be heavy JavaScript usage regardless of the selected framework, the author would need to be fluent in it anyway. So why not a pure JavaScript solution, given the possibilities above? This is what I'll be looking into for the next few weeks. The code will soon live at opensource.case.edu/projects/present.
If anyone has experience with S5 or Slidy and would like to share their opinions, please do (even if it's just which system you like more).
dmath: Math routines for Python's arbitrary-precision Decimal type
Yesterday Chris and I spent all day writing math functions for Python's Decimal type. The result is our new dmath library, available on Google Code and the Cheese Shop under the MIT/X11 license.
Sparked by the routine for atan in my last post, I decided it wouldn't be too hard to go ahead and do the rest of the functions already offered by math and cmath. We now have acos, asin, atan, atan2, ceil, cos, cosh, degrees, e, exp, floor, golden_ratio, hypot, log, log10, pi, pow, radians, sign, sin, sinh, sqrt, tan, and tanh.
Check it out:
>>> from dmath import *
>>> from decimal import Decimal as D, getcontext
>>> getcontext().prec = 50
>>> asin(D(1))
Decimal("1.5707963267948966192313216916397514420985846996876")
>>> golden_ratio()
Decimal("1.6180339887498948482045868343656381177203091798058")
We're calling this release 0.9 because it just needs some testing and maybe some speed improvements, otherwise it's ready to use. There is currently some work being done in Python sandbox/trunk to convert the decimal module to C, and maybe they'll include fast versions of all these routines. But hey, you can use these right now!
Arbitrary precision is one of the coolest things in programming. We spent a lot of time in Mathematica, where if you ask it to tell you the precision, it says 'Infinity'. During our testing, we actually stumbled across a bug in Mathematica's ArcTan function! This page correctly states that ArcTan[-Infinity, y] should always be Pi (with the sign of y). However, Mathematica always returns 0. I sent a message with my findings to the Mathematica mailing list and Daniel Lichtblau of Wolfram Research confirmed that it is indeed a simple bug. ArcTan users, beware!
Anyway, enjoy dmath. Contributions are welcome, especially if you have any speed tips!
Geocoding and Python's decimal module
Python has an awesome decimal module for decimal floating point arithmetic. It has configurable precision and keeps track of significant digits and does some other neat stuff.
While I was adding the geopy distance module, I began to wonder if it would be worth the effort to switch everything in geopy over to use Decimals instead of floats. After checking out the decimal module (I had never used it before), I decided that I had nothing to lose, so I went for it...
I quickly ran into some snags when I realized that I'd have to code my own trigonometric functions for use with Decimals, since those that come with Python are for complex or floating point numbers. The decimal recipes page in the documentation has functions for sin and cos, but distance uses asin, acos, atan, and atan2. Don Peterson has a nice decimalfuncs module with most of these, but it's GPL (and would be an uncommon dependency) — geopy is MIT/X11. So I went ahead and started on these...
I decided it would be easiest to define asin and acos in terms of atan, and it turns out there is a (relatively) quickly converging algorithm for that. Here's what I came up with for a Decimal-compatible atan:
def atan(x):
if x == D('-Inf'):
return pi() / -2
elif x == 0:
return D(0)
elif x == D('Inf'):
return pi() / 2
if x < -1:
c = pi() / -2
x = 1 / x
elif x > 1:
c = pi() / 2
x = 1 / x
else:
c = 0
getcontext().prec += 2
x_squared = x ** 2
y = x_squared / (1 + x_squared)
y_over_x = y / x
i, lasts, s, coeff, num = D(0), 0, y_over_x, 1, y_over_x
while s != lasts:
lasts = s
i += 2
coeff *= i / (i + 1)
num *= y
s += num * coeff
if c:
s = c - s
getcontext().prec -= 2
return +s
It depends on the pi function from the decimal recipes page, which calculates pi to the currently configured precision.
Upon finishing this, Chris came home and I told him what I was doing. Immediately, he tried to talk me out of it, asserting that floating point was good enough for geocoding. I tried to counter by explaining all the floating point calculations being performed in distance, but in the end he won. I no longer think it would be a very important change to convert everything in geopy to use the Decimal type.
What finally convinced me was this quote from the Vincenty distance page I used for reference:
Vincenty’s formula is accurate to within 0.5mm, or 0.000015″ (!), on the ellipsoid being used.
0.000015 arcseconds is about 4.16667e-9 degrees. Well, if floating point is good to about 10 decimal places, I guess Chris wins this time...
Still, if anyone wants Decimal support in the future, maybe I'll just ask Don Peterson for permission to include decimalfuncs with geopy...
Update: On second thought, maybe I will just continue implementing my own trig functions for Decimals. Chris and I just spent a while investigating the precision of my atan vs. decimalfunc's, and mine seems to be faster and more precise.
geopy gets distance and util modules
If you check out geopy trunk right now you'll notice a few changes.
I introduced two modules: util and distance.
util now contains the parse_geo and arc_angle functions, and will grow more in the future.
distance is a bigger addition and contains helpful functions for calculating geodesic distances. I planned to add this eventually, but development was sparked by a request from Chris Mulligan.
There are two distance formulas: Great-circle (aka haversine, aka spherical law of cosines) distance and Vincenty distance.
Great-circle distance uses a spherical model of the earth, using the average great-circle radius of 6372.795 kilometers (this is configurable). This results in an error of up to about 0.5%.
Vincenty distance uses a more accurate ellipsoidal model of the earth. This is the default distance formula, and is thus aliased as distance.distance — so you can easily swap out distance formulas just by changing distance.distance at the top of your code. There are multiple popular ellipsoidal models, and which one will be the most accurate depends on where your points are located on the earth. geopy includes a few good models in the distance.ELLIPSOIDS dictionary:
# model major (km) minor (km) flattening
ELLIPSOIDS = {'WGS-84': (6378.137, 6356.7523142, 1 / 298.257223563),
'GRS-80': (6378.137, 6356.7523141, 1 / 298.257222101),
'Airy (1830)': (6377.563396, 6356.256909, 1 / 299.3249646),
'Intl 1924': (6378.388, 6356.911946, 1 / 297.0),
'Clarke (1880)': (6378.249145, 6356.51486955, 1 / 293.465),
'GRS-67': (6378.1600, 6356.774719, 1 / 298.25),
}
Here's an example usage of distance.distance:
>>> from geopy import distance
>>> import Case
>>> wiki = Case.Geocode.CaseWikiGeocoder()
>>> _, a = wiki.geocode('Wade')
>>> _, b = wiki.geocode('Fribley')
>>> distance.distance(a, b).kilometers
1.342250272726943
>>> distance.distance(a, b).miles
0.83403565192666562
Using Great-circle distance:
>>> distance.distance = distance.GreatCircleDistance
>>> distance.distance(a, b).miles
0.835175984734287
You can change the ellipsoid model used by the Vincenty formula like so:
>>> distance.VincentyDistance.ELLIPSOID = 'Intl 1924'
The above model name will automatically be retrieved from the ELLIPSOIDS dictionary. Alternatively, you can specify the model values directly:
>>> distance.VincentyDistance.ELLIPSOID = (6377., 6356., 1 / 297.)
Oh yeah, you can add distances too (for paths and such). Here's the distance from Fribley to Wade to Phi Kappa Theta:
>>> _, c = wiki.geocode('Phi Kappa Theta')
>>> (distance.distance(b, a) + distance.distance(a, c)).miles
1.0596624112817861
Also included in the distance module are functions for converting between length units (kilometers, miles, feet, nautical miles), and calculating a destination given a starting point, initial bearing, and distance.
This stuff is still just in trunk, no egg or updated documentation yet...
geopy: Now on Google Code, More Geocoders
I decided to try out the Google's Project Hosting feature for geopy. You can find the hosted page at code.google.com/p/geopy. So far it seems pretty sweet and very easy to administer.
I added a geocoder for Microsoft's Windows Live Local (powered by Virtual Earth) to the geocoders module. Sadly, they don't actually have a non-JavaScript geocoding API, so I had to reverse-engineer it.
Norman Khine and I have been investigating issues geocoding UK addresses with the Google Maps API. Due to contractual reasons, they can't offer geocoded addresses with their HTTP geocoder. So instead I again had to reverse-engineer their JavaScript to get it to work. The geocoded results aren't always accurate, but this is Google's problem and not geopy's.
I also tried to add a geocoder for MapQuest's OpenAPI. It is possible to get geocoded results over HTTP (although they don't tell you how, you have to look at their JavaScript or guess), but unfortunately they require you to parse the input location first. This is totally lame. You're telling me they can't parse the address into street, city, country for me? I didn't want to have to do this, but I now plan to add address parsing methods to geopy.
geopy: A Geocoding Toolbox for Python
I just uploaded the first release of geopy to the Python Cheese Shop.
The web site (with a little documentation) is located at exogen.case.edu/projects/geopy/.
There are five geocoders: Google, Yahoo!, geocoder.us, MediaWiki, and Semantic MediaWiki. Usage examples are given on the web site.
I need to read up more on setuptools to make my package maintenance life easier. For instance, telling the Cheese Shop to install my package from Subversion would be nice right now, but I'm not sure how.
Update: You can now view the pretty syntax-highlighted source code on the web site. Thanks to the wonderful KDevelop for that ability!
Functional Programming in JavaScript: Nice But Slow?
While making those two JavaScript contour prototypes I posted yesterday, I had the chance to try out the nice functional programming tools offered by MochiKit. I love MochiKit—mostly because it's highly influenced by Python.
However, since the performance of the contour mapping code is extremely important to its feasibility, I did some profiling and was a bit surprised.
Consider this code to test membership in an array of points:
var containsPoint = function (container, element) {
for (var i = 0; i < container.length; i++) {
if (pointsEqual(container[i], element)) {
return true;
}
}
return false;
}
Boring, right? Now the functional equivalent:
return some(container, partial(pointsEqual, element));
some and partial come from MochiKit.Base and MochiKit.Iter, respectively.
Sadly, I profiled the functional approach at being 27 times slower than the verbose version.
Similarly, this code:
var neighbors = new Array();
for (var i = 0; i < points.length; i++) {
var p = points[i];
if (is_neighbor(scale, prev, p)) {
neighbors.push(p);
}
}
...is 3 times faster than this code:
var neighbors = filter(partial(is_neighbor, scale, prev), points);
I assume this has something to do with the functional approach being run as "pure JavaScript" while the verbose version is optimized by the implementation—like pure-Python vs. Python written to take advantage of its C implementation (the "pure" version being slower).
I think a lot of people don't profile their JavaScript code, so I'm just throwing this out there. Unless someone can point out a solution to the dramatic speed difference, I will sadly be avoiding a lot of the functional tools offered by MochiKit.
I'll be spending the next few days moving into my Village apartment.
Tomorrow: The final day of Summer of Code, I'll talk about my results.
Brian's TurboGears Tutorial
This weekend I decided to write my own TurboGears tutorial instead of working on Merquery. If you're interested, check out Brian's TurboGears Tutorial. The application demonstrated in the tutorial can be played with on my server and downloaded*.
Why my own tutorial?
First off, last week I gave a talk on Web Programming with Python, and I never made the demonstration code available to the attendees. Hopefully this will make up for that. The application is the same as the one I made from scratch during the talk, except not crappy.
Secondly, I've made a dozen or more TurboGears apps now and I know how the routine goes. I've also walked a couple beginners through the basics in person. I generally do things in a specific order and this tutorial reflects that.
This tutorial's goal is to give beginners a good understanding of how every basic component of TurboGears works. That means SQLObject, CherryPy, Kid, and MochiKit.
For one thing, it slightly annoys me that the Identity login stuff is plastered all over the default code (okay, just in two places) in 0.9. If it were two lines of code to add that stuff, I'd be okay with it. But as it is, it just looks confusing to newbies. When absolute beginners encounter that stuff in controllers.py and master.kid, I just tell them "Aw, wipe that crap offa there!"
And as soon as you start throwing around any remotely abstract-sounding terms like SecureResource, WidgetValidator, IdentityManagement, DataController... what pops into my head is this thing you may have heard of, starts with a Z.
So yeah. This tutorial starts the reader off with a clean slate, trying to explain all the little things TurboGears does for them along the way. Here's that link again.
* Note: setup.py sdist needs to grab everything in templates/ and static/ by default for TurboGears projects. I just made the archive myself because it was easier than figuring out how to do it with setup.py.
reducipes: Use reduce in Python For Real
If you're like me, you enjoy browsing through the list of Python's built-ins to find better ways to do high-level things more concisely.
If you're even more like me, you stop to scratch your chin every time you come across reduce, a nice way to do some functional programming in Python. If you're unfamiliar with reduce, it goes a little something like this:
Apply function of two arguments cumulatively to the items of sequence, from left to right, so as to reduce the sequence to a single value. For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates ((((1+2)+3)+4)+5).
"Hmmm," you think, "I bet I could use this for something. But what?!" Well, today I submitted this recipe to the ActiveState Python Cookbook. It provides a few ideas for real-life situations where using reduce is both helpful and concise.
Here's an example:
import operator
def factorial(n):
"""Calculate n factorial"""
return reduce(operator.mul, range(2, n+1), 1)
Usage:
>>> factorial(10)
3628800
Another one:
def intersection(*sets):
"""Get the intersection of all input sets"""
return reduce(set.intersection, sets)
Usage:
>>> a = set([1, 2, 3, 4, 5])
>>> b = set([5, 6, 3, 7])
>>> c = set([8, 7, 5])
>>> intersection(a, b, c)
set([5])
If you have any of your own reduce recipes, post them here or to the Cookbook recipe. Happy reducing!
More on Merquery
A bunch of traffic has been directed to this blog due to the post about Merquery. Seems like there has been discussion going on in that posts's comments and also over at Jacob Kaplan-Moss' post.
As predicted, a lot of people are only seeing the "reinvention" aspect of Merquery. And admittedly, there is a lot that would need to be reinvented based on the goals I wrote down.
The real novel part about Merquery is that it's easy to drop into a Python web application. Imagine if you're going through a TurboGears tutorial and all you have to do is add one line to add full-text indexing and search to your database tables. Cool!
So here's the reformulated plan. Write adapters for the nice-looking Python indexing engines mentioned so far, such as PyLucene, Hype, and Xapwrap. Make using any of them look the same (so they're easy to swap in and out), and make it a one-liner for the most basic indexing setup desirable. Then, add a pure-Python indexer to the package as a side project, for those people who don't want dependencies. (All three of those existing libraries mentioned above still require the library they wrap to be installed.)
Unlike the current interfaces for those indexing libraries, these adapters don't have to be completely general (yet). If they only provide adapters for SQLObject classes and the Django database API, that's already a great accomplishment, even though these adapters are less flexible than the generic interfaces already provided. This will allow Django and TurboGears developers to stick with what they know rather than worry about getting an indexer working with their underlying database. (Hey, we've got to start somewhere, might as well have mass appeal right from the get-go.)
Here's an idea of what some customization of a developer's search engine might look like:
class Person(SQLObject):
firstName = StringCol(notNone=True)
lastName = StringCol(notNone=True)
nameSearch = Merquery.LuceneIndex(first=Person.firstName,
last=Person.lastName)
(I have no idea why that space is there. Sorry for my blog being so ugly.)
In this example, the developer has customized the index by giving Person.firstName strings the field name 'first' and Person.lastName strings the field name 'last'. So to find people with 'Beck' in their name but not 'Brian Beck', this would work:
beck -first:brian
Developers could just pass query strings like the above directly from their forms to the index:
results = nameSearch.query("beck -first:brian")
Since LuceneIndex knows we passed in SQLObject columns, it will know to return results as a ranked list of SQLObject instances.
results[0].firstName, results[0].lastName
Obviously this example might not be very realistic since firstName and lastName are just strings and we could accomplish this with SQL. But the same ideas apply for fields storing big documents, etc., where things like term frequency and proximity become important.
Thoughts?
Update: I made a Merquery Google Group so discussion can now happen in a centralized place. I was also kind enough to make the first typo on there.
Merquery, Text Indexing and Search with a Focus
If you happened to catch my RSS feed at some terrible hour of the morning a couple nights ago, you may have read a little about Merquery.

Merquery is the reason I want to parse nice search query syntax like Google's (or more generally, Lucene's). It's a full-text indexer and search engine. Not the crawly, home-page kind like Google, but the deployable kind like Lucene.
Now I know what you're thinking. You're coming up with all these nerd-slang fake syndromes with which to diagnose me. You're wildly gesticulating while going on about how a bunch of people smarter than me have done this before and published their results.
Let me say that I think Lucene is one of the best open source offerings out there. If you're serious about deploying search locally, you should use Lucene. Heck, you can even use Python to do that.
Problem is, there are a ton of people out there who aren't that serious about search. Or more appropriately, don't need to be that serious about search. I've been investigating Lucene for the past week, and my impressions are that it's:
- huge and complex
- overkill for Joe Programmer's First TurboGears App
- hard to deploy for my app in less than 5 minutes
Obviously, Lucene was never intended for use with that second point. But there are a lot of developers these days using small-but-scalable web frameworks that could use a lightweight search engine that's easy to plug into their apps. I'm talking, like, just pass it a list of SQLObject columns and it will do the rest. Giving your SQLObject tables a search method is easy, but giving them a fully-featured query syntax and ranking the results is a little more work. Sure, we can count on Google to index our pages for us, but they can't search our databases, which is also a huge need.
So my goal is for Merquery to focus on being super-easy to deploy for those web developers using things like Django, TurboGears, Pylons, etc. I already know what Python database wrappers they're likely to be using, so I can make Merquery cater directly to their needs.
With that in mind, here are some design decisions and other things I've been thinking about:
- Full Lucene query syntax would be nice, but hasn't Google's small subset of it proven to be good enough?
- I like being able to search for stop words. Google used to have stop words, but now you can search for anything you want.
- Python has lots of nice ways to process large datasets without having everything in memory at once; I plan to use these methods extensively.
- So far I think pyparsing will be the only dependency. There's a lot of cool stuff already in the standard library that's well-suited to building a search engine. And hey, Paul says search query parsing will come with the next release of pyparsing.
- Some motivating phrases: lighweight, fast, easily deployable, easily extendable
This article (with code) about making something similar in Python is a good read. In fact I've found a few articles like that one, but they're all from at least a few years ago. I got the impression that no one has approached this problem from the same perspective of focusing on the tools today's Python web developers are using (SQLObject, TurboGears, and such). But if you know otherwise, let me know!
Google Search Syntax BNF
A while ago I was making a specialized search engine of sorts and I wanted to use Google's nice query syntax with my own custom modifiers instead of things like inurl:, intitle:, site:, etc. Besides these powerful modifiers, Google's search syntax is nice because no query is invalid. Yahoo, MSN, and Amazon also at least support more than just basic search expressions.
Sometimes I wish other sites (like reddit) would implement this syntax for their search queries. So tomorrow I'll release a little Python module that parses this query syntax and makes the query easy to read and process. I wrote code that did this successfully a year or two ago, but I'd like to rewrite it now with pyparsing or something. Now there will be no excuse to offer only lame search queries.
To get the ball rolling, here's the BNF for the syntax I will implement. I have no idea if this is even a proper way to express BNF, but I used the Python grammar as a reference (because if it's good enough for Guido, it's good enough for me).
L ::= expr | expr L
expr ::= term | binary_expr
binary_expr ::= term " " binary_op " " term
binary_op ::= "*" | "OR" | "AND"
include_bool ::= "+" | "-"
term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~" literal)
modifier ::= (letter | "_")+
literal ::= word | quoted_words
quoted_words ::= '"' word (" " word)* '"'
word ::= (letter | digit | "_")+
number ::= digit+
range ::= number (".." | "...") number
letter ::= "A"..."Z" | "a"..."z"
digit ::= "0"..."9"
Contrary to what many people believe, you can NOT use parentheses for grouping or precedence in Google search queries. Every punctuation character except for [+-_".] is converted to a space and becomes meaningless.
Sprinting messes with your mind
What Mike and I accomplished during Spring break can only be described as a sprint. We hammered out an initial release of a decent contribution to the Python community in two days. In fact, we started preparing for the next release by rewriting everything from scratch to more easily support new collaborative filtering models and make our current models more accurate. There's another project we started that I'll put online in a week or so.
Our long and focused coding sessions resulted in what I have only ever experienced after playing a single video game for prolonged hours. I fall asleep thinking about code and math; sleep is interrupted by thoughts about code and math; the day begins (often in the late afternoon) with thoughts about code and math. It takes a while to wear off.
More specifically, here are some thoughts that keep popping up since we first released consensus:
- Why don't any of the models presented in research papers work as presented? We've had to modify almost all of them.
- Why is our refactored code an order of magnitude slower than our initial naive code?
- Why did using numarray for churning numbers on our dataset make things slower, not faster?
- If AudioScrobbler's dataset is so large, why are their all-time top recommendations for me so unstable?
Anyway, I hope that posting these thoughts will get them off my mind, because I'm ready to think about other things for a while.
consensus: Collaborative filtering in Python
Mike is visiting for Spring break, so naturally we've spent the whole time coding. We've started a few projects, and we've just made the first of them available.
consensus is a collaborative filtering library for Python. It implements three different filtering models, and makes it easy to implement more. We're no experts on the subject, but the results seem pretty good.
As an example, we wrote a little script to harvest data from thousands of AudioScrobbler users, starting with myself. All three models yielded similar results with lots of overlap. Based on my AudioScrobbler profile (which hasn't been updated in at least a year), our models suggested these additional artists:
- Sufjan Stevens
- Elliott Smith
- Broken Social Scene
- Belle and Sebastian
- The Beatles
- Interpol
- Wilco
- The Decemberists
- Bright Eyes
- Beck
- Death Cab for Cutie
Not bad! There is some overlap with AudioScrobbler's suggestions, so we must be doing something right. We're not sure what fancy algorithms they're using, but based on which suggestions we determined I would actually like, our models had a higher success rate. But who knows, if we had a larger data set, maybe they'd be the same...
Anyway, if you'd like to install the latest release of consensus, just use setuptools:
sudo easy_install consensus
Update: Both consensus and easyBay are now on the Cheese Shop.
easyBay Unleashed at Clepy
I gave my presentation about talking to eBay with Python last night at Clepy and I think it went pretty well, with positive feedback. You can view the slides from the presentation online. Clink clink:

You can now check out my Python eBay library from my Subversion repository:
svn co svn://exogen.case.edu/easyBay
Some documentation can be found at exogen.case.edu/easyBay. No point release or Python egg yet, everything is still in the trunk. You can sign up for (free) Developer Keys at developer.ebay.com in order to use the library.
Fostering Python Development at Case
I love programming in Python. Despite what your opinions of Python might be...
Alright, so instead of mirroring Greg's post, I'll just talk a little about the new Python Case classes.
Right now you can easily:
- Authenticate someone with CAS
- Search and retrieve LDAP attributes
- Query the USG web service
- Read syndicated feeds from Blog@Case, KSL, CaseWiki, Housing, Help Desk, Case News, UPB, Photos, and Forum
- Manage files with Filer
- Read and manage mail over IMAP
- Various little niceties like validating Case IDs, guessing someone's Case ID from their name, and checking all the valid mail domains (having a single alias at Case is like having at least 10 e-mail addresses).
Sprinkle on some HTML and you've got a portal. In a later posting I'll go into the usage of these modules, since seeing it in action might encourage people to tinker around with it. By the way, I found all those RSS feeds (and a few more) with this Google query.
I had my first experience creating a Python Egg for easy distribution of Case Classes. Eggs are the Python equivalent of Java's Jars. It was all quite painless. You can grab the latest egg (Case-0.1dev-py2.4.egg) from Subversion.
Knoware Client Update
It's getting pretty late into the summer now, and I predict that the client portion of Knoware will be done by the end of the week. So, outlook good.
Despite my prior C++ coding projects, my course experience here at Case, and even having taught it for a class, I was dreading getting back into it after having been dazzled by Python for the past year. God damn it, I thought, I'm gonna have to find all these libraries for crap like XML parsing, gonna have to re-learn sockets for like the third time, etc. Of course, no one said I had to use C++ — Python even has bindings for both the Qt and KDE libraries. But, I thought, I owe it to the users to slave over a damn solid C++ program.
The only thing I was certain of was using Qt for the GUI, obviously, just like the rest of KDE. And that's when I started getting more surprised every day, even now. What? Qt does XML parsing? It does sockets? It does threads? It can communicate with external processes? I don't even have to use the classic C++ strings, because it does that, too, and like I'd expect? Qt has definitely made getting back into C++ a pleasure. Thanks, Trolltech.
Remember those old Knoware screenshots? I prettied up a few areas with the help of KHTML, finally bringing them to life. Take a look:
If I don't write a post here about statistical results by this time next week, I'm in trouble.
Knoware, Statistics & Orange
The lack of recent Knoware updates on this blog is a good indication of what the statistical aspects of this project mean for a programmer like myself — lots of time reading and researching up front. Aside from things like database design, I've put very little actual code into Knoware for the past week due the sheer amount of knowledge required to understand (let alone successfully apply) Random Forests and Classification & Regression Trees. This particular approach also includes learning a bit of R.
However, I just came across Orange for the second time and actually read the feature list this time. Wow! Could I have asked for a better module? Their Orange For Beginners guide includes examples of classification, bagging, regression, and other processes that make up the bulk of Random Forests. I think this could definitely speed up the process of getting a working system up before September.
Since I'm talking about the pieces I'm using to construct Knoware, I guess I'll mention that I'm working with SQLite right now for the database backend. I've used it for a few projects so far and I've become quite a fan. MySQL and friends are just too much sometimes — users, permissions, configuration, bah!
The Return of CASPER
This post is not related to Knoware, but an old half-finished project called CASPER.
CASPER is a recursive acronym standing for Casper Artist-Sorted Playlist Entropy Reducer. The idea relies on the fact that there are two extremes dominating the way people can sort a playlist: alphabetically or completely random. Since the former would be a strange way to listen to music, most people go with the shuffled approach. In my experience, the reasoning behind listening to a shuffled playlist is to promote variety. But while complete chaos might increase variety, it certainly does not maximize it. Some people notice that their media players seem to have an affinity for certain artists or songs. So I came up with a method to increase the variety of any given attribute in a playlist.
The method places songs throughout the playlist such that items with the same key (for example, songs by the same artist) occur as far apart as possible, and it does this for every song in the list. For example, if the playlist contains 3 songs by X and 3 songs by Y, one result might be X2-Y3-X1-Y2-X3-Y1, where the order of the songs by X and Y are random (so there are lots of possibilities for even such simple cases). With large playlists, this careful song placement is much less apparent, so it appears random. The total entropy of the shuffled sequence is decreased, but the variety is increased.
I've decided to release some simple, documented Python code demonstrating the algorithm used for CASPER. This should make it easy for anyone to implement this shuffling method into their media player if they so desire. If your media player can easily make use of Python, it would probably take less than a minute to do. Check out casper.py if you're curious. Running the script as-is will produce a simple example. Requires Python 2.4 or higher, but is easy to port.
The full source code of the linked file is in my extended entry if you're really lazy, but it's probably formatted poorly and won't have syntax highlighting.
Continue reading "The Return of CASPER"
Drafting the Knoware Interface
As suggested by my KDE mentor, one of my first tasks is to try designing an attractive interface for Knoware. I've been doing mostly web programming for the past year or two, so getting back to using a GUI toolkit has been fun, and working with Qt Designer is a pleasure.
As mentioned in a previous post, the Knoware client will be used to perform three main tasks:
- Detect and display the user's system configuration,
- Browse and report problems the user is experiencing, and
- Collaborate with other users on possible solutions.
Here's what I've come up with so far (click to enlarge):
An announcements panel, the first thing the user sees when opening Knoware. Much more information will be displayed than is currently shown. It will be styled with CSS, so it should look very nice (think of amaroK's sidebar).
A system configuration panel. Self-explanatory. I may try to make the visualization of the user's system a bit more interesting and easier to browse.
A panel for searching and browsing problems. The search results should be updated as the user types. Problem details, instructions to reproduce the problem, and subscribed users will all be shown here. This is also where the statistics Knoware has collected should be displayed — more on how I plan to visualize this later.
A discussion panel. I do have more things in mind for collaborative problem-solving, but this is the most important and flexible, I think. Discussion 'rooms' may be thought of as one-per-problem (like comments on a Bugzilla page). But an interesting feature is that they are both live (like a chat room) and persistent (like a forum). Right now the two forms of communication people turn to for fixing their problems are forums and IRC — this unites the strong points of both. Since live chat may generate a long discussion history, users can flag individual messages to indicate importance. The flag and recycle-bin icons next to the search field are toggle buttons which show only flagged messages and show junk messages, respectively.
Besides putting all this functionality in one place and including a few niceties, there isn't much new here. The novel part about Knoware is, after all, on the server-side: the use of statistical methods to identify patterns. The discovered patterns should be visible to users, since they are the people who have the ability to use the information and apply it. So how can I best visualize this information?
One way might be to think of any general system configuration as a tree (kinda like in the System panel above). There are always certain components in a configuration, and various areas that differ on each system. To visualize what Knoware has discovered, then, might include making such a tree in which only the likely predictors are shown. The more confident Knoware is about a predictor, the bigger or bolder it could show it, or maybe all the other components are shown but grayed out — the important thing is that the likely predictors of the problem stand out.
Another way might be to show a list much like Apple's Spotlight results. Instead of object categories and files, results would be arranged into component categories and predictors, along with some statistical information.
Some mock-ups would convey these ideas much more easily. Maybe I'll post some later.
Knoware, Part II
In my last entry, Andi Roedl helpfully pointed out LSHW, a hardware-listing tool for Linux. I gave it a try (it's even on Portage), and it works great! From what I can tell, it even retrieves information such as clock speeds, so Knoware may have the ability to identify overclocking as a problem source. I contacted Lyonel Vincent, the author of LSHW, and it looks like this is the best answer to the hardware detection part of Knoware so far.
The other piece of the puzzle is installed package detection. This could end up being even more important than the hardware portion, since issues like library version compatibility seem to cause just as much distress. I haven't searched very hard for something like this yet — any ideas? Chris pointed out that I may have to find per-distribution methods for this to work reliably, which is what I was thinking. Gentoo, for example, uses equery and qpkg.
In the statistics department, my friend Erin and her husband Ryan have been offering their educated brains for the picking. Both suggested that Bayesian networks (as I had in mind) might not provide the best results for Knoware. Instead they pointed me at Random Forests (more info here). Erin explains:
The idea is you can't make any good statistical inference from one decision tree. Random Forests works by splitting the data using majority vote decision tree processes. It does that lots of times, thousands if you want, and it gets more precise the more it does that. Then it tells you which independent variables are the most valuable predictors of your dependent variable. In Knoware's case, you don't know what the problem is or how to fix it, so the problem is the dependent variable. Also, it is a means to evaluate groups, so you can see if a certain kind of problem clumps with other factors.
The source code for Random Forests is available, but it's in Fortran and I'm unsure of licensing issues. Word on the street is there's source in another language out there somewhere. Otherwise I could just do a bit of research and try to code it from scratch. This kind of process seems like a prime candidate for rapid prototyping, so I'm planning to use Python for it. Hopefully it will provide a nice complement to the Python Bayesian Network Toolbox being developed by Elliot Cohen.
Mark Dickie recently commented on my previous entry, saying:
I believe that Dell have something like this although I think it's nothing much more than an update/security announcer tailored to your hardware and software. Might be worth a look.
This raises an interesting point that I have not yet made explicit. I always imagined that utilities like the Microsoft crash reporter (and the Dell utility Mark speaks of, which is news to me) aim to accomplish something exactly like Knoware, but behind closed doors. A big distinction here is that Knoware puts actually fixing problems into the user's hands, and more importantly into the community's hands.
But there is also a simple yet key feature that Mark brings up, and that is providing a distribution channel for patches and other announcements. Notifying users of solutions once they are found is what will eventually cause more users to participate in Knoware. Even if a user has not taken the time to subscribe to a specific problem, Knoware will be able to determine that the user probably does experience the problem based on their system configuration.
Knoware, Part I
In my last post I mentioned that I made my decision to continue with my KDE project, Knoware, for the Summer of Code. For those who missed the full description (read it if you get confused), Knoware is a program that will use Bayesian networks to find patterns between Linux system configurations and bug report subscriptions. In this entry I'll provide a little background as well as some more details, plans, and open issues. Since the project has barely started yet, I'm open to input from any direction.
First, some background. I actually envisioned Knoware as a Windows program in high school more than five years ago. At the time I didn't even know Bayes nets existed, just that computers must have the processing power to analyze the thousands of statistics and arrive at a solution. I pictured the interface much like a chat program, since collaboratively working through problem solutions will play a prominent role in the program's success. I even came up with a business plan for it: make it free for users, but provide custom statistics reports to hardware and software companies at a cost (remember, Knoware will know which products are conflicting with one another, the distribution of models and driver versions, and such). I look back on this now and still see it as plausible, but I'm very glad to be developing for a platform I use and enjoy now. More on other aspects of my original vision, like the interface, shortly.
You may have realized that the mention of Windows brings up an interesting point: conceptually, Knoware has very little to do with KDE or even the Linux platform. It could easily be made to work with Windows or Mac OS, and focus on their respective bugs. If it turns out to work as well as expected, you can be sure versions will pop up on other platforms.
The current limitation to Linux and KDE may actually be a benefit in finding accurate statistics. Sure, simultaneously opening up the program to Windows and Mac users would result in a larger database. The problem is that the bug reports and system configurations would be much less focused. The extra input does no good if it's for a completely different problem and coming from a completely different system. If all input configurations have Linux and KDE in common, this will help the users focus on more of the same problems instead of completely unrelated ones.
The issue has been raised that it may be hard to gain enough users initially, making it difficult to draw useful conclusions. I'm pretty confident when it comes to this issue. Take for example the site KDE-Forum.org. This forum has more than 8,000 registered users, and this is surely a small fraction of KDE users — the English-speaking vocal minority. If just 2.5% of that vocal minority uses Knoware to report a specific problem, that's 200 configurations to analyze. I'm no expert on Bayes nets (yet), but to me that sounds like a fairly significant data set. And in reality, of course, I picture many more users giving Knoware a try.
One aspect that will help draw in participants is the user interface. My mentor for this project told me to concentrate on this area first, since it will ultimately be the first thing to get a user's attention. In my proposal, I scheduled the user interface toward the end of the summer, so already I've learned something new (about KDE development, at least). Earlier I mentioned that in the past I pictured the user interface much like a chat program. This has changed significantly in my mind, since collaborating with other participants is something the user will only be doing some of the time. The two other main tasks will be scanning — that is, gathering information about the user's system — and browsing for relevant bugs.
Scanning the system is an automated process, but requires the user to specify the level of granularity they will provide, since there is a chance that this information will be made available to other users. It's also possible that the user will need to input some information manually, in case the scanner can't determine an important aspect of the user's system.
Browsing for bugs will hopefully be made easy through searching, live filtering, tagging, and other new trends in navigation. Since the detailed bug reports will need to be created at some point, this process is not just read-only. The interface should therefore assist the user in proper naming and categorization of the problem.
Lastly, I'll discuss possible integration with existing projects. As far as I know, there's nothing out there right now (in the public, at least) which hopes to accomplish the same thing as Knoware. However, there are a few projects which can assist in its goal. KDE has an existing bug tracking system using Bugzilla, and a graphical front-end for it called KBugBuster. While I hope to use more meta-data to make bug browsing easier, there's no sense in completely reinventing the wheel, so this is a likely project I will turn to for integration. Additionally, KDE has a crash handler that tries to make crashes a little more friendly. While crashes are only a tiny fraction of the problems I encounter, I won't rule out the possibility of submitting crash reports to Knoware through this program (much like the Windows crash handler).
Hopefully that puts to rest any questions or doubts out there. As I said, input is welcome during any phase of development.
Summer of Code: Pre-Project Developments
Since the spring semester ended and my number of commitments promptly fell close to zero, my e-mail inbox has been mostly empty. But over the past few days, it's been a challenge to keep up with the swarm of traffic going on due to Google's Summer of Code program. This is a dream come true for people like me who check their e-mail every 30 seconds.
In a previous posting I mentioned that I was leaning towards my KDE proposal as my selected Summer of Code project. At the time of writing, I had already been in contact with the KDE folks for a couple days, and hadn't received a single message from the Python guys. The selection process inevitably requires crushing some dreams, and the lack of contact with the nice folks at the PSF facilitated my early decision.
What a surprise it was, then, when I received a message from one of the KDE mentors telling me that they had already been exchanging messages with the Python guys, who really wanted to see the proposal brought to life. Google had apparently instructed the organizations to 'fight' among themselves for duplicate acceptances. After finally receiving messages from Ian Bicking and David Ascher (this is not to their discredit — the KDE guys were just fast), I wasn't so sure which project to select anymore.
On one hand, I felt a little selfish selecting my own project over the Python one simply for the reason that it excited me more, even with more risk involved. On the other hand, I think a lot of other people are qualified to work on the Python project, and I also think Knoware will directly benefit more people in the end. Since my decision was not swayed, I hope it doesn't look like I've just been stubborn all along — I looked very closely at the goals laid out in my project schedule, and I see no reason not to continue. Furthermore, I plan to use Python for a significant part of my KDE project, and it directly lines up with another proposal accepted by the PSF. So there's something in it for the Python guys as well. Who'd have thought that after all this time, I'd have to turn down the organization and community that has been the biggest part of my life as a programmer for the past year?
Patty brought up a good point last night. After about a month of being unemployed and exploring possible ways to make money, it's funny that now I'm stressing out over my options of doing something completely awesome and doing something just slightly less awesome. Funny how things change...
I'm sending my final decision out to the mentoring organizations shortly. This means that the PSF will get to choose another proposal, which must be exciting for them as well as one lucky participant.
Summer of Code Projects
As I mentioned in my last posting, I have the option of participating in the development of either of two Summer of Code proposals. Before I reveal the projects, it might help to put them into context with some information about the mentoring organizations: KDE and the Python Software Foundation. KDE is a desktop environment for Linux with a very complete software suite. I've tried out nearly two dozen window managers and desktop environments, and KDE definitely has something special going on. (Okay, for my brief GNOME-bashing on Alex's blog, I must say that I've learned to appreciate GNOME as well. I'm actually using it right now to write this up.) Python is my favorite programming language — what more can I say? I'm extremely productive with it.
On to the projects — first, the KDE proposal. On the KDE bounties page there's an entry for a "visionary application addition," awarded to innovative proposals — this was the bounty I was selected for. The idea is for a program called Knoware where users voluntarily contribute information about their system's configuration (hardware, installed packages, etc.) and then subscribe to bugs or problems they're experiencing (doesn't have to be a bug — any fixable annoyance). This information is kept on a server, which uses a system called Bayes Nets (short for Bayesian Networks — here are short and long descriptions — basically, inductive logic) to identify patterns between bug subscriptions and system configurations. If enough people participate, Knoware could quickly track down bugs that would otherwise take forever to identify. In my experience, a vast amount of problems in Linux arise from the unlimited number of possible hardware configurations people have. Often people turn to forums for tech support, where users generally don't provide enough information about their problem and others spend forever probing them about their system. There are a lot more details I could go into about this project, but I'll save it for a later posting. For example, the client isn't just a passive application that gathers system information, it also networks you with other users who are subscribed to similar problems, or have similar hardware configurations, so that you can collaboratively work through possible solutions (and log your progress for others to see).
If that description confused you, consider this example: everyone who subscribes to the 'Program Y crashes when I do this' bug has an 'X' brand graphics card. Knoware will recognize this pattern and then examine more details like the card model, specifications, drivers, etc. For instance, it could eventually determine that only 512MB 2nd generation cards with driver version 2.5 cause this crash, and this gives developers a starting point for fixing the problem.
My other submission was picked up from the Python Web Programming Ideas page. I proposed the addition of a comments & annotation system to the Python standard reference, much like what is seen in PHP's documentation. Oddly, I've had multiple conversations with people who agree that the PHP documentation is often made worse by the comments system. My proposal included the use of Ajax to provide abilities like live editing and filtering, a rating or categorization system to prevent abuse, and other additions. Okay, the comments thing is kind of boring, but the annotation part I am excited about — they are two different things. If you take a look at PHP's documentation, you see that all the user contributed notes are at the bottom of the page, like any other comments sytem. Annotation, on the other hand, is meant to sit alongside a specific block of text on the page. For example, many textbooks feature in-margin annotation to identify terms or concepts. Since it can be very easy to disrupt the flow of a page with extraneous text, especially on a web page, I'm looking forward to trying out different solutions such as CSS tooltips, in-margin annotation, and in-place footnote links. Assuming, of course, that I accept this bounty.
Right now I'm leaning heavily towards the Knoware proposal. My reasoning is that it's my own idea and I have a strong interest in seeing it done right and by my own hand. I'll go into more detail about Knoware and other Summer of Code happenings in future postings.
