Entries for September 2006

Miscellaneous School, Blog, Python Stuff

Today was perhaps the most easygoing day ever. My first class was cancelled, my second class ended 20 minutes early, and my last class only lasted for 20 minutes. Ah, education!

After getting some free food from the ACM / Women in EECS event, I finished my day by admiring this totally legit use of the Expression Wall (clink clink!):

regexp.jpg

Finally, an expression I can relate to!

A while ago I decided to try out some fancy log analyzers like Performancing Metrics and Google Analytics. Google Analytics seems to be better for checking out data about your users, while Performancing Metrics seems to be better for checking referrers and (surprisingly) search terms (well, Google's might be better, but Performancing Metrics is way easier to navigate).

One interesting thing these sites (and Blog@Case Stats) tell me is that start.case.edu is consistently my top external referrer. So it seems to send a lot of traffic my way. Go start!

Chris and I are working on the next version of dmath, mostly for speed and to deal with custom contexts. For example, the result of atan2(0, 0) should be indefinite, but in the math module it's 0 (presumably so that the function is continuous). But if someone wants it to be indefinite (by which I mean D('NaN')), they should be able to set that in their context. Oh yeah, one big improvement is that pow will allow Decimals to be raised to Decimal powers.

We're still trying to wrap our heads around some of the context stuff. For example, should all of our functions accept an optional context argument, like the sqrt, pow, and other methods in Decimal? If so, does every Decimal constructed within that function need to also be passed the context, even D(1)? This is stuff that will probably be obvious after some more browsing of decimal.py. We're also looking into doing things in pyrex once everything is known to be in working order. Need for speed, baby!

Did I ever mention that geopy trunk now has support for GeoNames, and may soon support Map24? Map24 has done a pretty good job of convoluting their JavaScript so that their free geocoder is only accessible via AJAX, but this is merely a speedbump and not a road block. It almost works (but not the version in trunk). Sadly, like Yahoo!'s, their Terms of Use state that their geocoding tools can only be used in combination with their Maps AJAX API. But hey, just because you can access their stuff from Python doesn't mean the developer isn't still using it legitimately (that is to say, to show locations on a Map24 map).

That's all I got!

dmath: Math routines for Python's arbitrary-precision Decimal type

Yesterday Chris and I spent all day writing math functions for Python's Decimal type. The result is our new dmath library, available on Google Code and the Cheese Shop under the MIT/X11 license.

Sparked by the routine for atan in my last post, I decided it wouldn't be too hard to go ahead and do the rest of the functions already offered by math and cmath. We now have acos, asin, atan, atan2, ceil, cos, cosh, degrees, e, exp, floor, golden_ratio, hypot, log, log10, pi, pow, radians, sign, sin, sinh, sqrt, tan, and tanh.

Check it out:

>>> from dmath import *
>>> from decimal import Decimal as D, getcontext
>>> getcontext().prec = 50
>>> asin(D(1))
Decimal("1.5707963267948966192313216916397514420985846996876")
>>> golden_ratio()
Decimal("1.6180339887498948482045868343656381177203091798058")

We're calling this release 0.9 because it just needs some testing and maybe some speed improvements, otherwise it's ready to use. There is currently some work being done in Python sandbox/trunk to convert the decimal module to C, and maybe they'll include fast versions of all these routines. But hey, you can use these right now!

Arbitrary precision is one of the coolest things in programming. We spent a lot of time in Mathematica, where if you ask it to tell you the precision, it says 'Infinity'. During our testing, we actually stumbled across a bug in Mathematica's ArcTan function! This page correctly states that ArcTan[-Infinity, y] should always be Pi (with the sign of y). However, Mathematica always returns 0. I sent a message with my findings to the Mathematica mailing list and Daniel Lichtblau of Wolfram Research confirmed that it is indeed a simple bug. ArcTan users, beware!

Anyway, enjoy dmath. Contributions are welcome, especially if you have any speed tips!

Geocoding and Python's decimal module

Python has an awesome decimal module for decimal floating point arithmetic. It has configurable precision and keeps track of significant digits and does some other neat stuff.

While I was adding the geopy distance module, I began to wonder if it would be worth the effort to switch everything in geopy over to use Decimals instead of floats. After checking out the decimal module (I had never used it before), I decided that I had nothing to lose, so I went for it...

I quickly ran into some snags when I realized that I'd have to code my own trigonometric functions for use with Decimals, since those that come with Python are for complex or floating point numbers. The decimal recipes page in the documentation has functions for sin and cos, but distance uses asin, acos, atan, and atan2. Don Peterson has a nice decimalfuncs module with most of these, but it's GPL (and would be an uncommon dependency) — geopy is MIT/X11. So I went ahead and started on these...

I decided it would be easiest to define asin and acos in terms of atan, and it turns out there is a (relatively) quickly converging algorithm for that. Here's what I came up with for a Decimal-compatible atan:

def atan(x):
    if x == D('-Inf'):
        return pi() / -2
    elif x == 0:
        return D(0)
    elif x == D('Inf'):
        return pi() / 2
    
    if x < -1:
        c = pi() / -2
        x = 1 / x
    elif x > 1:
        c = pi() / 2
        x = 1 / x
    else:
        c = 0
    
    getcontext().prec += 2
    x_squared = x ** 2
    y = x_squared / (1 + x_squared)
    y_over_x = y / x
    i, lasts, s, coeff, num = D(0), 0, y_over_x, 1, y_over_x
    while s != lasts:
        lasts = s    
        i += 2
        coeff *= i / (i + 1)
        num *= y
        s += num * coeff
    if c:
        s = c - s
    getcontext().prec -= 2
    return +s

It depends on the pi function from the decimal recipes page, which calculates pi to the currently configured precision.

Upon finishing this, Chris came home and I told him what I was doing. Immediately, he tried to talk me out of it, asserting that floating point was good enough for geocoding. I tried to counter by explaining all the floating point calculations being performed in distance, but in the end he won. I no longer think it would be a very important change to convert everything in geopy to use the Decimal type.

What finally convinced me was this quote from the Vincenty distance page I used for reference:

Vincenty’s formula is accurate to within 0.5mm, or 0.000015″ (!), on the ellipsoid being used.

0.000015 arcseconds is about 4.16667e-9 degrees. Well, if floating point is good to about 10 decimal places, I guess Chris wins this time...

Still, if anyone wants Decimal support in the future, maybe I'll just ask Don Peterson for permission to include decimalfuncs with geopy...

Update: On second thought, maybe I will just continue implementing my own trig functions for Decimals. Chris and I just spent a while investigating the precision of my atan vs. decimalfunc's, and mine seems to be faster and more precise.

geopy gets distance and util modules

If you check out geopy trunk right now you'll notice a few changes.

I introduced two modules: util and distance.

util now contains the parse_geo and arc_angle functions, and will grow more in the future.

distance is a bigger addition and contains helpful functions for calculating geodesic distances. I planned to add this eventually, but development was sparked by a request from Chris Mulligan.

There are two distance formulas: Great-circle (aka haversine, aka spherical law of cosines) distance and Vincenty distance.

Great-circle distance uses a spherical model of the earth, using the average great-circle radius of 6372.795 kilometers (this is configurable). This results in an error of up to about 0.5%.

Vincenty distance uses a more accurate ellipsoidal model of the earth. This is the default distance formula, and is thus aliased as distance.distance — so you can easily swap out distance formulas just by changing distance.distance at the top of your code. There are multiple popular ellipsoidal models, and which one will be the most accurate depends on where your points are located on the earth. geopy includes a few good models in the distance.ELLIPSOIDS dictionary:

#             model             major (km)   minor (km)     flattening
ELLIPSOIDS = {'WGS-84':        (6378.137,    6356.7523142,  1 / 298.257223563),
              'GRS-80':        (6378.137,    6356.7523141,  1 / 298.257222101),
              'Airy (1830)':   (6377.563396, 6356.256909,   1 / 299.3249646),
              'Intl 1924':     (6378.388,    6356.911946,   1 / 297.0),
              'Clarke (1880)': (6378.249145, 6356.51486955, 1 / 293.465),
              'GRS-67':        (6378.1600,   6356.774719,   1 / 298.25),
              }

Here's an example usage of distance.distance:

>>> from geopy import distance
>>> import Case
>>> wiki = Case.Geocode.CaseWikiGeocoder()
>>> _, a = wiki.geocode('Wade')
>>> _, b = wiki.geocode('Fribley')
>>> distance.distance(a, b).kilometers
1.342250272726943
>>> distance.distance(a, b).miles
0.83403565192666562

Using Great-circle distance:

>>> distance.distance = distance.GreatCircleDistance
>>> distance.distance(a, b).miles
0.835175984734287

You can change the ellipsoid model used by the Vincenty formula like so:

>>> distance.VincentyDistance.ELLIPSOID = 'Intl 1924'

The above model name will automatically be retrieved from the ELLIPSOIDS dictionary. Alternatively, you can specify the model values directly:

>>> distance.VincentyDistance.ELLIPSOID = (6377., 6356., 1 / 297.)

Oh yeah, you can add distances too (for paths and such). Here's the distance from Fribley to Wade to Phi Kappa Theta:

>>> _, c = wiki.geocode('Phi Kappa Theta')
>>> (distance.distance(b, a) + distance.distance(a, c)).miles
1.0596624112817861

Also included in the distance module are functions for converting between length units (kilometers, miles, feet, nautical miles), and calculating a destination given a starting point, initial bearing, and distance.

This stuff is still just in trunk, no egg or updated documentation yet...

Project Club Barbecue, Friday 9/15

Tomorrow Case Project Club will be hosting another barbecue at 12:30 PM in front of Olin, which means free food for everyone! The last one was a big success. According to Jon Ward, we have "way too much food." There will be hamburgers, hot dogs, chips, pop, fruit, veggies and other snacks.

See you there!

openSUSE on My Laptop

Speaking of experimenting with new distros, last night I installed openSUSE 10.1 on my laptop. I don't keep anything of importance on my laptop, so it was a just quick format away.

It left me super impressed with almost everything, but some things annoyed me.

Maybe I haven't installed Linux in a while, but the installer seemed to take forever. What annoyed me the most was when you get to the GUI installer and it starts doing something in the background that takes 5-20 minutes without explaining itself. This is the part where you can't press ESC to see what's going on. Anything to do with the catalogue, packages, or network took forever while just giving a vague snippet of an explanation.

After finally getting into KDE with some software installed, the first thing it did was find updates. Okay, cool, but the updates it found had all sorts of conflict problems immediately. This is something I wouldn't expect to deal with after just installing some basic stuff, but whatever. I was impressed with the interface for resolving conflicts. Unlike Gentoo, where you often have no idea what the best move is when encountering a conflict, the package manager gives you two or three sensible options.

The GTK-interface Install/Remove Software windows are incredibly slow at whatever the hell they're doing... I'm guessing it's retrieving the huge list of packages. How about a progress bar? Or how about you don't do that and just let me search, then send off that query instead? I think YaST might have a better interface for doing this stuff, I haven't checked yet.

The KDE defaults are awesome and pretty. I love KDE. I wish it had more classy themes like GNOME has. I really think most KDE themes are gaudy. And I think Tango looks (mostly) better than Oxygen. I think KDE icons and themes have too much contrast and saturation, which really detracts from their iconic intention and instead makes them look like real-life objects I have to identify first, then identity the iconic intention of the object. Have much bolder and well-defined outlines would help a lot in this area... but I'm getting off-topic here.

Like Jeremy, I appreciate any tips from experienced SUSE users about getting started.

geopy: Now on Google Code, More Geocoders

I decided to try out the Google's Project Hosting feature for geopy. You can find the hosted page at code.google.com/p/geopy. So far it seems pretty sweet and very easy to administer.

I added a geocoder for Microsoft's Windows Live Local (powered by Virtual Earth) to the geocoders module. Sadly, they don't actually have a non-JavaScript geocoding API, so I had to reverse-engineer it.

Norman Khine and I have been investigating issues geocoding UK addresses with the Google Maps API. Due to contractual reasons, they can't offer geocoded addresses with their HTTP geocoder. So instead I again had to reverse-engineer their JavaScript to get it to work. The geocoded results aren't always accurate, but this is Google's problem and not geopy's.

I also tried to add a geocoder for MapQuest's OpenAPI. It is possible to get geocoded results over HTTP (although they don't tell you how, you have to look at their JavaScript or guess), but unfortunately they require you to parse the input location first. This is totally lame. You're telling me they can't parse the address into street, city, country for me? I didn't want to have to do this, but I now plan to add address parsing methods to geopy.

Making dp.SyntaxHighlighter for Python Not Suck

I've always wanted syntax highlighting for my Python code on the web. The geopy site, for example, is made much more readable with syntax-highlighted code. However, it's all done statically by using KDevelop's HTML Export feature.

The other night I decided to finally try out dp.SyntaxHighlighter, a JavaScript syntax highlighter. But unfortunately I discovered that its Python support is actually pretty ugly. Its Python example page isn't even valid Python! Type declarations? Not only that, but the test page demonstrates its parsing failures—the comment within a string is highlighted as a comment, not a string.

So I made some modifications. You can see the modified test page here. If you want to use it, check out my modified shBrushPython.js and my python.css, which you can modify to change the colors.

I took out all the highlighting of the builtins, exceptions, types, and modules. In practice I don't think that's very useful. The purpose of syntax highlighting is to make code clearer, not to classify every little token and make it look like a rainbow.

One thing I was not able to do was highlight class names and function names where they are defined without also highlighting the "def" and "class" keywords. I think this makes things look a lot nicer (as you can see in the highlighted geocoders.py code), but unfortunately JavaScript does not support lookbehind assertions, so I don't think it's possible without modifying dp.SyntaxHighlighter a bunch.

I also really like having operators highlighted (see here again), but that slowed it down considerably. Uncomment that line in shBrushPython.js if you want it.

Enjoy.

geopy: A Geocoding Toolbox for Python

I just uploaded the first release of geopy to the Python Cheese Shop.

The web site (with a little documentation) is located at exogen.case.edu/projects/geopy/.

There are five geocoders: Google, Yahoo!, geocoder.us, MediaWiki, and Semantic MediaWiki. Usage examples are given on the web site.

I need to read up more on setuptools to make my package maintenance life easier. For instance, telling the Cheese Shop to install my package from Subversion would be nice right now, but I'm not sure how.

Update: You can now view the pretty syntax-highlighted source code on the web site. Thanks to the wonderful KDevelop for that ability!

MT-Blacklist and Spam

My blog has been getting hit hard by spammers (you probably noticed if you subscribe to Blog@Case Comments). I use the MT-Blacklist plugin to mark every one of them as spam, so I figured I was doing some good... but it keeps letting them through, even if the URL patterns are obviously similar.

So I checked the MT-Blacklist page and noticed that it says only 39 URL patterns are blacklisted. Huh? Does it just add all those as string patterns instead? If so, even that's not working; I still get spams with the URL dunetribune.info even after de-spamming all of them.

Is our spam filter working? Am I just not seeing the tidal wave of messages that are actually coming in? Why don't I ever see any spam on Mano Singham's journal (and I'm subscribed to Blog@Case Comments, so I'd see them even if he deleted them afterward)?

Geocoding Tools for Python (and CaseClasses)

I've been working on some geocoding classes in Python. Right now I've got tools made for MediaWiki, Semantic MediaWiki, and the Google geocoder. I plan to include the Yahoo! geocoder in this toolbox soon.

I think this could be a useful package, so I plan to upload it to the Cheese Shop soon.

I added the relevant geocoder classes to the Python CaseClasses so that developers can easily geocode strings using Case's Semantic MediaWiki.

Check it out. First grab the geocoder:

>>> from Case import Geocode
>>> wiki = Geocode.CaseWikiGeocoder()

Then start geocoding:

>>> place, (lat, lng) = wiki.geocode('KSL')
Fetching http://wiki.case.edu/KSL...
>>> print "%s: %.5f, %.5f" % (place, lat, lng)
Kelvin_Smith_Library: 41.50727, -81.60950

geocode returns a tuple consisting of the location name found and the coordinates (another tuple).

Here's where the Semantic part comes in. The Project Club article isn't geocoded, but it is located in Olin, which is geocoded:

>>> place, (lat, lng) = wiki.geocode('Project Club')
Fetching http://wiki.case.edu/Project_Club...
Fetching http://wiki.case.edu/index.php/Special:ExportRDF/Project_Club?xmlmime=rdf...
Fetching http://wiki.case.edu/index.php/Olin_Building...
>>> print "%s: %.5f, %.5f" % (place, lat, lng)
Olin_Building: 41.50224, -81.60778

CaseWikiGeocoder is a subclass of SemanticMediaWikiGeocoder and is defined by only the following:

class CaseWikiGeocoder(SemanticMediaWikiGeocoder):
    def __init__(self):
        super(CaseWikiGeocoder, self).__init__("http://wiki.case.edu/%s",
                                               relations=['Located in'])

This creates a SemanticMediaWikiGeocoder with a base URL of 'http://wiki.case.edu/' that follows the 'Located in' relation if a page fails to geocode. So SemanticMediaWikiGeocoder could easily be used for any Semantic MediaWiki with any set of relationships defined. This class is brand new and has only been tested on the Case Wiki, so it might be buggy.

MediaWikiGeocoder relies on BeautifulSoup since it assumes wiki pages can be malformed.

Remember, if you have easy_install, you can simply type this to install CaseClasses:

sudo easy_install http://opensource.case.edu/svn/CaseClasses/python/trunk

After this geocoding toolbox is complete, I'll see if I can make that Case geocoder web service we talked about on the forum.

Update: Per Greg's comment, support for reading coordinates from semantic attributes is now in Case.Geocode (along with some other small improvements).

CaseWikiGeocoder is now defined as:

class CaseWikiGeocoder(SemanticMediaWikiGeocoder):
    def __init__(self):
        base = super(CaseWikiGeocoder, self)
        base.__init__("http://wiki.case.edu/%s",
                      attributes=['Geographical coordinate'],
                      relations=['Located in'], prefer_semantic=True)