Raghu Ramakrishnan - The World Online
Today felt like being back in taught courses, with three seminars at UW in one day. Once again this is very convenient, but I'll be spreading the write-ups out so as not to take up too long a block of time with them.
The CSE department had Raghu Ramakrishnan of Yahoo! Research talking about the social, participatory web, which as far is can tell is Yahoo!'s prime focus these days. He gave several examples of relatively well-known participatory sites, some of which Yahoo! owns (Flickr, Yahoo! Answers, Yahoo! Groups), and some of which it probably wishes it did (Freecycle and YouTube), to illustrate the general area he's working in, and then spoke about some of the challenges and benefits of sites that are heavily based on user-supplied content.
The shift that people generally associate with these newer, participation-heavy sites is that a much larger proportion of viewers also contribute some content; for instance I use Flickr both to upload my own photos and to view other peoples'. From the system designers' point of view, though, Ramakrishnan talked in terms of two shifts that this causes: getting content is no longer a problem but organising it is, and it's possible to do that organising based on structures of how all the content is linked rather than simply keywords. The shift from keywords to structures was already happening, but user-submitted content tends to be much more heavily interlinked than proprietary content (because, for example, each newspaper's website tries to be self-contained and keep eyeballs on its own advertising), so there is an ever-richer supply of structural information to be used. He didn't say as much explicitly, but this is also a prime motivator behind Yahoo!'s strategy of acquiring interesting Web 2.0 companies: the more of these are kept 'under one roof' the richer the additional information they can harvest from users' behaviour.
The rest of the talk was devoted to different approaches to linking and classifying all this information to make it useful. The most obvious such approach is to encourage users to do the job explicitly. This works in places, but there has to be some motivating reason for people to submit the information. One such example is Flickr tags; users tag photos either to drive traffic to them or to organise their own pictures, and the overall effect is to produce a database of words associated with images that Flickr then gets to mine for further relationships, such as clusters within tags.
Slightly less obvious, but along similar lines, is Luis von Ahn's approach of setting up games that are fun to play. Ramakrishnan talked about the ESP Game, which von Ahn set up and has [according to the statistics here] acquired over a million data points per month it's been running. Like the Flickr tags, the object is not collecting the photos themselves—there's an enormous supply of those online already—but the meta-data that allows those photos to be searched.
Finally, if the information to be categorised is text or has useful text associated with it (such as descriptive URLs, as opposed to the auto-generated ones on a site like Flickr), a lot of the information can be harvested automatically. Ramakrishnan demonstrated DBLife, an automatically compiled database researchers' portal along these lines. DBLife uses fairly simple heuristics to scrape information from sources like researchers' own homepages and conferences' lists of contributors, and then links all the information together to come up with lists of who collaborates with whom, focuses on which problem areas, and so on. It makes some of what it's doing very explicit (for instance, see DBLife's marked-up version of Ramakrishnan's homepage), and it also provides opportunities for people to correct faulty data. I think that last point is quite an important one: it's not actually necessary for the automated process to be perfect, as long as it's good enough to (a) be useful, and (b) leave no more work for humans to do than can realistically be done. A flawed auto-generated portal is still going to be much less work to fix than a whole portal would have been to create in the first place, after all.