« 3D Holograms | Main | TEI conference, day 2 »

November 13, 2009

Text encoding in the era of mass digitization, TEI conference 2009, Ann Arbor, MI

The 2009 TEI conference currently going on in Ann Arbor, MI has some great sessions and speakers on the program, focusing on some of the larger digital libraries and text encoding projects. There are a good number of specialists and practitioners from across the States at the conference representing diverse projects, and I should add, quite a few international participants and speakers. I am quite excited to be here at the conference as a co-presenter on a segment on Day 1 of the conference, along with my colleagues Rich Wisneski and Stephanie Pasadyn. Our project spoke more to the smaller TEI projects and libraries, which we immediately found the small handful of attendees from similar institutions and projects just getting off the ground.

Day 1:

So far, I've found this group to be an interesting mix of librarians, programmers, library directors, historians and independent researchers.

Virtual Research Environments in the Humanities: Challenges and New Developments with a Focus on Europe, presented by Elmar Mittler. He addressed many of the larger European projects (such as Europeana, an earlier digital Gutenberg project (perhaps to illustrate an early project, and show how far encoding has progressed since then)

Mr. Mittler directed most of his speech to the larger digital repositories, and how (and who) will be responsible for sustaining the content, concern over central funding in projects such as Europeana which combines records from many diverse contributors. He went into depth at the growing digital online research and databases such as Clarin (language resource), Dariah (digital resource for arts/humanities) and GÉANT.

Some other European interest along the same lines: European Strategy Forum on Research Infrastructures
----------------------

Between the folds. A hybrid model for online publishing projects integrating philology, mass document repositories, and automatic text analysis, presented by Thomas Crombez (University of Antwerp) This presentation was interesting to me, in that the collection of objects in the digital collection was fairly concentrated and specialized (Polish theater programs), but incorporated some other search engines and similarly encoded digital objects. The programs have been coded, with special attention given to the fields important in this type of material (director, theater companies, titles, dramatists, poets, writers, composers). These coded fields are listed along the right side of each program, linking any field to all the other mentions of name, place, person, etc. There is also a list of related to the particular program, rated by relevancy in the key words, and also pulling results from this webesite of related digital works. What this project is looking for with its use of TEI, is a more refined way to develop well-formed queries, and creating a method for users to quickly and easily find related items.

Mr. Crombez also spoke to developing a "service-oriented arc", pulling in outside, related digital content in a meaningful manner. This does rely on bridging data that is able to pull these relations together (similarly coded, metadata schema, same level of mark-up), and also how to look forward with regards to annotated text, bridging the gap between digital content.
--------------------
TEI stand-off architecture of the National Corpus of Polish, presented by Piotr Banski (University of Warsaw) National Corpus of Polish is a collection of texts encoded especially for research in linguistics of the Polish language. Their project, especially in comparison to the other projects using a lower level of encoding or creative works, is much more technical in the encoding process. I was interested to see the variety of the source material used in their project- classic literature, contemporary newspapers and more ephemeral materials like the spoken word (transcribed) and internet texts. Mr. Banski presented some of the problems in tagging word structure, which I would imagine is especially problematic with slang or changes in word use over time. The project uses XPointer, which Banski was quick to point out has some issues in the application and use in a project such as theirs.

more to come on Day 2.........

Posted by vad17 at November 13, 2009 02:13 PM

Trackback Pings

TrackBack URL for this entry:
http://blog.case.edu/digitallibrary/mt-tb.cgi/21458

Comments

Post a comment




Remember Me?

(you may use HTML tags for style)