« August 2009 | Main | March 2010 »

November 19, 2009

Memento: Time Travel for the Web : OCLC Research Distinguished Seminar Series Presentation

The topic of web archiving is enough to make your head spin, or at least feel like you are at the bottom of a very large ice berg... Herbert Van de Sompel (from Los Alamos National Laboratory) spoke this morning at OCLC about the current project focusing on some of the issues relevant to web archiving, called Memento. I found a similar talk and powerpoint slides from this morning's talk that include some of the visual represenations of how the underpinnings of the programming side.

Van de Sompel mentioned some other efforts to archive previous versions of websites, such as the Internet Archive. While this site did capture websites beginning in 1996 (an early Case Western Reserve homepage from the Internet Archive), it was rather intermittent when the capture took place. Van de Sompel spoke about integrating navigation- to provide a means to combine multiple manifestations of a page (particularly news content) in a way that is easier to navigate, or in his terms 'transparent content negotiation'. There were also terms of 'time gate' and 'time map' that may sound more something out of a science fiction book, but what was really interesting about Van de Sompel's lecture was addressing not only the navigating issues of dealing with multiple versioned content, while also dealing with the display and content of the changing websites. His work on Memento addresses the attempt to correlate these diverse efforts, into a single feed and display that is capable of accounting for muliple time points. This creates a method to navigate these web pages over time, and is certainly less clunky that the Internet Archive. A demo is currently up on the Memento project site.

Related article: Memento: Time Travel for the Web

Posted by vad17 at 07:52 PM | Comments (0) | TrackBack

November 14, 2009

TEI conference, day 2

(notes from Day 2)

Computational Work with Very Large Text Collections: Google Books, HathiTrust, the Open Content Alliance, and the Future of TEI [slides] (Gallery, Hatcher Graduate Library North) Speaker: John M. Unsworth

MorphAdorner

Unsworth spoke on integrating research tools/other databases into a single interface, offering faceted browsing. Also differentiated between high level research and "non-consumptive research" (ex. image analysis, textual analysis, citation extraction, indexing)

Unsworth poses the question: does there exist a marriage of convenience between computer science and the humanties? (a doctored image of christopher columbus and Pocahontas- have the two worlds collided?)

Micropapers - (5 min mini presentations)

DeReKo goes P5: Customizing TEI P5 for the Mannheim German Reference Corpus- Andreas Witt- Database of written contemporary language, ca.
3 3/4 billion words (+300 million words added every year)
XCES was used initially; internal usage only in the beginning
using P5 now

The Chicago Foreign Language Press Survey in TEI- Douglas Knox

Transribed foreign language press; a number of languages surveyed- basic encoding. Using XSLT to expand on basic encoding. Project also points to a taxonomy created specifically by terms in survey (serves as a way to correct some of the issues, such as misspellings, authority name records, etc)


Evolving TEI standards and the burdens of digital project maintenance
-Andrew Jewell
Beginning to think about the transition from P4 to P5 with the Willa Cather archive (which is almost completely in TEI. When/how to migrate- how to make the decision for conversion, particularly with other migrations likely to occur in the future? Jewell states that in the digital realm, 'stability is an illusion'- something will always be changing down the road. How to make these long-term decisions about content?


The role of TEI in large text-analysis projects
-Brian Pytlik Zillig
Uses Abbot software for the project. Refers to the 'gated communities' of larger digital libraries (halitrust, etc.)

TEI documentation and the need to be responsive and accessible to a varied user community -Brett Barney
The difficultly of figuring out some of the more complex tags (restore)

How can the researcher turned digital project decifer the P5 guidelines- where does the computer science take over? is this tangible for researchers/humantists to use as well
making TEI more tangible and legible for consumption by a larger audience?


TEI in the classroom, with emphasis on the need for mark up that engages student interpretive interests
-Amanda Gailey
From the perspective of an english prof making applications in a classroom setting-
How to merge 2 worlds- mass digitization w/ literature
How to create meaningful projects
How to make this more approachable to non-techies?
Are there more learning environments and workshops to address this?

Also posed some larger issues on the TEI subject from discussion:
Can we study how TEI projects are used/researched (to what level of encoding, for example- basic?)

How to logistically keep up with levels/coding - how much time to spend on conversion, every level- every time you upgrade? or not?
where is your text going? do you want it to be conformant with other projects/digital repositories

How to sustain small TEI projects- where should they go? who will store these? curate? track?

Posted by vad17 at 06:40 PM | Comments (0) | TrackBack

November 13, 2009

Text encoding in the era of mass digitization, TEI conference 2009, Ann Arbor, MI

The 2009 TEI conference currently going on in Ann Arbor, MI has some great sessions and speakers on the program, focusing on some of the larger digital libraries and text encoding projects. There are a good number of specialists and practitioners from across the States at the conference representing diverse projects, and I should add, quite a few international participants and speakers. I am quite excited to be here at the conference as a co-presenter on a segment on Day 1 of the conference, along with my colleagues Rich Wisneski and Stephanie Pasadyn. Our project spoke more to the smaller TEI projects and libraries, which we immediately found the small handful of attendees from similar institutions and projects just getting off the ground.

Day 1:

So far, I've found this group to be an interesting mix of librarians, programmers, library directors, historians and independent researchers.

Virtual Research Environments in the Humanities: Challenges and New Developments with a Focus on Europe, presented by Elmar Mittler. He addressed many of the larger European projects (such as Europeana, an earlier digital Gutenberg project (perhaps to illustrate an early project, and show how far encoding has progressed since then)

Mr. Mittler directed most of his speech to the larger digital repositories, and how (and who) will be responsible for sustaining the content, concern over central funding in projects such as Europeana which combines records from many diverse contributors. He went into depth at the growing digital online research and databases such as Clarin (language resource), Dariah (digital resource for arts/humanities) and GÉANT.

Some other European interest along the same lines: European Strategy Forum on Research Infrastructures
----------------------

Between the folds. A hybrid model for online publishing projects integrating philology, mass document repositories, and automatic text analysis, presented by Thomas Crombez (University of Antwerp) This presentation was interesting to me, in that the collection of objects in the digital collection was fairly concentrated and specialized (Polish theater programs), but incorporated some other search engines and similarly encoded digital objects. The programs have been coded, with special attention given to the fields important in this type of material (director, theater companies, titles, dramatists, poets, writers, composers). These coded fields are listed along the right side of each program, linking any field to all the other mentions of name, place, person, etc. There is also a list of related to the particular program, rated by relevancy in the key words, and also pulling results from this webesite of related digital works. What this project is looking for with its use of TEI, is a more refined way to develop well-formed queries, and creating a method for users to quickly and easily find related items.

Mr. Crombez also spoke to developing a "service-oriented arc", pulling in outside, related digital content in a meaningful manner. This does rely on bridging data that is able to pull these relations together (similarly coded, metadata schema, same level of mark-up), and also how to look forward with regards to annotated text, bridging the gap between digital content.
--------------------
TEI stand-off architecture of the National Corpus of Polish, presented by Piotr Banski (University of Warsaw) National Corpus of Polish is a collection of texts encoded especially for research in linguistics of the Polish language. Their project, especially in comparison to the other projects using a lower level of encoding or creative works, is much more technical in the encoding process. I was interested to see the variety of the source material used in their project- classic literature, contemporary newspapers and more ephemeral materials like the spoken word (transcribed) and internet texts. Mr. Banski presented some of the problems in tagging word structure, which I would imagine is especially problematic with slang or changes in word use over time. The project uses XPointer, which Banski was quick to point out has some issues in the application and use in a project such as theirs.

more to come on Day 2.........

Posted by vad17 at 02:13 PM | Comments (0) | TrackBack