« Week 5 : Migration and Emulation Tools | Main

December 03, 2010

Week 6: Web Archiving, part 1: Harvesting and Access Tools

Week 6: Web Archiving, part 1: Harvesting and Access Tools

Markus Enders, “A METS-Based Information Package for Long Term Accessibility of Web Archives,” iPRES ’10, Vienna, Austria, 2010

British Library's web archive was examined, which uses an underlying METS profile to disseminate complex digital objects. Multiple tools were used in this project: Heretrix, Wayback Machine and the Web Curator Toolkit. PANDAS was originally used, since this was the only tool available at the time. Web Curator Tool kit was more ideal, since the structure of OAIS was used.

arc file format - multiple 'concentrated' archive files representing change over time. This does introduce difficulties in the capture and storage of the multiple components inherent to web content. As well, even more issues emerge over time with new formats, new display and style mechanisms (CSS, Java), new transport protocols, etc. will further complicate matters. Is there a way to standardize submission materials? For these reasons, a data model should be flexible and extensible to adapt to these questions. AIPs as defined in the OAIS model will need to be defined to migrate file format types and also track change in the respective metadata file.

British Library has defined levels of description for each component involved in the process (website (as a whole); webpage (part of the whole); every associated object (digital content within each part). These have been defined as object types within the METS record. The WARC container is also defined within the METS record. Technical and provenance info can also be included and defined in a set structure.

One issue in the METS format is the problem of recording the DIP (Dissemination info package).

Tracy Seneca and Shifra Pride Raffel, “NDIIPP Web-at-Risk: The Development of a Web Archiving Service at the California Digital Library,” 2006.

Defining "at-risk" web content. Such as .gov sites with huge volumes of text, forms, media, and the factor of constant change (survey that went into more depth: Web-Based Government Information: Evaluating Solutions for Capture, Curation, and Preservation)

Web Archiving Service (WAS) from California Digital Library was examined.

Six months into the project, Hurricane Katrina struck New Orleans. The project immediately turned its efforts to collect over 600 seed URLs from a range of topics to test out the program's capabilities under a situation of potential loss.

Framework and assessment tools were designed to adjust to the necessary changes as more studies were made. XML-formatted parameters were made into the capture phase and filtered into METS.

Tools and Resources

Alex Ball, “Web Archiving,” Digital Curation Centre, March 2010
Web Curator Tool
Archive-It (Internet Archive)
Web Archiving Service (California Digital Library)
Content DM Web Harvester service (OCLC)

Posted by vad17 at December 3, 2010 11:04 AM

Trackback Pings

TrackBack URL for this entry:


Post a comment

Remember Me?

(you may use HTML tags for style)