« Week 4: Assessing Digital Preservation Needs; Preservation Planning and Administration | Main | Week 6: Web Archiving, part 1: Harvesting and Access Tools »

November 29, 2010

Week 5 : Migration and Emulation Tools

Selecting the Right Preservation Strategy, Migration

Migration- copying or conversion of digital media from one format to another, preserving the most significant properties, affecting both the hardware and software involved in rendering the digital media. Migration w/n OAIS is defined as a transformation of the media from the Archival info package (AIP), as defined by Administration through developed migration plans. Migration should be reversible at any point in time (similar to preservation function of physical media), though a certain level of "acceptable loss" in digital media has been found to be OK. This amount of loss should be defined by the Administration as well.

Decisions to be determined by Admin of a digital archive include: Decisions to keep proprietary formats (and if/when to upgrade). Tougher to predict proprietary markets and the direction of these future software versions. As multiple migrations occur, the idea of a fully reversible file is almost impossible as more and more editions of a software are introduced (and who is going to keep up with these old software editions over time?) Also a plan to incorporate new versions into the workflow of the repository.

Many repositories will limit the number of formats supported within the digital library, those which "embody the best overall compromise among characteristics like functionality, longevity and preservability." Defining format types, and also the specifications of each format (ex. Uncompressed Baseline TIFF) and how the repostitory can support these selected formats indefinitely (called normalisation) Would also define how to conform other formats to the selected formats (like a normalisation tool, such as ZENA) Institutions can have a technology watch facility to ID formats in risk of obsolescence

A few types of migration plans- When newer formats or software become available; When requested by users (as defined in the CEDARS project and CAMILEON). This approach is problematic since this is dependent on continual use and access of the repository, which may not be easy to assume.

Advantages and disadvantages of migration:
Advantages

It is a widely used strategy and procedures for simple migration are well established.
It is generally a reliable way to preserve the intellectual content of digital objects and is particularly suited to page-based documents.
Conversion software for some formats is readily available.
Disadvantages

It requires a large commitment of resources, both initially and over time. Migration at the point of obsolescence is labour intensive unless it can be automated, because formats evolve so rapidly; as collections grow, the work involved in migration also increases. The migration on request approach may mitigate this to some extent, in that migration is not carried out on digital objects which may not be used; standardisation of formats also makes batch migration easier.
Some of the data or attributes (e.g. formatting) of the digital object may be lost during migration; the authenticity of the record may then be compromised. In particular, there is likely to be a significant loss of functionality in the case of complex digital objects. Migration is based on the assumption that content is more important than functionality or look and feel.
The potential loss of data and attributes may compromise the integrity and authenticity of a digital object, which is a major issue for digital archivists.
There may be potential IPR problems if either the source or the new format is proprietary, although these are unlikely to be as prohibitive as they might be in the case of emulation. It is unclear yet whether the Gowers Review, published in December 2006, will mitigate the problem of IPR: Recommendation 10b of this report states that by 2008 libraries in the UK should be enabled to format shift archival copies to ensure that records do not become obsolete.
Specialised conversion tools are needed to convert digital objects from one format to another, and if no appropriate tool is available for a specific file format, developing a customised migration system can be complex and expensive, although costs could be shared with institutions wishing to perform the same migration.
Selecting the right preservation strategy, Emulation

Finding current means to mimic the environment of the original rendering of a digital object by emulating software applications, operating system and hardware. Not tested to preservation standards. Might be a better solution for complex digital objects, such as websites or data sets. Aspects such as execution speed, display resolution, colour and any input devices like a keyboard or mouse can be moderated to reflect original experience in a more meaningful way. Emulation differs from migration, since the original file, the original software and hardware have been maintained. (A fuller preservation strategy? Though more tricky to maintain and ensure longevity of software/hardware specs, as well as the emulator to run the original programs)

Virtual machine approach- Usually based in Java, which is tricky in the digital preservation setting, since Java is changing frequently

Universal Virtual Computer- Developed by IBM and involves preserving the bitstream of a digital object along with a specially written emulation program. Platform independent. In theory, UVC programs can be written for each file format. A file is broken down in detail how the digital object is structured, e.g. raster-based images are described pixel by pixel.

Advantages and disadvantages of emulation:

Advantages

In theory full emulation enables us to recreate the full functionality and exact look and feel of a digital object's performance. It is therefore an attractive approach for preserving complex digital objects and those where appearance or functionality are identified as significant properties.
In contrast to migration, the focus of emulation is on changing the environment rather than the digital object itself, thus lessening the risk of data loss through repeated migration cycles.
Oltmans and Kol have concluded that emulation is more cost-effective for preserving large collections, despite the relatively high initial costs for developing an emulation device; in contrast, migration applies to all the objects in a collection repetitively, creating high ongoing costs. However, the need for chaining emulators in the future may detract from this.
The emulation approach can be implemented at a higher level than the migration approach, so rather than developing conversion solutions per format institutions can develop emulation solutions per environment.
It means that records in obscure formats do not have to be abandoned; in theory if the creating hardware/software can be emulated, all the records created in that environment can be recreated.
Regardless of the principal preservation approach adopted by a digital repository, emulation could be useful as a backup mechanism that would provide access to the 'digital original' form of each record and may be necessary for the extraction of digital objects from older technological environments.
Disadvantages

As yet, emulation has not been widely tested as a long-term digital preservation strategy, and further practical tests are essential before more definitive conclusions about its reliability can be drawn.
An emulation system may require the user to master completely unfamiliar technology in order to understand an archival digital record, and technological developments are incredibly rapid; for instance, many have already forgotten how to use relatively recent word processing programs like Wordstar. This problem could potentially be addressed by developing different means or levels of access.
Selecting an emulation strategy also involves buying into a migration strategy because emulators themselves become obsolete, so it becomes necessary to replace the old emulator with a new one, or to create a new emulator that allows the old emulator to work on new platforms.
Most emulation approaches will involve preserving or emulating proprietary software which is covered by patent, licence or other IPR. This is a major issue and must be addressed by any institution introducing an emulation strategy; it is unclear yet whether the Gowers Review will alter this situation.
The concept of 'exact original look and feel' is itself debatable; can it therefore be preserved by emulation? Digital objects are so dependent on the environment used to render them; for instance, a user's experience of a website can differ according to what software and hardware they are using.
Emulation may require a large commitment in resources, and highly skilled computer programmers would be needed to write the emulator code.
If the UVC approach is used, large numbers of decoder programs will be necessary to cope with the variety of file formats that are available, and it may be that new UVC emulators need to be written for each new generation of hardware.
Systematic Characterisation of Objects in Digital Preservation: The eXtensible Characterisation Languages", Christoph Becker et al.

Validation needed in any action that transforms file formats to ensure content and reliability that the intellectual content is the same. Much of this work is currently done manually, but is not practical for larger collections. The article describes "Xtensible Characterisation Languages (XCL) that support the automatic validation of document conversions and the evaluation of migration quality by hierarchically decomposing a document and representing documents from different sources in an abstract XML language."
Mention of survey among archiving professionals- The 100 Year Archive Task Force, 2007

The consequence of migration is sometimes subtle in the lost functionality or content. (Think of transformation of most word processing files to other formats (ODT or PDF). Depending on the file, this loss of functionality might be fine (final version) Are the footnotes and links preserved? Decisions on which pieces of info are important is sometimes difficult to ascertain.

Variation regarding the quality of conversion is very high - layout in particular in word processing files is lost, or footnotes lost or references. Layout may not be an issue, but footnotes would certainly impact the overall meaning of a document. These things are not easily measurable in an automated way. Acceptable loss needs to be defined.

Planets includes Plato, a digital preservation planning tool and a method to automate part of the workflow. Also provides evaluation in a standardized testbed setting.

Jeffrey van der Hoeven, et al. “Emulation for Digital Preservation in Practice: The Results,” International Journal of Digital Curation, Issue 2, Volume 2 (2007): 123-132

Early skepticism of emulation was mainly due to technical complexities and constraints, as well as the initial start-up costs to new initiatives. National Library of the Netherlands started a project in 2004 to deal with issue of emulation, particularly in regard to complex digital files, like interactive multimedia files. The project was wrapped up in 2007 and delivered a component-based computer emulator Dioscuri.

The project broke down the digital object into five attributes: content, context, structure, appearance and behaviour (or, functionality), as defined by Rothenberg and Bikson 1999. Each attribute may vary for each format type in importance, depending on the requirements of the repository (or consideration of a final edition vs. working files)

Existing emulators at the time of the project were functional, though issues of migrating emulators also arise over time. Some solutions chain or migrate emulators, though this also risks the functionality of the emulator and the overall performance of the service.

Emulation is difficult with the constant development with new formats or new versions for the main software packages. This would be tougher with proprietary software packages too, since there is usually a delay in when they release the source code which is fundamental in the design of a user friendly (and up-to-date) functionality of an emulator on any level. (Or, conversely, software that has been long out of date and there is no documentation in existence- Small business proprietary Video card example in article) (Could a digital file be fully understood and processed without a complete software application? How long could any potential migration/emulation product be before it is outdated? Or updates to peripheral features of the product that could affect functionality (example in the 'Emulation for Digital Preservation in Practice' article on the consequence of the VM being updated, which the emulator stays the same)

The different flavors of each software version is under question as well- In the Dioscuri set-up, is the open source version of MS-DOS (FreeDOS) change the file or lose data? Dioscuri still limited to a particular window of information (16-bit)

Might be difficult to assure quality of original files without a substantial amount of work to compare the original file to the file after emulation, since some software products will inadvertently complete minor (or major, depending on the source file) upgrades upon opening the file.

Assisted Emulation for Legacy Executables Kam Woods and Geoffrey Brown, International Journal of Digital Curation, Issue 1, Volume 5, 2010

Testing out implications of emulation through the use of simple scripts and tools to ensure quality of process. Case study used existing collection of 4,000 virtual CD-ROM images with thousands of custom binary executables. The authors were successful in designing a "wrapper" of an older DOS-based and early Windows GUI apps to generate reports and export data from these binary files, without having to deal with a Virtual Machine application, installing original software (and learning the installation process from older hardware applications) With a small amount of code, the authors used modern applications to check and validate the files to ensure that data was not lost in the process.

The article references a nightmare scenario in the upkeep of multiple hardware environments (and also heavily reliant on the user's knowledge base (and likely a great deal of time and patience) to effectively reproduce the environment time and time again, with issues of hardware requirements and security issues arising with each application. The authors noted issues related to available fonts, drivers, or languages of an operating system can lead to user frustration and potential problems. They identified three main areas for problems to arise in emulation: administration, configuration, and maintenance.

Keeping the Game Alive: Evaluating Strategies for the Preservation of Console Video Games, Mark Guttenbrunner, et al, International Journal of Digital Curation, Issue 1, Volume 5, 2010
Digital preservation for console games
Some companies trying to preserve and present games, such as Virtual Console Channel on Nintendo’s Wii game console or research projects like Preserving Virtual Worlds and National Videogame
Archive. The open or closed platforms, different media, and different game input devices, or 'controllers' are some of the main hurdles for console game preservation and emulation. 3 case studies are examined using the Planets preservation plan.

UNESCO guidelines define 4 ways that digital objects can be at risk- The physical media, storage media, conceptual object, and the essential elements that define context (or metadata)

For emulation, software + hardware components must be preserved along with digital media. Solutions have been found with the use of Emulation Virtual Machine (EVM) and Universal Virtual Machine. A modular emulation project in development under Planets develops an emulator on a hardware level.

Article summarizes video game and console history.

Lack of documentation for certain hardware (say, manufacturer goes out of business or documentation cannot be found), as with complete game code, storage media (ROM Cartridges, Optical media, software).Other components include: graphics, audio, special controllers

Backwards compatibility to older software and hardware systems w/out affecting original make-up of game (viewing an 8-bit graphic on HD) Migration strategies include: Code re-compilation- in the case of missing/incomplete code. Simulation is when part of the code is reproduced or reinterpreted.

Evaluation of Strategies considered 5 main factors: Object characteristics; Infrastructure; Process Characteristics; Cost; Content/Data Characteristics

The tests on the case studies (Nintendo SNES, Sega Genesis, SNK NeoGeo and NECTurboGrafx 16) showed that each emulator tested showed some loss to the functionality and display.

René van Horik and Dirk Rooda, “MIXED: Repository of Durable File Format Conversions,” iPRES 2009 proceedings, San Francisco, 2009

MIXED project to form sound theoretical framework for curation of file formations and corresponding services and tools to support the framework.

EASY: electronic archiving system. Database for datasets: Data Archiving and Networked Services- DANS. A huge variety of formats in the system- images, Lotus 123, SPSS, Word Perfect, ASCII. Offers file migration for the varied formats into XML (referred to as "smart migration") Aspects that do not make transformation: Presentation and action details (fonts, forms, update, editing. SDFP (Standard Data Format for Preservation) umbrella format- Includes the XML structure. Future work on the format to include the functionality and to expand to more file formats.

DANS DBF Library a Java library for reading and writing xBase database files.

Resources and Tools
Dioscuri
JPC Emulator
Universal Virtual Computer
Software Preservation Society
The Emulator Zone
Walker Sampson, “An Annotated Bibliography: Approaches to Software Preservation”

Posted by vad17 at November 29, 2010 07:16 AM

Trackback Pings

TrackBack URL for this entry:
http://blog.case.edu/digitalpreservation/mt-tb.cgi/23837

Comments

Post a comment




Remember Me?

(you may use HTML tags for style)