April 03, 2008
Live Blogging from KSL GIS Symposium--Clifford Lynch keynote address
Cliff Lynch, Executive Director of the Coalition for Networked Information, spoke about the notion of cyberinfrastructure as related to scientific inquiry:
high performance computing; sensors connected to the Network; very large data sets; virtual organizations. These concepts are known collectively as e-science, especially in Europe. In the U.S. the same concepts are known (in Lynch's view, somewhat perversely) as cyberinfrastructure. The NSF is the guiding body for these concepts in its Office of Cyberinfrastructure, headed by Dan Adkins.
He discussed simulation as a fundamental tool for science, which some argue is a topic that should be taught broadly to undergraduate students. Examples: disaster planning based on certain characteristics (time of day, spring break, on a bridge); simulating early agrarian societies;
Sensors: use of very tiny sensors--"smart dust"--that can be "spread around" to gather data. Ecologists and environmentalists using "dumb sensors" that can sense only a few kinds of phenomena, but then overlaid with a system of "mobile sensors" that can move to a place that indicates the need for a more sophisticated gathering of data. Social elements: closed circuit TVs; monitoring highways; cell phones that know where you are and are now starting to have other kinds of sensors--with certain kinds of sensors built in, it would be easy to build a ubiquitous national sensing network. Much social, commercial and societal activity has been moved to the Internet that can now be monitored and tracked. It creates an enormous social sensor network that we have not had before. He described his impression that the major software firms (Google, Yahoo, Microsoft) are very protective of the privacy of their users for good business reasons; if they let the data be "non-anonymized" they would not be able to gather it.
Data: It is a fundamental cornerstone of e-science. Not just preservation, but data curation. Notion that you are not just keeping data for altruistic purposes, but to be able to do new scholarship and repurpose. Not just for the sake of creating archives, but for the possibility of creating downstream new knowledge. Examples: meta-analysis across separate data sources, especially using diverse data sources removed from their original purposes. Preservation of data costs money, and we don't necessarily want to preserve everything. There is still a strong bias in sciences to preserve as little data as possible, creating a "nightmare scenario" which requires future funding to preserve old data. We are just beginning to have a language to discuss preservation of data: "data curation." "Data scientists" is a new breed of person starting coming out of schools of information science. Most projects will not be able to support large-scale data curation staff. There are some scientific areas in which it seems an unsolvable long-term problem (e.g. high energy physics). Data also needs to be collected in some sort of context. Once it is packaged, some entity needs to take responsibility for managing the data in the long term. It is part of fundamental scientific results:
*disciplinary repository (e.g. molecular biology) with norms for collection and sharing; some agencies are now demanding pre-publication data sharing; who pays for the repository
*journal publishers: "give us the whole package": article, data, computer programs. Sometimes this is in reaction to academic fraud cases. The journals are quite vague about who has the long term responsibility for these "supplemental materials." How much supplemental data can you give them?
*universities themselves who host the research, especially through the university library. Serious financial issues for the university/library who undertakes these efforts. It is a big expansion of role for the library. Also need to deal with area of duplication of effort by spreading areas of expertise for academic disciplines among fewer institutions.
Lynch pointed out that his use of "e-science" can more correctly be termed as "e-research" since the same techniques of for research can be used in the humanities and social sciences. Humanities are beginning to generate large data sets that will also need to be preserved and curated.
QUESTIONS FROM THE AUDIENCE:
Q. Product liability for information? e.g. faulty sensors and data that causes catastrophe. Pharmaceutical trials in which data may have been suppressed.
A. Lynch sees that this is a serious problem that will get worse, especially from corporate lawyers who want to have data destroyed as soon as possible, because old data is "pure liability". What constitutes the material that is used for peer review? (Article? Data? Computer programs?)
Q. Comment on data storage.
A. For most data, raw costs of storing are not very significant. Human-produced data (writing, speaking, video) is now "not that big a deal." Getting rid of data will be done for other social purposes, not for costs. Historians now do not have enough hours to review the entire human record, so data mining becomes essential.
Q. Will there be improvement of metadata?
A. Deposited data should be streamlined as much as possible, and manage the metadata in other ways.
Posted by tdr at April 3, 2008 10:10 AM