September 04, 2008

August 2008 Cataloging Statistics

Here are our statistics for August 2008:

August 2008 (MS Excel File)


OCLC August 2008 Statistics (MS Excel File)

Also, stay tuned to KSL Statistics and Reports to view our 2007-2008 fiscal year statistical report...

August 28, 2008

XML vs. Databases

The following excerpt from Julia Flanders, from Brown University's Women's Writer's Project, and the links she offers, provide an interesting discussion of "XML vs. Databases."

J. Flanders:
"I should also say in advance (before anyone reads the more detailed screed below) that we're teaching XML and TEI for a reason, which is that they help us work with text in a way that respects both its nuance and our own interest in that nuance. So my own personal recommendation for representing textual information is to use XML on principle, because (regardless of what tools are available right now) in the long run it's the right kind of approach. However, it's worth understanding the broader context, which I will try to sketch below.
"Database tools and XML tools differ in the kinds of things they're good at (and this is where the readings may come in handy, to give concreteness to this point). In addition, database structures and XML structures differ somewhat in their emphasis: database structures emphasize what is regular and predictable about your data (e.g. the fact that every individual commenter has a name, address, and gender). XML structures emphasize what is less regular and predictable about your data (e.g. the fact that the comment might or might not include praise for the exhibit, references to other exhibits, references to specific artists of interest, statements about being inspired, etc., and also the fact that the comment might contain an unpredictable number of paragraphs). For your data, which has a fairly regular and predictable structure, the difference is comparatively minor. For other kinds of data, though, the difference might be great: it would be much more difficult and bizarre to express the structure of a novel using a database."

Article Links


  • XML and databases

  • Going Native: Use Cases for Native XML Databases

  • Introduction to Native XML Databases

  • Why Use a Native XML Database
  • August 07, 2008

    July 08 stats

    Here our the bibliographic/metadata statistics for July 2008:

    III Millennium Statistics (MS Excel Format)

    OCLC stats (MS Excel Format)

    August 06, 2008

    August 2008 presentation

    Here is the Powerpoint presentation I gave August 6, 2008. Note: Some links to external files will not work:
    Powerpoint: Cataloging Trends and Challenges

    July 24, 2008

    NINES-TEI Workshop -- Day 3

    Notes from Day 3 of NINES-TEI Conference:

    _Customization the TEI schema: options_
    --Select modules
    --Delete unnecessary elements
    --Add new elements or attributes
    --Change element or attribute names
    --Constrain attribute values (constrain data early and often. Tighter you make the schema, the fewer errors you'll have. But, don't contrain too much).
    --Constrain structure
    --Manipulate functional groupings of elements
    --Produce an internationalized version of the TEI

    _Contextual Information_
    Information we know that is relevant to an understanding of the text:
    --The identity of things named in the text: people, places, books, etc.
    --Information about things named in the text: birthdates, geographical locations, date published, etc.
    --Interpretive information: themes, keywords
    --Normalization of measurements, dates, etc.
    _Contextual Information in the TEI_
    The TEI provides several different structures for encoding contextual information:
    --’Ographies: prosopography (personography), gazetteers(placeography),orgography, bibliography
    --keywords applied to the text as a whole
    --thematic or interpretive information applied to specific places in the text
    --attributes for supplying normalized values

    _Personography_
    --Like a local name authority file
    --Can be simple or very detailed
    --Can be kept in your encoded file or externally
    --Includes specific elements for the most common data
    --Also includes general elements for the unforeseen

    _Placeography (Gazetteer)_
    Key points:
    --Very similar to personography...but for places!
    --Can be linked to maps via geographic information data

    * use XLST for organization, CSS for display, database (e.g. Oracle, Lucene) for search capabilities. Do in-house, NOT via outsourcing

    _Successful Digital Projects_
    --Well-planned workflow (planning before execution, communication mechanisms, etc.)
    --Phased approach (show progress early on)
    --Successful staff plan (needs versus practicality)
    --Realistic technical implementation plan (identify tools at the cale you need. Oracle database?)
    --Funding...

    _Workflow Issues_
    Source --> Transcription --> Corrected Transcription --> HTML output
    information gain should go to "corrected transcripton." "Information loss" may occur at HTML output.

    _Workflows that Make Sense_
    1. Craft approach
    detail initial capture by hand (i.e. you encode it) --> review and error correction --> interface created that allows you to do various things

    2. Expertise in the Craft Approach
    detail initial capture by hand (i.e. you encode it) --> review and error correction --> scholarly and technical expertise interacts with review-and-error correction --> display

    3. Phased Approach
    simple initial capture (not detailed capture; could be automated via OCR capture) --> simple error correction (e.g. spell-checking) and standard publication tools that do simple searches (e.g. search on author) --> simple reading and search interface output --> then do more advanced information (i.e. add to the XML markup) --> new output
    Occurs often in large projects

    _Project Life Cycle: Starting Out_
    1. initial idea. Essentially volunteer
    2. Seed funding (e.g. NEH DHSG)
    3. Implementation that works, with a real audience. Requires serious funding (NEH, Mellon, Gettey, NSF,...)
    4. Discover flaws and re-do; get funding because it's still an interesting project
    5. wrap up and archive; institutional funding via institutional repository
    -OR- sustainability model for on-going project ad infinitum --> ongoing redevelopment/new prototypes

    July 23, 2008

    NINES-TEI Workshop -- Day 2

    Notes from 2nd day at NINES -- TEI workshop at Miami U. of Ohio
    XML
    A vocabulary and a grammar

    Example:
    --DocBook is a markup language for writing books
    vocabulary includes: article, title, and paragraph
    grammar states: "paragraph" is not allowed inside "title"
    Extensible
    Not a language, but a meta-language
    --methods of defining markup language
    --syntax for expressing markup language

    Gives us methods to define a markup language

    XML has no tags of its own, but instead defines teh syntax of tags; it defines no vocabulary or grammar of its own, but does tell you how to define a vocabulary and grammar

    XML languages varry greatly
    --many different purposes (financial data, linguistics, literary texts,...)
    --many different kinds of markup (structure, content, interpretation,...)
    --many different user communities (IRS, literary scholars, librarians,...)

    XML is
    --easy to understand
    --non-proprietary plaint-text (i.e. no particular company owns this. Plus, no binary data, such as opening a JPEG in MSWord or NotePad. Btw, RTF is

    problematic, because it's a MS format; MS can do away with it):
    --human readable
    --software independent
    --hardware independent
    --(relatively) easy to write a parser for
    --widespread: very well supported by both commercial and open source software

    Definition of...
    Parser--the software that differentiates markup from document.

    XML is a metalanguage
    --no tags or attributes of its own
    --instead, a set of rules ofor definning tags and attributes
    --imposes no constraints on elements and attributs in document
    --instead, defines how rules for such constraints are written

    Everything is Delimited
    Text is divided into elements
    --elements by start- and end-tags
    --start tags by <...>
    --end tags by
    --special case: short-hand for an element with no content. =

    Everything's Delimited: attributes
    Elements have attributes

    Users can turn on or turn off the stuff within the brackets (quantity=, unit=) when searching.

    Everything's Delimited: Character References
    To refer to a character that is not on your keyboard, delimit its ISO 10646 (or Unicode) code-point with:
    --&# and ; for decimal values, or
    --&#x and ; for hexadecimal values

    Well-Formedness
    Simple set of rules on document syntax:
    --single "root" element (meaning, the one at the top of the tree/box)
    --every element has a start- and an end-tag (or is an empty tag)
    --no elements overlap

    NOTE: XML is not a perfect representation, but it gives us strategic ways to model information.

    One solution to overlap
    use the TEI part=attribute to work around this problem
    --"I" is for initital, "F" is for final.

    Validity
    A valid XML document follows the rules of a schema that describes a particular markup language:
    --lexicon or available voculary: elements and attributes
    --grammar for how teh lexicon is used: ruls for nesting, sequencing, etc.
    e.g. a paragraph can be inside a chapter, but a chapter cannot be inside a pragraph
    e.g. a chapter must begin with a heading followed by at lesat one paragraph
    --there exist variuos schema languages with which you can describe an XML grmmar, each with advantages and disadvantages.
    --in order to be valid, an instance must be well-formed
    --a well-formed document need not be valid

    You have to pass the well-formed test first, then do the validity check!
    You can be well-formed and yet not valid; you cannot be valid and not well-formed.

    Namespaces
    --a way to use tag vocabularies from diffferent markup languages
    --allows for specialization of markup languages (by discipline, by function)
    --good for metadat: can use TEI header in a METS record
    --good for speciailized markup: e.g. MusicML
    --No need for every markup language to handle everything

    * Make use of www.unicode.org

    _Challenges of Markup_
    3 Critical areas:
    --overlapping structures
    --images and figurality
    --materiality of the text

    July 22, 2008

    NINES Workship -- TEI

    Notes from NINES Workshop, Miami University of Ohio, July 22-24, 2008:

    Motives for TEI
    -- to store info for long term
    -- to analyse info
    -- to share info

    Granularity
    Longevity

    What is TEI
    --technically: a standards organization for humanities text encoding
    --organizationally: an inter'l membership consortium
    --socially: a community of people and projects

    TEI founded in 2000. Members pay annual fee, pays for editorial work, outreach, workshops

    Formal declarations for encoding language. Constraints for how you can encode.
    Guidelines are thick; not everything is needed.

    P5 released Nov. 2007. Current version. Substantial shift.
    NEH cares about practical interchange than technical aspects.
    P4 and P5 have many of the same elements, but P5 is differrent from P4 in that:
    --P5 introduces some new elements
    --P5 fixes some errors from P4 (e.g. postscript element)
    --P5 doesn't change much for the encoder, but it does for how the texts are displayed, and the way its language is expressed. How you do customizations is

    very different in P5 from P4.


    TEI Guidelines
    --can be applied strictly or loosely
    --Can adapt to local conditions
    --Designed as a sett of modules that can be selected as needed
    --Not unlike a human language in some respect

    Vocabulary to use experimentally and exploratory.
    TEI is modular. Drama, verse, etc.
    TEI provides the tools. You have to ask: what is my project? What are my needs? What level of granularity do I want?

    Areas of Usage
    --digital libraries and digital archives
    --literary and cultural materials
    --scholarly editions
    --manuscript collections and descriptions
    --dictionaries
    --language corpora
    --historical documents
    --anthropplogy and social sciences
    --authoring
    --linguistics
    --many other areas...

    See, for example, projects on William Blake, Herman Melville, Poetress Archive (Miami Univ. of Ohio), Women's Writer's Project (Brown Univ.)

    Customization
    The proces of altering the TEI schema and documentation to match your needs.
    changes include:
    --choosing which parts to use or omit
    --changing the name of elements or attributes
    --restricting the values of attributes
    --adding new elements
    --adding new attributes

    "interchange" crucial element of TEI. Goal is meaningful exchange.

    Sources of Info
    --WWP seminars site
    --TEI web site
    --TEI listserv (TEI-L)
    --Colleagues at other projects