April 06, 2008

Using DTD's and Catalogs for XHTML Validation

Posted at April 6, 2008 01:48 AM in PHP , XML .

If you are like me, whenever you develop web sites or pages, you constantly find yourself validating the generated XHTML using the W3C Markup Validator (TIP: the Web Developer Firefox extension has an option under the "Tools" menu to validate local source, which automatically uploads the source to the validation service).

This approach is a good start, but it is far from ideal because it is based on an honor system of sorts. You often forget to validate each change you make and there is always some corner case that you forget. So, what can be done about it?

Well, if you find yourself developing in PHP, you can employ the following solution.

First, you need the output of your script available to the PHP script itself. If you are using one of the many frameworks out there, chances are you have a $response->getBody() function or equivalent. If you aren't saving output to a string, use the Output Control functions to capture script output to a string. For example,


ob_start();
header('Content-Type: application/xhtml+xml');
print '<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head />
<body>
<!-- stuff here -->
</body>
</html>';

$xhtml = ob_get_clean();

Now, you have the XHTML (supposedly XHTML 1.1) in a string variable. Now, we need to plug in the validation part. All you need is PHP's DOM extension with libXML support. This extension has functions built-in to validate XML. But, instructions for optimally configuring it are hard to find (if they even exist).

The basic recipe for validating the XHTML/XML response is the following,


libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadXML($body, LIBXML_DTDLOAD | LIBXML_DTDATTR);

if (!$doc->validate()) {
  foreach (libxml_get_errors() as $err) {
    print $err->line . ' : ' . $err->message . PHP_EOL;
  }
}

You can actually use any kind of error handling you want. My personal favorite is to split the XHTML string by newlines and then correlate the $err->line to the actual line in the output so you can print error context and figure out what the offending code is.

When you run the above code, you will find that script execution becomes extremely slow or you will get some cryptic error about remote URI's not being accessible. The reason is that the validate() call will fetch the referenced DTD's in your XML. In the above example, it will fetch http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd and any files referenced by that document.

But, the W3C doesn't like people who fetch DTD's all day, so we need a way to avoid the remote file fetching.

This is where XML Catalogs come to the rescue (see #1 and #2 for more about XML Catalogs). Basically, XML catalogs allow you to map remote entities into a local space, mainly your file system. In this case, we will be using XML catalogs to map the XHTML DTD files to your local filesystem so validation doesn't have to go across the internet to access necessary files.

The first step is to define mappings of remote URI's to the local filesystem. On my Gentoo Linux install, this file is /etc/xml/catalog. In this file, I add the following mappings:


<rewriteSystem systemIdStartString="http://www.w3.org/MarkUp/DTD" rewritePrefix="file:///etc/xml/www.w3.org/MarkUp/DTD"/>
<rewriteSystem systemIdStartString="http://www.w3.org/TR" rewritePrefix="file:///etc/xml/www.w3.org/TR" />

There is probably a way to do this with the 'xmlcatalog' program distributed as part of libXML, but I am lazy.

The above example maps http://www.w3.org/MarkUp/DTD to /etc/xml/www.w3.org/MarkUp/DTD and http://www.w3.org/TR to /etc/xml/www.w3.org/TR. Any remote entity existing under one of the above URL's will be mapped to the corresponding local filesystem location.

Once you make these additions, try to run the XML validation in your script again. It will probably fail fast, but this time with a different error message. It will most likely say that it couldn't find /etc/xml/www.w3.org/TR/xhtml11/DTD/xhtml11.dtd. So, you need to add it to your filesystem.

Create the necessary directory structure in your local filesystem (e.g. /etc/xml/www.w3.org/TD/xhtml11/DTD). Then, start downloading the missing files needed for validation. The XHTML 1.1 DTD's can be found in http://www.w3.org/TR/xhtml11/xhtml11.tgz.

Once you have that, you will need to find all the modular XHTML files. You can find these inside http://www.w3.org/TR/xhtml-modularization/xhtml-modularization.tgz. Be sure to place the files in directories that match the mapping you created in your XML catalog file.

Now, try running validation again. If it complains about missing files, just fetch them one at a time from w3.org until you get no more errors about missing files.

After you have put all of these files on your local file system, calls to validate() should be quick and will immediately tell you if your output conforms to XHTML 1.1. No need to use the W3C service!

So, there you have it. An easy and quick method for automatically validating script output for XML compliance!

Finally, you will probably want to turn off XML validation on production web sites because it adds unnecessary overhead. But for developing, it is an invaluable asset! Even though the above example was for XHTML 1.1, the same methods can be used to validate any XML that has a DTD, schema, or RelaxNG description available.

Update 1, April 6
Added content-type header to help remind people it is needed. Some rewording all around.

Trackback

You can ping this entry by using http://blog.case.edu/gps10/mt-tb.cgi/17452 .

Comments

http://users.skynet.be/mgueury/mozilla/
This Addon could be interesting for you.

Posted by Felix at April 6, 2008 04:15 AM

Obligatory link about XHTML being useless and just use HTML.

Posted by Jeremy Smith at April 6, 2008 01:19 PM

Obviously, the Content-Type response header is obligatory for full XHTML 1.1 conformance. I left it out of the example for conciseness. Perhaps I should put it in...

Posted by Gregory Szorc at April 6, 2008 02:46 PM

ob_tidyhandler?

Posted by Daniel O'Connor at April 7, 2008 08:21 AM

If you lose any computer data, the recovery info at this site http://www.datarecoveryfreeware.net will certainly help you out.

Posted by Elliot at April 19, 2008 10:29 PM

Post a comment










Remember personal info?