DocZilla can load HTML as true SGML

HTML is SGML

HTML is not only a cool language that allows writing hyperlink documents in very little time but, at its finest, HTML is simply an SGML document type definition (DTD). Many have been taught to start an HTML document with the DOCTYPE declaration, like: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> without ever knowing what it actually means in practice. That's because there hasn't been any way to put it practice with concrete visualization.

With DocZilla, you can do that now.

DocZilla validates and loads HTML as SGML

While DocZilla generally reuses the non-validating HTML support of Mozilla for viewing ordinary webpages, it's an interesting application to load HTML as SGML. We took the front page of the W3C HTML 4.01 standard for use as an example. (Copyright and licensing information here.)

Original W3C document as HTML

Here's the original HTML document describing HTML 4.01, served with content type of text/html: http://www.w3.org/TR/1999/REC-html401-19991224/.

The same document appearing as SGML

Note: you need a working network connection in order to test this demonstration.

This one, instead, is an SGML file -- Check out the content type using View > Page Info in your DocZilla. Because SGML doesn't define any human semantics or the outlook of a document, DocZilla loads it as an arbitrary SGML file, never knowing that it's actually HTML. We had to add a stylesheet invocation in our local entityrc to help DocZilla with styling it and as an unnecessary addition, an XSLT TOC to act as a table of contents. That's all the files we have on our server.

What happens with the DOCTYPE declaration is that DocZilla loads the DTD from the system ID "http://www.w3.org/TR/html4/strict.dtd". Yes, straight off the web! Open the Message Logger to confirm this (View > Message Logger) -- you'll see information on many DTD subsets loading from W3C's website. It then parses the document as SGML and displays it using html.css referred to in entityrc. You might have noticed that the document loads a bit slowly: that's because it has to fetch a half a dozen files over the internet.

All elements are presented as generic SGML elements, meaning that the special semantics of HTML elements are not in effect. Links are missing in action since DocZilla doesn't know that the "HREF" attribute of an "A" element is actuall a URL reference. That could be fixed by adding few HyTime constructs into the "A" element's ATTLIST declaration in the DTD, to help DocZilla turn them into SGML hyperlinks. Here's a modified version of the SGML document, with HyTime declaration in the internal DTD subset. The message logger will notify of an error as we declare the attribute list twice for the "A" element, but the HyTime links will nevertheless work.

Also the numbering of the entries of the table of contents fails, because Mozilla doesn't fully implement CSS counters yet, where as with "OL" and "LI" elements in HTML the numbering is built-in. (There are tricks to overcome this restriction, if absolutely needed.) Well, we then applied the powerful DocZilla Table of Contents to the document instead. If the Sidebar isn't visible, hit F9 and you should have a simple, XSLT-generated Table of Contents tree in it.


Last modified: Fri Jul 8 12:59:14 EEST 2005