There is more to XML than roll-your-own HTML

Uche Ogbuji
consultant and cofounder of FourThought LLC
uche.ogbuji@linuxworld.com

    Uche Ogbuji explains various XML-related standards, XML implementations available to Linux users, and why Linux must adopt XML now if it is to become a premiere data- and document-exchange platform for the enterprise. (4,000 words)

XML support key to Linux success in the enterprise

he rise of the Internet was a flash flood that caught many by surprise. Fueled by open standards and a host of Unix-compliant technologies, it led to a wild, crazy, and prolific period in publishing and data exchange.

Now that this pioneering boomtown is beginning show signs of chaos and unmanageability, the city planners have moved in on the scene and are quietly putting things in order. Their primary tool is XML, the Extensible Markup Language.

XML itself is yet another markup language. It's a subset of SGML that allows the description and implementation of documents in a tag format familiar to anyone who reads HTML. However, XML isn't just, as some say, roll-your-own HTML. The language introduces many subtleties and extensions that extend beyond HTML. There is even more to XML than XML.

The many XML-based proposals wending their way through standards bodies right now promise to revolutionize the way information is stored, retrieved, managed, and exchanged.

This isn't due to any leap in technology. Few of the technologies advanced by the XML community are in themselves revolutionary. In fact, many of them are based on decades-old information processing models. The power of XML lies in the fact that so many vendors and users are finally agreeing on practical standards for such models.

For this reason, XML might well be the ideal vehicle for boosting Linux in the enormous halls of enterprise information management. Among those looking to get in on the ground floor of the enterprise market is Microsoft. Having largely failed to capitalize on the Internet boom, Redmond is feverishly working to establish itself in the enterprise. And it's using XML.

This article focuses on XML in the enterprise, with a particular eye to XML-Linux development. As such, the article isn't intended as an introduction to XML. See the Resources section below for recommended introductory reading.

More than markup: Data formats and data exchange

There are really two ways to look at XML. Its primary emphasis is the representation of documents. But its inherent tree structure and sophisticated support for element composition and attributes makes it a capable data definition language as well. These two areas lead to subtly different toolsets and supporting standards.

 
For the XML exchange of documents,
the primary tools needed are
an intelligent viewer or browser
and an intelligent XML editor.
Linux is lacking in both areas.

For the XML exchange of documents, the primary tools needed are an intelligent viewer or browser (preferably one that supports style sheets) and an intelligent XML editor (preferably one that can guide the editing process according to a document type definition [DTD] and help develop style sheets). Linux is lacking in both areas.

But even without slick browsers, there is much that can be done with lower level tools for handling document representation via style sheets. SGML has been around long enough for style sheet tools to have matured, and the foremost standard is the Document Style Semantics and Specification Language, or DSSSL. DSSSL is already well-known to the Linux community, as it is used by the SGMLTools group to generate many of the info, TeX, Postscript, and manual pages with which we are all familiar from the Linux Documentation Project.

Enterprise level document exchange: Open standards

The venerable Electronic Data Interchange (EDI) standard seeks to automate this kind of information exchange between businesses. An entire industry out there centers around EDI data conversion, exchange, and transmission, and Linux has virtually no presence in that industry. The same is true in the complementary world of Enterprise Resource Planning (ERP).

Proprietary and byzantine data formats have contributed to the inability of open technology to crack these markets. But as more and more businesses see the merits of open technology, vendors are being forced to standardize. Many of them are shaping their standardization efforts around XML.

A vast range of information systems vendors -- who are developing everything from databases to groupware -- are exploring XML for standardized and flexible data management. This movement is fueled by a flurry of XML-based standards in almost every industry. The more Linux tools support these standards, the more Linux can insinuate itself into the corporate enterprise. Support for such widely used technologies as TCP/IP and SMB has taken Linux this far. We can ride the XML train even farther.

Linux system administration made easy -- or at least less painful

Even if you couldn't care less about corporate enterprise, there are good reasons you should take a second look at XML.

One of the more annoying aspects of Linux administration is keeping up with the various formats and conventions of dotfiles. The format of, say, sendmail.cf differs greatly from that of .procmailrc. While a difference in semantics is understandable (because the tools that use the files are in different domains) there is no reason for the syntax to be so different.

The adoption of XML for formatting such tools would be a large step in the right direction. The basic syntax would be standard from tool to tool and existing configurations would be easier for humans to read.

XML .procmailrc, for example, might read as follows:

<?xml version"1.0"?>
<!DOCTYPE ProcmailRC SYSTEM "/usr/doc/procmail/procmailrc.dtd">
<ProcmailRC>
    <Recipe>
        <Condition regex"^(To|Cc):.*xml-dev"/>
        <Action>
            <Redirect command"/usr/lib/mh/rcvstore +XML-DEV-Folder"/>
        </Action>
    </Recipe>
    <Recipe>
        <Condition regex"^From:.*jokes@spam.com"/>
        <Action>
            <Forward>
                <Recipient address"spamee@myorg.com">
                <Recipient address"joke.lover@myorg.com">
            </Forward>
       </Action>
    </Recipe>
</ProcmailRC>

One notable problem with this is that users would have to get used to XML's more complex (if quite sensible) way of escaping control characters. For instance, to have the recipe condition that the message subject contain the string [XML&Linux], one would have to use

regex "^Subject:.*\[XML&amp;Linux\]"

where not only would the ampersand (&) be replaced with the built-in entity &amp; (due to its special meaning in XML), but the square brackets ([) and (]) would be escaped in a different way (due to their special meaning in regular expressions). But at least the uniform adoption of XML would standardize the forbidden characters, as the uniform adoption of regular expressions has.

 
There are a growing number
of XML parser libraries available
for Python, C++, Java, Perl,
and other languages.

There are a growing number of XML parser libraries available for Python, C++, Java, Perl, and other languages. Using these would free the application from the task of validating the basic syntax and semantics of config files. A parser would check the former as a matter of course, and the latter by validating against a supplied DTD. In fact, if a fast and compact parser library could be agreed upon by developers and included as a shared object in Linux distributions, not only would application writers be able to use it, but there would be the advantage of reducing bloat. Parsing is a complex business, and right now almost every utility comes with its own miniparser. The libxml that comes with the GNOME desktop environment is a useful, if incomplete, step in this direction.

Future XML: Smarter search engines, better hyperlinks, collaborative Web publishing

One of the more ambitious aims of XML is to allow easier indexing of the Web, and the leading tool in this effort is the Resource Description Framework, or RDF. Basically, RDF provides a standard means of specifying and exchanging the classification of resources, which are often housed on the Web (or intranet).

Linux support for RDF -- in the form of a backend tool to store and exchange RDF properties, thus allowing for searches and listings -- would be of great value to companies in need of a system to manage their intranet. It would also be attractive to large commercial sites looking for a simpler way to provide site maps and indexes.

Equally attractive is the way XML promises to improve hyperlinks. Much as HTML and the Web were a generalization of the Apple HyperCard, XML XLinks has the potential to foster a revolution that will change the face of hypertext today.

HTML hyperlinks are unidirectional one-to-one links. This article, for example, links to the Koala XSL page, allowing you to click from this page to the Koala page. But once there, there is nothing in the Koala page that links back to this one. You can click the Back button on your browser, but that is a browser convenience, not a link. And if I wish to link to ArborText's XML Styler, I need to provide a separate link.

 
XML XLinks has the
potential to foster a
revolution that will
change the face of
hypertext today.

XLink expands this narrow idea of links. It allows you to define a single link with multiple destinations. If you go to the GNOME page, you'll see that each entry in the table of GNOME applications has a link to a home page, a screen shot, and a package for download. With an XLink, a single link handles all three roles. XLink doesn't specify how the browser handles this, but an XLink-capable browser might pop up a submenu for the link, allowing you to choose from the three destinations. Or, if you provide a link with two destinations -- one to a visible page and one to an audio readout of that page -- a browser could be programmed to always select the audio destination for its blind user.

XLink also supports multiended, multidirectional links. This means I could link to the Koala XSL page as a reference and have a built-in link from that page back to this one. That way, if you went to the Koala page first, you would find a link to this article, as well as to any other articles with this kind of bidirectional link. The key is that the Koala page maintainers don't need to change their page to make this happen.

There is even more hocus-pocus possible with XLink. In practice, though, this would require an XLink database to which the browser had access. This could be local, if we were in the same department, or you and I could subscribe to the same global XLink database.

XLink extensions are admirable, but they're also very ambitious and it will take some time to develop their practical application. However, a related innovation -- which works quite well even within the familiar HTML-style links -- is a much more powerful way of describing destinations. Using XPointers, one can not only link to all of an XML document, but also to a particular tag at a particular position using a content model familiar to JavaScript users.

The Web today is mostly a read-only medium. Publishing is typically effected though FTP. XML is already attempting to change this, with the Distributed Authoring and Versioning standard, or WebDAV. This XML-based standard allows distributed authoring on the Web, with version control and management of document location. The intention of the WebDAV working group is to include extensions to HTTP to enable some of these goals.

Much work to be done

There have always been two flavors of "free" in Linux; price is the less important of the two.

The more compelling argument for increased adoption of Linux is the variety of solutions available to solve any particular problem, coupled with the freedom each of us has to develop his or her own solution if necessary.

The more support Linux has for the many open XML-based standards in emergence, the better our position for advocacy. What follows is a summary of what is possible with XML on Linux today (and what isn't), with my comments. Please see the Resources section at the bottom of this article for links to the documents, groups and sites mentioned in the following overview.

Browsers and editors
The only XML-enabled browsers for Linux, Mozilla and Jumbo, are both in the alpha stage. Jumbo, implemented in Java, is heavily slanted towards the Chemical Markup Language (CML).

For intelligent editing, there is only Visual XML, in beta and using Java. ArborText's sophisticated, commercial AdeptEditor supports many Unix platforms, but not Linux. Perhaps ArborText needs some coaxing to type make for us. Of course, Emacs and XEmacs have the powerful PSGML-mode to support editing XML the traditional Unix way.

Hopefully open source projects will crop up to address these basic gaps as more and more languages used by Linux users gain XML toolkits.

Data formats
A significant movement in the EDI community leans toward employing XML-based formats and technology, with an eye to making EDI less expensive and thus available to a greater percentage of companies. This increased openness should enable the development and deployment of more enterprise solutions on Linux.

It may be to the advantage of Linux software developers to participate in current initiatives for an XML-based standard for exchanging modeling and design data. This would allow major commercial CASE tools without a Linux version, such as Rational Rose, to exchange documents with the slowly emerging tools available on Linux, such as Together/J or Argo/UML (Java).

A group of vendors, including Rational Software and IBM, is pushing XMI, the XML Metadata Interchange Format (as well as SMIF, an XML language for modeling metadata), through the Object Management Group. Separately, a group of developers, myself included, is working on UxF, a lightweight format for UML exchange.

There are many other XML-based data-exchange formats in development, such as Extensible Forms Definition Language (XFDL); Web Interface Definition Language (WIDL); XML Remote Procedure Calls (XML-RPC); and a language for expressing data schemas in XML (XML-Data).

Style sheets
The SGMLTools project (formerly Linuxdoc-SGML) uses James Clark's SGMLS parser and the Jade DSSSL processor to generate pages in multiple formats for the Linux Documentation Project.

Since XML is a fully conformant subset of SGML, one can use the Jade processor under Linux to create XML style sheets in DSSSL. However, XML has a style sheet language of its own, the Extensible Style Sheet Language, or XSL. Also, the Cascading Style Sheet specification supports XML as well as HTML. Unfortunately, XSL is still in the intermediate stages of standardization. Thus the availability of tools is hampered by the likelihood of change.

One interesting option available now is IBM's TeXML Java toolkit, which processes an XML document and XSL style sheet into a special format, TeXML, and then converts the TeXML to TeX. There are many tools for converting TeX or output DVI files to various formats, including HTML and, for Java, the Koala XSL processor and Arbor Text's XML Styler. Then there is the Language Technology Group's xslj, which translates XSL to DSSSL. This might be the best current option for those who don't wish to use Java.

XML support in programming and parsing
Right now we have Expat for C, by the prolific James Clark, and the Language Technology Group's LT XML, but neither validate documents (they only check for basic syntactical compliance, aka "well-formedness"). For document validation we currently have Clark's SP for C++.

Java has Clark's XP, IBM's XML4J library (the license of which was recently liberalized), and DataChannel's DXP. Sun and Oracle also distribute Java XML processors and toolkits of varying levels of openness.

If you're planning to use a Java XML parser in an applet, you'll find that the main libraries mentioned are probably too heavyweight for convenient Internet download. Microstar provides AElfred as a lightweight Java-based parser particularly for use in applets.

Python has a comprehensive XML package which includes a set of validating and nonvalidating parsers and APIs for many supporting standards.

Perl's support is primarily through Clark Cooper's XML::Parser which can function as a wrapper for James Clark's Expat (thus nonvalidating), or in other forms. There are also other modules for XML-related standards.

There are already a few utilities that use XML as a backend technology. Glade is a GTK-based user interface builder. It stores project data in slightly off-standard XML, from which C code for the GUI is generated.

And of course the Mozilla project uses XML for everything from GUI design to internal data storage (including bookmarks). Sun Microsystems's recent $30,000 "bounty" on an XSL engine for Mozilla should warm things up even more.

RDF, XLinks, and XCatalog
Janne Saarela's Simple RDF Parser & Compiler (SiRPAC) provides a start in Java, and IBM shows its continued zeal for XML and Java with its own RDF processor. Interestingly, the rpm2html tool for generating Web pages for RPM packages supports RDF metadata as a way of eventually managing large repositories of RPMs.

Python has PyPointers, an XPointer locator by Lars Garshol, and Java has Patrice Bonhomme's XPointer Parser. For those who do want to work with XLink in Linux, there is Simon St. Laurent's XLinkFilter. (A link database for Linux that could be used with Mozilla would be a worthwhile project.)

For browsers, parsers, and style sheet processors to locate DTDs, entities, and the like, which are based on public identifiers, there needs to be a catalog of identifiers. One can use the SGML Open Catalog spec for this, but there is a draft proposal for an XCatalog subset specialized for XML. Some protocols and standards for supporting catalogs in Linux are important for XML development. Unfortunately, catalogs have no official standing in current XML standards, but with the XCatalog work in progress, this is likely to change. Luckily, some parsers already support XCatalogs, such as Lars Garshol's xmproc, which is part of the aforementioned Python XML package.

WebDAV
Greg Stein provides the mod_dav module to support DAV extensions for the Apache Web server.


Resources

Introduction to XML XML-enabled browsers for Linux XML editors for Linux Data modelling XML language toolkits & parsers Style sheets RDF XCatalog XLinks WebDAV References Further reading

Источник публикации: LinuxWorld, published by Web Publishing Inc. , March 1999