Stories

D-Lib Magazine
February 1999

Volume 5 Number 2

ISSN 1082-9873

ICAAP eXtended Markup Language: Exploiting XML and Adding Value to the Journals Production Process

blue line

Mike Sosteric
Executive and founding director of ICAAP
Athabasca University
Athabasca, Alberta, Canada

mikes@athabascau.ca

 

Introduction

The same idea has been bandied about by university leaders and head librarians in the United States for the past year. But the consortium... may be the first large-scale project designed to encourage scholars to publish their work on their own.

Lisa Guernsey writing about ICAAP - Chronicle of Higher Education, 1998.

This article discusses the technological advances attained by a recently announced international effort to reform the scholarly communication system and provide an alternative to the high priced commercial presses. This consortium, named the International Consortium for Alternative Academic Publication (ICAAP), has as its explicit goal the elimination of technological, social, and political barriers to reforming the scholarly communication system. ICAAP is an international consortium of scholars, libraries, and programmers, based at Athabasca University,1 and devoted to demonstrating that a high quality scholarly communication system can be created without the high cost of the old paper based system. This paper describes the progress that has been made on the technical aspects of that agenda. For more information on ICAAP see Sosteric (1998) or Guernsey (1998).

Despite both the need for reform created by a high cost scholarly communication system, and the potential for reform inherent in information technologies, significant obstacles have impeded attempts to bring change to the scholarly communication system. These obstacles include an inability for various stakeholders to work together, the resistance of the commercial presses to what they most likely perceive as a threat to their continued existence, and a global political agenda pushing the institutions of higher education away from a public service ethic and towards an ideology that emphasises private profit and market orientation.

The potential costs of failing to reform the system are well rehearsed by now. Rising costs for distributing scholarly information, declining access to the world’s scientific output (especially in developing nations), the development of a tiered communication system, and a decline in educational quality, are all possibilities if the system continues to move towards commercialisation and higher cost. This scenario has not gone unnoticed. Many have chosen to raise their voices against ongoing commercialisation over the years, and appeals for reform have been more frequent and increasingly resounding as time has passed.

ICAAP Production: Bringing SGML Sophistication to Electronic Publication

A critical part of the task of demonstrating that non-commercial alternatives are viable is the development of a journals production system which exploits the power of available information technologies. ICAAP has been exploring SGML and XML as the central technologies to be used to deliver online resources. In fact, ICAAP has developed an XML implementation known as the ICAAP eXtended Markup Language, abbreviated "IXML."

The IXML language, along with backend software, allows ICAAP to introduce sophisticated indexing and document handling capabilities at a very low cost. In a recent article, Anthony Beavers (1998) described the application of IXML as a meta-tagging system that is used with the GOLIATH search engine to automatically add structured indexing capability to online scholarly journals. However, the potentials extend far beyond those to be realized in this indexing system, which was developed by the Internet Applications Laboratory (IALab) at the University of Evansville.

XML basically provides an easier-to-implement SGML system. In the words of Peter Flynn, XML is:

an abbreviated version of SGML, to make it easier for you to define your own document types, and to make it easier for programmers to write programs to handle them. It omits the more complex and less-used parts of SGML in return for the benefits of being easier to write applications, easier to understand, and more suited to delivery and interoperability over the Web. But it is still SGML, and XML files may still be parsed and validated the same as any other SGML file.2

Understanding the potentials of the IXML requires going into some detail concerning the tagging system. The IXML is based in large measure on HTML. Indeed, IXML is both an extension and stripped down version of HTML. It is stripped down in the sense that the elements that are used in regular HTML to specify the format of a page are disallowed in IXML (e.g., the <FONT> tag, the <EM> tag, etc.) because they are irrelevant to scholarly journals and unnecessarily complicate document handling by adding too much complexity and uncertainty. Elements that describe the structure of a page are retained. There is good reason to remove complexity -- or at least control it. By removing superfluous elements and by increasing control over document structures, IXML makes it possible to streamline and automate the journals production process. Put another way, with a reduced HTML element set, it is much easier to anticipate document characteristics and handle electronic texts automatically.

Note, however, that eliminating the ability to include <FONT> and <EM> tags does not disadvantage a journal article in terms of appearance or functionality. In fact, quite the opposite is the case. As is well known, this separation of the logical structure of the article from its presentation is a desiderata of electronic text handling. Not only does this allow tighter control over the logical structure of documents, but by stripping the ability to include discredited and deprecated elements, IXML expects journal authors and editors to handle the appearance of information via style sheets. In the end, this allows journals to enhance both control over logical structure and control over presentation.

Most modern browsers support Cascading Style Sheets, a system of style sheets designed for use with HTML. XSL (the eXtensible Style Language) is the equivalent that is under development for use with XML. As can be seen by examining, e.g., http://www.icaap.org/TheCraft/content/1999/beavers/, with a stylesheet enabled browser, professional results can be achieved even when appearance information is strictly excluded from IXML markup.

As noted, besides being a stripped down version of HTML, IXML is also an extension to HTML designed to more accurately reflect the logical structure of scholarly journal articles. That is, IXML adds element definitions for those types of structures most often found in journal articles. For example, unlike HTML which has only two top level elements (the head and body), IXML has four. These top level elements include, like HTML, a document head and body. However in addition to these basic elements, optional endnotes and references sections can be added. Figure One below provides a graphical representation of these document elements.3

Figure One: IXML Top Level Document Elements

IXML
|_(head,
|__body,
|__endnotes?,
|__references?)

The usage of the endnotes and references elements are self explanatory. There is considerable utility in providing separate IXML elements for these document structures. When automating document handling, having these additional structures allows document endnotes and references to be treated in unique ways. For example, providing an IXML container for all references allows ICAAP parsing software to add style commands to paragraphs in the references section differently than those that appear in the body. Thus, while paragraphs in the body section may be styled as double space, paragraphs in the references section may be styled as single space. Providing these additional containers thus provides an efficient way of identifying key structures in journal articles and processing these structures in a unique, but efficient, manner.

Besides adding handling capability, adding these top level elements also allow for a more robust article error control process because the content of the elements can be more tightly controlled. XML allows IXML to specify which elements are permitted in certain structures, or to exclude some based on the position of elements. For example, the references section of an IXML document cannot contain the full set of IXML elements. It can contain only an option level one heading (H1) and paragraph content. Similarly, the endnotes section can only contain an endnotetext container. This endnotetext container is an IXML widget that will be described below. Figure Two provides a graphical representation of the allowed content of these two sections.

Figure Two: IXML Second Level Document Elements - REFERENCES and ENDNOTES

REFERENCES ENDNOTES
|_(h1* | |_(endnotetext)+
|__p)+

The benefit of this tight content control is simple. It eases the task of document handling and conversion and creates a less error prone process. In technical terms, it allows ICAAP processing software to anticipate all document possibilities with ease and confidence. Tightly controlling document content means there are fewer surprises that might "break" the ICAAP document conversion process. This allows for the creation of a very robust and virtually error free production system.

As noted above, IXML requires a body and a head element. As can be seen from Figure Three, the IXML body is pretty much the same as a regular HTML body - sans some irrelevant elements. The body of an IXML document takes paragraphs, quotations, headings, list structures, tables (not shown) and an IXML widget called a publicationnote.

Figure Three: IXML Second Level Document Elements - BODY

BODY
|_((publicationnote)?,
|__(h1* |
|___h2* |
|___h3* |
|___h4* |
|___h5* |
|___h6* |
|___p* |
|___ul* |
|___ol* |
|___dl* |
|___blockquote)+)

Most of the items in Figure Three are self explanatory. Headings, ordered and unordered lists are familiar from their widespread use in HTML. However there is one relatively important difference between the HTML body and the IXML body. This difference appears in the content model for the paragraph tag (<P>). As Figure Four demonstrates, the IXML paragraph is both less than, and greater than, the HTML paragraph.

Figure Four: IXML Third Level Document Elements - P

P
|_((#PCDATA |
|___tt |
|___i |
|___b |
|___u |
|___sub |
|___sup |
|___br |
|___a |
|___inline |
|___endnotenumber)+)

As can be seen, the IXML paragraph contains much of what individuals would expect. Paragraphs contain text (#PCDATA), italic, bold, underline, superscript, and subscripted text. Paragraphs may also contain line breaks br and HTML anchors (A). Unlike HTML however, IXML paragraphs cannot contain the logical formatting elements (EM). Also unlike HTML, the IXML paragraph contains additional elements to mark IXML widgets. Here the IXML widgets include an element for inline graphic and textual content, and an element to mark end note numbers.

The IXML Head

So far in this discussion of IXML, we have seen how the elements for references, endnotes, and the document body both add to, and subtract from, regular HTML in order to provide a more intuitive, easier to handle, and more robust, representation of journal documents. A key component of the IXML document language that allows the creation of this integrated production system is the use of an extended IXML head structure. The head of the IXML document is reserved primarily for bibliographic and indexing information. This information generally includes the document abstract and author, document web location, keywords, publishers and distributor of the document, etc. Unlike regular HTML where this information is included in an often haphazard manner in the body of the document, in IXML all such information is moved out of the body and into the head.

Recently, some digital libraries researchers have advocated the emerging Resource Description Framework (RDF) as a method for expressing the syntax of a metadata scheme, with XML used to represent it. For many purposes, including ICAAP, this is unduly complex. By representing metadata directly in the DTD, IXML provides everything that a scholar requires in a manner that is comprehensible to authors, straightforward to process, and flexible and extensible enough to incorporate more complex requirements and future developments.

The benefits from locating all metadata in the head are almost innumerable. Putting all this information in a location that is consistent and tightly controlled allows for the intelligent parsing and indexing of IXML documents. This means that search engines like the DAVID engine of the [IALab] can add structured indexing and sophisticated database capabilities not possible with unstructured HTML. It also means that the documents can be parsed and formatted in a consistent and controlled manner. For example, always being able to locate the document title and subtitle means always knowing where to output it in output files. This solves a significant problem with online publication -- i.e., the lack of consistency and standardisation of web documents. With the IXML head structure, and the use of stylesheets, all articles in a journal can be guaranteed to look the same.

There are other benefits. The most important benefit from this author's perspective is that the use of the IXML head structure allows documents to be output in multiple formats, and for multiple platforms, in an easy and efficient manner. Browsers have great difficulties in printing conventional web pages. Being able to locate and control key bibliographic information means that output programs can be written that provide complex document transformations. Space constraints limit going into more detail about the transformation process. For now it seems worthwhile to examine in more detail the structure of the IXML head. Figure Five gives a graphical representation of the top level elements in the IXML head.

Figure Five: IXML HEAD Elements

HEAD
|_(resourcegroup,
|__publicationgroup,
|__seriesgroup,
|__indexinggroup)

As can be seen from Figure Five, the IXML head contains four top level elements. Each of these containers is designed to store a logical segment of an article or resource's bibliographic information. That is, the four containers provide an intuitive way of grouping information at different levels of abstraction. The resourcegroup is designed to hold information useful for describing the individual article. The publicationgroup is used to describe the publisher and distributor of the article or resource. The seriesgroup contains information on serialisation including volume and issue numbers, special issue title, and special issue editors, if applicable. Finally, the indexinggroup contains bibliographic information including Library of Congress subject headings, and the start date of the journal. It will be useful to go into a bit more detail concerning each of the groupings.

As noted above, the indexinggroup contains bibliographic and indexing information. The indexinggroup includes a list of keywords, an identifier to indicate the keyword scheme, and a startdate. The actual realisation of the indexingroup in IXML code would look something like that in Figure Six.

Figure Six: IXML HEAD Elements - INDEXINGGROUP Example

<INDEXINGGROUP>
<KEYWORDS scheme="LCSH">
<ITEM>Women in Judaism</ITEM>
</KEYWORDS>
<IDNO type="ISFN">900.1999.1.1</IDNO>
<STARTDATE><YEAR>1998-</YEAR></STARTDATE>
</INDEXINGGROUP>

As can be seen from Figure Seven, the keywords element contains any number of item elements which can be used to provide a list of journal level keywords. In the above example, these keywords are derived from the Library of Congress Subject Heading (LCSH) Red Books. However, different schemes could be utilised including the UNESCO subject classification. The startdate indicates when the journal began publication, and when (and if) the journal stopped publication.

The idno number appears many times in the IXML header. In this case, the idno is of type "IUICODE." IUICODE stands for ICAAP Unique Identifier Code and is a unique identifier assigned by ICAAP that allows each article published under the auspices of ICAAP to be uniquely identified in the DAVID search database. This ability to uniquely identify articles independent of their location on the WWW allows very sophisticated document indexing, maintenance and tracking. This will mean that authors and readers will always be able to track down a journal article regardless of its web location simply by citing its IUICODE to the GOLIATH search engine.

The second to the last element in the head is the seriesgroup. This IXML element is designed to hold information relevant to serialisation of the journal. As noted in Figure Eight, the series group contains a description of the resource. Figure Seven gives the content model for the IXML description element.

Figure Seven: IXML head Elements - SERIESGROUP

SERIESGROUP
|_(description)

As can be seen, the seriesgroup contains only a description of the journal series. However this description can be quite detailed. As Figure Eight indicates, an IXML description can contain a number of elements including a stylesheet, graphic, web address, title and subtitle, date, abstract, etc.

Figure Eight: IXML HEAD Elements - DESCRIPTION

DESCRIPTION
|_((stylesheet?),
|__(graphic?),
|__(web?),
|__(title?),
|__(subtitle?),
|__(date?),
|__(abstract?),
|__(language?),
|__(idno?),
|__(availability?),
|__(respstmt?))

Note that the description element is designed to be used in a number of places inside the IXML head -- generally whenever a description of the resource is required. This means that the actual content of the description offers more options than would normally be used in describing a particular level of the resource in question. For example, inside a seriesgroup, most of the elements that are possible inside a description are not used. Generally, the description of a journal series would look something like the representation in Figure Nine.

Figure Nine: IXML HEAD Elements - SERIESGROUP

<SERIESGROUP>
<DESCRIPTION>
<WEB>http://www.sociology.org/Vol004.001/</WEB>
<DATE><YEAR>1999</YEAR></DATE>
<IDNO type="vol">4.1</IDNO>
</DESCRIPTION>
</SERIESGROUP>

The description above indicates that this article belongs to volume four, issue one of the journal. This issue was published in 1999 and is located at http://www.sociology.org/Vol004.001/. As can be seen, this basic description is quite simple and provides only the absolute minimum of information required to identify the location of an article in a journal series. Note, however, that additional tags can be added to indicate that the issue is a special issue, with its own title and editor and even its own copyright requirements.

The second element in the IXML head is the publicationgroup. This element is used exclusively to indicate who is responsible for the journal or resource. Generally this involves "describing" the journal and also providing information on the publisher and distributor (if any) of the resource. The content model of the IXML publicationgroup element is given in Figure Ten.

Figure Ten: IXML HEAD Elements - PUBLICATIONGROUP

PUBLICATIONGROUP
|_((description?),
|
|__publisher,
| |_(name,
| |__address?,
| |__respstmt?)
|
|__distributor?)
|_(name,
|__address?)

As can be seen, the publication contains a description (which contains identical element possibilities to the previously discussed description), a publisher and a distributor. The publisher and distributor elements both contain the basic structures you would expect to find when providing information on organisations. There is a name and an address. The name and address tags contain bottom level elements that describe the information that would most often be contained in names and addresses. Like the description element, the name and address tags are designed to be reusable in other structures (e.g., to provide information on author). Figure Eleven describes the content model for the IXML name and address elements.

Figure Eleven: IXML HEAD Elements - NAME and ADDRESS

NAME
|_(full |
|__(honorific?,
|___first,
|___middle?,
|___last))

ADDRESS
|_(street*
|__city?
|__province?
|__postalcode?
|__organisation?
|__division?
|__email?
|__web?)

Figure Twelve provides an example of how the publicationgroup may be realised in a production environment.

Figure Twelve: IXML HEAD Elements - PUBLICATIONGROUP Example

<PUBLICATIONGROUP>
<DESCRIPTION>
<WEB>http://www.sociology.org/</WEB>
<TITLE>Electronic Journal of Sociology</TITLE>
<IDNO type="ISSN">1198 3655</IDNO>
</DESCRIPTION>

<PUBLISHER>
<NAME><FULL>Athabasca University</FULL></NAME>
<ADDRESS><EMAIL>mikes@athabascau.ca</EMAIL>
</ADDRESS>
</PUBLISHER>

<DISTRIBUTOR>
<NAME><FULL>ICAAP</FULL></NAME>
<ADDRESS><WEB>http://www.icaap.org/</WEB></ADDRESS>
</DISTRIBUTOR>
</PUBLICATIONGROUP>

Of course, the name, address and description tags are capable of resolving the publisher, distributor, and journal with much more detail if so desired.

The final top level element in the IXML head is the resourcegroup. This element is used to describe the resource at the "article" level. As can be seen from Figure Thirteen, the resourcegroup also contains a description of the resource (this time applied to the article itself), and one or more author elements. Each author element will contain, not surprisingly, a name and an address.

Figure Thirteen: IXML HEAD Elements - PUBLICATIONGROUP

resourcegroup
|_(description,
|
|__author+)
|_(name,
|__address?)

An example of the realisation of the resourcegroup tag is provided in Figure Fourteen.

Figure Fourteen: IXML HEAD Elements - RESOURCEGROUP

<RESOURCEGROUP>
<DESCRIPTION>
<STYLESHEET>http://www.icaap.org/TheCraft/article.css </STYLESHEET>
<GRAPHIC>http://www.icaap.org/graphics/quill1.jpg</GRAPHIC>
<WEB>http://www.icaap.org/TheCraft/1999/sosteric/article.html/</WEB>
<TITLE>ICAAP Document Automation</TITLE>
<SUBTITLE>Standardising the Storage of Electronic Texts</SUBTITLE>
<AVAILABILITY status="free">Copyright 1999 ICAAP</AVAILABILITY>
</DESCRIPTION>

<AUTHOR>
<NAME>
<FIRST>Mike </FIRST>
<LAST>Sosteric</LAST>
</NAME>
<ADDRESS>
<EMAIL>mikes@athabascau.ca</EMAIL>
<ORGANISATION>Athabasca University</ORGANISATION>
<DIVISION>Department of Global and Social Analysis</DIVISION>
</ADDRESS>
</AUTHOR>
</RESOURCEGROUP>

At first glance, the IXML head structures may seem quite complicated. However, this complexity is more apparent than real. Most of the information contained in the IXML head is consistent across all resources of an individual journal or publisher. Thus tags in the indexinggroup and publicationgroup remain constant. Tags in the seriesgroup change with each new issue of a journal. Of course, tags in the resourcegroup change on a per article basis. However, it is possible to have authors fill this information in for themselves by providing cut and paste templates, or by providing online forms to fill out. Either way, the actual task of adding an IXML header to documents is small when compared against the benefits of localising bibliographic information.

Conclusion

The question that must be on the readers mind at this point is, "So what?" IXML looks pretty, but what are the production implications? Unfortunately, space constraints limit the ability of this author to go into more detail. That will have to wait for a future article. Suffice it to note at this point that in addition to allowing for sophisticated document indexing in search engines, IXML is also being used to provide automatic author, date, and title indices for ICAAP journals (see http://www.sociology.org/ for a beta example). IXML also allows ICAAP the ability to output multiple document formats. So far this includes the creation of a regular HTML version, and also a Dynamic HTML version with popup widget (endnotes, graphics, etc.). The technical article at http://www.icaap.org/TheCraft/content/1999/sosteric/article_d.html provides a demonstration. Note that the above article, and multimedia features (pop up graphics and endnotes), along with a second regular HTML version, was created instantly from the original IXML source.

The implications of IXML extend beyond simply providing an automated document processing system. The digital libraries community is currently carrying out research and development into mark-up languages, style sheets, metadata schemes, and resource description. The work embraces XML, XLS, the Dublin Core, RDF, URNs, and much more. It is easy to believe that this research will solve all the problems of scholarly publication and that there is little benefit to begin implementation yet. Both these assumptions are wrong. At the ICAAP, we have implemented a system for scholarly publishing, using technology that is widely available today. Moreover, our experience provides valuable feedback to suggest which of the research concepts are likely to be welcomed in practice.

We are using a Document Type Declaration (DTD) (influenced by the TEI DTD), which builds on HTML, but is much better suited to represent scholarly journal articles. The IXML DTD encapsulates several key features of the ICAAP approach which is designed to provide a sensible and easy to use representation of the journal document structure. With IXML, metadata is consolidated in the head container. The structure of the head is carefully defined to allow for automated document handling and future element extensions for those journals or scholarly resources that require more sophisticated metadata representation. The elements of the IXML body are restricted to those that describe the content and structure of the document, eliminating those that are purely concerned with appearance. Cascading Style Sheets (and eventually XSL because IXML is XML), are used to render documents. Finally, a unique identifier, IUICODE, is associated with every document, and a database system is used to resolve the identifier to the location of the document.

The strengths of the ICAAP approach are numerous. Since IXML is SGML, documents benefit from the strengths associated with SGML (document longetivity, safe archival, etc.). With the IXML head, it becomes possible to represent complex metadata and bibliographic information in a simple, easy to understand, easy to incorporate, and easy to extend format. Rather than relying on the inadequate HTML meta tag, which was never designed to handle complex metadata and requires confusing and error prone contortions to represent complex information, the IXML head allows bibliographic information to be represented in a hierarchical and structured fashion. Further, the association of IXML with a DTD means that incredibly powerful SGML validation and error correction can be applied to all tagged articles in order to guarantee that the IXML head and body has been incorporated correctly. This level of certainty is impossible with languages like the Dublin Core and allows us increased confidence that automated processing software, including web roaming robots, will correctly utilise the information contained in the IXML head and document.

One of the principle objections to IXML that could be raised is that, like other initiatives, it is not supported by a software base which can recognise and handle the IXML language. However, ICAAP and the IALAB have worked very hard to overcome this critical limitation. Work has already begun on search engines which understand IXML. The Noesis search engine (http://noesis.evansville.edu/) is being made IXML-aware as is the GOLIATH scholarly journals indexing system being developed by the IALAB. In addition, ICAAP has a suite of tools already in place which work with IXML and provide the backend required for processing IXML documentation. In a very short time, ICAAP has created an admittedly young and undeveloped, but largely complete and workable, scholarly communications infrastructure.

A final benefit of IXML is worth noting. Unlike many other initiatives, IXML and especially the IXML head, is designed to allow for easy document transformation. Under the current circumstances, where no general agreement has been reached as to the correct way to represent metadata or journal information, IXML is perhaps the safest alternative among a number of competing approaches. Not only is there a workable infrastructure in place which can support the use of IXML, but even if IXML does not become a standard approach to representing journal articles, the fact that IXML has been designed to allow for easy document conversions means that any documents currently marked up in IXML can be efficiently transformed to any future markup or metadata language. For example, should the Dublin Core become the standard representation scheme, a simple script could be developed and distributed to easily convert IXML to HTML plus the Dublin Core. Alternatively, scripts could be developed which could output a document that includes both IXML and the Dublin Core (or some other scheme).

ICAAP has only begun to tap the potentials of IXML. Future plans include the introduction of a document tracking system so that documents can be tracked by simply entering a URL with the article's IUICODE (e.g., http://www.icaap.org/iuicode?100.4.1.1), and the ability to produce document abstracts, tables of contents, etc., "on the fly" from a single source file.

The possibilities seem endless and the potentials enormous. Imagination is our only limitation.

References

Beavers, Anthony (1998). Evaluating Search Engine Models for Scholarly Purposes: A Report from the Internet Applications Laboratory. D-Lib Magazine, December. http://www.dlib.org/dlib/december98/12beavers.html

Lisa Guernsey (1998). Research Libraries Newsletter Examines Profits of Journal Publishers. Chronicle of Higher Education, October 30. Reprinted at http://www.icaap.org/chronicle98.html

Sosteric, Mike (1998). At the Speed of Thought: Pursuing Non-Commercial Alternatives to Scholarly Communication. Association of Research Libraries Newsletter, 200. http://www.arl.org/newsltr/200/200toc.html

Endnotes

[Note 1] Athabasca University's mission is to "remove barriers that traditionally restrict access to and success in university and to increase equality of educational opportunity." Supporting the development of a high quality, low cost, scholarly communication system supports the long terms goals of Athabasca University. More information is available at < http://www.athabascau.ca/> and at <http://www.athabascau.ca/openu.htm>.

[Note 2] Peter Flynn from the XML FAQ at, <http://www.ucc.ie/xml>.

[Note 3] For production examples of IXML, see <http://www.icaap.org/TheCraft/> and examine the contents page and the associated articles.

Copyright © 1999 Mike Sosteric

Top | Contents
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Home| E-mail the Editor

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/february99-sosteric