The <xsl:key>s to Happiness 1

David J.Birnbaum (djbpitt+@pitt.edu)

Location: http://clover.slavic.pitt.edu/~repertorium/plectogram/keys/keys.html
Last revised: 2004-08-02


Abstract: The present article is intended as a tutorial on the use of XSLT keys with multiple files.


Introduction

In the summer of 2003 I wrote an XSLT stylesheet that could produce a graphic (SVG) representation of the comparative structure of two or more medieval miscellany manuscripts. This visualization, originally developed in a different software environment by the American Slavist Hugh Olmsted in the 1970s, 2 is called a plectogram. A sample plectogram is available in the original SVG at http://clover.slavic.pitt.edu/~repertorium/plectogram/keys/am100mcb_am149nbw.svg (or embedded in HTML at http://clover.slavic.pitt.edu/~repertorium/plectogram/keys/plectogram.html) and in PNG format (without active animation) at http://clover.slavic.pitt.edu/~repertorium/plectogram/keys/plectogram.png. A more complete report about both the philological and informational aspects of the project is available at http://clover.slavic.pitt.edu/~djb/2003_ljubljana/2003_ljubljana_paper.pdf.

My XSLT stylesheet operated in two stages. In the first of these, it needed to read the modified-TEI manuscript descriptions, extract a list of the contents (which I called articles) from each description, and draw a column of cells for each manuscript, where each cell contained a four-digit hexadecimal number that represented the title of the article. The result of the first stage was a series of parallel columns, one for each manuscript, where each column consisted of cells containing four-digit numbers. The second stage involved drawing lines that connected cells in adjacent columns if they contained the same numbers, effectively creating a visual map of the correspondences in contents between adjacent manuscripts.

Both stages of plectogram generation required that the XSLT stylesheet access multiple files. To populate a column with numbers in the first stage, the stylesheet needed to read the manuscript description file, retrieve each article title, and look up the corresponding four-digit number in a separate reference file. To draw the connecting lines, the stylesheet needed to access in turn each article title in the leftmost of two adjacent manuscripts and compare it to every article title in its right-hand neighbor, drawing lines where these matched. Although it was not strictly necessary for the purpose of plectogram generation to access the reference file during this stage (since the program was comparing the actual article titles in the two manuscript files, and not their associated numbers), the stylesheet nonetheless did access the reference file in order to generate a stream of information about its progress (using <xml:message> elements).

The Problem

An early version of the stylesheet accessed through regular XPath expressions the various manuscript descriptions and the reference file. It worked as advertised, but it did so very slowly because it needed to load, parse, and navigate the same files repeatedly, and generating a large plectogram for ten or twelve manuscripts could take as long as fifteen or twenty minutes. Not only was optimization for speed necessary in general, but it was necessary particularly because the plectogram-generation stylesheet was intended as part of an eventual client-server web system, and remote users could not reasonably be expected to wait as patiently for the results of their queries as scholars at stand-alone workstations running local programs.

The Solution

General

When I presented a report on this project at Extreme Markup 2003, I had the opportunity to complain to Jeni Tennison about the slow operation, and she advised me to use XSLT keys as a way of generating cached hash tables that would obviate repeated parsing of the same document. This was, indeed, the key to solving the execution-time problem; a plectogram that took almost fifteen minutes to create without keys required less than four seconds with keys. While groping my way toward this solution, however, I discovered that the requirements for using keys across multiple documents were not entirely intuitive, and this report is intended to alert others who may be using keys for the first time to possible pitfalls and ways to avoid them.

Creating an <xsl:key> Element

The <xsl:key> element takes three obligatory attributes:

Sample <xsl:key> elements from my plectogram project are:

  1. <xsl:key name="articleByTitle" match="articleName[@lang='BG' or @lang='bg']" use="normalize-space(.)"/>
  2. <xsl:key name="codeByTitle" match="title" use="../code"/>

The first of these examples makes it possible to take a space-normalized version of the current node set (or, more accurately, the string value of a single node that one has matched) and retrieve all of the Bulgarian-language <articleName> elements in the current context document that it matches. Thus, I could find all <articleName> elements in the current context document that have the value “Fiziolog” with key('articleByTitle','Fiziolog').

The second example makes it possible to retrieve a <title> that is a sibling to a <code> in a reference file. If the numerical code corresponding to “Fiziolog” is “0070,” I can retrieve “Fiziolog” with key('codeByTitle',0070).

Things to know when creating <xsl:key> elements are:

Using key() to Access a Key

As was described above, once a key has been created, one can access information in it with the key() function by specifying the name of the key and the value to use as a pointer (via the use attribute of the <xsl:key> element) to the desired information (as specified by the match attribute of the <xsl:key> element). This process is straightforward as long as one remains within a single document, but because keys apply potentially to all documents accessed by the stylesheet, it is sometimes necessary to change contexts in order to retrieve information from the appropriate source. More confusingly, it is sometimes necessary to use information from one document to point into another, and to take the information retrieved from the second document and use it while continuing to process the first.

Whadda ya mean <xsl:for-each>?

The “stupid XSLT trick” to change the context document is to embed the retrieval inside an <xsl:for-each> element. The select attribute in this case does not serve, as it does with genuine for-each operations, to specify a set of nodes over which one wishes to iterate. Instead, it specifies a single node that is (typically) the root of a different document. When the key() function is used inside an <xsl:for-each> element, it will point into the document specified as the value of the select attribute.

For example, if one is working in a document other than bib.xml and one wishes to retrieve information with a key from bib.xml, one embeds the following code into one’s stylesheet (taken from the XSLT specification at http://www.w3.org/TR/xslt):

<xsl:for-each select="document('bib.xml')">
    <xsl:apply-templates select="key('bib',$name)"/>
</xsl:for-each>

This instructs the stylesheet to switch from whatever document it is processing to a document called bib.xml and to use the key called “bib” to retrieve the information that the value of the $name variable points to.

How Do I Get Information into an <xsl:for-each>?

Once one has entered an <xsl:for-each> element of the sort illustrated in the example above, the context has switched to the new document. One may have been processing the current node (“.”) previously, but once one has entered the <xsl:for-each> element in that example, “.” points to the bib.xml document, and not to the node that was originally being processed in a different document. Thus, if one needs to refer to that other document, one can no longer use “.” (or anything similar) to do so, and a different strategy is required.

One gets information into an <xsl:for-each> element by assigning it to a variable. This includes not only the value of the current context (“.”), but also any values that one might need to deduce by referring to it. For example, the plectogram program needs to perform differently inside the <xsl:for-each> depending on whether the value used to point into the key is the first or second occurence of a specific article title. It also needs to write information that depends on the position of that original value in the tree in which it occurs. This information is stored in variables before entering the <xsl:for-each> element, as follows:

<xsl:variable name="senior" select="boolean(../preceding-sibling::*/articleName[. = current()])"/>
<xsl:variable name="count" select="count(../preceding-sibling::* /articleName[@lang='BG' or @lang='bg']) + 1"/>
<xsl:for-each select="document('articles.xml')"> ...

The “senior” variable determines whether an article name in a manuscript description file has been encountered before and the “count” variable gives the ordinal position of the article name in the manuscript description file. The XPaths in these variable declarations refer to axes within the manuscript description files; were one to use those XPaths inside the <xsl:for-each> element, they would refer (incorrectly for our purposes) to those same axes within the articles.xml reference file.

How Do I Get Information out of an <xsl:for-each>?

You can’t. Variables created inside an <xsl:for-each> element are local to that element, and do not exist once one has left it. This limitation may require one to move code inside the <xsl:for-each> element that might logically seem to belong outside it.

For example, suppose one needs to process a manuscript description file a.xml and one needs to look up the numerical codes for each article in that file by using the article name as a pointer into a reference file articles.xml. One then needs to use the numerical code while continuing to process the manuscript description file. My first instinct was to step inside an <xsl:for-each> element to get the code, store it in a variable, and then step back outside to continue processing the manuscript description, using that stored variable value where needed. Because the variable does not exist outside the <xsl:for-each> element, this approach fails. The correct alternative is to obtain all information needed about the manuscript description file before entering the <xsl:for-each> element, bring that information along with you in variables, and perform all processing that requires information from the (in this case) articles.xml reference file within the <xsl:for-each> element. One can continue processing the original context after leaving the <xsl:for-each> element, but only if one has no further use for the information obtained within that element.

Conclusions

The <xsl:key> element and key() function provide substantial improvement (by a factor of several hundred) in the execution time of XSLT stylesheets that require repeated access to the same trees. Keys work effectively across multiple documents, as long as one:

  1. makes judicious, if intuitively peculiar, use of <xsl:for-each> to change context;
  2. uses variables to carry information into an <xsl:for-each> element from outside;
  3. moves code into the <xsl:for-each> element as a way of overcoming the difficulty of moving information outside.

Footnotes

1 The title of this article is an allusion to Anastasya Verbitskaia’s engagingly trashy late-nineteenth-century novel entitled The Keys to Happiness, available in English translation from Indiana University Press (ISBN 0253335388).

2 The plectogram model is described in Olmsted’s “Modeling the Genealogy of Maksim Grek’s Collection Types: The ‘Plectogram’ as Visual Aid in Reconstruction.” In: Michael S. Flier and Daniel Rowland, eds. Medieval Russian Culture. Volume II. (= California Slavic Studies XIX.) Berkeley: University of California Press. 1994. 107–33. For examples of its application in a large-scale investigation, see also his “Studies in the Early Manuscript Tradition of Maksim Grek’s Collected Works.” Unpublished doctoral dissertation, Harvard University. 1977.