<xsl:key>s to Happiness
1
David J.Birnbaum (djbpitt+@pitt.edu)
Location: http://clover.slavic.pitt.edu/~repertorium/plectogram/keys/keys.html
Last revised: 2004-08-02
Abstract: The present article is intended as a tutorial on the use of XSLT keys with multiple files.
In the summer of 2003 I wrote an XSLT stylesheet that could produce a graphic (SVG) representation of the comparative structure of two or more medieval miscellany manuscripts. This visualization, originally developed in a different software environment by the American Slavist Hugh Olmsted in the 1970s, 2 is called a plectogram. A sample plectogram is available in the original SVG at http://clover.slavic.pitt.edu/~repertorium/plectogram/keys/am100mcb_am149nbw.svg (or embedded in HTML at http://clover.slavic.pitt.edu/~repertorium/plectogram/keys/plectogram.html) and in PNG format (without active animation) at http://clover.slavic.pitt.edu/~repertorium/plectogram/keys/plectogram.png. A more complete report about both the philological and informational aspects of the project is available at http://clover.slavic.pitt.edu/~djb/2003_ljubljana/2003_ljubljana_paper.pdf.
My XSLT stylesheet operated in two stages. In the first of these, it needed to read the modified-TEI manuscript descriptions, extract a list of the contents (which I called articles) from each description, and draw a column of cells for each manuscript, where each cell contained a four-digit hexadecimal number that represented the title of the article. The result of the first stage was a series of parallel columns, one for each manuscript, where each column consisted of cells containing four-digit numbers. The second stage involved drawing lines that connected cells in adjacent columns if they contained the same numbers, effectively creating a visual map of the correspondences in contents between adjacent manuscripts.
Both stages of plectogram generation required that the XSLT stylesheet access multiple
files. To populate a column with numbers in the first stage, the stylesheet needed to
read the manuscript description file, retrieve each article title, and look up the
corresponding four-digit number in a separate reference file. To draw the connecting
lines, the stylesheet needed to access in turn each article title in the leftmost of two
adjacent manuscripts and compare it to every article title in its right-hand neighbor,
drawing lines where these matched. Although it was not strictly necessary for the
purpose of plectogram generation to access the reference file during this stage (since
the program was comparing the actual article titles in the two manuscript files, and not
their associated numbers), the stylesheet nonetheless did access the reference file in
order to generate a stream of information about its progress (using
<xml:message> elements).
An early version of the stylesheet accessed through regular XPath expressions the various manuscript descriptions and the reference file. It worked as advertised, but it did so very slowly because it needed to load, parse, and navigate the same files repeatedly, and generating a large plectogram for ten or twelve manuscripts could take as long as fifteen or twenty minutes. Not only was optimization for speed necessary in general, but it was necessary particularly because the plectogram-generation stylesheet was intended as part of an eventual client-server web system, and remote users could not reasonably be expected to wait as patiently for the results of their queries as scholars at stand-alone workstations running local programs.
When I presented a report on this project at Extreme Markup 2003, I had the opportunity to complain to Jeni Tennison about the slow operation, and she advised me to use XSLT keys as a way of generating cached hash tables that would obviate repeated parsing of the same document. This was, indeed, the key to solving the execution-time problem; a plectogram that took almost fifteen minutes to create without keys required less than four seconds with keys. While groping my way toward this solution, however, I discovered that the requirements for using keys across multiple documents were not entirely intuitive, and this report is intended to alert others who may be using keys for the first time to possible pitfalls and ways to avoid them.
<xsl:key> ElementThe <xsl:key> element takes three obligatory attributes:
name identifies a string that can be used to access a particular key
using the key() functionmatch identifies the node set that will be returned when one uses the
key() functionuse identifies the XPath that will serve as a pointer into the table
built by creating the <xsl:key> elementSample <xsl:key> elements from my plectogram project are:
The first of these examples makes it possible to take a space-normalized version of the
current node set (or, more accurately, the string value of a single node that one has
matched) and retrieve all of the Bulgarian-language
<articleName> elements in the current context document that
it matches. Thus, I could find all <articleName> elements in
the current context document that have the value “Fiziolog” with key('articleByTitle','Fiziolog').
The second example makes it possible to retrieve a <title>
that is a sibling to a <code> in a reference file. If the
numerical code corresponding to “Fiziolog” is “0070,” I can retrieve “Fiziolog” with key('codeByTitle',0070).
Things to know when creating <xsl:key> elements are:
<code>
elements), and the relevant file must be specified not when the key is created, but
when it is used for retrieval.<articleName> element
exists only on the XPath
/TEI.2/teiHeader/profileDesc/articleContentDesc/articleName, but I
nonetheless specify only the articleName leaf (the lang
attribute value is necessary because the files also contain article names in
languages other than Bulgarian, and I need to exclude those).key() to Access a KeyAs was described above, once a key has been created, one can access information in it
with the key() function by specifying the name of the key and the value to
use as a pointer (via the use attribute of the
<xsl:key> element) to the desired information (as specified
by the match attribute of the <xsl:key>
element). This process is straightforward as long as one remains within a single
document, but because keys apply potentially to all documents accessed by the
stylesheet, it is sometimes necessary to change contexts in order to retrieve
information from the appropriate source. More confusingly, it is sometimes necessary to
use information from one document to point into another, and to take the information
retrieved from the second document and use it while continuing to process the first.
<xsl:for-each>?The “stupid XSLT trick” to change the context document is to embed the retrieval inside
an <xsl:for-each> element. The select attribute
in this case does not serve, as it does with genuine for-each operations, to specify a
set of nodes over which one wishes to iterate. Instead, it specifies a single node that
is (typically) the root of a different document. When the key() function is
used inside an <xsl:for-each> element, it will point into the
document specified as the value of the select attribute.
For example, if one is working in a document other than bib.xml and one wishes to retrieve information with a key from bib.xml, one embeds the following code into one’s stylesheet (taken from the XSLT specification at http://www.w3.org/TR/xslt):
<xsl:for-each select="document('bib.xml')">
<xsl:apply-templates select="key('bib',$name)"/>
</xsl:for-each>
This instructs the stylesheet to switch from whatever document it is processing to a
document called bib.xml and to use the key called “bib” to retrieve the information that
the value of the $name variable points to.
<xsl:for-each>?Once one has entered an <xsl:for-each> element of the sort
illustrated in the example above, the context has switched to the new document. One may
have been processing the current node (“.”) previously, but once one has entered the
<xsl:for-each> element in that example, “.” points to the
bib.xml document, and not to the node that was originally being processed in a different
document. Thus, if one needs to refer to that other document, one can no longer use “.”
(or anything similar) to do so, and a different strategy is required.
One gets information into an <xsl:for-each> element by
assigning it to a variable. This includes not only the value of the current context
(“.”), but also any values that one might need to deduce by referring to it. For
example, the plectogram program needs to perform differently inside the
<xsl:for-each> depending on whether the value used to point
into the key is the first or second occurence of a specific article title. It also needs
to write information that depends on the position of that original value in the tree in
which it occurs. This information is stored in variables before entering the
<xsl:for-each> element, as follows:
<xsl:variable name="senior" select="boolean(../preceding-sibling::*/articleName[. = current()])"/>
<xsl:variable name="count" select="count(../preceding-sibling::* /articleName[@lang='BG' or @lang='bg']) + 1"/>
<xsl:for-each select="document('articles.xml')"> ...
The “senior” variable determines whether an article name in a manuscript description file
has been encountered before and the “count” variable gives the ordinal position of the
article name in the manuscript description file. The XPaths in these variable
declarations refer to axes within the manuscript description files; were one to use
those XPaths inside the <xsl:for-each> element, they would
refer (incorrectly for our purposes) to those same axes within the articles.xml
reference file.
<xsl:for-each>?You can’t. Variables created inside an <xsl:for-each> element
are local to that element, and do not exist once one has left it. This limitation may
require one to move code inside the <xsl:for-each> element
that might logically seem to belong outside it.
For example, suppose one needs to process a manuscript description file a.xml and one
needs to look up the numerical codes for each article in that file by using the article
name as a pointer into a reference file articles.xml. One then needs to use the
numerical code while continuing to process the manuscript description file. My first
instinct was to step inside an <xsl:for-each> element to get
the code, store it in a variable, and then step back outside to continue processing the
manuscript description, using that stored variable value where needed. Because the
variable does not exist outside the <xsl:for-each> element,
this approach fails. The correct alternative is to obtain all information needed about
the manuscript description file before entering the
<xsl:for-each> element, bring that information along with you
in variables, and perform all processing that requires information from the (in this
case) articles.xml reference file within the <xsl:for-each>
element. One can continue processing the original context after leaving the
<xsl:for-each> element, but only if one has no further use
for the information obtained within that element.
The <xsl:key> element and key() function provide
substantial improvement (by a factor of several hundred) in the execution time of XSLT
stylesheets that require repeated access to the same trees. Keys work effectively across
multiple documents, as long as one:
<xsl:for-each> to change context;<xsl:for-each> element from outside;<xsl:for-each> element as a way of
overcoming the difficulty of moving information outside.1 The title of this article is an allusion to Anastasya Verbitskaia’s engagingly trashy late-nineteenth-century novel entitled The Keys to Happiness, available in English translation from Indiana University Press (ISBN 0253335388).
2 The plectogram model is described in Olmsted’s “Modeling the Genealogy of Maksim Grek’s Collection Types: The ‘Plectogram’ as Visual Aid in Reconstruction.” In: Michael S. Flier and Daniel Rowland, eds. Medieval Russian Culture. Volume II. (= California Slavic Studies XIX.) Berkeley: University of California Press. 1994. 107–33. For examples of its application in a large-scale investigation, see also his “Studies in the Early Manuscript Tradition of Maksim Grek’s Collected Works.” Unpublished doctoral dissertation, Harvard University. 1977.