David J. Birnbaum
University of PittsburghDavid J. Birnbaum is Associate Professor and Chair of the Department of Slavic Languages and Literatures at the University of Pittsburgh. His research in electronic text technology is concentrated primarily on problems of encoding and processing medieval Slavic manuscript materials.
David A. Mundie
DASYS Inc.David A. Mundie is head of Software Quality Assurance at DASYS Incorporated. He has a long history of dabbling in the computer-assisted analysis of literary texts.
In the context of encoding new texts, the use of automated tools to ensure conformance to a grammar specified in a DTD has many benefits, most notably the guarantee of correctness and the simplification of downstream processing applications. In the context of producing new electronic editions of existing print documents, however, the use of those tools is problematic, because the existing paper texts may violate their underlying structure due to human error during their compilation or production. The producer of such a document is torn between the need to preserve the historical record of the original text, on the one hand, and the need to produce a correct, validated document that conforms to a DTD on the other. Such texts raise the philosophical problem of encoding what is in essence an invalid document within a framework designed specifically to support validity.
The solution that we propose is to view the valid and the invalid texts as transforms of each other. By capturing the transformation rules between them, we can easily produce a correct SGML document instance while at the same time preserving the historical record.
(To appear in Markup Languages: Theory and Practice.)
SGML (Standard Generalized Markup Language, ISO 8879:1986) was developed primarily for encoding new texts, an environment in which the rigorous adherence to a DTD (Document Type Definition) ensures that the resulting documents will exhibit a coherent structure. Consider, for example, the OED (Oxford English Dictionary), which is divided into, among other things, lexical entries, each of which consists of the following information (described in the introduction to the dictionary itself):
The following partial entry for "mavourneen" is a simplified partial illustration:
mavourneen (m@'vU@ni:n). Also 9 mavournin. [Irish mo mhurnín.] My darling.
One way to represent the structure of dictionary entries of this
type (partially, and with considerable simplification) would be the
following (assuming that the content of all undefined elements is
#PDCATA):[1]
<!ELEMENT entry - - (ident, variants?, morph, signif+) > <!ELEMENT ident - - (main, pron, gram?, spec?, status?, early?, inflect?) > <!ELEMENT morph - - (etym, hist?, misc?) > <!ELEMENT signif - - (mean, cites+) > <!ELEMENT cites - - (cite+) > <!ELEMENT cite - - (date, source, text) > <!ELEMENT source - - (who?, where) > <!ELEMENT where - - (title, page?) >
The text of the dictionary excerpt for "mavourneen" cited above might be marked up in conformity with this DTD fragment as follows:[2]
<entry>
<ident>
<main>mavourneen</main>
<pron>m@'vU@ni:n</pron>
</ident>
<variant>mavournin</variant>
<morph>
<etym>Irish mo mhurnín</etym>
</morph>
<signif>
<mean>My darling.</mean>
</signif>
</entry>
Users who create new dictionaries based on this DTD
will be
required by their SGML editing and validating tools to
follow the
specified structure. The model provides some flexibility, so that, for
example, authors may include or omit an indication of the lexical
status of a particular entry. But this flexibility is restricted; for
example, if a <status> element is included,
it must follow, rather than precede, the obligatory
<pron> element. SGML software
is not, of
course, able to ensure that the author actually enters status
information into the <status> element, but
the software can, at least, verify that the
<status> element itself occurs only in a
legal context. An SGML development environment thus
protects users
from inadvertently creating syntactically contradictory
documents.
In a standard SGML authoring model, SGML tools can ensure that newly-created documents conform to a DTD developed for a particular purpose. This model is appropriate in an environment where SGML tools are used to create new structured documents, but it is less well suited to the production of electronic versions of pre-existing print (or even non-SGML electronic) documents, an extremely common enterprise in humanities computing. One reason that such transcriptions are problematic is that portions of pre-existing documents that were created outside SGML editors may, owing to the fallibility of human editors, violate the overall logical structure of those documents in general. For example, a dictionary entry might improperly omit an obligatory element, or place it out of an otherwise strict and regular position.[3]
To illustrate, consider the following two entries taken from the second edition of the OED. As we have seen, in this dictionary variants normally precede etymological information, but for the word "maverick" the variants follow, rather than precede, the etymology. Compare the entries for "mavourneen," which is regular, and for "maverick," which is anomalous:[4]
mavourneen (m@'vU@ni:n). Also 9 mavournin. [Irish mo mhurnín.] My darling.
maverick ('m{v@rKk), sb. [Samuel A. Maverick (1803-1970), a Texas cattle-owner who left the calves of his herd unbranded.] Also mavorick.
This sort of error leaves the editor of the electronic edition with several unattractive choices, including:
The first three strategies are compatible with SGML processing: all three yield a DTD and a document instance that can be processed with standard SGML tools. The fourth, on the other hand, yields a document that when submitted to SGML tools generates parser error messages and undefined results at best, and at worst is simply rejected. The interesting issues, then, are of two types: those that distinguish the three SGML-compatible solutions from one another and those that support the SGML-incompatible solution.
It may be worth recalling at this point that the detection of many types of logical errors is the responsibility of processing applications themselves, and not of the SGML tool set per se. For example:
<pron> element and character data
representing etymological information inside the
<etym> element, rather than inadvertently
switching them around. SGML is concerned primarily
with document structure, and except for the special treatment of
certain SGML markup characters,
SGML's interest in the specific bytes that
constitute each piece of PCDATA content is limited
to verifying that all character data contains only legal
characters.[5]<pron>
element and character data representing etymological information in
the <etym> element, SGML
software has no way to check whether this information has been entered
without error. As in the preceding case, SGML may
be concerned with where PCDATA may occur within a
document, but it is indifferent to the makeup of each instance of
PCDATA itself.PCDATA environment, but it has no way of monitoring
whether a PCDATA location contains any data at
all. Because a requirement for PCDATA in a
particular location may be satisfied by zero data characters, a user
who includes an <etym> element but forgets
to enter the textual content will not be notified of any error by an
SGML application.<pron> systematically to tag etymological
information and <etym> systematically to
tag pronunciation information would be monumentally confusing to a
human, but it would not be an SGML error, because
GIs are merely arbitrary signs as far as an
SGML system is concerned.These SGML-irrelevant errors invite us to ask what sorts of errors an SGML system should monitor. The obvious answer is that SGML is concerned with the syntax of documents, and, specifically, with verifying that the syntax of a document instance obeys the syntactic rules established in an associated DTD. Deviations from the DTD within the document instance constitute SGML errors, and must be reported as such by a validating SGML parser.
Those who work with SGML texts are used to thinking of valid SGML as the inevitable goal of our encoding projects, and we assume that error messages are generated by a parser to alert us to the presence of unwanted faulty data, which we then normally repair before publication or further processing. We are not conditioned to think of syntactically invalid SGML as a natural or desirable state, or as a practical or appropriate way of representing syntactically contradictory source data.
Our evaluation of the various practical solutions to dealing with anomalies in source documents is based on three requirements.
First, we require that where the meaning or function of character data is unambiguous, that data should be transcribed faithfully. Editorial comments about errors in the original source document are not excluded, but, for philological reasons discussed below, these should be restricted to markup, rather than introduced by altering the character data content of the document.
Second, we require that a DTD be an interpretive statement, rather than just an engineering convenience. A DTD represents the human editor's analysis of the structure of a document, as inferred from document analysis.
Third, we recognize that documents produced by humans operating without the benefit of an SGML editing environment may contain errors, and we consider it the philological responsibility of the eventual human editor to distinguish a document's somewhat idealized structure from the reality that may contain fortuitous errors. In other words, the most direct SGML representation of inadvertent deviations from what is otherwise a coherent structure would be an SGML document with syntactic violations. This is in keeping with the second assumption, above, that a DTD is an interpretive statement. An SGML document may thus include the interpretive information that the document instance has a regular structure (formalized in the DTD) and also includes sporadic violations of that structure (encoded within the document instance).
From a more practical perspective, one may consider the DTD inferred from a print dictionary to be a formal statement about the structure of the dictionary. As such, the DTD itself may be regarded as an object of study, and its value as such would be restricted were there no formal distinction between the general structure of the dictionary, on the one hand, and the reality that may contain sporadic deviations from that structure, on the other. To represent a document with errors as syntactically consistent by allowing the DTD to license what philological analysis identifies as erroneous structures would deprive the DTD of much of its utility as a structural model.
The first of the four solutions noted above, changing text during transcription to correct errors in an original source document, is unattractive because transcriptions of existing print dictionaries (to continue the original example) have two essences: they are new and functional electronic dictionaries and they are electronic records of existing archaeographic materials, viz. print dictionaries. While correcting errors observes the spirit of the first of these essences, since it produces a more useful and practical electronic reference work, it runs counter to the second, in that it simply suppresses information about the original source document. When the history of the OED is written, scholars may choose to ignore certain errors that occurred in the pre-electronic print editions, but unless those errors are preserved in the source, individual scholars will be unable to make deliberate decisions about how to deal with them. If the editor simply modifies the entry for "maverick" to be:
<entry>
<ident>
<main>maverick</main>
<pron>'m{v@rIk</pron>
<gram>sb.</gram>
</ident>
<variant>Also mavorick.</variant>
<morph>
<etym>Samuel A. Maverick (1803-1970), a Texas
cattle-owner who left the calves of his herd
unbranded.</etym>
</morph>
</entry>
then any trace of the original, anomalous data will have been lost.
The TEI (Text Encoding Initiative) DTD
addresses a superficially related problem: the treatment of anomalous
character data during textual transcription. For example, errors in
character data in original sources may be tagged as
<sic>, with an editorial emendation stored
in a corr attribute. Alternatively, the
editorial emendation may be entered as content in a
<corr> element, with the original reading
stored in a sic attribute. But despite the
superficial similarity between anomalous character data and anomalous
structure, SGML does not readily support markup of
markup through the use of attributes, and the TEI
DTD contains no comparable proposal for wrapping
anomalous syntactic structures in an umbrella element that would
facilitate the specification of two types of markup as markup, one
anomalous (as found in the original source) and the other logically
correct but philologically unfaithful.[6]
One complication that arises when working with print documents is that the print medium necessarily imposes a linear order, even when this order may appear to be more properly a matter of presentation than of structure. Much as footnotes and endnotes may be regarded as two presentational variants of a single structural note-type element, the order of the parts of a print dictionary entry may be regarded as purely a presentational artifact imposed by the print medium on information that alternatively may be encoded fully without reference to linear order by SGML tags. This issue emerges from an imbalance between the SGML model of data as a single ordered hierarchy of content objects (see the critique in Renear et al. 1993) and the inherently twofold existence of print documents as both archeographic objects (including all of the presentational features imposed by print) and structural abstractions.
The second alternative is to use a loose DTD that does not enforce a strict element order. For example, the ordered set of elements can be made unordered by changing the commas in the content model to ampersands, and accidental omissions in the source can be catered to by making all elements in the content model optional, as in this humorously noncommittal declaration--a Hamlet among DTDs:
<!ELEMENT entry - - (ident? & variants? & morph?, signif*) > <!ELEMENT ident - - (main? & pron? & gram? & spec? & status? & early? & inflect?) > <!ELEMENT morph - - (etym? & hist? & misc?) > <!ELEMENT signif - - (mean? & cites*) > <!ELEMENT cites - - (cite*) > <!ELEMENT cite - - (date? & source? & text?) > <!ELEMENT source - - (who? & where?) > <!ELEMENT where - - (title? & page?) >
This approach is unattractive because it sacrifices structural information. If the evidence of document analysis clearly points to the transgression in a particular place of a structure that otherwise observes a strict order, rather than to conformity to a loose structure that does not require strict order, a DTD based on the latter conceals information about the document, viz. what is regular and what is exceptional. If the DTD is viewed as a formal model of the editor's interpretation of the structure of a document, it is properly a potential object of study in its own right, and suppressing the distinction between the regular and the exceptional distorts the model.
One might wonder, when confronted with a very small number of articles in the dictionary that share some structural peculiarity, whether they are a) a very rare but perfectly valid variation on the standard pattern, b) errors, or c) the sole remaining evidence that the compilers of the original dictionary had a mental model of the dictionary entry that permitted far more variation than they (almost) ever actually used.
This question is a particular application of a general problem
that arises whenever we transcribe existing print sources. One
advantage commonly cited for descriptive over procedural markup is
that procedural markup may neutralize and conflate structurally
different elements. For example, a printed document may use italics to
represent emphasis, book titles, and foreign words, all of which may
occur in some of the same environments. SGML does not
prohibit the use
of procedural markup tags such as <italic>,
but most approaches to encoding such texts for humanities research
would assign descriptive tags such as
<emph>, <title>,
and <foreign> to these pieces of text, and
it would be the responsibility of the editor to infer the meaning of
italic type wherever it occurs. In most cases the function of italic
type in a specific context in a printed source document is obvious,
but apparent ambiguities are possible, and it may not be easy to
describe the algorithm a human editor would use to resolve
them.
In the problem cited above, assuming a large volume of consistent material, we would rule out (a) (rare but valid variation) unless we could identify a reason for the dictionary compiler to have deviated deliberately from a pattern followed consistently elsewhere. And we would usually rule out (c) (broader mental model) as possibly true but uninteresting, in that it is always possible that the compiler had a mental model that was broader even than anything actually printed in the dictionary. Interpreting typographic features structurally suffers from the same limitations as any historical reconstruction, and historical records are often faulty. A decision in the present case may come down to a subjective editorial conclusion about whether it is more probable that the compiler had a loose mental model than that he committed a small number of errors.[7]
The third alternative is to combine a tightly-structured
DTD with a loosely-structured escape hatch for
deviant data. This is the solution adopted by the TEI
DTD, which for transcriptions of print dictionaries
distinguishes <entry>, a well-structured
lexical entry, from <entryfree>, which may
contain any dictionary elements in any combination, and therefore
caters to lexical entries that violate normal structural constraints.
Applying this solution to our example, we might have:
<!ELEMENT entryfree - - (identfree? & variants? & morphfree? & signiffree*) > <!ELEMENT identfree - - (main? & pron? & gram? & spec? & status? & early? & inflect?) > <!ELEMENT morphfree - - (etym? & hist? & misc?) > <!ELEMENT signiffree - - (mean? & citesfree*) > <!ELEMENT citesfree - - (citefree*) > <!ELEMENT citefree - - (date? & sourcefree? & text?) > <!ELEMENT sourcefree - - (who? & wherefree?) > <!ELEMENT wherefree - - (title? & page?) >
The entry for "maverick" would then be:
<entryfree>
<identfree>
<main>maverick</main>
<pron>'m{v@rIk</pron>
<gram>sb.</gram>
</identfree>
<morphfree>
<etym>Samual A. Maverick (1803-1970), a Texas
cattle-owner who left the calves of his herd
unbranded.</etym>
</morphfree>
<variant>Also mavorick.</variant>
</entry>
Although from an engineering standpoint this technique allows the editor to mark up the text in a way that can be processed by standard SGML tools, from an intellectual perspective it is unsatisfactory because it violates our second requirement, namely that the DTD be an interpretive statement. It seems unlikely that the editors of the OED were indulging in a little esoteric philological humor by recording the word in a totally free-form way so that the entry would reflect the meaning--by having "maverick" be a maverick, so to speak. In all probability they intended to use the strict entry, but goofed up. Using a loose entry, even as an escape hatch, obscures that fact.
Put another way, with this approach the fact that the content of
such elements is erroneous, rather than correctly and appropriately
unusual or unconstrained, becomes purely a semantic matter, even
though by nature the error is syntactic. That is, markup in general is
supposed to represent the syntax of the document, but the syntactic
error in this case would be represented not by syntactically erroneous
markup, but by the extrasyntactic meaning assigned to specific
elements. This type of model involves less loss of information than in
the first two solutions, since irregularities are identified as
irregularities, albeit through the names of GIs,
rather than through the distinction between syntactic validity and
invalidity in the document instance. But since the principal thing
SGML understands is the distinction between
syntactic validity and invalidity, there is something philosophically
unsatisfactory about this dislocation, and from an
information-processing point of view there still is a loss of
information, since the use of the
<entryfree> element tells us nothing about
the particular deviation of each
anomalous element from its ideal counterpart.
The ironic truth is that nothing captures the reality of a mistake in legacy text as cleanly as invalid markup. A document instance that violates its DTD preserves the integrity of the original source, encodes explicitly the difference between norm and violation of norm, and represents syntactic anomaly in the source as syntactic anomaly in the document instance. This approach seems almost kinky or subversive in an SGML context, because our conditioned perception of SGML as a way of modeling document structure assumes that documents should be structured, that there should be no errors in a document's structure, and that a document's structure should be represented by its DTD.
From the perspective we advocate here, a document must be considered to have at least two structures: an ideal one (which must be inferred by scholars) and a concrete one that conforms to the ideal in most--but not necessarily all--places. The SGML document model described here is intended to support the use of the DTD to formalize a grammar of the ideal structure, while permitting the document instance to include violations of that structure.[8]
This model may be unusual in SGML terms, but it is standard operating procedure in other disciplines. For example, descriptive linguistics traditionally distinguishes competence from performance or langue from parole, where the first term represents the ideal abstract grammatical competence of a speaker and the second term represents his real speech, complete with unintended slips of the tongue. Within this model, linguistic performance is easy to observe and describe, while abstract linguistic competence can only be inferred from the study of performance. Linguists nonetheless recognize that the system of linguistic competence is an important object of study from several perspectives, including formal modeling. To return to an SGML environment, we suggest here that a DTD can be considered a model of the grammatical competence of a document, while the document instance represents the associated--and possibly imperfect--grammatical performance.
The traditional assumption that documents should not contain structural errors is challenged by the examples cited above from the OED, which demonstrate that documents that exhibit what in many respects appears to be a rigid, structure (such as dictionaries) may occasionally deviate from an otherwise very consistent structure through human error (or, to describe the situation from a less SGML-oriented perspective, as a result of the editors not being as preoccupied with formal structure as SGML authoring tools might be). These violations are informational, at least from an archaeographic perspective, and should be preserved. And what should be preserved is not merely that the offending portions observe a different but unremarkable structure, and certainly not that the overall document structure is generally loose. What should be preserved is what document analysis reveals: the document has a highly-structured implicit DTD and the offending portions are conceptual violations of this DTD, rather than alternative valid structures. Although SGML distinguishes valid from invalid syntactic structures, and although documents may contain a smattering of philologically true (authentic, imported from a source and requiring preservation) syntactically invalid structures amid a sea of valid ones, the "escape hatch" solution preserves the distinction only by translating it from the syntactic to the semantic. This translation, in turn, restricts the DTD to functioning as a grammar of performance, rather than of competence.
The requirement that SGML be syntactically valid seems in most contexts so obvious that it would rarely be questioned, but if document analysis of existing documents reveals violations of the structure identified during the document analysis process, the most appropriate model of this information in SGML terms involves invalid SGML.
Having concluded that invalid markup is philosophically the most satisfactory solution to the anomalous data problem, one might be tempted to let it go at that, maintaining an invalid document instance that generates error messages when submitted to SGML tools. To continue with our "maverick" example, one would simply grin and bear it when submitting the document to the SGML system:
<entry>
<ident>
<main>maverick</main>
<pron>'m{v@rIk</pron>
<gram>sb.</gram>
</ident>
<morph>
<etym>Samual A. Maverick (1803-1970), a Texas
cattle-owner who left the calves of his herd
unbranded.</etym>
</morph>
<variant>Also mavorick.</variant>
***** Line 13: Element "variant" not allowed here.
</entry>
Unfortunately, this approach is just as unacceptable from an engineering viewpoint as the first three were from a philosophical one. Some of the problems it poses are:
<entry>
<ident>
<main>maverick</main>
<gram>sb.</gram>
***** Line 4: Element "gram" not allowed here
<pron>'m{v@rIk</pron>
<gram>sb.</gram>
</ident>
<morph>
<etym>Samual A. Maverick (1803-1970), a Texas
cattle-owner who left the calves of his herd
unbranded.</etym>
</morph>
<variant>Also mavorick.</variant>
***** Line 13: Element "variant" not allowed here.
</entry>
it is a tedious, error-prone process to figure out that the error on line 4 should be corrected but the one on line 13 should not be.
***** Line 13: Element "variant" not allowed here.
is, in some very loose sense, a specification of the relationship between this dictionary entry and the DTD, but it is certainly not a very good one. It fails to capture the underlying concept that "the variant and the morphology have been interchanged", and certainly cannot be used to produce the correct version of the entry.
To be sure, there may be partial work-arounds to some of these problems within the context of traditional tools. SGML parsers may be able to recognize some types of errors well enough to generate informative error messages, recognizing, in effect, both a correct grammar that raises no errors and a looser grammar that includes a number of incorrect constructions that the parser can nonetheless identify. This raises the possibility of using the notion of multiple grammars to enrich parser output, so that a document might, for example, be evaluated not only as correct or incorrect, but as correct or incorrect in a variety of specific ways. Elements that have been omitted or included erroneously are already recognized by such parsers as James Clark's nsgmls; recognizing elements that are misplaced as dislocations, rather than unexpected omissions combined with unexpected insertions, may prove more difficult. Other techniques might include using architectural forms to map from a strict to a loose DTD, or using DSSSL or XSLT to do tree transformations. All of these approaches, however, suffer from the inherent clumsiness of attempting to deal with invalid data in an environment designed to enforce validity.
Instead of simply maintaining an invalid document instance and attempting to work around the problems it causes, there is another solution which we feel allows one to eat one's markup cake and have it too. In this approach, the relationship between what we have called the "ideal" document and its concrete instantiation is specified as a set of rule-based text-to-text transformations. Rule-based rewriting systems are a well-understood topic in computer science, and provide a powerful paradigm for expressing the relationships between texts. The system we have in mind looks like this:

The basic idea is to maintain the text in a valid
SGML document from which the historical, invalid
document can be automatically derived at any time for research
purposes.[10] To this end,
we allow the editor to record the discrepancy between the ideal
element and the actual one with an attribute specifying how to
transform the one into the other. For our
"maverick" example, we might have a
variant-morph transformation which captures the
fact that the variant and the morphology have been interchanged:
<entry transform="variant-morph">
<ident>
<main>maverick</main>
<pron>'m{v@rIk</pron>
<gram>sb.</gram>
</ident>
<variant>Also mavorick.</variant>
<morph>
<etym>Samual A. Maverick (1803-1970), a Texas
cattle-owner who left the calves of his herd
unbranded.</etym>
</morph>
</entry>
To implement the transformational component, the Gema (Generalized Macro Language) is an obvious choice, since it was designed specifically for rule-based text-to-text transformations and allows the entire transformation engine to be written in just one line. Other pattern-matching languages such as Perl or Awk could be used as well.
To express the way the anomalous entry for
"maverick" is related to its platonically correct
one, we might write the variant-morph
transformation rule in Gema as follows:
variant-morph:[variant][morph]=$2$1
This states in an intuitive way that in entries of this type,
the morphology and the variant have been swapped. The recognizers for
morph and variant are also
quite easy to specify in Gema:
morph:<morph>[U]</morph>=$0@end morph:=@fail variant:<variant>[U]</variant>=$0@end variant:=@fail
These expressions say quite simply that anything between
morph tags is a morph, and anything between
variant tags is a variant. It should be stressed
that these Gema specifications are executable, so that applying the
variant-morph transformation to the canonical entry for
"maverick" will at any time generate the
anomalous version:
<entry transform="variant-morph">
<ident>
<main>maverick</main>
<pron>'m{v@rIk</pron>
<gram>sb.</gram>
</ident>
<morph>
<etym>Samual A. Maverick (1803-1970), a Texas
cattle-owner who left the calves of his herd
unbranded.</etym>
</morph>
<variant>Also mavorick.</variant>
</entry>
The final piece of the picture is the one-line transformation
engine that uses the transform attribute to call
the appropriate transformation:
<entry transform\="[U]">[U]</entry>=@define{temp:[U]=\@$1\{\$0\}}@temp{$0}
We may break this rule into three pieces. The left-hand side matches entries with transform attributes:
<entry transform\="[U]">[U]</entry>=
For simplicity's sake we are assuming that the transform
attribute will be the only attribute on entries. The
define statement redefines the
temp statement to be a call on the transformation
specified by the transform attribute (the first pattern matched,
referenced as "$1"):
@define{temp:[U]=\@$1\{\$0\}}
Finally, the entry as a whole is transformed using the redefined
temp transform:
@temp{$0}
This is a very simple example, but should suffice to demonstrate the utility of the technique. Any textual elements that can be extracted by means of Gema's powerful context-sensitive parsing rules can be reordered, deleted, duplicated, and augmented by all the techniques available in programming languages. In our experience, the implementation cost of such transformations is a linear function of their conceptual complexity.
With this approach, we believe we have found the best of both worlds. The publisher can submit the canonical SGML version of the text to standard SGML tools and have it processed without error messages or other deleterious side effects, while the archeographer can at any time, at near-zero cost, produce a version of the text that represents the anomalous, historical print record.
The specification of anomalous data by means of text-to-text transformation rules seems to solve the problem of anomalous data. Intellectually it captures the essence of the problem and meets all of the philosophical desiderata we listed above. In addition, it has many advantages from the engineering viewpoint. It provides a complete, formal, machine-processable specification of the intertextual relationship. It is user-friendly--that is, intuitive, maintainable, and intellectually satisfying. It can generate both the anomalous and the ideal version of the document on demand. Finally, it is easily portable across computer system, and not tied to any particular SGML tool set.
Ultimately one can envisage merging this transformation-based system with a traditional SGML system, perhaps using the error-correction facilities of the parser to generate at least some of the transformation rules automatically. In an authoring environment enriched in this way, the system might query the user upon encountering a parsing error. The user would either correct the error or inform the system of how the erroneous structure should be mapped automatically to a valid structure. The interjection of this type of associative layer into the model allows the document instance to preserve the syntax of the original, it allows the DTD to model the abstract structure underlying the original even when that structure is not followed with absolute fidelity, and it provides a format where users can specify in formal ways the relationships between ideal and actual markup without compromising the integrity of either the transcription of the primary source or the DTD that purports to represent the syntactic structure underlying that source.
[1] See Berg 1989 concerning the conversion of the printed OED to electronic form and Young-Lai 1966 concerning one strategy for inferring a DTD from a non-SGML document with in-line tags. Painter 1998 is clearly relevant, but I have not been able to obtain access to a copy.
[2] Note that in a
real document, the pronunciation would more probably be represented
with SDATA character entities (for example the
Unicode IPA (International
Phonetic Alphabet) block), rather than the
SAMPA (Speech Assessment
Methods Phonetic Alphabet) transcription of
IPA characters using
ASCII (American Standard Code
for Information
Interchange).
[3] Quin 1996 distinguishes prescriptive from descriptive DTDs, with the latter suitable for creating "an electronic version of material that already exists in a non-SGML format" (415) in a way that records ambiguity within a valid SGML document. One might reasonably approach documents that were originally constructed outside an SGML environment as taking priority over the DTD that might be inferred after the fact, so that what is anomalous in such examples is not so much the data (text), but the structure. From this perspective, anomalous structures may be correct but not captured by the inferred DTD, incorrect with respect to the inferred DTD, or perhaps simply not capturable within any usable DTD.
[4] The only statement that may be made about the validity of a document within an SGML model is that the document is either valid or invalid. But as Asimov 1986 notes, there may be different degrees of wrong, and the more nuanced XML (Extensible Markup Language) model distinguishes documents that are valid (in the SGML sense) from those that are well-formed (the elements, delimited by start- and end-tags, nest properly within each other and there is a single root element, but their structure is not necessarily governed by a DTD). All valid XML documents are also well-formed, but the reverse is not true. The present defense of invalid documents assumes that the documents in question are well-formed.
It should also be noted that our description of the structure of the entry for "maverick" as anomalous is an interpretation, and one might alternatively argue that the order of the components of this entry represents a variant, rather than an anomaly. This issue is discussed in greater detail in the discussion of loose DTDs, below.
[5] SGML does support data typing through notations, although a) the actual type validation must be performed by an external process, rather than by the SGML parser itself, and b) not all constraints on data content are easily expressed in ways that lend themselves easily to automated validation, especially in a system-independent manner.
[6] See the discussion of escape hatches, below.
[7] Distinguishing the rare from the exceptional is, of course, a general classificatory problem not restricted to questions of textual markup.
[8] It might be instructive in this context to consider one of the principal differences between the DTD as part of the original SGML standard and schemas: SGML and XML allow only a single DTD, while a schema-based system would be able to associate a document instance with multiple schemata. A schema-based approach would thus enable a document to be associated with both of the structures described above. See Prescod 1998.
[9] For
example, the structure
<A><C><B><D> where
<A><B><C><D> is expected
may be interpreted in at least three ways:
<B> between
<A> and <C> and
insertion of <B> between
<C> and
<D>;<C> between
<B> and <D> and
insertion of <C> between
<A> and <B>;
and<B> and
<C> between <A>
and <D>.While all three produce the same result in this example, they have somewhat different implications on a more general scale. For example, interpretations 1 and 2 do not preclude omission without insertion or insertion without omission, while interpretation 3 views these two processes as interdependent.
[10] To be perfectly honest, it makes no difference whether the "master" document is the valid one or the invalid one.
The first version of this paper was presented by David J. Birnbaum under the title "In Defense of Invalid SGML" at the 1997 annual joint meeting of the ACH (Association for Computers and the Humanities)/ALLC (Association for Literary and Linguistic Computing) at Queens College, Kingston, Ontario. A second version was presented by the same author under the title of the present article at the Markup Technologies '98 conference in Chicago, IL and, with slight revision, at a meeting of the Pittsburgh (PA) XML/SGML Lecture Series (PghXSLS). The authors are grateful to John Lavagnino, C. Michael Sperberg-McQueen, Chris Maden, Paul Prescod, Michael Spring, and Frank Tompa for comments on those earlier texts, and to an anonymous referee for comments on an earlier version of this article.